Unified Information Access Blog

Welcome to Attivio's Unified Information Access Blog. Join us for discussions on topics ranging from enterprise search solutions, information access insights, Agile software development methodology to programming with Java. We hope you'll find the articles informative and participate in the discussions by leaving a comment.

Share


Just because you can do a Google search and get some relevant data, doesn't mean a web search engine is right for your organization's needs. Unlike Web search, you don't want to find the most popular result; you want to find the right result. And users expect to get better answers because they are far more familiar with the enterprise’s content. They want to be able to explore the content by navigating through results instead of entering more queries.

Plus, the content is far more complex with a greater variety of formats and security concerns. In order to address all these issues, your solution must at least start to meet the following criteria.

Flexibility

Web search engines deal only with web pages; although web pages, such as on intranets, or even your competitor's websites, may be of interest, the vast repositories of email, office applications and enterprise software are equally (if not more) important.  An enterprise search engine must be able to get at all of these sources, and more.

Also, an enterprise search engine should allow you to tune relevancy to balance between precision and recall as appropriate to the specifics of the application.  An e-Discovery application, for example, should be biased towards broad recall - you're trying to find ALL the material relevant to some topic, after all.  On the other hand an intranet search might want to be focused on precision.

Security

All the information on the web is public - so a web search engine doesn't need to be concerned with security.  But in the enterprise, access control is critical.  If you are going to bring in and make the most important repositories of information available to search, the underlying technology must support robust access control.  Minimally, your enterprise search engine should include Access Control Lists (ACLs) from the source repositories, and require that a user ID and/or group ID are provided as part of every search.  A more robust implementation will include a last-second check against the live security authority to ensure that the user is truly authorized to see the content.

Content Density

The amount of content that is generated and stored within the enterprise is growing at a faster rate each year. You need to know the rough number of records (documents, web pages or database rows) that your application needs to load up.  Then you need to understand how much data your enterprise search vendor can store on a single commodity server.  State-of-the-art density will be around 100M full text pages indexed on a single server.  Also, expect the index to be 20 to 40 percent of the size of the source data, and you may be able to reduce this number even more by trading off certain features.  Talk to your vendor references and see what sort of 'real world' numbers they're acheiving.  Bear in mind that content density ties directly to the overall total cost of ownership of your enterprise search solution - it's very much in your interest to understand it.

Incremental Scalability

Linear scalability is a key requirement. If you need twice the capacity of the one-machine limit, you only buy one more machine. At no point should you suddenly need to switch to a different hardware architecture.  Another important consideration is to make sure you don't need to buy the hardware up front to accommodate the "theoretical full size" of your index. You should be able to add capacity as needed, without having to re-index your content.

Ease of Use

It seems to be a prevailing assumption that ease of use and functional sophistication are mutually exclusive. This doesn't need to be the case, but both must be designed into the product from the beginning.  When evaluating a search engine, make a list of the common tasks you expect to have to do; then see how long it takes you.  Even if you don't use a real world case, you should be able to get a sense of what you're in for during implementation.

Connectors

Connectors interpret the format of content for ingestion. Does the solution support formats such as: RDBMS, documents (Microsoft Office, PDF, and so-on), e-mail, and specific applications (AS400, Salesforce.com)? Does the solution use commercial or individually developed connectors? Individually developed connectors can ensure performance and reliability, while commercial connectors may cause compatibility problems down the road due to upgrades to the connectors and updates to the underlying applications supporting the given format.

Language Support

Does the solution support a wide range of languages from all over the world? Can the solution provide extended language modules supporting tokenization of all languages, as well as extraction of noun phrases/entities for many of them? Does the solution provide for special and local dictionaries and lexicons that can be plugged in for customization?

Workflow

Content, queries, and results should each pass through a sequence of processing stages before reaching their destination. These stages should be arranged logically in a process workflow that supports branching, conditional logic, and parallel processing. You want to be sure, for example, that you can index e-mail attachments and zip files while maintaining their relationship to the original e-mail.

Testing

After researching your choices, make sure to actually test the different systems with real users and scenarios. Although you may not be able to do so on your own data, at least make sure to run the various solutions through their paces with real world data and not data contrived to provide favorable results.


Trackback(0)
Comments (0)add comment

Write comment
smaller | bigger

security image
Write the displayed characters


busy

Attivio on LinkedIn

 

blue-rss-icon.png

Enter your email address:

 

Articles by Date

Recent Posts

Thinking Like a Tester

As a member of what was back then, just a three-person QA team, my heart sank when I read the title of one of our early...
Read More...

What AIE and unified information access mean for developers

There has been a lot of press recently on unified information access and how it enables business users and IT staff to reduce the time it takes to provide...
Read More...

The (Real) Semantic Web Requires Machine Learning

The (Real) Semantic Web Requires Machine Learning
We think about the semantic web in two complementary (and equivalent) ways. It can be viewed as: • A large set of subject-verb-object triples, where...
Read More...

More on Triples and Graphs

More on Triples and Graphs
One of the follow-up questions I've received regarding the post on Triples...
Read More...
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8