Unified Information Access Blog

Welcome to Attivio's Unified Information Access Blog. Join us for discussions on topics ranging from enterprise search solutions, information access insights, Agile software development methodology to programming with Java. We hope you'll find the articles informative and participate in the discussions by leaving a comment.

Have you ever wanted to get all the documents that match your query? I'm not talking about all of the top 10, or even all of the top 100. I'm talking about all of them. Trying this with a few popular search engines reveals some interesting results. Google gives you a nice error message:

google-too-many-results.png

Yahoo on the other hand, quietly (no error message) returns the last page of results numbered 990-999 no matter how many you ask for.

MarkMail (a site we really love by the way) isn't quite sure how many results there are:

markmail-too-many-results.png

Most eCommerce sites that are powered by traditional enterprise search engines are even worse, they never let you see more than 10-50 results period since that makes catalog scraping easier. A number of other legacy search vendors have hard coded limits on the number of results that can be retrieved for any given query or suffer significant performance issues as you ask for more and more results.

While this model may work well for internet portals it can have disastrous results inside an enterprise. Imagine telling a federal judge you can only get him the top 1000 emails from your client about the insider trading case or telling your CEO you can't get all of the data for the recent sales activity into the quarterly report but that the top 500 should be enough.

Traditionally speaking, search engines are very good at getting you the top 10, 20, or maybe even 100 results. Again, this is fine for portals since most users of those systems never get past the first few results, much less the tenth page. Databases on the other hand are very good at returning all the results that exactly match a query but they have no concept of the 'top 10' or a 'fuzzy match.'

Generally each of these systems serves certain purposes quite well but they definitely have their limitations. As mentioned above, a legal discovery situation might require finding all emails from person 'X' that mention any form of the company name 'Y'. Alternatively you might want to get back all the news stories that mention a certain set of keywords for further review. A database is good at the 'return all' aspects of these queries while a search engine is good at the 'any form of' and 'keyword' aspects.

On the input side, databases are very good at joining large sets of filter criteria against a result set but search engines generally require you to formulate your searches with the full filter set expressed as a single expression (a OR b OR c OR d OR ... ).

Getting one system to meet all of these requirements has been very difficult using existing tools. If you were able to request this kind of information it usually meant you needed very large memory spaces or that you had to address the possibility of results changing underneath you while executing a search. Even if you were able to make all of this work it was very sensitive to data volumes, query types and a host of other issues... until now.

In AIE version 2.2 we shipped a new Beta feature we plan to officially roll out in our next release which will give users the ability to request that all results be returned for a given query. It will also let you request all the facet values for a particular query. On the input side, it lets you stream in a list of filter criteria instead of creating a huge filter expression. Most importantly it works with any type of query and has no memory overhead on the client or server.

So far we're using search result streaming with some integration projects where we're federating between AIE and large databases with 10s and 100s of millions of records flowing in either direction. We send large results back to BI tools and databases and they send us large filter criteria lists and no one has to worry about all of the edge cases that might cause system instability.

We're sure there are other uses cases though. Imagine what you could do if you could request all the results from an internet search provider for a given keyword, or all the pages period. We'd love to hear about ideas you have for this kind of functionality.

I recently provided an answer to an online discussion group (LinkedIn Enterprise Search Engine Professionals Forum) and thought it might be helpful for some of our customers, so I wanted to share it directly through the Attivio blog.

The question posted by Bob Lawrence:

I noticed the GSA announcement talked about ACLs in the index to allow "early binding" to security rules. This was also alluded to in connection to another discussion on open source security. In that thread, it was implied that for new security rules to take affect for SOLR/LUCENE, the search index needed to be rebuilt (please correct me if I misinterpreted the discussion). Is this also true of the GSA and other search engines that tout early binding to security rules? How big a deal is this? I am assuming organizations need to update the index on at least a daily basis to account for added/deleted documents. Can they do this without rebuilding the index and thus not incorporating new security rules?

The original question asked makes a good point about ingestion issues when ACLs are bound directly to documents. It isn't great when you have to re-ingest a large PDF just because the permissions on it changed. But there are even bigger issues when taking into account user and group changes.

For example, suppose the "Company Group" is made up of the "Consulting Group" and the "Sales Group". If the "Consulting Group" is removed from the "Company Group", you don't want to have to re-ingest all documents that have access to the "Company Group" just because the group hierarchy changed.

Then there's also the issue of users and ACLs existing on different realms. For example, if Tom@Windows and Tom@Unix are both the same user, it would be best if both realms' ACLs are applied when Tom queries.

All of these issues point out why it's best to separately store:

  • Users, groups and their membership
  • Documents
  • Document ACLs

Warning: shameless plug follows ...

As was pointed out by another participant in the discussion, Kevin Watters, Attivio has had built-in support for JOINs as well as ACLs for quite some time, which has allowed us to secure queries using early binding without re-indexing documents. We have recently released a new security module with advanced user and group support, which makes this process even easier and handles all the issues mentioned. The security module comes with multiple features, including:

  • An Active Directory scanner to ingest users, groups and group membership
  • A Windows file scanner to ingest documents along with their Windows ACLs
  • A Windows service to persist ACL updates in real-time to the index (without re-indexing the document)
  • An Exchange scanner to ingest emails along with their ACLs
  • A SharePoint scanner to ingest SharePoint documents along their ACLs
  • The ability to create cross-realm aliases, so that user Tom@Windows and Tom@Unix can be treated as the same user with respect to permissions
  • An extensible API to support custom user, group and ACL ingestion
  • Complete security filtering of queries and their results (including facets) so users have no knowledge of applied security rules

I'd like to hear your thoughts on this topic as well. I think the hybrid approach using JOIN has changed this paradigm significantly.

If you’d like more information on the Attivio security model, check out our on-demand webinar called “Information Access Control: Can you really have faster, safer AND cheaper?”.

Author Bio

Since graduating from Carnegie Mellon University with a Master's degree in Information Networking, Greg George has had extensive experience developing high performance, large-scale data processing solutions. Greg was an early employee at Ab Initio Software Corporation where he solely designed and developed their database interface engine for all of the major database vendors. After Ab Initio, Greg worked at Lumigent Technologies as a technical leader designing auditing solutions for Oracle, DB2 and Sybase. Greg is a Principal Software Engineer working at Attivio.

The last few years have brought significant changes in the enterprise search market. Acquisitions, redirected strategies and discontinued products have altered the landscape, but the most dramatic development is the emergence of a new class of technology, unified information access (UIA), which combines the best of search and database applications. This article details the top reasons our customers have upgraded their legacy search technology and why selecting UIA is the optimal approach to expand your information access capabilities.

Top 10 Reasons to Replace Enterprise Search with UIA:

1. Current search solution is too expensive

It's not just the license and the required number of servers or appliances, it's the ongoing operational costs, the added fees for expanding document volume and servers, and the consultants required for long implementation, tuning and upgrade phases. And for all the expense, the results aren't that impressive either.

Why UIA is better: Spend less, get more

Although we can't speak for other UIA products, Attivio's UIA platform delivers better results at a lower total cost of ownership. The lower cost includes:

  • Dramatically faster development and deployment time, so lower fees for professional services and faster time to value

  • No requirement for taxonomy or manual facet development

  • Smaller number of servers required because of a combination of optimized ingestion and query performance (you can simultaneously ingest 100M documents while maintaining 20-40 QPS with query delays of no more than 2 seconds — try achieving that on one server with a legacy search engine)

  • Ability to add servers and new data formats without re-indexing

2. Current search solution doesn't scale easily

There's got to be an easier way to add new content sources, more servers, more navigation options without having to re-index. Changes are disruptive and expensive, often discouraging expansion and innovation.

Why UIA is better: UIA from Attivio scales seamlessly

Why guess what your needs will be later and buy all the hardware now? With Attivio AIE, there's no need to plan future expansion and invest in hardware upfront because AIE scales incrementally - and without disruptive re-indexing for additional hardware or data formats.

More Articles...

Page 1 of 4

Start
Prev
1

Attivio on LinkedIn

 

blue-rss-icon.png

Enter your email address:

 

Articles by Date

Recent Posts

Thinking Like a Tester

As a member of what was back then, just a three-person QA team, my heart sank when I read the title of one of our early...
Read More...

What AIE and unified information access mean for developers

There has been a lot of press recently on unified information access and how it enables business users and IT staff to reduce the time it takes to provide...
Read More...

The (Real) Semantic Web Requires Machine Learning

The (Real) Semantic Web Requires Machine Learning
We think about the semantic web in two complementary (and equivalent) ways. It can be viewed as: • A large set of subject-verb-object triples, where...
Read More...

More on Triples and Graphs

More on Triples and Graphs
One of the follow-up questions I've received regarding the post on Triples...
Read More...
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8