Attivio | Unified Information Access Blog
Welcome to Attivio's Unified Information Access Blog. Join us for discussions on topics ranging from enterprise search solutions, information access insights, Agile software development methodology to programming with Java. We hope you'll find the articles informative and participate in the discussions by leaving a comment.
| Can a Search Engine Replace a Relational Database? |
| Written by Sid Probstein | ||||
| Wednesday, 02 December 2009 | ||||
|
It's a common enough question: can a search engine replace a relational database? The answer is "sometimes". One challenge is transaction support and the ACID properties. If we focus on replacement for the purpose of querying only, we can leave that aside. This leaves some other interesting challenges. First and foremost is the need to translate the relational model to the search model. Relational databases are so named because they organize data in various ways – typically by commonality, but this is dependent on the application - with relationships between sets of data expressed through foreign keys. Data stored in relational database is sometimes normalized, but this is not required. Using this model, it is a straightforward matter to represent one-to-one, one-to-many and many-to-many relationships, and query them. In contrast, search engines map words to documents; the documents may have structure - typically fields - but content/data is generally not normalized. Converting relational data to an index model requires running a query in the database to produce a single table - a "flattened" view that can be indexed such that each row is a document, and each column is a field within it. Consider a simple case - a table of news stories, a table of people, and a comment table:
Another reasonable flattening scheme might be to use news as the parent, and roll-up comments, then users (with duplication as noted above.) This will allow analysis of news and comments, but as soon as you try to include users the same problem becomes evident. This is the downside of flattening: it is a lossy process in which useful relational data is discarded. Another potential issue when replacing a database with a search engine is update performance. Databases are designed to ingest and store data rapidly - essentially, all you are doing is appending a record to the end of a data structure. The performance penalty is taken at query time. This is why databases are typically used to store large numbers of small things - like sales data - or things that change frequently - like permission data. Database queries have a highly variable execution time, and it is almost always the case that certain queries will take a very long time to return. Search engines are designed in the opposite way: the performance penalty is at ingestion time, to generate an inverted index that is optimized for queries by mapping words to documents. But updating a large portion of your corpus can take a relatively long time. (Consider a change to a single news article in the flattening example above; reflecting that change in the index requires updating the record of every user who has commented on it.) Here's another example: a table consisting of blobs of text (perhaps documents), and another table with the access control list (ACL) for those blobs. (We could also imagine this scenario has nothing to do with a database; perhaps it is a set of documents stored on a file system.) In the search engine model, ACLs are flattened into the blobs/documents; they become a field.
This approach - of storing ACLs in the index and combining them with the end-user's query and credentials - is typically referred to as the "early bound" security model. It is quite challenging to make work with a document-centric search engine for even moderately volatile environments. An alternative is the "late bound" model in which ACLs are left out of the search index. Queries are run against an unsecured index, and each result is then checked against a live security authority - often a database that can keep up with the pace of ACL changes. The late bound approach is very problematic, however. Among other things it leads to long query times and can render useful search features like facets or spelling suggestions unusable, since they are usually calculated by the search engine before security is applied, and thus would include content from documents the user is not authorized to see. In the database model, of course, the ACL updates are simply applied to the separate table. Atomic, normalized storage models do have their advantages; this is one scenario in which it makes all the difference. But databases are notoriously bad at full-text querying. So what about using a search engine and a database together? There are certainly cases when this makes sense; if, for example, you want transactional storage and full-text querying. Using both for querying probably makes sense if you use one or the other for a given application or feature. Trying to use them at the same time will produce interesting new challenges. Search engines are not typically optimized at returning large numbers of results quickly - most requests want 10, 50 or 100 results. Bottlenecks on disk or network make this generally impractical. The additional requirement to write business logic (code) to route the right content, data and queries to the right server adds further complexity. An overall better model is Unified Information Access. Attivio's Active Intelligence Engine (AIE) uses a hybrid index structure that supports both flat and relational indexing models. (It combines the query-side performance of the inverted index with the ingest-side performance of a relational store.) AIE's query-side JOIN operator allows selection of relationships between tables as would be expressed in SQL, and use full-text search at the same time. JOIN elegantly solves the problems that result from the translation of relational data to inverted indexing. For example, AIE can index the news, people and comment tables as tables - no flattening required - and queries using the JOIN operator can ask most any question: JOIN(table:user, INNER(AND(table:comment, like)), ON="id=userid") This query returns users who used the word "like" in a comment they made. JOIN(table:news, INNER(AND(table:comment, like*)), ON="id=commentid") This query returns news articles that a user said they "like", "likes", "liked" (etc) in comments. (Note use of the wildcard.) JOIN(table:comment, INNER(AND(table:news, title:search)), ON="id=newsid") This query returns comments for news articles with "search" in the title. AIE also supports SQL queries. For example: SELECT * FROM news INNER JOIN AIE_TABLE_QUERY('AND(table:comment, like*)', 'Advanced')) ON news.id = comment.newsid This query is the same as the second query and returns news that a user said they 'like', 'likes', etc. The unique AIE_TABLE_QUERY clause allows you to use a full-text query to create a virtual table. (AIE also exposes an All_Tables() function that can be used to search across all content.) Beyond SQL ... AIE also makes early bound security work, even in highly volatile environments, by keeping ACLs in separate tables that can be rapidly and independently updated. No fetching or re-processing of related documents is required. At query time the JOIN operator, along with unique ALLOW and DENY operators, brings together the user's search with their ACLs: JOIN(AND(table:documents, xyzzy), ALLOW(AND(table:users, userid:1), on="allowacl"), DENY(AND(table:users, userid:1), on="denyacl")) The bottom line is that using a search engine instead of a database makes sense in some scenarios, mostly those where one particular relationship is of primary interest for search, no more than one level of additional related data is needed, and the rate of updates, particularly in the secondary table(s), is moderate. Using a search engine and a database together makes sense in similar scenarios, so long as each is used for separate purposes. Providing more complex unified information access capabilities requires a next-generation technology like AIE. To learn more about AIE's support for early bound security and how it is helping innovative companies like IntraLinks provide search in a highly secured environment, please check out our webinar on this topic. For more information about security risks and enterprise search in general here is a Network World podcast I did earlier this year on that very subject.
Bookmark
Email This
Trackback(0)
Comments (2)
![]()
Fredrick.William
said:
|
|
... Attivio Active Intelligence Engine's Permissioning Model and Real-Time Updates Prove Key Differentiators in Replacing Leading Search Solution at Traction. ----------------------------- Fredrick.William Melbourne |
live search engine
said:
|
... All search engines should capable of doing better search results as the people wanted. |
















