Unified Information Access Blog

Welcome to Attivio's Unified Information Access Blog. Join us for discussions on topics ranging from enterprise search solutions, information access insights, Agile software development methodology to programming with Java. We hope you'll find the articles informative and participate in the discussions by leaving a comment.

Share


Everyone wants computers to silently understand exactly what they want, and even to anticipate their needs in advance.  But so far technology has only allowed us to build software which approximates the holy grail of "artificial intelligence".  We can play checkers or chess with computers (and they usually win), but software for extracting the semantic relationships from text still chokes on sarcasm and elliptical language.  One well-known knowledge extraction engine recently reported that Henry VIII was crowned Queen of England.  From the same site I can also learn that Edison built foil and used appearances (actually, he built light bulbs, some of which contained foil, and he made publicity appearances).  Googling for "white house paint" still returns more hits about the White House than about paint. Clearly, computers don't understand the human language yet.

The "Semantic Web" is one way some researchers have proposed to help computers to better "understand" documents.  The collection of standards and technologies which comprise the semantic web proposal was originally intended to improve the ability of computers to index, retrieve, and relate documents on the world-wide web, but it easily generalizes to non-web-based documents in formats other than HTML. The proposal formalizes the forms of metadata which can be attached to documents.  Instead of free-form keywords, a semantic web document is tagged with entities from a controlled vocabulary, such as "http://www.daml.org/2002/02/chiefs/bf#A4" (Note: These tags are strings, not web sites!).

In this case, the controlled vocabulary is written in a language called RDF, which looks something like this:

<rdf:Description rdf:nodeID="A4">
<rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Person"/>
<NS0:surname>Sears</NS0:surname>
<NS0:givenname>Joshua</NS0:givenname>
</rdf:Description>


It should come as no surprise that specialized tools are required to maintain these vocabulary lists.  In addition, these tools allow the definition of relationships between entities, such as "The name of A4 is Joshua Sears", and logical inferencing rules, such as "if person P works at company C then P can access document D".

Although the semantic web sounds like a panacea in theory, it does not have a great track record in practice.  It is unclear what value is added by the ability to define relationships between the entities.  It is very tricky to define inferencing rules, and there are too many exceptions.  All "chairs" have four "legs", except when they don't.  On the world-wide web, there is no universal controlled vocabulary.  Spammers routinely lie about the contents of their documents, and many other publishers just can't be bothered.

Controlled vocabularies are the most valuable thread in the semantic web.  Controlled vocabularies, such as subject headings, author names, and thesauri, were used in libraries long before computers were used for information retrieval or documents were published on the world-wide web.  Labeling documents with metadata from controlled vocabularies can significantly improve the user experience with an enterprise-class information retrieval system.  However, controlled vocabularies are known to be expensive to build, use, and maintain.  Controlled vocabularies are domain-dependent: the controlled vocabularies for book authors and wrenches probably look very different at Amazon and Snap-On Tools.  In addition, they are difficult to use.  The keywords you use to tag a document may not be the keywords I use, so a list of preferred terms is required.  For best results, you have to train your users in the use of your preferred terms.  In a list of a thousand authors, there will invariably be duplicates, so additional information is needed for disambiguation. 

In designing Attivio's Active Intelligence Engine, we picked a few of the golden threads from the semantic web and combined them with a number of techniques which have been shown to improve the search experience for the vast majority of users while minimizing the cost of cataloging documents.  These techniques include:

  • Extracting entities, such as names and dates, from documents

  • Improving queries by removing common words and by adding synonyms

  • Ranking documents using the correspondence between the words and phrases of the query and the documents

  • Using "quality" metrics, such as the number of times a document has been viewed or voted for


If metadata is available, AIE will use it to assist the user in learning about and exploring the document collection.  If the metadata is drawn from a controlled vocabulary, the user experience will be even better.  But because of the costs involved, AIE does not require you to tag documents or maintain controlled vocabularies – it integrates a number of proven techniques to maximize your results.

Author Bio

Jonathan Young earned a Ph.D. in Computer Science from Yale University.  While working at Dragon Systems, he built the speech recognition engine and speech user interface which is now known as Dragon NaturallySpeaking, and then built the Dragon AudioIndexing engine for multimedia information retrieval.  Dr. Young is the inventor on 6 patents, and more recently has worked on statistical algorithms for intrusion detection, speech recognition, and natural language processing.  Jonathan is a Senior Research Engineer at Attivio.

Trackback(0)
Comments (0)add comment

Write comment
smaller | bigger

security image
Write the displayed characters


busy

Attivio on LinkedIn

 

blue-rss-icon.png

Enter your email address:

 

Articles by Date

Recent Posts

Thinking Like a Tester

As a member of what was back then, just a three-person QA team, my heart sank when I read the title of one of our early...
Read More...

What AIE and unified information access mean for developers

There has been a lot of press recently on unified information access and how it enables business users and IT staff to reduce the time it takes to provide...
Read More...

The (Real) Semantic Web Requires Machine Learning

The (Real) Semantic Web Requires Machine Learning
We think about the semantic web in two complementary (and equivalent) ways. It can be viewed as: • A large set of subject-verb-object triples, where...
Read More...

More on Triples and Graphs

More on Triples and Graphs
One of the follow-up questions I've received regarding the post on Triples...
Read More...
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8