We think about the semantic web in two complementary (and equivalent) ways. It can be viewed as:
• A large set of subject-verb-object triples, where the verb is a relation and the subject and object are entities
OR
• As a large graph or network, where the nodes of the graph are entities and the graph's directed edges or arrows are the relations between nodes.
As a reminder, entities are proper names, like people, places, companies, and so on. Relations are meaningful events, outcomes or states, like BORN-IN, WORKS-FOR, MARRIED-TO, and so on. Each entity (like "John O'Neil", "Attivio" or "Newton, MA") has a type (like "PERSON", "COMPANY" or "LOCATION") and each relation is constrained to only accept certain types of entities. For example, WORKS-FOR may require a PERSON as the subject and a COMPANY as the object.
How semantic web information is organized and transmitted is described by a blizzard of technical standards and XML namespaces. Once you escape from that, the basic goals of the semantic web are (1) to allow a lot of useful information about the world to be simply expressed, in a way that (2) allows computers to do useful things with it.
Almost immediately, some problems crop up. As generations of artificial intelligence researchers have learned, it can be really difficult to encode real-world knowledge into predicate logic, which is more-or-less what the semantic web is. The same AI researchers also learned that different people will almost inevitably create knowledge encodings that can't easily be compared, because they use different — sometimes subtly, maddeningly different — basic definitions and concepts. Another difficult problem is to decide when entity names refer to the "same" real-world thing. Even worse, if the entity names are defined in two separate places, when and how should they be merged? For example, do an Internet search for "John O'Neil", and try to decide which of the results refer to how many different people. Believe me, all the results are not for the same person.
As for relations, it's difficult to tell when they really mean the same thing across different knowledge encodings. No matter how careful you are, if you want to use relations to infer new facts, you have few resources to check to see if the combined information is valid.
So, when each web site can define its own entities and relations, independently of any other web site, how do you reconcile entities and relations defined by different people?
One technique is to require (or STRONGLY SUGGEST) the use of a shared ontology. (For our purposes, an ontology is one person's — or one company's — semantic web).
Perhaps, if it were carefully designed, it would be possible to allow anyone to add to it without making it unusable. Wikipedia might serve as an inspiration here. However, this is generally impractical, for a number of reasons:
- A lot of smart people have tried to do this in the past, and they've obviously failed.
- Wikipedia has grown a community that is good — perhaps too good — at discussing how articles should be written. However, it's not clear that any community could become competent to discuss semantic web issues in detail - and to come into agreement about them.
The major problem is the "open-world" requirement implicit in the semantic web. In a closed world or a limited domain - even if the limited domain isn't small — it's possible to agree on the ontological issues and get to work. Many companies have put a lot of effort into creating their domain ontologies, and some have even found a day-to-day use for them. However, it takes a lot of work, and continuously ongoing work, to maintain a good domain ontology.
Even if companies were willing to open-source their ontologies, their domain is closed — and once you start trying to knit different domain ontologies together, you quickly start seeing the problems discussed above.
By the way, the fact that the semantic web has failed to be widely adopted has, I think, a simple explanation: it's really difficult, much more so than learning HTML, and the practical payoff is not obvious, to put it mildly.
As an aside, Attivio's unified information access architecture allows corporate ontologies to be directly imported, so a user can search through them, or perform SQL queries on them, including joins. Joins, in particular, are a powerful tool for understanding semantic web ontologies, and for using them to improve search and other kinds of business intelligence work. (You can read about our newly awarded join patent here.)
Is there a solution? Can the creation of domain ontologies be automated — or at least made easier? Will something make it possible to combine different domain (and different site) semantic webs — at least with some minimum guarantees about reliability? I think so, and here's why.
At Attivio, we've been working on using statistical machine learning to learn how to extract relations from plain text. We're still working on it — it's a difficult problem — but we're making real progress and I'm pretty sure that we'll discuss the details of our work in future blog posts. For now, though, it's clear to us that there's a real advantage in being able to associate probabilities with the entities and relations that we find in a document, especially when we can accumulate information from millions of documents (or more). If we build a knowledge graph with weights on the entity nodes and relational edges, we start having a way to measure the reliability of different parts of a semantic web. We can also determine, for two separate semantic webs, what entities and relations we know are the same or different, and where we're unsure.
Human ontology builders can't create probabilities like that, since humans are even worse at statistics than they are at semantics. (No blame here — both are really confusing to think about!) However, there's been a lot of research into relation and event extraction, as well as in machine learning using big data (or extreme information, if you prefer). So it's now possible to create tools that substantially help the process of building ontologies.
And, making no promises we'll regret, we hope that we'll be able to talk more about it soon.
Author Bio
John O'Neil has written and designed software for search, natural language processing and machine learning for 10 years. After receiving a Ph.D. in computational linguistics from Harvard University, he has worked for Lingo Motors, where he designed their main commercial product and ended up with his name on a number of their patents, as well as other search engine companies where he worked to increase search relevancy and accuracy. He also worked for over five years at Basis Technology, Inc., where he was the designer and lead developer for the Rosette Linguistics Platform, their language processing and entity extraction suite of products.
In 

