Unified Information Access Blog

Welcome to Attivio's Unified Information Access Blog. Join us for discussions on topics ranging from enterprise search solutions, information access insights, Agile software development methodology to programming with Java. We hope you'll find the articles informative and participate in the discussions by leaving a comment.

As a member of what was back then, just a three-person QA team, my heart sank when I read the title of one of our early blog posts stating that quality is job 1 to 3000+. My manager had recently transitioned into another group and his vacancy had yet to be filled. Our CTO, Sid Probstein, had recently closed Attivio's maiden blog post with a promise to expound on Attivio's deliberate approach to quality. So here it is, I thought — had I taken on an impossible task with unrealistic objectives?

Fortunately, though, thinking like a tester once again resuscitated me. The logic of abductive inference compelled me to continue reading Sid's post. While I had to hold my breath until the closing statements convinced me sufficiently that my initial interpretation had been misguided, I was relieved to discover that 3000+ was actually a reference to the number of unit tests covering 81.2% of AIE V1.2. I eyed the 81.2% assertion suspiciously...Ah, things were going to be ok (we are now on AIE V3, and actually have almost 19,000 automated unit tests covering close to 85% of the code base).

Sid's post highlighted the premium Attivio puts on product quality. With a tip of the hat to Lessons Learned in Software Testing (Cem Kaner, James Bach, Bret Pettichord), I'd like to introduce another dynamic and indexable facet of our approach to quality at Attivio: Exploratory Testing.

AIE, Attivio's unified information access platform is incredibly flexible, configurable, and extensible. You can feed multi-language documents into sophisticated workflows via custom clients, command line connectors and in-process configured connectors. You may want your timely insight that matters pushed to an active dashboard, prefer searching/exploring your unified information store via a custom GUI, or perhaps have occasion when only a sick multi-level SQL join will suffice. Of course, it goes without saying that all of this must be secure, stable, scalable, performant, highly available and fault-tolerant.

While we continue to review, expand, and augment the unit tests that are the foundation of our quality strategy, our now fully-staffed QA team has also formally embarked on an exploratory voyage. More than tourists, like C.T. Granville setting sail on a whale watch, Attivio QA is on a quest, a never-ending journey to boldly go where no test or tester has gone before. Everything is fair game: stories, requirements, design, usability and documentation. We delve into all of it, bringing our own experience, curiosity, and skepticism to bear. Most importantly, we take inspiration from our customers who are constantly finding new and interesting ways to deploy our platform, challenging us to look at things from an entirely new perspective.

Recently, a friend sent me a Slate article about two studies that appeared in Cognition, an international journal that publishes theoretical and experimental papers on the study of the mind, which focused on how children learn. The article's author, Alison Gopnik, a professor at UC-Berkeley, who conducted one of the studies, says the studies "provide scientific support for the intuitions many teachers have had all along: Direct instruction really can limit young children's learning. Teaching is a very effective way to get children to learn something specific...But it also makes children less likely to discover unexpected information and to draw unexpected conclusions."

I think this assertion has relevance to people of any age. In the province of software testing, it's reasonable to equate 'direct instruction' with test scripts, whether manual or automated. There is significant value in them (e.g., regression, smoke, and config testing). However, they have the potential to hamstring a tester's most valuable assets: creativity, curiosity, and judgment.

At Attivio, we strive to take full advantage of those assets. The effort produces benefits beyond simply identifying additional defects and regressions. It augments our culture of communication, collaboration, and continuous learning/discovery and enables fresh and deep insight. That this is commensurate with AIE's capabilities is only fitting.

Author Bio

John McEleney is a senior member of the Attivio QA team and has been with Attivio for over three years. Prior to Attivio, John worked on and managed teams at BEA and Plumtree.

There has been a lot of press recently on unified information access and how it enables business users and IT staff to reduce the time it takes to provide information to make better business decisions. The Active Intelligence Engine not only provides value to business users, there are also a number of advantages for developers that make life much easier:

Index now, think later

One of the greatest advantages of AIE is the ability to facet and join on fields without having to do a lot of preprocessing or, more importantly, design work. For example, you might not know ahead of time that your customer database records are linkable to customer comments on your website, but you can easily find out with a single query after both information sources are indexed. A number of POCs and development spikes we have conducted have followed the pattern of indexing everything possible and then trying to infer relationships using queries. Unlike a database where primary and foreign keys must be setup ahead of time, the index does not require this sort of predefined and rigid schema definition. Also, many of these features are so cheap to leave on, that tuning isn't always necessary.

That's not to say that there aren't advantages to doing some tuning of the schema or the ingestion workflows, but it isn't required in order to start seeing the power and ROI of using our system.

Develop locally, deploy globally

Developing for a single node, single JVM system is a straightforward process for most any platform. Some platforms also make it fairly easy to write business logic for a large distributed system. The key advantage we've found is being able to develop and test in a small localized environment, but then deploy to a large distributed system and not be surprised by system behaviors. In addition, it's important to be able to use a standard debugger when developing locally in a fully functional system, but then also to be able to use the same debugger in a distributed environment. AIE topology files provide an abstraction that separates the system functionality from the system deployment. This allows operations teams the ability to scale the system for QA, staging and production environments without having to worry about functional issues with the configuration.

Some other systems like Hadoop force users into somewhat complex development models. AIE's development models strive to support the "I want to do X to Y" in the simplest possible manner. We ship a sample transformer that implements some simple business logic and more importantly, we ship a unit test for the transformer.

Learn one API, let us handle the details

One of the hardest parts of building enterprise wide applications is the need to work with multiple different APIs. In addition, each system has its own idea of what a user is, what it means to have permissions to read a document and more importantly, what a document is to begin with. If you can't define and normalize all of these concepts it's impossible to join, group, categorize and make decisions based on the data. AIE not only provides connectors to these back-end systems, we also handle normalizing each system's data to a standard format that is accessible via our API. A user in Active Directory can have permissions to a document in Documentum and a document in SharePoint. More importantly, the permissions are applied transparently at search time so that developers don't have to worry about doing any sort of post filtering of results.

Attivio's development environment is meant to hide all of the enterprise ugliness from developers and present a single user, group, document, acl, and query concept. If userX can see records in 10 repositories, we handle the details. If you want to join data from your internal SharePoint server to your CRM system based on a support person's contact information, we can do that for you as well.

Author Bio

Since graduating from MIT with a degree in Computer Science, Will Johnson has worked for Altavista and FAST for over 7 years. At Altavista Will developed AV's real time indexing solution used by news aggregators who demanded instantaneous access to news as it arrived. In addition he was one of two engineers responsible for developing the Altavista QIndexer product that was used by the large majority of AV's customers. At FAST, Will developed high speed database connectors as well as developing search UI's and tool sets used across the organization. In addition Will also worked on many of the largest and most complex sales engagements and deployments for customers around the world, specializing in distributed systems for many of the largest internet publishers, directories as well as internal knowledge management systems. Will is a founder, one of the Chief Architects at Attivio and a really nice guy.

We think about the semantic web in two complementary (and equivalent) ways. It can be viewed as:

• A large set of subject-verb-object triples, where the verb is a relation and the subject and object are entities

OR

• As a large graph or network, where the nodes of the graph are entities and the graph's directed edges or arrows are the relations between nodes.

As a reminder, entities are proper names, like people, places, companies, and so on. Relations are meaningful events, outcomes or states, like BORN-IN, WORKS-FOR, MARRIED-TO, and so on. Each entity (like "John O'Neil", "Attivio" or "Newton, MA") has a type (like "PERSON", "COMPANY" or "LOCATION") and each relation is constrained to only accept certain types of entities. For example, WORKS-FOR may require a PERSON as the subject and a COMPANY as the object.

How semantic web information is organized and transmitted is described by a blizzard of technical standards and XML namespaces. Once you escape from that, the basic goals of the semantic web are (1) to allow a lot of useful information about the world to be simply expressed, in a way that (2) allows computers to do useful things with it.

Almost immediately, some problems crop up. As generations of artificial intelligence researchers have learned, it can be really difficult to encode real-world knowledge into predicate logic, which is more-or-less what the semantic web is. The same AI researchers also learned that different people will almost inevitably create knowledge encodings that can't easily be compared, because they use different — sometimes subtly, maddeningly different — basic definitions and concepts. Another difficult problem is to decide when entity names refer to the "same" real-world thing. Even worse, if the entity names are defined in two separate places, when and how should they be merged? For example, do an Internet search for "John O'Neil", and try to decide which of the results refer to how many different people. Believe me, all the results are not for the same person.

idata-semantic-web.jpgAs for relations, it's difficult to tell when they really mean the same thing across different knowledge encodings. No matter how careful you are, if you want to use relations to infer new facts, you have few resources to check to see if the combined information is valid.

So, when each web site can define its own entities and relations, independently of any other web site, how do you reconcile entities and relations defined by different people?

One technique is to require (or STRONGLY SUGGEST) the use of a shared ontology. (For our purposes, an ontology is one person's — or one company's — semantic web).

Perhaps, if it were carefully designed, it would be possible to allow anyone to add to it without making it unusable. Wikipedia might serve as an inspiration here. However, this is generally impractical, for a number of reasons:

  1. A lot of smart people have tried to do this in the past, and they've obviously failed.
  2. Wikipedia has grown a community that is good — perhaps too good — at discussing how articles should be written. However, it's not clear that any community could become competent to discuss semantic web issues in detail - and to come into agreement about them.

The major problem is the "open-world" requirement implicit in the semantic web. In a closed world or a limited domain - even if the limited domain isn't small — it's possible to agree on the ontological issues and get to work. Many companies have put a lot of effort into creating their domain ontologies, and some have even found a day-to-day use for them. However, it takes a lot of work, and continuously ongoing work, to maintain a good domain ontology.

Even if companies were willing to open-source their ontologies, their domain is closed — and once you start trying to knit different domain ontologies together, you quickly start seeing the problems discussed above.

By the way, the fact that the semantic web has failed to be widely adopted has, I think, a simple explanation: it's really difficult, much more so than learning HTML, and the practical payoff is not obvious, to put it mildly.

As an aside, Attivio's unified information access architecture allows corporate ontologies to be directly imported, so a user can search through them, or perform SQL queries on them, including joins. Joins, in particular, are a powerful tool for understanding semantic web ontologies, and for using them to improve search and other kinds of business intelligence work. (You can read about our newly awarded join patent here.)

Is there a solution? Can the creation of domain ontologies be automated — or at least made easier? Will something make it possible to combine different domain (and different site) semantic webs — at least with some minimum guarantees about reliability? I think so, and here's why.

At Attivio, we've been working on using statistical machine learning to learn how to extract relations from plain text. We're still working on it — it's a difficult problem — but we're making real progress and I'm pretty sure that we'll discuss the details of our work in future blog posts. For now, though, it's clear to us that there's a real advantage in being able to associate probabilities with the entities and relations that we find in a document, especially when we can accumulate information from millions of documents (or more). If we build a knowledge graph with weights on the entity nodes and relational edges, we start having a way to measure the reliability of different parts of a semantic web. We can also determine, for two separate semantic webs, what entities and relations we know are the same or different, and where we're unsure.

Human ontology builders can't create probabilities like that, since humans are even worse at statistics than they are at semantics. (No blame here — both are really confusing to think about!) However, there's been a lot of research into relation and event extraction, as well as in machine learning using big data (or extreme information, if you prefer). So it's now possible to create tools that substantially help the process of building ontologies.

And, making no promises we'll regret, we hope that we'll be able to talk more about it soon.

Author Bio

John O'Neil has written and designed software for search, natural language processing and machine learning for 10 years. After receiving a Ph.D. in computational linguistics from Harvard University, he has worked for Lingo Motors, where he designed their main commercial product and ended up with his name on a number of their patents, as well as other search engine companies where he worked to increase search relevancy and accuracy. He also worked for over five years at Basis Technology, Inc., where he was the designer and lead developer for the Rosette Linguistics Platform, their language processing and entity extraction suite of products.

More Articles...

Page 1 of 35

Start
Prev
1

Attivio on LinkedIn

 

blue-rss-icon.png

Enter your email address:

 

Articles by Date

Recent Posts

Thinking Like a Tester

As a member of what was back then, just a three-person QA team, my heart sank when I read the title of one of our early...
Read More...

What AIE and unified information access mean for developers

There has been a lot of press recently on unified information access and how it enables business users and IT staff to reduce the time it takes to provide...
Read More...

The (Real) Semantic Web Requires Machine Learning

The (Real) Semantic Web Requires Machine Learning
We think about the semantic web in two complementary (and equivalent) ways. It can be viewed as: • A large set of subject-verb-object triples, where...
Read More...

More on Triples and Graphs

More on Triples and Graphs
One of the follow-up questions I've received regarding the post on Triples...
Read More...
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8