Welcome to Attivio's Unified Information Access Blog. Join us for discussions on topics ranging from enterprise search solutions, information access insights, Agile software development methodology to programming with Java. We hope you'll find the articles informative and participate in the discussions by leaving a comment.
One of our colleagues at Attivio has a niece and nephew who are as fluent in Japanese as they are in English. Their mom is Japanese and their dad is American, so they have a completely bilingual household. In one moment they might talk to each other in English, their Mom or Dad might call to them from another room in Japanese, and they will answer in kind, switching between their two languages as easily as switching TV channels.
Unified information access (UIA) technology is a lot like being strongly bilingual, in that UIA also quickly and easily communicates information that spans different worlds — specifically, structured data (databases) and unstructured content (documents/text), whether from internal and external sources.
Just as our colleague's niece and nephew can communicate as easily with anyone when visiting Japan as they can at home, a true UIA platform can also freely communicate with disparate information sources and with other applications; particularly BI tools, self-service dashboards and analytic systems. Doing so requires supporting the widely-used SQL (Structured Query Language) standard, via ODBC/JDBC connectivity.
Background
Much energy and effort has gone into the production of tools and technologies to analyze data. From iPhone and iPad apps to spreadsheets, reporting tools, "self-service" dashboards, various analytic systems, right on up to full-blown ad-hoc drag & drop BI tools, we live in an era where everything is analyzed, and the tools we use for that analysis actually contribute to better decisions. One of the keys to the interoperability of this huge ecosystem is a standards-based approach: the broad use of the Structured Query Language (SQL) is the reason the eco-system exists.
The downside to many of these tools is that they operate only on so-called structured data — until recently, ignoring valuable context contained in unstructured sources. Without an integrated and fully correlated view of the complete picture, organizations will miss out on a much wider world of business insights and understanding; not unlike relying on a really bad language interpreter (poor Bill Murray!):
Unified Information Access
Fortunately, Attivio’s Active Intelligence Enging (AIE) gives you “the best of both worlds.” Because AIE supports querying in SQL via ODBC and JDBC, organizations can use it to explore all information regardless of source or format. By deploying AIE as a back-end unified information source, your users can continue to use BI and other tools they are comfortable with — but now with the added ability to access a far more complete business informational picture, for more informed decisions and deeper understanding that is simply not possible working with structured or unstructured information alone.

One key to making this happen: Active Intelligence SQL (AI-SQL) - a set of full-text function extensions to SQL. AI-SQL functions make it easy for SQL query authors to incorporate AIE's unique features including operations like:
The fulltextsearch function enables blended search and analytic user interfaces by enabling applications to plug user search box input into a SQL query to allow the user to interact with data - but in a controlled, simple way.
Some examples of using AI-SQL extensions:
select r_regionkey,r_name from all_tables where r_name = regex('e.*e')
select p_partkey,p_container from all_tables where p_container = startswith('brown')
select p_partkey,p_container from all_tables where p_container = endswith('car')
select p_partkey,p_container from all_tables where p_container = near('brown','car','set(distance=1)')
select p_partkey,p_container from all_tables where p_container = onear('brown','car','set(distance=1)')
select company.company, company.ticker, count(news.newsarticleid)
FROM company INNER JOIN news ON company.ticker = news.ticker
WHERE news.content=fulltextsearch(?UserSearch)
GROUP BY company.company, company.ticker
The Importance of JOIN
It should be noted that it is no easy feat to support a detailed query language like SQL, while also serving as a UIA platform that can ingest and search across countless information sources at massive scale.
Some SQL capabilities, like a single field GROUP BY, are easily accommodated by unstructured search, and are relatively straightforward to handle. For example, given the following query:
SELECT name,count(*) FROM customers GROUP BY name
This can be easily issued as an AIE facet query, asking for facet values and counts on the name field:
query-request> table:customers
facet-request> name
However, other SQL features, like JOIN, are among the most difficult to support; but happily, they are handled by AIE’s patented ability to dynamically JOIN data and content without advance data modeling on an unstructured index. These advanced AIE capabilities allow us to execute all of the queries of the TPC-H benchmark, including TPC-H "Query 3" joining three tables and aggregating the results:
SELECT l_orderkey,Partnering for Success
SUM(l_extendedprice*(1-l_discount)) as revenue,
o_orderdate,
o_shippriority
FROM customer,
orders,
lineitem
WHERE c_mktsegment = 'BUILDING'
and c_custkey = o_custkey
and l_orderkey = o_orderkey
and o_orderdate < '1995-03-15'
and l_shipdate > '1995-03-15'
GROUP BY l_orderkey,
o_orderdate,
o_shippriority
ORDER BY revenue DESC,
o_orderdate
Attivio, like many other companies, uses open source where appropriate. The Java community in particular has great open source technologies powering some of the hottest technology trends today, from big data (Hadoop and MapReduce) to columnar storage (MonetDB) to IDE frameworks (Eclipse).
There‘s also a trove of infrastructure technologies out there for logging, dependency ingestion, character set normalization, etc. Knowing when to use open source and when to write your own code is where the real value comes in as a developer and more importantly, can save you and the company a great deal of time and resources.
All that being said, there are often bugs and functional gaps in open source code and as a developer you need to have a system in place that allows you to handle these issues. For example, we may find a bug in a particular package and submit a patch but since we don't control the release cycle of the open source projects we can't simply wait around for the fix. We are also very picky about less critical run-time issues such as threads being left around after a process or unit test finishes. Many open source projects assume that they are going to be run as a standalone server that terminates with the JVM and these sorts of assumptions can break or disturb our unit test environments.
We've recently switched to the vendor branching methodology described here against our Subversion repository. This allows us to import external projects into our revision control system.
When we encounter an issue in open source code, we:
Lastly, we strive to create a formal ticket, patch and test for contribution back to the open source community. In all fairness, this is our weakest part of the process, but one we are striving to improve. Many of our changes are small in nature and fix either esoteric edge cases or general code cleanliness like the thread example I mentioned above, but most changes are still useful for the wider community.
Related:
Thinking About Replacing Your Search Engine or Search Appliance?
I recently attended the YES Boston panel on Big Data & Analytics at the Harvard Innovation Center. Overall it was an excellent discussion. At least one panelist indicated that "It feels like 1995 again", referring to that heady period when the Internet emerged and drove the dot-com era forward. Most of the focus was on the superb opportunities that Big Data creates for entrepreneurs. A few panelists also suggested that Big Data would lead to the death of the traditional relational database and data warehouse.
Earlier in the discussion, one panelist characterized Big Data as having "three V's — Volume, Velocity, Variety" etc. This has become the standard way to segment the various use cases that collectively add up to "Big Data", as well as a number of other often cited characteristics like value and complexity. However, another panelist then said that Big Data was mostly about unstructured information. I have written about how all unstructured data is not the same previously. Most "unstructured" in the context of Big Data is "data" — variable length log files, etc. Not truly unstructured CONTENT, which includes articles, web pages, documents, email, etc. It is important to understand the difference, especially from the entrepreneur's perspective. Much of the "volume" in the Big Data that is unstructured DATA — again, log files, mostly — has very low individual value. It is only when analyzed in volume that it becomes interesting and valuable.
One of the more interesting questions at the end of the panel came from Wikibon's Dave Vellante. He asked, more or less, "Why are big data and the new technologies that are emerging to analyze it going to be disruptive to the enterprise data warehouse?"
Here, the panelists' answers seemed uncertain. Several spoke about the challenge of getting centralized IT to produce new information from legacy BI tools. While this is probably impossible to argue with, at least one panelist went too far, saying that data will just be kept in new systems and BI tools will work directly against them. I didn't buy this angle, and in a follow-up conversation with Dave, he agreed with me that the panel mostly missed the mark.
I would have answered Dave's question like this: the key with Big Data is to take the volume of low value items and turn it into high-value analysis. That analysis then needs to be co-mingled with other information that has high item value. This includes email, documents, text in applications, rows in databases, ERP, CRM etc. That isn't disruptive to the eDW in and of itself. Most big data — like behavioral information — won't be interesting to typical corporate decision makers. A few data scientists etc. will analyze click streams and use it to optimize the end user experience. The transactions (hopefully sales) that result from that will go into the eDW. The yield improvements will also be analyzed and tracked over time — again, probably by traditional BI tools.
The best example I can give of a real world case is from Attivio's IT Knowledge Expert solution. ITKE analyzes log files from operating systems and applications to identify events that are interesting and/or problematic. For example, we may drop or summarize informative messages, keep warnings, and correlate errors. This is great because it helps system administrators quickly discover the symptoms of an issue. However, it is the other data — the high value articles, knowledge bases, SharePoint articles created by previous admins, etc., in which the solution to the problem is found. This is why I refer to content as high-value. It explains WHY things happen.
Related:
Missing Some Key Points with Big Data