Unified Information Access Blog

Welcome to Attivio's Unified Information Access Blog. Join us for discussions on topics ranging from enterprise search solutions, information access insights, Agile software development methodology to programming with Java. We hope you'll find the articles informative and participate in the discussions by leaving a comment.

Home Resources Blog
Follow Me on Pinterest

One of our colleagues at Attivio has a niece and nephew who are as fluent in Japanese as they are in English. Their mom is Japanese and their dad is American, so they have a completely bilingual household. In one moment they might talk to each other in English, their Mom or Dad might call to them from another room in Japanese, and they will answer in kind, switching between their two languages as easily as switching TV channels.

Unified information access (UIA) technology is a lot like being strongly bilingual, in that UIA also quickly and easily communicates information that spans different worlds — specifically, structured data (databases) and unstructured content (documents/text), whether from internal and external sources.

Just as our colleague's niece and nephew can communicate as easily with anyone when visiting Japan as they can at home, a true UIA platform can also freely communicate with disparate information sources and with other applications; particularly BI tools, self-service dashboards and analytic systems. Doing so requires supporting the widely-used SQL (Structured Query Language) standard, via ODBC/JDBC connectivity.

Background

Much energy and effort has gone into the production of tools and technologies to analyze data. From iPhone and iPad apps to spreadsheets, reporting tools, "self-service" dashboards, various analytic systems, right on up to full-blown ad-hoc drag & drop BI tools, we live in an era where everything is analyzed, and the tools we use for that analysis actually contribute to better decisions. One of the keys to the interoperability of this huge ecosystem is a standards-based approach: the broad use of the Structured Query Language (SQL) is the reason the eco-system exists.

The downside to many of these tools is that they operate only on so-called structured data — until recently, ignoring valuable context contained in unstructured sources. Without an integrated and fully correlated view of the complete picture, organizations will miss out on a much wider world of business insights and understanding; not unlike relying on a really bad language interpreter (poor Bill Murray!):

 

Unified Information Access

Fortunately, Attivio’s Active Intelligence Enging (AIE) gives you “the best of both worlds.” Because AIE supports querying in SQL via ODBC and JDBC, organizations can use it to explore all information regardless of source or format. By deploying AIE as a back-end unified information source, your users can continue to use BI and other tools they are comfortable with — but now with the added ability to access a far more complete business informational picture, for more informed decisions and deeper understanding that is simply not possible working with structured or unstructured information alone.

Attivio-TibcoSpotfire Screenshot

One key to making this happen: Active Intelligence SQL (AI-SQL) - a set of full-text function extensions to SQL. AI-SQL functions make it easy for SQL query authors to incorporate AIE's unique features including operations like:

  • REGEX — find rows based on a pattern in a field.
  • STARTSWITH — find rows in which a given fields starts with a specified string.
  • ENDSWITH — find rows in which a given fields ends with a specified string.
  • NEAR — find rows based on two or more terms being within a specific number of words of each other
  • ONEAR — find rows based on two or more terms being within a specific number of words of each other in the order they are specified to the function.
  • FULLTEXTSEARCH — apply a simple query language query as a filter to a specified field.

The fulltextsearch function enables blended search and analytic user interfaces by enabling applications to plug user search box input into a SQL query to allow the user to interact with data - but in a controlled, simple way.

Some examples of using AI-SQL extensions:

select r_regionkey,r_name from all_tables where r_name = regex('e.*e')
select p_partkey,p_container from all_tables where p_container = startswith('brown')
select p_partkey,p_container from all_tables where p_container = endswith('car')
select p_partkey,p_container from all_tables where p_container = near('brown','car','set(distance=1)')
select p_partkey,p_container from all_tables where p_container = onear('brown','car','set(distance=1)')

select company.company, company.ticker, count(news.newsarticleid)
FROM company INNER JOIN news ON company.ticker = news.ticker
WHERE news.content=fulltextsearch(?UserSearch)
GROUP BY company.company, company.ticker

The Importance of JOIN

It should be noted that it is no easy feat to support a detailed query language like SQL, while also serving as a UIA platform that can ingest and search across countless information sources at massive scale.

Some SQL capabilities, like a single field GROUP BY, are easily accommodated by unstructured search, and are relatively straightforward to handle.  For example, given the following query:

SELECT name,count(*) FROM customers GROUP BY name


This can be easily issued as an AIE facet query, asking for facet values and counts on the name field:

query-request> table:customers
facet-request>
name


However, other SQL features, like JOIN, are among the most difficult to support; but happily, they are handled by AIE’s patented ability to dynamically JOIN data and content without advance data modeling on an unstructured index. These advanced AIE capabilities allow us to execute all of the queries of the TPC-H benchmark, including TPC-H "Query 3" joining three tables and aggregating the results:

SELECT l_orderkey,  
SUM(l_extendedprice*(1-l_discount)) as revenue,
o_orderdate,
o_shippriority
FROM customer,
orders,
lineitem
WHERE c_mktsegment = 'BUILDING'
and c_custkey = o_custkey
and l_orderkey = o_orderkey
and o_orderdate < '1995-03-15'
and l_shipdate > '1995-03-15'
GROUP BY l_orderkey,
o_orderdate,
o_shippriority
ORDER BY revenue DESC,
o_orderdate
Partnering for Success

While AIE supports a wide range of SQL, as with any backend system, some SQL operations will not be available. According to a leading analyst, no relational database actually supports the full ANSI-92 standard.

BI vendors recognize this and provide mechanisms to tune the SQL that a tool tries to issue to a specific backend. To help our customers in ensuring success with a given BI tool, Attivio has implemented a BI vendor certification program, which currently includes Tableau, TIBCO Spotfire and QlikView.

This program certifies tools into two levels:
  • Gold — The BI tool is certified to work with AIE using standard SQL
  • Platinum — The BI tool is certified to work with AIE using standard SQL, but also supports the use of AI-SQL extension functions.

Opening up the flexibility of AIE’s universal index to support a powerful industry standard language like SQL via ODBC and JDBC, enabling BI tools and scores of other compatible applications to access and analyze unified information, has proven to be a compelling value proposition for organizations looking for new opportunities to build revenue, cut costs and/or increase competitiveness. Clearly, AIE also speaks the universal language of business success.

Learn more about AIE’s support for SQL via ODBC and JDBC, as well as AIE’s Query-Time JOIN with no data modeling required.

Attivio, like many other companies, uses open source where appropriate. The Java community in particular has great open source technologies powering some of the hottest technology trends today, from big data (Hadoop and MapReduce) to columnar storage (MonetDB) to IDE frameworks (Eclipse).

There‘s also a trove of infrastructure technologies out there for logging, dependency ingestion, character set normalization, etc. Knowing when to use open source and when to write your own code is where the real value comes in as a developer and more importantly, can save you and the company a great deal of time and resources.

All that being said, there are often bugs and functional gaps in open source code and as a developer you need to have a system in place that allows you to handle these issues. For example, we may find a bug in a particular package and submit a patch but since we don't control the release cycle of the open source projects we can't simply wait around for the fix. We are also very picky about less critical run-time issues such as threads being left around after a process or unit test finishes. Many open source projects assume that they are going to be run as a standalone server that terminates with the JVM and these sorts of assumptions can break or disturb our unit test environments.

We've recently switched to the vendor branching methodology described here against our Subversion repository. This allows us to import external projects into our revision control system.

When we encounter an issue in open source code, we:

  • Check in a copy of the source code for the class(es) we plan to fix or enhance. It's important to have this copy so we can easily compare and merge new revisions of the upstream project with our own modifications.
  • Make changes to the class(es) annotating them with //attivio start mod and //attivio end mod to make quick scans for changes that are easier to tease out. We put a reference to the internal developer ticket in the code as well; so later reviewers can back track those changes easily when upgrading upstream code.
  • Compile and build these changes into a JAR file that contains not only the class files, but also the source files in order to comply with many open source licenses. Generally speaking, there is no intellectual property in this code so we don't have to worry about anyone reviewing our changes.
  • Deploy this JAR to our internal Maven repository with an Attivio specific revision. We then update our top level POM (Project Object Model) to reference this Attivio version instead of the public Maven repository version.

Lastly, we strive to create a formal ticket, patch and test for contribution back to the open source community. In all fairness, this is our weakest part of the process, but one we are striving to improve. Many of our changes are small in nature and fix either esoteric edge cases or general code cleanliness like the thread example I mentioned above, but most changes are still useful for the wider community.

Related:

Enterprise Strategy Group Report - Today's Information Access Requirements Outpace Open Source Search Options

Thinking About Replacing Your Search Engine or Search Appliance?

Software at the Speed of Light

I recently attended the YES Boston panel on Big Data & Analytics at the Harvard Innovation Center. Overall it was an excellent discussion. At least one panelist indicated that "It feels like 1995 again", referring to that heady period when the Internet emerged and drove the dot-com era forward. Most of the focus was on the superb opportunities that Big Data creates for entrepreneurs. A few panelists also suggested that Big Data would lead to the death of the traditional relational database and data warehouse.

Earlier in the discussion, one panelist characterized Big Data as having "three V's — Volume, Velocity, Variety" etc. This has become the standard way to segment the various use cases that collectively add up to "Big Data", as well as a number of other often cited characteristics like value and complexity. However, another panelist then said that Big Data was mostly about unstructured information. I have written about how all unstructured data is not the same previously. Most "unstructured" in the context of Big Data is "data" — variable length log files, etc. Not truly unstructured CONTENT, which includes articles, web pages, documents, email, etc. It is important to understand the difference, especially from the entrepreneur's perspective. Much of the "volume" in the Big Data that is unstructured DATA — again, log files, mostly — has very low individual value. It is only when analyzed in volume that it becomes interesting and valuable.

One of the more interesting questions at the end of the panel came from Wikibon's Dave Vellante. He asked, more or less, "Why are big data and the new technologies that are emerging to analyze it going to be disruptive to the enterprise data warehouse?"

Here, the panelists' answers seemed uncertain. Several spoke about the challenge of getting centralized IT to produce new information from legacy BI tools. While this is probably impossible to argue with, at least one panelist went too far, saying that data will just be kept in new systems and BI tools will work directly against them. I didn't buy this angle, and in a follow-up conversation with Dave, he agreed with me that the panel mostly missed the mark.

I would have answered Dave's question like this: the key with Big Data is to take the volume of low value items and turn it into high-value analysis. That analysis then needs to be co-mingled with other information that has high item value. This includes email, documents, text in applications, rows in databases, ERP, CRM etc. That isn't disruptive to the eDW in and of itself. Most big data — like behavioral information — won't be interesting to typical corporate decision makers. A few data scientists etc. will analyze click streams and use it to optimize the end user experience. The transactions (hopefully sales) that result from that will go into the eDW. The yield improvements will also be analyzed and tracked over time — again, probably by traditional BI tools.

The best example I can give of a real world case is from Attivio's IT Knowledge Expert solution. ITKE analyzes log files from operating systems and applications to identify events that are interesting and/or problematic. For example, we may drop or summarize informative messages, keep warnings, and correlate errors. This is great because it helps system administrators quickly discover the symptoms of an issue. However, it is the other data — the high value articles, knowledge bases, SharePoint articles created by previous admins, etc., in which the solution to the problem is found. This is why I refer to content as high-value. It explains WHY things happen.

Related:

Missing Some Key Points with Big Data

Extreme Information: Completing the Big Data Picture

Attivio IT Knowledge Expert

More Articles...

Page 1 of 37

<< Start < Prev 1 2 3 4 5 6 7 8 9 10 Next > End >>