Attivio recently sponsored the Enzee Universe show in Boston. We've been a long-time partner of Netezza, dating back to before IBM's acquisition, and it is always great to see the growth in their user base, continued innovation in structured analytics, and to reconnect with all the brilliant people who attend.
I also got the opportunity to present on the topic of Completing the Picture: Moving from "Big Data" to Extreme Information. Gartner group defines Extreme Information as consisting of four main dimensions: Volume, Velocity, Variety and Complexity.
Here are the slides from my presentation:
There isn't an official definition of Big Data, but in my view, most people who refer to it are focused on volume. It's easy to understand why — sensors and systems of all kinds, from web servers to click trackers to smart phones and GPS - are cranking out more and more data every day. From the point of view of a portal like Yahoo! it's easy to understand the challenge. Millions of visitors produce billions of requests, advertisements and even a few purchases.
Taking a traditional BI approach to this volume is simply not possible; the number of servers and database software licenses is simply beyond reason. Moreover, the classic relational database doesn't really add much value as the records created by sensors tend to be what people call "unstructured data". This is a bit of misnomer - it really refers to the fact that the sensor outputs are variable and unfamiliar. They are entirely informal compared to what you would typically see in an OLTP system. Thus it's no surprise that Hadoop was developed at Yahoo to deal with this particular challenge. There's no need for real-time processing, nor to store all of the data so it can be queried via SQL. It's much more efficient to take a bunch of servers, crunch the data over the course of an evening, and output summary records that can then be loaded into a BI system so people who need to know things like number of clicks, ads served, conversions, etc, can get that information.
During my presentation I talked about how Attivio's Active Intelligence Engine (AIE) complete the Big Data picture by bringing in the other dimensions. Most importantly, AIE provides access to a greater variety of data — including unstructured content. This is important because it helps explain "why" something happened; BI and reporting on structured data typically provides the "what". For example, a typical report using structured data will indicate that the number of units sold is decreasing, or that the price of an asset is increasing. Analyzing the documents, email and other user generated content is the key to identifying WHY these changes occurred.
AIE also addresses information complexity. AIE transformers — components that you configure or write to perform analysis — automatically benefit from complete, "best practices" text analytics that enrich and normalize content regardless of language, encoding, etc. Transformers may be written in any JSR language, following a record-centric programming model that is much simpler than MapReduce. If you ultimately index their output, you can even have them query against all records that have come before. For example it may be useful to know how many of the same record, or type of record, have come before.
AIE can also deal with the increasing velocity of information by supporting real-time processing. It supports streaming information in as opposed to being constrained to just bulk load only and therefore can provide more frequent insight. And it can be scaled up linearly to deal with volume as well.
To demonstrate some of these complementary capabilities, I presented several queries using two different approaches: first, querying Netezza's TwinFin iClass framework using SQL and having it hand table functions off to AIE. Using AIE's streaming API (described here in a superb post by Will Johnson) you can pass a long filter list to AIE, or get all the results streamed back with excellent performance. For example, here's a query that finds articles with public companies mentioned (by ticker symbol) and then finds the one day change for their stock price. One can use something like this to see which companies were negatively impacted by oil prices - or perhaps a major spill in the Gulf of Mexico.
select aie.title, aie.uri, e.name, e.changepercent
from table with final(aie_simple('oil', 'title,url,date,symbols')) as aie
join earnings e on strpos(aie.symbols,e.symbol) > 0 and date (aie.date+interval '1 day') = e.date
order by changepercent desc
(Refer to slides 11, 12 and 13 for more examples.)
Secondly I showed sending an SQL query to AIE and having part of it "federated" against the Netezza appliance, with results streamed back and merged into a single result set. For example, here's a query joining market data for specific companies in Netezza, with news articles from AIE:
select * from news n join SymbolsInArticle s on n.articleid=s.articleid
where s.symbol in ('NCRB', 'NWS', 'NWSA', 'GOOG')
federated join on "symbol"
select symbol as "symbol", * from earnings, where symbol in (%%QUOTED_KEYS%%) using workflow federated_db
(Refer to slide 14 and 15 for more detail.)
Using the streaming API and standard integration features like table and user-defined functions, AIE can be integrated with most any analytic platform - from the new MPP databases to NO/NEWSQL and legacy relational systems - and all points in between.
We look forward to attending Enzee next year and thank everyone at IBM/Netezza for putting together such a tremendous show.
