Chief Data Officers and the Challenges of Big Data
Writing on the O’Reilly.com site back in August, CEO Jessie Anderson of Smoking Hand, a training company for Big Data technologies, commented on the overall complexity of Big Data, NoSQL technologies, and the distributed systems that deploy them.
Chief Data Officers certainly have first-hand knowledge of this complexity and the hurdles it presents to extracting the maximum value out of business data. Complexity takes a variety of forms throughout the Big Data stack. Let’s start at the bottom.
Modern and Legacy Data Sources
All CDOs must create a modern data architecture from hybrid data sources.That means traditional data warehouses, NoSQL databases, an HDFS like Hadoop, and streaming data.
Typically, data warehouses are “inherited.” But, depending on the demands of the business, CDOs have quite a few options in the other three categories. For example, there are five flavors of NoSQL: column, document, key-value, graph, and multi-model—and many vendor offerings in each. Referencing Anderson again, “To have made the right decision in choosing, for example, a NoSQL cluster, you’ll need to first have learned the pros and cons of five to 10 different NoSQL technologies.” That kind of expertise is often hard to come by.
Likewise, Hadoop expertise isn’t all that prevalent either. A survey by Snowflake Computing found that “only 12 percent have easy access to Hadoop expertise; in contrast, 93 percent have easy access to SQL expertise.” So it’s not surprising that CDOs we talk to point out that a lack of staff skill often constrains them from building apps on their Hadoop-based data lakes.
And then we have streaming data. A question posted on Quora about storing streaming data drew this response (edited for conciseness):
I would strongly recommend a horizontally scaling system like Apache Flume to collect streaming data from the sensors. Have Flume split the stream, sending one stream to HDFS for storage and later analysis and send the second stream to your complex event processing system for realtime analysis.
That sounds pretty straightforward—but not necessarily simple. And again, you need the resources to pull it off.
Things don’t get any less complex as we move up the stack. In my next post, we’ll look at the middle. Also be sure to check out our 5-minute Guide to the Challenges of a CDO.