Dynamically Scalable Index in the Attivio Platform
One of the biggest changes in the upcoming Sherlock release of the Attivio Platform is moving to a Hadoop-based architecture for big data applications. Hadoop provides us with a number of low level capabilities that we no longer have to manage at this scale, such as resource management (YARN), coordination (Zookeeper), storage (HDFS) and bookkeeping (HBase). One of the side benefits of YARN based resource allocation is the ability to scale up or down your system’s footprint with a few simple YARN commands. At Attivio, we’ve taken this ability and used it to scale up/down our index, for total flexibility when it comes to performance, cost or index management.
For example, when building email surveillance solutions that index billions of messages over multiple years, it’s often the case that your index has no documents to start with, but grows at a predictable pace as more content flows into the system. Solutions without dynamic scaling require you to plan for the 2-, 3-, 5- or even 7-year out final architecture and then purchase most, if not all, of the hardware upfront. Worse, if you experience an event that requires higher than normal usage of the system in the form of queries, you’d need to take the system offline to reconfigure and redeploy the solution.
Attivio’s upcoming release addresses these concerns. For example, you can start with a single process that hosts your entire index on day 1. Then when you start to grow your index, you can simply start to split the index across multiple nodes (scale out for content volume) with a single command:
flexindex myIndex loadfactor 4
Then if you start to see increased usage of your system, you can simply add an extra 2 rows of searchers (scale out for query volume) that can serve queries using another command:
flexindex myIndex addrow 2
If any of your index’s usage patterns start to decrease, you can also decrease the resource utilization just as easily. It’s common that many enterprise applications only see peak usage during normal business hours. The architecture will allow you to scale down your query capacity at night, so that other non-Attivio nightly batch jobs can utilize the shared resource pool.
Most importantly, these operations can be performed on a live system, without ingestion or query downtime, and without any impacts on the users.