Home Blog Attivio Transaction-like Document Processing in AIE
Follow Me on Pinterest

Attivio's Active Intelligence Engine (AIE) is built on a scalable parallel asynchronous messaging system (more detail later). One of the aspects of such a system is that messages do not usually get processed in the same order in which they are sent. Usually paying this price for a high-speed scalable content ingestion system is a no brainer. But there are cases where the order of processing is a nice-to-have, and some cases where it is absolutely required for correctness. Some examples:

  • Consistent data view - When re-ingesting a zip file AIE stores the contents of an archive as separate documents linked to a parent zip file document. When re-ingesting the archive, we ideally would delete the previous contents (a delete message) and then re-ingest the new contents (new document messages).
  • Consistent data view - As detailed in this blog, AIE supports flexible and efficient update of security information (one set of documents) independent of the secured document. When updating, previous security information must be removed (a delete message). Whether updating or ingesting for the first time, it is critical that the update be consistent: the ACLs and the document changes must occur on the same side of any commit of the index.
  • Grouping records and documents - AIE can combine records or documents together by key. The transformation component which does this requires all documents that share a key to arrive together for memory efficiency and correctness. Without this guarantee, such a component can never determine when the last document with a shared key has arrived and the combined output should be generated.

The ability to support these types of operations while maintaining a high-throughput scalable system that can ingest structured and unstructured content is a key requirement of unified information access (UIA). The alternative is to pre-join all your data which dramatically limits the types of queries that can be executed and the way updates to content are processed. AIE has always been able to handle these use cases, but until recently it was a computationally expensive option.

New in release 2.2 is the ability to support the processing of messages within a group together and in order while letting all other messages and groups be processed independently. This is a unique capability among ETL-like scalable data processing systems like ours.

How it works

Grouped message processing allows a set (usually small) of related messages to be processed as a group (in order) when needed. This capability can be turned on or off by individual document processing transformers. What this means is that when the grouping isn't required (the transformer doesn't have side-effects which depend on the group ordering) then the messages are processed independently, delivering the maximum throughput. When a document transformer does require this transaction-like behavior, a simple configuration change is all that is necessary. This change causes the following semantics to come into play for the component:

  • All non-document messages are blocked while processing the group. A system commit or optimize message cannot be processed until the group is complete. This means the processing of all the messages of the group will occur together before or after the commit, ensuring a consistent state.
  • The documents in the group are sent to the component in the order they were added to the group.
  • Only a single instance of the component is used to process the documents in the group.

The component that most heavily uses message groups is the ContentDispatcher (the gateway component for the AIE index).

Client Example

Message grouping is easy to use. In the client API example below, doc1 and doc2 will be processed as one group and a DeleteByQuery and doc3 will be processed as a second group.

Content feeder example

ContentFeeder feeder = new ContentFeeder(...);
feeder.startMessageGroup();
feeder.feed(new AttivioDocument("doc1"));
feeder.feed(new AttivioDocument("doc2"));
feeder.endMessageGroup(); // must always be called at the end of a group.
feeder.startMessageGroup();
// an example of replacing a doc and all children as a single atomic event

feeder.deleteByQuery(new WorkflowQueue("defaultQuery"), "parentid:doc3", QueryLanguages.SIMPLE);
feeder.feed(new AttivioDocument("doc3"));
feeder.endMessageGroup();

Background on the AIE architecture

The architecture of AIE is based on the Staged Event-Driven Architecture (SEDA) pattern. In SEDA, each pool of components has a work queue in front of it. Components (document transformers are a type of component) work on their input queue and forward system messages to the queue of the next component in the workflow. The SEDA architecture allows AIE to manage processing via sizing the queues and component instances all while processing content in an asynchronous fashion. As a result, back pressure can be applied if one component gets overwhelmed and that back pressure will eventually flow all the way back to the client application.

Client Diagram

AIE supplies pluggable transports that allow components to be located in separate processes and on separate machines. In this way, processing can be scaled across multiple machines as needed. When a message (which may contain multiple documents) is received it is transferred to one of the available instances of the component for processing.

Trackback(0)
Comments (0)add comment

Write comment
smaller | bigger

security image
Write the displayed characters


busy