Attivio Text Analytics and Language Processing
Language Processing in Attivio Active Intelligence Engine® (AIE®) provides linguistics capabilities that facilitate finding and analyzing information and increasing the relevancy of query results. AIE supports ingestions, basic tokenization, noun phrase extraction, named entity recognition and other analytic processes for many widely-used languages, with some language‐specific linguistics features for non‐English languages, such as decompounding of German words at query time. Advanced linguistics features, such as statistical entity extraction and Key Phrases are also available as an option for English.
The strong linguistics foundation in AIE enhances user experience, improves discovery for information relationships and bolsters functionality for the two pillars of information retrieval - relevancy and precision.
- Tokenization. This process divides a block of text, such as a paragraph or an email, into individual units ("strings"). Tokens are single words, punctuation marks, numbers, etc. These tokens are used to index the block of text. AIE provides tokenizers for different languages, including complex languages that are not space-separated, such as Chinese and Japanese.
- Removal of stop words. Stop words are conjunctions, prepositions, articles and other words such as AND, TO and A that appear often in documents but typically contain little meaning. Typically, stop words are removed from the query to improve relevancy as these words would otherwise return many documents that are not in fact relevant to the query terms.
- Stemming and Lemmatization. These related processes identify and group related words to improve the relevancy of search results. The less complex of the two, stemming reduces words to their root or stem form (e.g., the stemming process identifies "fishes," "fished," "fishy," "fishing," etc., as sharing the stem "fish") Lemmatization finds the lemma, or base form of the word, and groups various inflected forms of the word so they can be analyzed as a single item (e.g., the stemming process identifies "good" as the lemma for "better"). Lemmatization is closely related to stemming, but unlike stemming, which operates on only a single word at a time, lemmatization operates on the full text and therefore can discriminate between words that have different meanings depending on part of speech or context. Both stemming and lemmatization can be used during ingestion or query preparation to ensure relevancy of results without requiring an exact match (e.g., using stemming and lemmatization, a search for "bike" would find documents with "bicycle," "biking" and "cycling" as well as "bike").
- Phrasing. Phrasing recognizes and leaves whole any common phrases or idioms such as "end zone" or "diamond lane" so that this phrase is indexed as a single term and not as two separate terms. Phrasing ensures that a document that contains "hat trick" (a sports term) is not returned when the search is for "hat" or "trick" or vice versa - documents about hats aren't returned when the search term is "hat trick."
- Synonym Expansion. You can expand a query or document with a defined list of synonyms (from a built-in or plug-in dictionary) from the words it originally contains. Synonym expansion is mostly useful in query processing, so a search from "cheap hotels" would also match and return document that contain "inexpensive hotels," "budget hotels" and "two-star hotels."
- Word Decompounding. For languages such as German or Japanese, which have a very rich compound word structure, AIE can automatically divide compound words into smaller words and other units of meaning (e.g., meaningful noun-phrases) for better query matching.
- Language Identification. A document or other text source does not, on its own, reveal what language it is in or what character encoding it is using. AIE solves this problem by analyzing each document, determining the most likely language and encoding for it.
- Spelling correction. When users make mistakes when typing queries, AIE can suggest alternatives and even correct queries automatically. Spelling correction can use a spelling dictionary supplied by the customer or automatically extracted from user documents.
- Entity extraction. Using entity extraction, AIE detects and indexes certain nouns and noun phrases in content (entities are people, companies, locations), providing another useful source of metadata that can be indexed to improve query results. The entities can be based on supplemental dictionaries or on rules, (customer can supply their own dictionaries - such as terms in their industry - and rules). AIE Advanced Linguistics supports Statistical Entity Extraction, which is entity extraction based on statistical models that are trained on examples.
- AIE also includes text analytics capabilities such as classification and sentiment analysis.
Advanced Linguistics and Additional Languages Options
Customers who need support for languages other than English can select Attivio's Additional Languages options. For advanced features such as Key Phrase Extraction, select Attivio's Advanced Linguistics. The non-English language options support tokenization, linguistics and algorithmic entity identification and extraction methods for a number of European, Middle Eastern and Asian languages. Advanced Linguistics, which is available for English and in limited cases for other languages, includes these capabilities:
- Statistical entity extraction. AIE Advanced Linguistics supports Statistical Entity Extraction, which is entity extraction based on statistical models that are trained on examples. Statistical Entity Extraction weights the frequency of an entity's appearance in content. The resulting metadata can be used in navigation tools such as tag clouds and facet recommendations.
- Key phrase extraction. Key Phrase Extraction, a kind of Concept Extraction, uses statistical techniques during document ingestion to select phrases that stand out as important by virtue of being statistically improbable (i.e., unlikely to appear in any document that is not specifically about that phrase). These key phrases can then be used for classifying information, creating facets that can be used in menus to facilitate information discovery, etc.
Doing Things with Words, Part One: Tokenization
Doing Things with Words, Part Two: Sentence Boundary Detection
Doing Things with Words, Part Three: Stemming and Lemmatization
Untangling the Semantic Web: Finding Threads of Gold
Needles and Haystacks