Home Blog Attivio Key Phrase Extraction
Follow Me on Pinterest

Key Phrase ExtractionIn the near future, Attivio will be rolling out key phrase extraction for most European languages. Fortunately, this required very little work, because of the way we extract key phrases. In an earlier Attivio blog post introducing key phrase extraction, our CTO, Sid gave an executive summary. Here, I want to get into a few more details, and explain why the incremental work to supply key phrase extraction for a new language is very small in AIE.

What is our approach to key phrase extraction? To be cute about it, key phrases have to satisfy two criteria: the "key" part and the "phrase" part. "Key" means that the word or phrase describes the document it's from; "phrase" means that the words are more likely to form a meaningful unit than to be merely a random sequence of tokens. We measure both of these using a language model.

A language model is a way of assigning probabilities to all the possible sequences of words in a language. Traditionally, a language model is created by collecting a corpus of text -as large a corpus as you can - and counting every sequence (up to a cut-off length) of words in the corpus. It's straightforward, but there are two problems with it.

One problem is that, as the phrase length increases, the number of possible phrases grows very fast. Even if your language has only 20,000 words, that means that there are about 400 million possible bigrams; 8 trillion possible trigrams, 160 quadrillion possible four-grams, and 3.2 sextillion possible five-grams. Since the largest corpus ever collected (Google's 1T corpus) contained about one trillion words and even fewer bigrams, trigrams, four-grams and five-grams, most of these possible phrases will never be seen. (And, remember, English has a lot more than 20,000 words.) So, if a phrase has never been seen at all, how do you assign a probability to it?

Another problem comes from Zipf's law, which says that words obey a power law distribution. That is, the second most common word is about one half as frequent as the most common word, the third most common about one third as frequent as the most common word, and so on. So, after the thousand or so most frequent words, there are a lot of words that don't occur that often. Bigrams and trigrams that contain those words will be proportionally even rarer. For a language model, this means, as a language model grows, most of that growth consists of previously unseen phrases that happen to be in your chosen corpus. Also, it means that most phrases that are seen are seen only once, or a small number of times. So, the fewer times a phrase is seen, the more unreliable the observed count is.

These problems have been known for a long time, and there are a number of techniques that handle these challenges. I won't go into them here, but there are other places to find out about them. Here at Attivio, we've created a language modeling module that compresses a very large language model into surprisingly small space.

Once we have a language model, we're ready to extract key phrases. We create a (very small) language model from a document being ingested, and we compare it to the (very large) language model we have already built. For every one, two and three word sequence in the document (three being the current key phrase size limit), we use the standard likelihood ratio test, which returns a value that expresses how unusual the document's key phrase count is, compared to its count in the background corpus and compared to the count of its individual words. If the value is high enough, we consider it a key phase.

"That's clear as mud," I hear you thinking, "but what does that have to do with multi-language key phrase extraction?" Everything, as it turns out. For key phrase extraction in a new language, we only need to build a language model for that language. To do this, we only need to collect enough documents in that language. (We also need to be able to tokenize and detect sentence boundaries in that language, but if we've gotten to the point where key phrase extraction is of interest, then we already have tokenization and sentence boundary detection complete.) So, the only labor involved for key phrase extraction is corpus collection, which in the internet age is not hard. It's not easy, exactly, but it's not hard. Then, we turn the crank on the language model module, and point the key phrase module at the resulting language model files.

The best part is, if a customer decides that they want a custom language model that does a better job of key phrase extraction - for example, because they're working in the biomedical domain where words are used differently - then they can build a language model themselves, point the key phrase extraction module at the language model they've built, and they're done! Of course, they might prefer to have Attivio lend a hand, which is always an option, but it's not required.

This is where our approach to information processing and natural language processing pays off. No magic, a clear explanation of how things work, and putting the power in customers' hands, if they want it. It even makes our lives easier!

Trackback(0)
Comments (0)add comment

Write comment
smaller | bigger

security image
Write the displayed characters


busy