Unified Information Access Blog
Welcome to Attivio's Unified Information Access Blog. Join us for discussions on topics ranging from enterprise search solutions, information access insights, Agile software development methodology to programming with Java. We hope you'll find the articles informative and participate in the discussions by leaving a comment.
In previous installments of this series, we looked at tokenization and at sentence boundary detection. In part 3, we'll talk about stemming and lemmatization.
Stemming is the process of "normalizing" tokens -- giving a standard form for tokens -- by looking for prefixes or suffixes and removing (or occasionally rewriting) them. This happens over a number of phases; the stemmer looks for and deals with one set of affixes, and when it's dealt with them, it looks for another set. At the end of this process, the part of the token that remains is called the stem.
What's the rationale for stemming? The basic intuition is that tokens with the same stem tend to have the same meaning, frequently enough that (it's asserted) stemming will improve the accuracy of search. It isn't perfect -- as we'll see below, in English General and generous have the same stem -- but it can be useful for some applications.
I mentioned search accuracy in the previous paragraph, but really what stemming does is improve search recall. A search engine using stems will return a strictly greater number of documents for a search, compared to a search engine indexing tokens. The question is what proportion of the additional documents returned by the stemming search engine is relevant to the query? Once that question is answered, the question of whether to use stemming is answered.
The usefulness of stemming depends on the morphological complexity -- that is, the number and frequency of prefixes and suffixes -- for a particular language. No one bothers to stem Chinese, since there are hardly any prefixes or suffixes at all. On the other hand, languages like Finnish, Turkish, Japanese and Hungarian can benefit a great deal from stemming, because those languages have complex systems of inflections and conjugations that obscure when the same word is used more than once.
And, because stemming algorithms generally run very fast, stemming is seen as an almost free way to boost search recall. Empirically, the effect of stemming on search is small, but it ultimately improves search by a measurable amount, especially on smaller corpora where recall becomes more important. (By the same token, web search engines have less of a reason to use stemming, since boosting recall emphatically is not a concern of theirs.)
Let's look at an example, based on the standard stemmer originally designed and written by Martin Porter. The Porter stemmer is by far the most commonly used stemmer for English, and similar Porter-style stemmers have been created for many other languages. (For examples in other languages, see the Snowball web site.)
Without going into all the gory details, there are five steps, each of which consists of a set of rules. Each step is designed to delete certain types of suffixes, and to feed the next step's rules. The suffixes to be deleted include English inflectional endings (like '-s', '-ed' and '-ing') as well as English derivational endings like '-ful', '-able', '-ism' and '-ize'. The output of the final step is the stem.
Most of the time, the Porter stemmer produces reasonable results, where words with similar meanings are mapped onto the same stem: for example, car and cars both stem to car; and care, cared, careful and cares stem to care.
Note that the stem does not need to be a real English word. For example, beauties, beautiful and beauty are all given the stem beauti. This makes sense, especially in the context of the rest of the Porter stemmer's design, but it takes a little bit of getting used to.
There are also cases where words with different meanings are assigned the same stem. For example, general, generally, generation, generations, generative, generous, General and Generals are all assigned the stem gener. Similarly, sever, several, severally, severe, severed, severely, severing, severity and severs all have the same stem, sever. Such stemming collisions are not ideal from the standpoint of information retrieval, although such clear-cut stemming collisions are the exception rather than the rule.
There are other stemming algorithms, motivated in part by the desire to avoid such stemming collisions. Some stemmers use a comprehensive dictionary, others rely on statistical approaches.
Similar to stemming is lemmatization. In linguistics, a lemma is the "dictionary form" of a word. Lemmatization is the process of determining the lemma for a given word, so different inflected forms of a word can be analyzed as a single item. Lemmatization is closely related to stemming but unlike stemming, which operates only on a single word at a time, lemmatization operates on the full text and therefore can discriminate between words that have different meanings depending on part of speech.
Many wide-coverage parsers use lemmatization to correlate lexical information for morphology, syntax and semantics. For these uses, the lemmatizer is run before parsing, and it uses the stem (usually found by the Porter stemming algorithm) along with the part of speech, both of which are used to find the lemma unambiguously.
Lemmatization is also used in the search and IR domain for listing all the morphological variants of a word. Usually, this is done by looking up a list of related words in a dictionary. This kind of lemmatization is computationally simpler, since almost all the work is done off-line in compiling the dictionary of morphological variants.
Both stemming and lemmatization are computationally efficient ways to improve search. For English, the effects in most cases will be modest, but in some domains (and in other languages) it is clearly worthwhile.

SHANTARAM PATIL
said:
|
... i have porter algorithm which just does stemming but for getting the morphological words ,from where should i get the code? please help me |
John O'Neil
said:
|
... I'll assume you're working with English text, and from your comment, it sounds like you want a lemmatizer, rather than a stemmer. There are two ways to do lemmatization, depending on your preferences and needs. First, you can collect a lemma dictionary for English, which contains all the morphological variants for a large number of words. This is usually what search engines use, because it's relatively quick and fits in with the underlying technology well. However, for your comment, I think you want the unique "dictionary form" of a word. This requires more information about the word, usually the part of speech. There are a number of good open-source POS taggers for English, and most of the one that need to be trained come with a trained model. I think that the Stanford POS tagger (in Java) and OpenNLP (also in Java) as well as NLTK (in Python) are useful, and there are many others. With a POS tagger to do the disambiguation, it's possible to create a (word, pos) to lemma dictionary, but that's a lot of work unless someone else has already done it for you. On the other hand, you can use a morphological analyzer. All the systems I know of are commercial, except for KIMMO (http://www.sil.org/pckimmo/) and Hunmorph (http://mokk.bme.hu/resources/hunmorph), but they may or may not be what you're looking for. For ambiguous tokens, though, a morphological analyzer will produce more than one result, and so to get the correct one, you need to have the part of speech tag, which will almost always resolve the ambiguity (for English, anyway). |
Articles by Date
Recent Posts
Thinking Like a Tester
What AIE and unified information access mean for developers
The (Real) Semantic Web Requires Machine Learning
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8

