Blog
Unified Information Access
Doing Things with Words, Part One: Tokenization
Today, much of the search business is about handling plain old text. The basic fact about text is this: it's made of words. Search engines need words -- it's what they put into their index -- and of course queries are made of words, too. There's a chain of events that's the same for every application (and every language) when we ingest words, index a document and answer a query. And, the first link in the chain is tokenization.
Tokenization involves taking a big string (the text block) and turning it into a list of strings (the tokens). In English, it's pretty easy to find tokens usually, simply by separating the text where the spaces are. However, a tokenizer does a little more than that in English:
• It separates punctuation like periods from the beginning or the end of other tokens;
• It splits off things like contractions (-n't, -'ll, -'ve, for example) and possessives (like -'s) and make them tokens on their own.
• It also recognizes punctuation, numbers and things like email addresses and URLs, and sometimes deals with them specially.
• In some cases, it looks for multiword units (also called collocations) and treats them as a single word. Sometimes, they're closely associated parts of a proper name, like "White House". Other times, they're connective words that make much less sense when separated, like "because of" and "with respect to". Also, foreign phrases can be treated as a single unit, like "ab initio" or "et cetera".
When writing a tokenizer, English is one of the easiest languages to deal with, though. Some languages have more elements like English's contractions, called "clitics". For example, in Arabic many space-separated tokens contain several distinct words. An example is the transliterated token "wbYAlKtAb" in Arabic. Even after you add in the unwritten Arabic vowels, you sill have to separate off several layers of prefixes to get "wa biY Al KiTAb", meaning "and to the book".
Even more challenging is tokenizing languages like Japanese and Chinese, where there are no spaces between the words-- the words are simply run together, likewhenyourspacebarisn'tworking. Most Chinese words are more than two characters long (and Japanese words are even longer); at the same time, the writing system, with several thousand characters, leads to much ambiguity. For all those reasons, it's tricky for even a native speaker to always know where the words begin and end. For a computer, tokenizing Chinese and Japanese accurately is one of the biggest challenges in the field of natural language processing.
So, that's the blog-sized introduction to tokenization. In future installments, we'll look at other components, like stemmers, part of speech taggers, lemmatizers and entity extractors.
Author Bio
John O'Neil has written and designed software for search, natural language processing and machine learning for 10 years. After receiving a Ph.D. in computational linguistics from Harvard University, he has worked for Lingo Motors, where he designed their main commercial product and ended up with his name on a number of their patents, as well as other search engine companies where he worked to increase search relevancy and accuracy. He also worked for over five years at Basis Technology, Inc., where he was the designer and lead developer for the Rosette Linguistics Platform, their language processing and entity extraction suite of products.
