In part 1 of "Doing Things with Words", we took a short look at tokenization. This time we're talking about sentence boundary detection.
Sentence Boundary Detection (SBD) starts with a stream of tokens, and looks for the beginnings and ends of sentences. For plain-old-vanilla search engines, SBD is often not used, because basic search engines only care about tokens (and token positions) in each document. However, if you want to extract additional information from text, it's usually the case that those tools need to know the sentence boundaries. (Part-of-speech taggers and entity extractors usually operate on individual sentences, to name but two examples that we'll talk about in future installments.)
In English, finding sentence boundaries can be tricky. To native speakers, this is perhaps unexpected, since all us native English speakers (excuse me, all we native English speakers) learned that sentences end with periods, exclamation points or question marks. How difficult can that be?
For exclamation points and question marks, it usually isn't too difficult (although Yahoo! does complicate matters). But periods are a different story, because periods can be used in a number of different ways:
1) To end a sentence. (e.g. I saw a squirrel.)
2) To mark an abbreviation. (e.g. Attivio is on Walnut St. in Newton.)
3) To mark an abbreviation and end a sentence, at the same time. (e.g. Bob got a doctorate from M.I.T.)
4) To mark the end of a sentence, while allowing punctuation (usually double or single quotation marks) to follow. (e.g. I said, "Attivio is in Newton.")
5) To mark an ellipsis... like that.
6) To mark an ellipsis and end a sentence, at the same time.
So, if we determine that a particular period is case 1, 3, or 6 above, then we mark the period token as the end of a sentence. In case 2 or 5, we don't. Finally, in case 4, we move the sentence-end marker to the following punctuation token.
(Alternatively, we can allow quotation marks that can end sentences, which amounts to the same thing.)
There are a couple of ways to determine sentence boundaries automatically. One approach is to write a set of rules by hand, like this:
(a) If it's a period, it ends a sentence.
(b) If the preceding token is on my hand-compiled list of abbreviations, then it doesn't end a sentence.
(c) If the next token is capitalized, then it ends a sentence.
This, or something like this, gets about 95% of sentences correct. This isn't bad, since they only took a few minutes to write, but it also isn't too great, since two people agree about 99.8% of the time. A more accurate approach is to make a "training set" of documents where the sentence breaks are marked, and to automatically learn a set of rules. With some care, this can get better than 99.5% accuracy - so the number of expected mistakes goes from 1 in 20 to less than 1 in 200, ten times fewer. In the long run, the extra work is worth it.
It turns out that it's possible to improve on this using more powerful machine learning technology. Also, there's been a lot of research on automatically learning sentence boundaries without any labeled training data, using only plain text.
So far in this blog entry, we've talked only about English. This is first, because everyone reading this is virtually guaranteed to speak English, although the examples could have been in other European languages. However, the second reason is equally important. Languages like Japanese and Chinese have unambiguous sentence-ending markers, which makes SBD not a problem. Of course, if you remember part 1 of this series, you know that tokenization in Japanese and Chinese is very difficult, so perhaps it's only fair.
So, that's the blog-sized introduction to sentence boundary detection. In future installments, we'll look at other components that use sentence boundary detection, like stemmers, part of speech taggers, lemmatizers and entity extractors.