A common task on the Web is to insert spaces back into a sequence where they've been removed. The spaces are removed for any number of reasons, one of the most common being the creation of so-called hash tags (example). Domain names and even license plates similarly could be word-broken for clarity (example).
One approach is to use a dictionary, another with character N-gram data. But with our N-gram data, you have a few advantages:
- Ranking — with probability values, you can rank one word-break candidate over another.
- Currency — because the data is culled from the Web, our lexicon reflects how people compose text. The dataset incorporates leetspeak, lolspeak, textese, to name a few “languages”.
- Unsupervised — no human intervention is required to bootstrap the process, no special rules, exceptions, and so forth. Just the data.
To demonstrate N-gram based word-breaking, we've put together a few demo applications.
NgramWordBreaker/Break.svc
In a brute-force approach where any character boundary can have 0 or 1 space characters, you can have a potentially expensive lookup process, even with fairly agressive pruning. So instead of using the full lexicon we prepared a reduced-lexicon version of the Jun09 title stream, using only unigram and bigram data (also trimmed.) Using this data, we can serve the top-N word-break candidates, using a cutoff threshold of two orders of magnitude from the top candidate. The score reflects only the N-gram probability of the phrase, i.e. no insertion penalty is assessed for adding a space, etc.
Syntax
http://web-ngram.research.microsoft.com/ngramwordbreaker/break.svc/?p={phrase}[&format={format}]
{phrase} is the phrase you want word-broken.
{format} can be one of the following: text, json, xml. If omitted, text is assumed.
Example:
http://web-ngram.research.microsoft.com/ngramwordbreaker/break.svc/?p=ilovengrams&format=json
yields:
[{"Score":-13.0857868,"Subtokens":["i","love","ngrams"]},
{"Score":-15.0297565,"Subtokens":["i","love","n","grams"]}]
WordBreaker4Web
Unlike the lightweight REST interface, you can bring the full dataset to help you perform word-breaking. More details on this can be found here. The page hosts a Silverlight control that calls our Web N-Gram service. A screenshot of the control is shown below:
