Data-Driven Word-Breaking

A common task on the Web is to insert spaces back into a sequence where they've been removed.  The spaces are removed for any number of reasons, one of the most common being the creation of so-called hash tags (example).  Domain names and even license plates similarly could be word-broken for clarity (example).

One approach is to use a dictionary, another with character N-gram data. But with our N-gram data, you have a few advantages:

  • Ranking — with probability values, you can rank one word-break candidate over another.
  • Currency — because the data is culled from the Web, our lexicon reflects how people compose text.  The dataset incorporates leetspeak, lolspeak, textese, to name a few “languages”.
  • Unsupervised — no human intervention is required to bootstrap the process, no special rules, exceptions, and so forth.  Just the data.

To demonstrate N-gram based word-breaking, we've put together a few demo applications.


In a brute-force approach where any character boundary can have 0 or 1 space characters, you can have a potentially expensive lookup process, even with fairly agressive pruning. So instead of using the full lexicon we prepared a reduced-lexicon version of the Jun09 title stream, using only unigram and bigram data (also trimmed.)  Using this data, we can serve the top-N word-break candidates, using a cutoff threshold of two orders of magnitude from the top candidate.  The score reflects only the N-gram probability of the phrase, i.e. no insertion penalty is assessed for adding a space, etc.


{phrase} is the phrase you want word-broken.

{format} can be one of the following: text, json, xml.  If omitted, text is assumed.





Unlike the lightweight REST interface, you can bring the full dataset to help you perform word-breaking. More details on this can be found here.  The page hosts a Silverlight control that calls our Web N-Gram service.  A screenshot of the control is shown below:

Silverlight control screenshot