Microsoft Web N-Gram API

The Microsoft Web N-Gram service currently provides two services for the community, Lookup and Generate. The former allows users to look up the probability of words, and the latter allows users to get a list of words for which we have probability data.

1. Lookup Service

Web Service URL: http://weblm.research.microsoft.com/weblm/Lookup.svc

1.1. GetModels

Methodstring[] GetModels()
DescriptionGet a list of currently supported N-Gram models.
ArgumentsNone
ReturnsThe return value is an array of URNs, one for each model supported by this service. The URN follows the following form:
urn:ngram:{model-name}:{version}:{order}
For example,
urn:ngram:bing-body:jun09:3
corresponds to a trigram (order=3) model named “bing-body”, version “jun09” (i.e. June 2009).

1.2. GetProbability

Methodfloat GetProbability( string authorizationToken,
                     string modelUrn,
                     string phrase)
DescriptionFind the joint probability of the words in a phrase in the specified model.
ArgumentsauthorizationTokenAuthorization token as provided by Microsoft Research. The token is a GUID.
modelUrnOne of the URNs returned by GetModels().
phraseA string containing a sequence of words for which to compute the probability. The words should be separated by spaces.
ReturnsThe base-10 log of the joint probability of the word sequence. For instance, if you have an order m model and the following word sequence:
w1, w2, …,wn
The return value is the log of the following:
P(w1)P(w2|w1)…P(wn|wn-m+1…wn-1)
Notes:
  • The token <s> represents the beginning of a phrase.
  • Punctuation is generally ignored.

1.3. GetProbabilities

Methodfloat[] GetProbabilities( string authorizationToken,
                     string modelUrn,
                     string[] phrases)
DescriptionBatch-mode method for GetProbability. Find the joint probability of the words in multiple phrases in the specified model.
ArgumentsauthorizationTokenAuthorization token as provided by Microsoft Research. The token is a GUID.
modelUrnOne of the URNs returned by GetModels().
phrasesAn array of strings containing sequences of words for which to compute the probability. The words should be separated by spaces.
ReturnsAn array of joint probabilities, one for each phrase in phrases. For details about the joint probability, see GetProbability.

1.4. GetConditionalProbability

Methodfloat GetConditionalProbability( string authorizationToken,
                     string modelUrn,
                     string phrase)
DescriptionFind the joint probability of the words in a phrase in the specified model.
ArgumentsauthorizationTokenAuthorization token as provided by Microsoft Research. The token is a GUID.
modelUrnOne of the URNs returned by GetModels().
phraseA string containing a sequence of words for which to compute the probability. The words should be separated by spaces.
ReturnsThe base-10 log of the conditional probability of the last word in a sequence, in a given context. For instance, if you have an order m model and the following word sequence:
w1, w2, …,wn
The return value is the log of the following:
P(wn|wn-m+1…wn-1)
Notes:
  1. If n=0, the return value is float.NaN.
  2. If n>m, the words at the beginning of phrase are ignored.

1.5. GetConditionalProbabilities

Methodfloat[] GetConditionalProbabilities( string authorizationToken,
                     string modelUrn,
                     string[] phrases)
DescriptionBatch-mode method for GetConditionalProbability. Find the conditional probability of the word in a given context, for multiple phrases in the specified model.
ArgumentsauthorizationTokenAuthorization token as provided by Microsoft Research. The token is a GUID.
modelUrnOne of the URNs returned by GetModels().
phrasesAn array of strings containing sequences of words for which to compute the probability. The words should be separated by spaces.
ReturnsAn array of conditional probabilities, one for each phrase in phrases. For details about the joint probability, see GetConditionalProbability.

2. Generate Service

2.1. GetModels

Methodstring[] GetModels()
DescriptionGet a list of currently supported N-Gram models.
ArgumentsNone
ReturnsThe return value is an array of URNs, one for each model supported by this service. The URN follows the following form:
urn:ngram:{model-name}:{version}:{order}
For example,
urn:ngram:bing-body:jun09:3
corresponds to a trigram (order=3) model named “bing-body”, version “jun09” (i.e. June 2009).

Note that this method and the GetModels method in the Lookup service do not return the same list of URNs. Not all models can be used by this service; none of the unigram (order=1) models, for example, can be used in generative mode.

2.2. Generate

MethodTokenSet Generate(string authorizationToken,
                  string modelUrn,
                  string phraseContext,
                  int maxN,
                  string cookie)
DescriptionGiven a phrase (sequence of words), find the words that are the most likely to follow the phrase.
ArgumentsauthorizationTokenAuthorization token as provided by Microsoft Research. The token is a GUID.
modelUrnOne of the URNs returned by this service's GetModels() method.
phraseContextA string containing a sequence of words from which to generate the list of words likely to follow. The words should be separated by spaces.
maxNThe maximum number of following words to return. At most 1000 words are ever returned per call, irrespective of this value.
cookieAn opaque placeholder to facilitate subsequent calls.
ReturnsA TokenSet object, shown here, is returned.
class TokenSet
{
   public string cookie;
   public float backoff;
   public string[] words;
   public float[] probabilities;
}
The TokenSet contains words in decreasing probability order. The conditional probability is also returned, allowing users to compare the likelihood of the words.

Because this method is not a streaming method, you must call the method in succession if you require more than maxN results. After each call to the Generate method, the cookie returned in the TokenSet object must be used in the next call to Generate.

The backoff value is returned in the event that you wish to call Generate with a shortened phraseContext, so as to make the returned values compare-able.

Notes:
  1. During the first call to Generate, cookie must be an empty string.
  2. It is an error to change any of the arguments other than maxN (and, of course, cookie) when cookie is non-empty. The results are not reliable otherwise.
  3. For a model of order m, if phraseContext contains more than m-1 words, the excess words at the beginning of phraseContext are ignored.

Last Updated: July 8, 2010