Latent Dirichlet Allocation

Published:

Latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. In the case of NLP, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word’s creation is attributable to one of the document’s topics.

In LDA, each document may be viewed as a mixture of various topics. This is similar to probabilistic latent semantic analysis (pLSA), except that in LDA the topic distribution is assumed to have a Dirichlet prior. In practice, this results in more reasonable mixtures of topics in a document. It has been noted, however, that the pLSA model is equivalent to the LDA model under a uniform Dirichlet prior distribution.

Formalization

The model is formed by the elements:

  • {\displaystyle \alpha} is the parameter of the Dirichlet prior on the per-document topic distributions,
  • {\displaystyle \beta} is the parameter of the Dirichlet prior on the per-topic word distribution,
  • {\displaystyle \theta _{i}} is the topic distribution for document i,
  • {\displaystyle \varphi _{k}} is the word distribution for topic k,
  • {\displaystyle z_{ij}} is the topic for the jth word in document i, and
  • {\displaystyle w_{ij}} is the specific word.

The generative model is expressed by:

\[{\displaystyle P({\boldsymbol {W}},{\boldsymbol {Z}},{\boldsymbol {\theta }},{\boldsymbol {\varphi }};\alpha ,\beta )=\prod _{i=1}^{K}P(\varphi _{i};\beta )\prod _{j=1}^{M}P(\theta _{j};\alpha )\prod _{t=1}^{N}P(Z_{j,t}|\theta _{j})P(W_{j,t}|\varphi _{Z_{j,t}})}\]

using the assumptions of the distributions.

LDA is an example of a topic model and was first presented as a graphical model for topic discovery by David Blei, Andrew Ng, and Michael I. Jordan in 2003. Essentially the same model was also proposed independently by J. K. Pritchard, M. Stephens, and P. Donnelly in the study of population genetics in 2000.

See also

Latent Semantic Analysis

Material

Papers