For a long time, many social scientists
have conducted content analysis by using their substantive knowledge
and manually coding documents. In recent years, however, fully
automated content analysis based on probabilistic topic models has
become increasingly popular because of their scalability.
Unfortunately, applied researchers find that these models often fail
to yield topics of their substantive interest by inadvertently
creating multiple topics with similar content and combining
different themes into a single topic. In this paper, we empirically
demonstrate that providing topic models with a small number of
keywords can substantially improve their performance. The proposed
keyword assisted topic model (keyATM) offers an important advantage
that the specification of keywords requires researchers to label
topics prior to fitting a model to the data. This contrasts with a
widespread practice of post-hoc topic interpretation and adjustments
that compromises the objectivity of empirical findings. In our
applications, we find that the keyATM provides more interpretable
results, has better document classification performance, and is less
sensitive to the number of topics than the standard topic models.
Finally, we show that the keyATM can also incorporate covariates and
model time trends.
is available for implementing the
proposed methodology. (Last updated in April 2020)