Soc Sci Res. 2022 Nov;108:102798. doi: 10.1016/j.ssresearch.2022.102798. Epub 2022 Oct 1.
Since the beginning of this millennium, data in the form of human-generated text in a machine-readable format has become increasingly available to social scientists, presenting a unique window into social life. However, harnessing vast quantities of this highly unstructured data in a systematic way presents a unique combination of analytical and methodological challenges. Luckily, our understanding of how to overcome these challenges has also developed greatly over this same period. In this article, I present a novel typology of the methods social scientists have used to analyze text data at scale in the interest of testing and developing social theory. I describe three “families” of methods: analyses of (1) term frequency, (2) document structure, and (3) semantic similarity. For each family of methods, I discuss their logical and statistical foundations, analytical strengths and weaknesses, as well as prominent variants and applications.