Sci Data. 2025 Apr 19;12(1):662. doi: 10.1038/s41597-025-04681-x.
ABSTRACT
Unstructured text data have gained popularity in political science, owing to advancements in rigorous ‘text-as-data’ methods that allow extracting insights into election outcomes, candidates’ appeal to voters, ideologies and campaign strategies. Existing datasets on US presidential election campaign speeches are limited in size or source variation, and often contain speeches of different types (debates, rallies, official presidential events, e.g. inauguration), thus lacking consistency in their rhetorical content. The introduced dataset comprises the campaign speeches of the Democratic and Republican tickets for the 2020 US presidential election (1, 056 in total), covering the period between January 2019 and January 2021. Importantly, the dataset dictates specific criteria for the rhetorical structure of the speech ensuring consistency, critical for quantitative analysis. It has been carefully curated, yet only to the necessary extent to still be able to inform studies that require semantic or grammatical/syntactical structure. The provided corpus is hosted on Zenodo and GitHub under the CC BY-NC 4.0 license, and it aims to enhance timely studies on US presidential elections with high-quality text data.
PMID:40253428 | DOI:10.1038/s41597-025-04681-x