Chapter 2 Basic Framworks

The purpose of this chapter is to direct you to seminal articles outlining how psychologists think about text data and to ensure you understand common concepts.

2.1 Theory

I recommend the following resources for getting to grips with the fundamental ideas behind text analysis and natural language processing in psychology.

Boyd, R. L., & Markowitz, D. M. (2025). Verbal behavior and the future of social science. American Psychologist, 80(3), 411–433. https://doi.org/10.1037/amp0001319
Boyd, R. L., & Schwartz, H. A. (2021). Natural language analysis and the psychology of verbal behavior: The past, present, and future states of the field. Journal of Language and Social Psychology, 40(1), 21-41. https://doi.org/10.1177/0261927X20967028
Dehghani, M., & Boyd, R. L. (Eds.). (2022). Handbook of language analysis in psychology. Guilford Publications.
Pennebaker, J. W., Mehl, M. R., & Niederhoffer, K. G. (2003). Psychological aspects of natural language use: Our words, our selves. Annual review of psychology, 54(1), 547-577. https://doi.org/10.1146/annurev.psych.54.101601.145041

2.2 Definitions

Character: The smallest unit in text, such as a letter, digit, or symbol.
String: A sequence of characters treated as a single data element, often representing text.
Word: A basic unit of language consisting of one or more characters, typically separated by spaces or punctuation.
Type: A unique word or token in a text, regardless of how many times it appears.
Token: An instance of a sequence of characters grouped as a useful unit, often corresponding to a word, subword, or punctuation.
Lemma: The base or dictionary form of a word.
Term: A meaningful unit of language, often separated by white space, not necessarily a conventional word.
Document: A single unit of text treated as one item for analysis, such as an article, tweet, or email.
Text: A general sequence of written characters, ranging from a sentence to a book.
Corpus: A structured collection of texts used for training or analyzing language models.
Vocabulary: The set of all unique tokens or terms found in a corpus or collection of documents.
Natural Language Processing (NLP): A field of study focused on enabling computers to understand, interpret, and generate human language.