Chapter 2 Basic Framworks
The purpose of this chapter is to direct you to seminal articles outlining how psychologists think about text data and to ensure you understand common concepts.
2.1 Theory
I recommend the following resources for getting to grips with the fundamental ideas behind text analysis and natural language processing in psychology.
- Boyd, R. L., & Markowitz, D. M. (2025). Verbal behavior and the future of social science. American Psychologist, 80(3), 411–433. https://doi.org/10.1037/amp0001319
- Boyd, R. L., & Schwartz, H. A. (2021). Natural language analysis and the psychology of verbal behavior: The past, present, and future states of the field. Journal of Language and Social Psychology, 40(1), 21-41. https://doi.org/10.1177/0261927X20967028
- Dehghani, M., & Boyd, R. L. (Eds.). (2022). Handbook of language analysis in psychology. Guilford Publications.
- Pennebaker, J. W., Mehl, M. R., & Niederhoffer, K. G. (2003). Psychological aspects of natural language use: Our words, our selves. Annual review of psychology, 54(1), 547-577. https://doi.org/10.1146/annurev.psych.54.101601.145041
2.2 Definitions
- Character: The smallest unit in text, such as a letter, digit, or symbol.
- String: A sequence of characters treated as a single data element, often representing text.
- Word: A basic unit of language consisting of one or more characters, typically separated by spaces or punctuation.
- Type: A unique word or token in a text, regardless of how many times it appears.
- Token: An instance of a sequence of characters grouped as a useful unit, often corresponding to a word, subword, or punctuation.
- Lemma: The base or dictionary form of a word.
- Term: A meaningful unit of language, often separated by white space, not necessarily a conventional word.
- Document: A single unit of text treated as one item for analysis, such as an article, tweet, or email.
- Text: A general sequence of written characters, ranging from a sentence to a book.
- Corpus: A structured collection of texts used for training or analyzing language models.
- Vocabulary: The set of all unique tokens or terms found in a corpus or collection of documents.
- Natural Language Processing (NLP): A field of study focused on enabling computers to understand, interpret, and generate human language.