Research

Language production

What makes us say the things we do? The statistics of language, coupled with a rich lifetime of experience with it, and a sprinkling of inductive biases give us the power of infinite expressivity. Despite this, there are still cognitive and linguistic constraints on when and how we phrase things.

In the lab

The CaLiCo Lab uses a mixture of methods to study language production, ranging from simple surveys of preference in context, to large-scale cloze tasks in which people fill in the blank or attempt to predict the next word in a sentence, to highly constrained picture description tasks. We are interested in how aspects of language use, like short- and long-range phonological context, givenness, lexical frequency, episodic memory, and stylistic factors constrain both word forms and lexical choice. Experiments are programmed both for the web in jsPsych, in the lab using PsychoPy or PsychToolbox, and a mobile app is under current development!

Corpus studies

The experiments we design for the lab are supported by studies of psycholinguistically representative corpora of everyday language. We test for the influences of the same factors on lexical choice and phonological form but drawing on openly available datasets of spontaneous speech, episodic memory, social media, and more formal written corpora.

Cognitive modeling

Connectionist frameworks. Using scientific computing software, we are able to upgrade classic connectionist and parallel distributed processing (PDP) models to test harder cognitive questions and build new, transparent neural network models in Python. Projects still in their early stages are modeling production dynamics in response to phonetic vs. phonological context as well as extending classic models of production to the sentence level.

Rational speech act models and information theoretic approaches. In conversation, production and comprehension lack a clear boundary. The Rational Speech Act model (RSA) incorporates producer ease directly into a model of audience design. I am interested in applying better estimates of speaker utility directly from models of production difficulty to understand whether and how producers tailor their utterances for comprehenders.

Language comprehension

Readers have to juggle a lot of word and world knowledge to understand the texts in front of them. When we read, we quickly recognize and disambiguate words, resolve syntactic ambiguities, and build a structure of the conversation and where it might go. The CaLiCo Lab studies the content of what readers predict will come up next — from words to sentence structures to discourse relations. I am especially interested in how specific readers’ predictions are at any point in time and what kinds of predictions they are making.

In the lab

Reading times. Procedures like self-paced reading and the Maze task — a two-alternative forced choice version — can tell us about the dynamics of prediction in sentence reading. I am currently working within a surprisal theory framework, with an eye toward building models of decision processes during the Maze task.

Neural signals. In collaboration with cognitive neuroscientists, I have been working with others to improve methodology for analyzing high-resolution time series (EEG) data using mediation analyses and representational similarity analysis (the other RSA).

Leveraging large language models. The growth of pre-trained (Transformer-based) neural language models has enabled a smorgåsbord of options for computing measures of predictability like surprisal. With some additional creativity, we can make these models a better match to human predictions in context. The CaLiCo lab uses large language models like RoBERTa, GPT-2, and monolingual translation models as one way of modeling the content of readers’ predictions.

Machine learning for psycholinguistics

Creating stimuli and coding data are some of the most error-prone stages in experimental design in psycholinguistics. As part of a growing software project, the CaLiCo lab is developing tools to set new standards for data coding. Just as forced aligners brought advances to laboratory phonology, so can trained models more optimally organize items for annotation or compare stimuli for potential problems by identifying outliers or near-duplicates. With this work we can make resources such as large language models useful to the field as a whole.