When you chuang to close the chuang because you want to go to chuang: Modeling spoken word recognition in Chinese
We noted earlier this week that Chinese is a language without inflection of verbs or nouns. That is, whereas in English you “walk” and Fred “walks”, in Chinese the same word would be used irrespective of who does the walking.
So if Chinese does not inflect, what does it do instead? It turns out that Chinese, like many other East Asian languages, is a tonal language. That is, the tone of a word may determine its meaning, even if the pronunciation is otherwise identical. For speakers of non-tonal languages, such as English and almost all other European languages, this idea is not terribly easy to grasp and it is even harder to use during comprehension.
To illustrate, you should watch this YouTube video that was created specifically for this post and that explains the idea of tonality using Mandarin Chinese:
So there you have it. In Mandarin Chinese (and we only consider Mandarin in this post), chuang means window, bed, rush, or create, depending on a very subtle (to my ears) variation in the tone with which it is pronounced. Processing the phonemes in a word is not enough for a listener to access the meaning of a spoken word—we must also access tonal information.
And if you think differentiation between chuang (1), chuang (2), chuang (3), and chuang (4) is difficult or confusing, try this Chinese sentence: 麻 媽 罵 馬. Those four words are pronounced ma ma ma ma (albeit in different tones) and they translate to "the hemp's mother scolds the horse." (I suspect that sentence has little practical utility other than to illustrate the complexities of tonal languages.)
It should be self-evident that tonal languages pose a unique challenge to cognitive scientists interested in modeling spoken word recognition. It is difficult enough to train a computer to differentiate between “bat” and “pat”, but imagine training it to differentiate chuang from chuang or chuang.
A recent article in the Psychonomic Society’s journal Behavior Research Methods reported an implementation of just such a model. Researchers Shuai and Malins based their work on the TRACE model of spoken word recognition that was presented by McClelland and Elman in 1986.
The basic idea behind TRACE is that different components of analysis are handled by different layers of processing units. For example, phonemes and words are processed by different layers that are isolated from each other, thereby protecting each layer from interference that might otherwise arise from the inherently noisy speech signal. Within each layer, TRACE postulates that units compete with each other to determine a “winner” that is then ultimately recognized by the model. Because this process is modeled across (real) time, TRACE exhibits partial activation of competing words early on until it settles on a winner and the others are suppressed. For example, if the word “bald” is presented to a listener, initially the words bald, ball, bad, and bill (and potentially many others) are activated, although bad and bill drop out very quickly once the vowel sound is registered. The competition between the remaining candidates bald and ball continues until the final phoneme arrives, thereby knocking ball out of the competition.
To give an idea of the complexity of this representation, the phonemes were coded along 6 dimensions in total: voicing, place of articulation, and manner of articulation for consonants, and roundedness, tongue position, and tongue heightfor vowels.
To represent tone, two further dimensions were required: The first coded pitch height (in 5 levels), and the second coded pitch slope (that is, whether the pitch was level, rising, or falling during pronunciation). Taken together, those two dimensions created 15 unique combinations, which were identified for distinct time slices of a set of relevant tones.
The resulting model is known as TRACE-T, and Shuai and Malins validated it by simulating data from an experiment on spoken word recognition in native Chinese speakers. The experiment was designed to manipulate the type of competition that TRACE assumes occurs during word recognition. That is, participants were shown an array of 4 pictures on a screen and had to identify which of those pictures corresponded to a spoken word presented at the same time. The crucial manipulation involved the similarity between the target and the name of another picture in the array. For example, in an English version of this task the pictures of a rat and a rabbit might be interfering with each other—at least initially—during recognition of the spoken word “rabbit”. This would show up in the task as either a reduced response speed or as additional eye movements towards the distractor stimulus (compared to the other non-target pictures).
In the Chinese version of the task, competition between pictures could be introduced not only by names that shared initial phonemes (as with rat and rabbit), but also by manipulating the tone of words. Specifically, a tonal competitor would be the name of a picture that shared the tone—but nothing else—with the name of the target picture, and conversely a so-called “segmental” competitor would share all features but the tone.
The results of the experiment revealed that the tone of distractors mattered: compared to a baseline condition in which none of the distractors shared any similarity with the target, the segmental competitors delayed looking at the target.
Clearly, people are affected by tonal similarity in this task. What is of greater interest is that Shuai and Malins showed that TRACE-T exhibited the same behavior. When the model was presented with a pseudo sound spectrum in real time that corresponded to the spoken word, the recognition of the target was delayed when the simulation contained a segmental competitor. TRACE-T was therefore able to capture within-syllable competitive effects arising from tonal differences.
Shuai and Malins presented several additional validation simulations that examined performance at an even finer level of detail. In all cases, the model’s performance was found to be in accord with human performance. This is encouraging news: If a model can acquire a tonal representation then perhaps English-speaking humans can also learn to differentiate between ma, ma, ma, andma.
Focus article of this post:
Shuai, L., & Malins, J. G. (2016). Encoding lexical tones in jTRACE: a simulation of monosyllabic spoken word recognition in Mandarin Chinese. Behavior Research Methods. DOI: 10.3758/s13428-015-0690-0.