Although speech sounds as if it is broken up into discrete word-size packets, word boundaries in speech aren't always acoustically clear.
Spectrographic analysis helps us understand the energy distribution across different frequencies over time, but it can be challenging to pinpoint exact word boundaries in the speech stream.
The continuous nature of speech is more obvious to us when we listen to speech in a language other than our own.
Phonemes are units of spoken language, the smallest unit of sound that reliably changes the meaning of a word, for example, from “bat” to “pat”, the initial phoneme changes from a /b/ to a /p/.
Speech usually consists of up to 12 phonemes per second (though 50 is possible, Werker & Tees, 1992).
In English, there are about 40 different phonemes - the precise set changes with language.
Some African languages like Xhosa have a clicking sound as one of their permissible phonemes.
Co-articulation is the precise way in which each phoneme is pronounced in a given context, which can depend on the phonemes that come before it and come after it.
The /b/ sound is slightly different in the words “bull”, “bell”, “ball”, etc., making the job of recognizing phonemes in speech very difficult as there is variation between different versions of the same phoneme.
There are differences depending on the position in the word, “nib”, “rob”, etc., making phoneme recognition quite difficult but can make the task of extracting the words from speech a little bit easier.
The different versions of the same phoneme are called allophones.
Switching (using a computer programme) one allophone of /b/ (e.g., /boffset/) for another allophone of /b/ (e.g., /bonset/) will not change the meaning of a word, as allophones are not separate phonemes.
Phonemes are categorically perceived, meaning listeners tend to hear a given speech sound as one of several categories in their native language.
If I play you a mixture of a /bad/ and a /bat/ you are much more likely to hear the last phoneme as either one sound or the other, than you are to hear it as something in between.
Categorical perception is indicated by characteristic response patterns.
The Japanese language does not have a separate phoneme category for the sounds [l] and [r], leading to the well-known problem for Japanese speakers and listeners of English.
Infants are sensitive to differences between phonemic categories that they have never heard, but by as little as 6 months they lose this sensitivity and tend only to perceive the contrasts between phonemic categories in their own language.
Phonemes are said to incorporate various features relevant to their articulation: manner of production, place of articulation, and voicing.
Miller and Nicely (1955) found evidence suggesting the use of featural information in speech processing by showing that the phonemes presented under noisy conditions were most often confused with those that shared all but one feature.
Once phonemes have been recognized from their component features, the task becomes recognition of words from lists of phonemes (or possibly syllables).
Marslen-Wilson & Tyler (1980) proposed what they called cohort theory.
Words are activated to the extent that they are consistent with the phonemes that have been heard so far.
The set of partially activated words is called the word-initial cohort.
For example, by the time people have heard the /d/ and the /o/ of, say, “dog”, a cohort of words including “doll”, “dogma”, “dock”, “doctor”, “documentary”, etc., would be activated.
Words are knocked out of the cohort as more information becomes inconsistent with them.
For example, when the /g/ of “dog” arrives, “doll” will fall out of the cohort but “dogma” won't.
In the original model, words could also fall out if they don't make sense in the context.
This process continues until only a single word remains in the cohort.
The point at which this happens is called the recognition point or the uniqueness point. (Cohert model
Marslen-Wilson (1984) asked people to decide whether items were words or non-words.
O’Rourke & Holcomb (2002) used Event Related Potentials (ERPs) to show that a particular signature of word processing was earlier for words with early-uniqueness points.
The original data that suggested an influence of meaning on the cohort was from Marslen-Wilson & Tyler (1980) who showed that words were detected faster in the context of a meaningful sentence than they were when in a meaningless sentence or a list of unrelated words.
Zwitserlood (1989) used a technique called cross-modal priming.
Ss had to do a visual lexical decision task (is it a word or not?) while listening to words.
She showed that a word that was in an auditory cohort would be primed in a simultaneous visual LD task, even if it did not fit the meaning context.
The cohort became somewhat less strict regarding what could be included: words which started with similar phonemes to the one actually presented are able to stay in the cohort, e.g., “shigarette” - on a strict view “cigarette” will fall out of the cohort after the [sh] phoneme), but data suggested otherwise (e.g., Frauenfelder, Scholten, & Content, 2001).
In the modified model (Marslen-Wilson 1990, 1994), therefore, the cohort comprises a list of words with various strengths.
Recent experimental work has shown good cross-modal semantic priming evidence consistent with the cohort model.
The auditory string “capti...” primes visual processing of “ship” (from captain) and “prison” (from captive).
Either completed word (“captain” or “captive”) does not prime the other one at these timescales.