1Departments of Neurology and Neuroscience, Albert Einstein College of Medicine, Bronx, New York 10461; and 2Department of Surgery (Division of Neurosurgery), University of Iowa College of Medicine, Iowa City, Iowa 52242
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Steinschneider, Mitchell, Igor O. Volkov, M. Daniel Noh, P. Charles Garell, and Matthew A. Howard III. Temporal Encoding of the Voice Onset Time Phonetic Parameter by Field Potentials Recorded Directly From Human Auditory Cortex. J. Neurophysiol. 82: 2346-2357, 1999. Voice onset time (VOT) is an important parameter of speech that denotes the time interval between consonant onset and the onset of low-frequency periodicity generated by rhythmic vocal cord vibration. Voiced stop consonants (/b/, /g/, and /d/) in syllable initial position are characterized by short VOTs, whereas unvoiced stop consonants (/p/, /k/, and t/) contain prolonged VOTs. As the VOT is increased in incremental steps, perception rapidly changes from a voiced stop consonant to an unvoiced consonant at an interval of 20-40 ms. This abrupt change in consonant identification is an example of categorical speech perception and is a central feature of phonetic discrimination. This study tested the hypothesis that VOT is represented within auditory cortex by transient responses time-locked to consonant and voicing onset. Auditory evoked potentials (AEPs) elicited by stop consonant-vowel (CV) syllables were recorded directly from Heschl's gyrus, the planum temporale, and the superior temporal gyrus in three patients undergoing evaluation for surgical remediation of medically intractable epilepsy. Voiced CV syllables elicited a triphasic sequence of field potentials within Heschl's gyrus. AEPs evoked by unvoiced CV syllables contained additional response components time-locked to voicing onset. Syllables with a VOT of 40, 60, or 80 ms evoked components time-locked to consonant release and voicing onset. In contrast, the syllable with a VOT of 20 ms evoked a markedly diminished response to voicing onset and elicited an AEP very similar in morphology to that evoked by the syllable with a 0-ms VOT. Similar response features were observed in the AEPs evoked by click trains. In this case, there was a marked decrease in amplitude of the transient response to the second click in trains with interpulse intervals of 20-25 ms. Speech-evoked AEPs recorded from the posterior superior temporal gyrus lateral to Heschl's gyrus displayed comparable response features, whereas field potentials recorded from three locations in the planum temporale did not contain components time-locked to voicing onset. This study demonstrates that VOT at least partially is represented in primary and specific secondary auditory cortical fields by synchronized activity time-locked to consonant release and voicing onset. Furthermore, AEPs exhibit features that may facilitate categorical perception of stop consonants, and these response patterns appear to be based on temporal processing limitations within auditory cortex. Demonstrations of similar speech-evoked response patterns in animals support a role for these experimental models in clarifying selected features of speech encoding.
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Recent technological advances have invigorated the
investigation of neural mechanisms associated with speech perception.
One exciting avenue of research involves the expanded use of functional neuroimaging to define regions of brain participating in specific aspects of language processing (e.g., Binder et al.
1996; Wise et al. 1991
). These investigations
have implicated both primary and secondary auditory cortex as important
components of a neural network subserving the initial cortical stages
of phonetic processing (Zatorre et al. 1992
, 1996
).
Event-related potentials and neuromagnetic data have complemented
functional neuroimaging studies by identifying temporally dynamic
features of speech sound encoding in auditory cortex and by using
physiological markers to address important controversies surrounding
phonetic perception (e.g., Kaukoranta et al. 1987
;
Kuriki et al. 1995
; Poeppel et al. 1997
;
Sharma and Dorman 1998
; Sharma et al.
1993
). Methodological constraints of these techniques, however,
preclude resolution of the specific synaptic events that activate
auditory cortex, generate the evoked responses, or underlie the
phonetic discrimination. These considerations reinforce the need for
more detailed physiological investigations of phonetic encoding that
can best be obtained by performing invasive studies of auditory cortex.
Important features of phonetic discrimination that mirror human results
occur in monkeys and other experimental animals (e.g., Hienz et
al. 1996; Kuhl and Padden 1983
; Lotto et
al. 1997
; Sinnott et al. 1997
; Sommers et
al. 1992
). Findings such as these suggest that physiological
studies in animals can reveal basic mechanisms underlying aspects of
human speech perception common to multiple species. Similarities in
speech sound discrimination have been especially well documented for
perception of the voice onset time (VOT) phonetic parameter
(Kluender and Lotto 1994
; Kuhl and Miller 1978
; Kuhl and Padden 1982
; Sinnott and
Adams 1987
). VOT denotes the time interval between consonant
release and the onset of low-frequency periodicity in the speech sound
generated by rhythmic glottal pulsations. Voiced stop consonants (/b/,
/d/, and /g/) are characterized by short-duration VOTs, whereas
unvoiced consonants (/p/, /t/, and /k/) contain longer VOTs. As the VOT
is increased in incremental steps, the perception abruptly changes
between 20 and 40 ms from a voiced consonant (e.g., /d/) to an unvoiced
consonant (e.g., /t/). This abrupt transition in consonant
identification is an example of categorical speech perception, an
essential feature of phonetic discrimination. The ability of animals to
discriminate consonants differing in their VOT in a manner similar to
humans targets this phonetic parameter as a prime candidate for
elucidating relevant neural events associated with its categorical perception.
Studies undertaken in primary auditory cortex (A1) of the awake macaque
monkey have suggested a mechanism by which the VOT phonetic parameter
can be encoded rapidly in a categorical manner (Steinschneider
et al. 1994, 1995b
). These studies found that acoustical
transients associated with consonant release and voicing onset are
represented in the temporal response patterns of neuronal ensembles.
Consonant-vowel syllables with short VOTs evoked short-latency responses primarily time-locked to consonant release alone. In contrast, consonant-vowel syllables with longer VOTs evoked responses at the same cortical loci time-locked to both consonant release and
voicing onset. These neural patterns led to the hypothesis that
categorical perception of consonants varying in their VOT is based
partially on temporal encoding mechanisms within A1. The occurrence of
two transient response bursts time-locked to both consonant release and
voicing onset would signal an unvoiced stop consonant, whereas voiced
stop consonants would be represented by a single response time-locked
only to consonant release. The VOT where the second response to voicing
onset dissipates would denote the categorical boundary for this
phonetic contrast. Of special note was the replication of this activity
pattern in the auditory evoked potential (AEP), suggesting that similar
responses could be recorded in humans.
While speech-evoked activity in monkeys may be relevant for phonetic
encoding of VOT in humans, these observations could represent simply an
epiphenomenon of acoustical transient processing in A1. Support for the
significance of these responses would be a demonstration that similar
activity patterns occur in the human auditory cortex. Neurons in both
A1 and secondary auditory cortex can respond in select ways to
behaviorally relevant species-specific vocalizations
(Rauschecker 1998; Rauschecker et al.
1995
; Wang et al. 1995
). Learning and neuronal
response plasticity induced by exposure to behaviorally relevant sounds
also modify the activity of auditory cortical cells (Ohl and
Scheich 1997
; Recanzone et al. 1993
;
Weinberger 1997
). Therefore response patterns generated in human auditory cortex by biologically important speech sounds may
bear little resemblance to the activity profiles seen in the naïve monkey. This study investigates whether response patterns evoked by speech sounds varying in their VOT are similar in the monkey
and human. Speech-evoked AEPs were recorded directly from human
auditory cortex in three patients undergoing evaluation for surgical
intervention of medically intractable epilepsy. Specifically, we tested
the hypothesis that acoustical transients associated with consonant
release and voicing onset are represented by temporally discrete
responses in human auditory cortex. Further, we tested whether these
responses would display categorical-like features similar to those seen
in the monkey.
![]() |
METHODS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Subjects
Three right-handed patients with medically intractable epilepsy were studied. Experimental protocols were approved by the University of Iowa Human Subjects Review Board and reviewed by the National Institutes of Health and informed consent was obtained from each subject before their participation. Subjects underwent placement of intracranial electrodes for acquiring diagnostic electroencephalographic (EEG) information required to plan subsequent surgical treatment. Research recordings did not disrupt simultaneous gathering of clinically necessary data, and patients did not undergo any additional risk by participating in this study. Patient information is summarized in Table 1. All had a suspected epileptic focus in or near the auditory cortex of the right hemisphere, and all had normal hearing determined by standard audiometric tests. Patient 2 complained that occasional seizures were precipitated by acoustic stimuli. Recording sessions were carried out in a quiet room in the Epilepsy Monitoring Unit of the University of Iowa Hospitals and Clinics with the patients lying comfortably in their hospital beds. Patients were awake and alert throughout the recordings but were not asked to perform any specific task beyond listening to the presented sounds. For all patients, clinical evaluations ultimately indicated that Heschl's gyrus and nearby auditory cortical tissue were not epileptogenic foci and were not targeted for surgical extirpation.
|
Recording methods
Auditory evoked potentials (AEPs) were recorded at a gain of
5,000 using headstage amplification followed by differential amplification (BAK Electronics, Germantown, MD). Specific methods varied among the three subjects. This limitation was partly based on
time and other technical constraints imposed by the clinical needs of
the patients as well as by an on-going process to optimize recording
parameters. Recordings were obtained in patient 1 from three
EEG ring contacts spaced at 10 mm intervals incorporated into a depth
electrode implanted in the right Heschl's gyrus and from subdural grid
electrodes placed over the lateral convexity of the temporal lobe. The
reference electrode was a subdural grid electrode located on the
undersurface of the ipsilateral, anterior temporal lobe. AEPs were
recorded at a band-pass of 2-500 Hz (3 dB down, roll-off 6 dB/octave)
and digitization rate of 1,000 Hz. Patient 2 was studied
with hybrid clinical-research depth electrodes implanted in the right
Heschl's gyrus and planum temporale (Howard et al.
1996a,b
). Bipolar recordings at three depths were performed
from closely spaced high-impedance recording contacts interspersed
between the standard, clinical EEG contacts (Radionics, Burlington, MA)
at a band-pass of 1-5,000 Hz and digitization rate of 2,050 Hz. In
both patients, depth electrodes were placed stereotaxically along the
long axis of Heschl's gyrus. AEPs in patient 3 were
obtained from a subdural grid electrode placed over the lateral
convexity of the posterior temporal lobe, referenced to the deepest
low-impedance contact of a depth electrode located anterior to
Heschl's gyrus (band-pass, 1-3,000 Hz, digitization rate, 2,000 Hz).
This subdural electrode was chosen over other surrounding contacts in
the grid after preliminary mapping of AEPs demonstrated focal activity
restricted to this site. The relative inactivity of the reference was
determined by recording speech-evoked AEPs between this site and a
subdural electrode located beneath the temporal lobe. Field potentials
were computer averaged using an analysis time of 1,000 ms (300 ms
prestimulus) for patient 1 or an analysis time of 500 ms (25 ms prestimulus) for patients 2 and 3. Averages were generated from 50 to 75 stimulus presentations, and
high-amplitude artifacts led to automatic rejection of the presentation
for inclusion into the averaged responses. Raw EEG and appropriate
timing pulses were stored on a multichannel FM tape recorder (Racal,
Irvine, CA) for subsequent analysis. An independent search for isolated
single units and their responses to pure tones also was performed in
parallel with this AEP experiment (Howard et al. 1996a
).
Anatomic locations of all recording sites were identified on postimplantation MRIs. This procedure is illustrated in Fig. 1, which demonstrates implantation in patient 2 of the depth electrodes in Heschl's gyrus and the planum temporale. The coronal MRI scans (A-C) show the locations of the three EEG contacts in Heschl's gyrus, denoted by the geometric shapes in the images. Figure 1D is a schematic diagram of the superior temporal plane, and the tracks for the electrodes in Heschl's gyrus (E1) and planum temporale (E2) in patient 2. A schematic of the intracortical electrode is shown in Fig. 1E. The three recording contacts in patient 1 are denoted as depths 1-3, whereas depths 4-6 indicate the three recording sites with high-impedance wires from which AEPs were recorded in patient 2.
|
Stimuli
Two sets of synthetic speech stimuli previously used in primate
studies were presented to the subjects. The first set consisted of all
six stop consonants followed by the vowel /a/. They were constructed on
the parallel branch of a KLSYN88a speech synthesizer. Frequency
characteristics of the voiced consonant-vowel (CV) syllables (/ba/,
/ga/, and /da/) have been published previously (Steinschneider et al. 1995a). Steady-state formant frequencies were 700, 1,200, 2,500, and 3,600 Hz. Onset frequencies for the second (F2) and third (F3) formants of /ba/ were 800 and 2,000 Hz and 1,600 and 3,000 Hz for /da/. Starting frequencies for /ga/ were 1,600 and 2,000 Hz.
Onset frequency of the first formant (F1) was 200 Hz for all syllables.
F1 transition duration was 30 ms and 40 ms for F2 and F3 transitions.
Formant structure was such that /ba/ and /da/ had diffuse onset spectra
maximal at either low or high frequencies, whereas /ga/ had a compact
onset spectrum maximal at intermediate values (Stevens and
Blumstein 1978
). The unvoiced CV syllables (/pa/, /ka/, and
/ta/) were identical to their voiced counterparts except for an
increase in the VOT from 5 to 40 ms. The first 5 ms of all six
syllables consisted of frication. For the unvoiced CV syllables, the
next 35 ms contained aspiration noise. The second set of speech stimuli
were three formant syllables /da/ and /ta/ produced at the Haskins
Laboratories (New Haven, CT). Five syllables with VOT varying from 0 to
80 ms in 20 ms increments were created. Specific parameters of the
sounds have been published (Steinschneider et al.
1995b
). Steady-state formant frequencies were 817 Hz for F1,
1,181 Hz for F2, and 2,632 Hz for F3 and onset frequencies were 200, 1,835, and 3,439 Hz. All syllables were 175 ms in duration and
presented by computer. They were delivered binaurally by insert ear
phones (Etmotic Research, Elk Grove Village, IL) in patient
1 and to the ear contralateral to the recording sites by a Koss
K240DF headphone coupled to a 4-cm cushion in patients 2 and
3. Syllable intensity was 70 dB SPL for patient 1 and 80 dB SPL for the other two subjects. Perceptual testing of
multiple listeners in the laboratory of the lead investigator indicated
that syllables with a VOT of 0 and 20 ms consistently sounded like
/da/, whereas those with a longer VOT were perceived as /ta/. Informal
questioning of patient 2 during recording yielded identical
perceptions, whereas patient 3 gave variable responses only
for /da/ with the 20-ms VOT, indicating that this stimulus was near the
perceptual boundary. Patient 1 was not questioned.
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Neural representation of stop consonant-vowel syllables in Heschl's gyrus
Voiced CV syllables (/ba/, /ga/, and /da/) elicit a triphasic sequence of field potentials in Heschl's gyrus. These potentials are not specific to speech sounds and occur in response to other sounds such as tone and noise bursts. AEPs are illustrated in Fig. 2, which depicts patient 1 responses evoked by the six voiced and unvoiced stop CV syllables recorded simultaneously from the three EEG contacts within Heschl's gyrus. Onset and peak latencies of the three AEP components are earlier at more medial recording sites. Cortical activity begins at 11-12 ms at the most medial depth recording site (depth 1), and 19-21 ms at more lateral sites (depths 2 and 3). The first component is a small wave of positive polarity (wave A) that peaks at 26-28 ms at depth 1, 34-37 ms at depth 2, and 43-47 ms at depth 3. An additional positive deflection peaking at 55-59 ms also can be observed at depth 2. These positive waves are followed by a large amplitude negativity (wave B) peaking at 56-62 ms at depth 1, and 88-95 ms at the other two sites. The third component is a large amplitude positivity (wave C) peaking at 140 ms at depth 1, and 144-155 ms at depths 2 and 3. Additional potentials time-locked to stimulus offset conclude the responses (asterisks).
|
Unvoiced CV syllables (/pa/, /ka/, and/ta/) evoke a different pattern of activity that reflects both stimulus onset and the onset of voicing (Fig. 2). Thus the initial activity elicited by consonant onset consists of the same positive wave A. However, the following negativity (wave B) is truncated and replaced by a positive-going (solid arrows) and a negative-going wave (unfilled arrows) time-locked to the 40 ms VOT. Additionally, wave C peaks 40 ms later than when elicited by voiced CV syllables. These findings indicate that VOT is represented in Heschl's gyrus by synchronized activity time-locked to the onset of consonant release and voicing onset.
Observations are extended by the AEPs recorded at the three locations in Heschl's gyrus of patient 2 (Fig. 3). Because these AEPs are recorded between adjacent high-impedance contacts, amplitudes of the responses are diminished, depicted polarity of the waveforms is arbitrary, and phase shifts of the response peaks are likely. The strength of this method is the marked attenuation of far-field potentials and accentuation of local activity at the recording sites. Field potentials at depth 4, the most medial of the recording sites, are illustrated with a recording montage that places the electrode contact evoking the largest amplitude response as the "active" electrode (contact 2), and the contact eliciting the smallest response as the reference. With this montage, waves A-C are clearly identified. Peak latencies for wave A are 27 ms for /da/, 28 ms for /ga/, and 37 ms for /ba/. A preceding positivity is present in the response to /da/ that peaks at 16 ms and can be seen as a positive deflection on the other responses. Wave B follows the initial activity and peaks at 54 ms for /da/, 57 ms for /ga/, and 63 ms for /ba/. Wave C peaks between 102 and 108 ms. The near-field responses also contain superimposed, low-amplitude waves phase-locked to the syllable fundamental frequency (f0). Response components time-locked to the offset of the syllables conclude the response. No consistent response time-locked to voicing onset is present. AEPs at depth 5, located 6 mm anterolateral from depth 4, are dominated by a near-field component with peak latencies of 63-66 ms that overlap in time with wave B. Low-amplitude oscillatory activity phase-locked to the syllable f0 is superimposed on the slower components, but responses to voicing onset are also absent.
|
Local field potentials evoked by the voiced CV syllables at depth
6, 14 mm anterolateral from depth 4, contain a large
amplitude positive wave peaking at 42-47 ms. The positive wave is
followed by a large negativity with variable peak latency (62-86 ms),
which in turn is succeeded by a later positivity peaking at 134-144 ms
for the voiced CV syllables. These components overlap in time with
waves A-C recorded at the lateral sites in patient 1. The most relevant finding at this site is the additional positivity time-locked to voicing onset for the three unvoiced CV syllables ().
This finding, based on bipolar recordings between adjacent contacts,
indicates that representation of VOT by synchronized activity to
consonant release and voicing onset can occur at the same auditory
cortical sites.
Spectral features of the CV syllables also are reflected at these
sites. These findings mirror previous observations in the monkey
(Steinschneider et al. 1995a). Independent analysis
detected and isolated multiple units at depths 4 and
6. Maximum tone responses of these units were 2,125 ± 252 and 736 ± 91 Hz (means ± SD), respectively, and
are in accord with findings that higher frequencies are encoded at more
posteromedial locations in human A1 (Howard et al.
1996a
). The initial component in the AEP is largest to /ga/ at
depth 4. The positivity evoked by /ba/ is almost
identical in amplitude but is reduced to only 41% in the response to
/da/ (% maximum response indicated in Fig. 3). Similarly, the largest initial response to the unvoiced CV syllables is to /ka/ followed by
/ta/ and /pa/. This pattern is generally consistent with the spectral
content at consonant release for the syllables and the tonotopic
sensitivity of the sites. /Ga/ and /ka/ have a spectral maximum at
midfrequencies that overlap the maximum tone sensitivity of
depth 4. In contrast, the simultaneous recordings to the
CV syllables at depth 6 are maximal to the labial
consonants /b/ and /p/. These consonants have energy concentrated in
lower frequencies that overlap the tonal sensitivity of the site. Note
that the additional responses at depth 6 reflecting the
extended VOT for the unvoiced CV syllables are present regardless of
the preceding consonant place of articulation. These findings suggest
parallel representation for the spectral feature of place of
articulation and the temporal feature of voicing onset in human
auditory cortex. Furthermore, the tonotopic sensitivity at depth
6 suggests that voicing onset may be preferentially represented
in lower best frequency regions of Heschl's gyrus.
Neural representation of VOT in Heschl's gyrus
AEPs at depth 6 also exhibit marked changes in response
morphology that suggest a differential representation of VOT evoked by
the voiced consonant /d/ and the unvoiced consonant /t/ (Fig. 4). Figure 4, left, depicts
the responses to the syllables /da/ and /ta/ with VOTs of 0-80 ms in
20 ms increments. Field potentials evoked by the /ta/ stimuli with 40-, 60-, and 80-ms VOTs contain two positive polarity components that are
time-locked to stimulus onset and the onset of voicing (). There is
a progressive 20 ms shift of the second component time-locked to
voicing onset as the VOT is reduced from 80 to 60 and 40 ms. In
contrast, the AEP evoked by /da/ with the 20-ms VOT fails to exhibit a
discrete response time-locked to voicing onset. The expected location
of this response is marked by the single-headed arrow. To ensure that
the time-locked response to voicing onset is not masked by the
large-amplitude negative slow wave that follows the initial positivity,
the AEPs were high-pass filtered >10 Hz. This filter setting was
chosen after Fourier analysis indicated that the slow wave energy was
<10 Hz. Filtered responses are shown in Fig. 4, right. The
/ta/ stimuli still evoke response components time-locked to both
consonant release and voicing onset (
), whereas /da/ with the 20-ms
VOT fails to evoke a discrete response to voicing onset (
). Instead,
the initial positive wave contains a shoulder of activity marking the
location of the response to voicing onset.
|
Similar changes in AEP morphology are replicated in the field
potentials recorded by the low-impedance electrode contacts in
patient 1 (Fig. 5). AEPs
evoked by the stimuli with a prolonged VOT (40-80 ms) and recorded at
depth 2 display positive (solid arrow) and negative
(asterisk) components time-locked to voicing onset that are delayed
from the initial response complex by an amount equal to the VOT. The
syllable with the 20-ms VOT does not elicit a well-defined response to
voicing onset (open arrow) and instead evokes a response more similar
to that evoked by /da/ with the 0-ms VOT. The syllables evoke a similar
pattern of activity at depth 3, though the response to
voicing onset for the 40-ms VOT stimulus is seen only as a
positive-going deflection on the following positivity (). Although
differences are noted in the AEPs at depth 1, no pattern
reflecting a discrete representation of voicing onset is evident. This
suggests that the synchronized response pattern reflecting VOT is
generated in specific regions of auditory cortex and is supported by a
similar localization of activity exhibited in the AEPs recorded from
patient 2.
|
Findings are highlighted by the 10-Hz high-pass-filtered AEPs evoked by
the VOT series and recorded from depths 2 and 1 (Fig. 6, A and B).
Activity at depth 2 (A) contains prominent
positive waves time-locked to voicing onset for the three stimuli with more prolonged VOTs (), whereas a markedly diminished positivity, barely above baseline, is evoked by the syllable with the 20-ms VOT. In
contrast, simultaneously recorded activity at depth 1 (B) consists of prominent ON components and
oscillatory responses phase-locked to the syllable
f0 that occur after a nearly constant delay after
the response to consonant onset. Additional evidence revealing major
differences between the responses evoked by consonants with short and
long VOTs is shown in C. These waveforms are derived by
subtracting the response evoked by the completely voiced syllable /da/
with the 0-ms VOT from each of the other AEPs. The waveforms derived
using the /ta/-evoked responses contain a large-amplitude negative wave
that shifts in peak latency by increments of 20 ms as the VOT is
increased by identical amounts (
). These waves are diminished
markedly when the response is derived from the AEP evoked by the
syllable with the 20-ms VOT. Thus, activity evoked by the two short VOT
syllables are nearly identical, whereas those evoked by CV syllables
with prolonged VOTs elicit profoundly different responses.
|
Previous studies examining magnetic responses evoked by syllables or
two-tone analogues of the VOT continuum have reported a
categorical-like decrease in amplitude for the major surface negative
wave when VOT or tone separation increased from 20 to 40 ms
(Simos et al. 1998a-c
). Data from these studies were
acquired after low-pass filtering to improve signal quality. We
examined whether equivalent findings would be demonstrated in the
intracortical data by digitally filtering the syllable-evoked responses
<20 Hz, thus mimicking AEPs that might be observed with scalp
recordings using typical filter settings. Filtered AEPs recorded at
depth 2 are shown in Fig. 6D. Similar to the
noninvasive recordings of the previous studies, there is a decrease in
the peak (
) and trough-to-peak amplitudes for only the responses
evoked by the prolonged VOT stimuli. Additionally, two sequential
positive waves separated by the VOT interval are only observed for the
responses evoked by /ta/ (*). These findings support the noninvasively
acquired data and suggest that one reason for the amplitude decrement
is a truncation of the principal slow waves by the introduction of components time-locked to voicing onset for the /ta/ stimuli.
Results for the low-pass filtered AEPs at all three depths in patient 1 were quantified by measuring the maximum baseline-to-positive peak and trough-to-peak deflections in the waveforms. For both measures, there is a dramatic reduction in amplitude of the slow wave when VOT increases from 20 to 60 ms at all three sites (Fig. 7). The summed trough-to-peak excursion was maximal for the syllables with the 0-ms (100%) and 20-ms (94%) VOTs. Syllables with the 40-, 60-, and 80-ms VOT intervals elicited responses 65, 52, and 54% of maximum, respectively. When baseline-to-peak measurements were made, syllables with the 0- and 20-ms VOT intervals evoked the largest responses (99 and 100%, respectively). Evoked responses for the 40-, 60-, and 80-ms VOT stimuli were 70, 47, and 52% of maximum, respectively.
|
Heschl's gyrus responses evoked by acoustic transients embedded in click trains
The previous data indicate that acoustic transients evoked by the onsets of consonant release and voicing are represented partially by synchronized responses within auditory cortex. Furthermore, the markedly diminished response evoked by voicing onset embedded in the syllable with the 20-ms VOT suggests that there is a relative refractory time after stimulus onset when it is difficult to generate a second transient response. This refractory period may be partially responsible for defining the psychoacoustical boundary for voiced and unvoiced stop consonants. To further investigate this phenomenon, the ability of auditory cortex to respond to acoustic transients in the form of click trains was examined in patient 2 (Fig. 8). Phase-locked responses to the individual clicks were present at rates up to 50 Hz at depths 5 and 6 and 100 Hz at depth 4. Additionally, lower frequency click trains evoked well-defined second transient responses (solid arrows). However, there was a marked decrease in amplitude of the second phase-locked response at 40 and 50 Hz (unfilled arrows), corresponding to an interpulse interval of 20-25 ms. Further, the AEPs evoked by these click trains are similar in morphology to that evoked by the 100-Hz train. High-pass filtering the AEPs >10 Hz also failed to reveal a transient response peak to the second click in the higher frequency trains. The presence of time-locked activity to the individual clicks later in the waveforms, including trains of higher rates, suggest that stimulus onset engages a sequence of cortical events that inhibit the generation of a transient response within a time window similar to that of the VOT boundary.
|
VOT representation in nonprimary auditory cortex
Speech-evoked AEPs recorded from subdural strip electrodes placed
on the lateral surface of the superior temporal gyrus in patients
1 and 3 also contain features that accentuate
differences between syllables with short and long VOTs (Fig.
9, patient 3). The illustrated
electrode site was located lateral to the posterior margin of Heschl's
gyrus. Only the syllables with a VOT of 40 to 80 ms evoke two positive
waves separated by an interval equal to the VOT (A, 2nd peak
denoted by solid arrows). These positive waves are preceded by two
negative waves the separation of which also equals the extended VOT. In
contrast, the syllables with a VOT of 0 and 20 ms elicit a single
positive and negative wave complex. AEPs were high-pass filtered >10
Hz to investigate whether similar waves are masked by slower frequency
components for the 20-ms VOT stimulus (B). Again, the
stimuli with prolonged VOTs evoke two prominent positive waves
time-locked to consonant release and voicing onset (). The filtered
AEP evoked by the syllable with the 20-ms VOT contains only a single
positive wave preceded by a small positive-going deflection embedded in
the initial negativity (asterisk). Similar findings also were observed
in the high-pass filtered AEPs recorded from the subdural electrodes
overlying the posterior portion of the superior temporal gyrus in
patient 1 (not shown).
|
Response components linearly related to the VOT also are present in the waveforms, indicating that auditory cortex is capable of generating responses that differentiate among all five syllables regardless of their consonant perception. This is evident in the progressive 20-ms shift of the later negative AEP component as VOT is increased from 0 to 80 ms (Fig. 9A, unfilled arrows). Additionally, a significant difference wave is generated when /da/ with the 0-ms VOT is subtracted from the syllable with the 20-ms VOT (C). The difference wave includes a positive component that shifts in peak latency with VOT (asterisks). Although this component is present in the difference wave between the two short VOT syllables, it is only 52% of the peak amplitude generated from /ta/ with the 60-ms VOT. The other /ta/ stimuli evoke a positive difference wave that is 85% of the maximum.
Several considerations suggest that the VOT-related activity recorded from the lateral subdural region reflect local responses and not just far-field AEPs generated in Heschl's gyrus. First, the reference electrode for the subdural recordings shown in Fig. 9 failed to record AEPs that reflected syllable VOT when it was referenced to a subdural electrode located on the inferior surface of the temporal lobe. The absence of relevant activity emphasizes the focal nature of the responses encoding VOT and indicates that the previous AEPs primarily were generated by activity recorded from the lateral surface subdural electrode. Furthermore, AEPs recorded from the subdural site and evoked by click trains contained fundamentally different patterns than that seen in Heschl's gyrus (Fig. 9D). Whereas well-defined phase-locked responses that diminish with increasing frequency (low-pass) were recorded from Heschl's gyrus (Fig. 8), very weak phase-locking was present to low-frequency click trains. In contrast, phase-locked responses at the lateral subdural site were maximal to click train rates of 40 and 50 Hz, demonstrating band-pass characteristics (D, arrows).
AEPs also were recorded from three high-impedance sites on an electrode placed in the planum temporale of patient 2 (not shown). Waveforms evoked by the syllables consisted of a triphasic complex with peaks at 33, 61, and 107 ms. Latency differences between medial and lateral recording sites were not observed. Surprisingly, speech sounds failed to elicit response components time-locked to voicing onset. However, phase-locked responses were evoked at all three sites by click trains presented at rates of 10, 20, and 40 Hz, and were absent at a stimulus rate of 100 Hz. Similar to the findings recorded from Heschl's gyrus, the second phase-locked response evoked by the 40-Hz click train either was absent or markedly reduced in amplitude.
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Relationship to perceptual investigations
Numerous psychoacoustical studies have established the importance
of VOT as a cue for the perceptual discrimination of voiced from
unvoiced English stop consonants (e.g., Borsky et al.
1998; Lisker 1975
; Summerfield and
Haggard 1977
). This study demonstrates that the VOT temporal
cue is represented at least partly within auditory cortex by
synchronized responses of neuronal ensembles time-locked to both
consonant release and voicing onset. These neural responses exhibit
features that may facilitate categorical encoding of stop consonants.
AEPs evoked by syllables with a short VOT contain a large response
time-locked to consonant release followed by a variable, low-amplitude
component time-locked to voicing onset. In contrast, AEPs evoked by CV
syllables with a longer VOT contain prominent components time-locked to
both stimulus and voicing onset. The VOT at which an AEP component to
voicing onset becomes obvious is the same VOT at which the perceptual boundary occurs in most listeners. Present findings do not exclude additional phonetic encoding mechanisms or minimize the importance of
speech parameters other than VOT (e.g., 1st formant onset frequency), for making the discrimination of voiced from unvoiced stop consonants (Liberman et al. 1958
; McClaskey et al.
1983
; Pegg and Werker 1997
; Soli
1983
; Summerfield and Haggard 1977
;
Treisman et al. 1995
). Furthermore, present results need
to be tempered by the fact that epileptic foci were located in nearby
tissue and may have disrupted normal auditory cortical function and
that AEPs were recorded from variable sites in the nondominant right
hemisphere. Additionally, recording paradigms did not allow AEP
patterns to be correlated tightly with perceptual responses. Despite
these procedural limitations, results imply that auditory cortex is activated in synchronized fashion by temporal speech features and that
these components need to be incorporated into neural models that
characterize initial stages of language perception.
Temporal activity patterns representing VOT may contribute to
categorical perception, wherein stimuli drawn from a physical continuum
naturally segregate into discrete categories, and there is an increase
in perceptual discrimination between stimuli located on either side of
a perceptual boundary (Studdert-Kennedy et al. 1970). In
this scheme, rapid segregation into categories of voiced and unvoiced
stop consonants could be accomplished by determining whether the
syllable evokes one or two discrete response bursts to consonant
release and voicing onset. Heightened discrimination around the
perceptual boundary would be based on comparisons between syllables
that evoke one versus two response bursts. In contrast, discrimination
between two consonants located on the same side of the perceptual
boundary would require the more difficult task of comparing responses
with a similar temporal response pattern. Human listeners are capable,
under certain experimental conditions, to discriminate stop consonants
located on the same side of the VOT perceptual boundary (Carney
et al. 1977
; Kewley-Port et al. 1988
;
Tremblay et al. 1997
). Responses reported here mirror
these findings, as subtle differences between the AEPs evoked by the two stimuli with short VOTs were observed, and the /ta/ stimuli could
conceivably be discriminated by identifying differences in timing of
the response burst to voicing onset relative to that evoked by
consonant release.
AEP profiles demarcating English language voiced from unvoiced stop
consonants are likely indexing a natural psychoacoustic boundary
reflecting constraints imposed on temporal processing in the auditory
cortex. For instance, many languages use a different voicing contrast
from that used in English, with a distinction made between prevoiced
stop consonants (vocal cord vibrations precede consonant release) and
those with a short VOT. Studies in both adults and young infants whose
home language does not use the short versus long VOT distinction
demonstrate a natural tendency for enhanced discriminability of
consonants that straddle the English boundary of 20-40 ms
(Keating et al. 1981; Lasky et al. 1975
;
Streeter 1976
). These findings suggest that the VOT phonetic boundary is not limited to learned language-specific categories but is based on the more general ability to identify the
temporal sequence of two acoustical events, such as consonant release
and voicing onset (Pisoni 1977
). This conclusion is
supported by categorical-like VOT discrimination in animals, which
clearly do not possess language-specific capacities (Kuhl and
Miller 1978
; Kuhl and Padden 1982
;
Sinnott and Adams 1987
). In his classic study,
Hirsh (1959)
found that a 15 to 20 ms separation in the onset of two sounds was required for their temporal sequence to be
perceived. Subsequent studies found categorical perception with
boundaries similar to those of speech for the temporal ordering of
various other nonspeech stimuli (e.g., Formby et al.
1993
; Miller et al. 1976
; Phillips et al.
1997
; Pisoni 1977
; Stevens and Klatt
1974
). Our findings suggest that these perceptual limitations are manifested in auditory cortex by a threshold for generating discrete responses to the onsets of consonant release and voicing onset. The markedly decreased amplitude of a response to the second click in trains with rates of 40-50 Hz reinforces this suggestion.
Relationship to physiological investigations
Timing and morphology of the intracranial AEPs observed in this
study are consistent with previously reported values, suggesting that
although recording sites are limited, they represent a valid sample of
activity in human auditory cortex. Onset of activity evoked by clicks
in Heschl's gyrus is 8-10 ms (Liégeois-Chauvel et al.
1991), compared with 11-12 ms for the syllables. The slightly prolonged latency may reflect the slower rise time and lower frequency content of the speech sounds relative to clicks. The increase in
latency as recordings shift from more medial to lateral sites in
Heschl's gyrus supports previous observations
(Liégeois-Chauvel et al. 1994
). Variability among
the few intracranial studies reported makes exact comparisons with
previous data difficult, but several reports describe the first large
amplitude wave as peaking at ~30 ms (Celesia 1976
;
Liégeois-Chauvel et al. 1994
). Wave A of the
present study likely represents this component. The large amplitude
wave B overlaps in time with the prominent positivities seen in more
lateral sites of Heschl's gyrus, whereas wave C likely corresponds to
the N120 wave also observed in lateral Heschl's gyrus
(Liégeois-Chauvel et al. 1994
). Polarity-inversion
of waves A-C from previously reported AEP components is expected, as
our electrodes were located on the underside of the current dipoles within auditory cortex. AEPs recorded from the lateral surface of the
superior temporal gyrus have component latencies consistent with the
N1, P2, and N3 click-evoked AEP components reported by Celesia
(1976)
. Additionally, AEPs generated in the planum temporale have a triphasic morphology with polarity and peaks designated N30,
P50, and N100 (Liégeois-Chauvel and al. 1994
).
These components are nearly identical in latency to the waves seen from
the electrode in the planum temporale. Finally, it is evident from the
body of intracranial data that single dipole models of scalp-recorded AEPs and magnetic responses are extreme simplifications of a complex, overlapping sequence of activation extending from koniocortex to
secondary auditory areas. Similar conclusions have been stressed by
other investigators (Lütkenhöner and
Steinsträter 1998
; Schreiner 1998
),
highlighting the concern that interpretations of results obtained via
noninvasive physiological techniques should be viewed with caution and
be confirmed by direct intracranial recordings.
Given these concerns, it is worthwhile to examine other reports of VOT
encoding by responses time-locked to both consonant release and voicing
onset (Kaukoranta et al. 1987; Kuriki et al. 1995
). Both studies concluded that the magnetic responses
equivalent to the N100 AEP component (N100m) and evoked by the two
speech components were generated by separate sources in auditory
cortex. The same conclusion was reached from analysis of magnetic
responses evoked by a noise burst-square wave complex and two-tone
stimuli, both analogous to syllables varying in their VOT
(Mäkelä et al. 1988
; Simos et al.
1998a
). The latter two studies are especially relevant because
they also report categorical-like changes in the amplitude of the N100m
with a temporal boundary of 20-30 ms (see also Simos et al.
1998b
,c
). These conclusions are not wholly supported by the
present data. Locally generated AEPs contain comparable response
features reflecting VOT that are due to activity from single sources
within auditory cortex. Present findings indicate that the nonlinear
decrease in component amplitude is based on attenuation of the slower
waves by new responses time-locked to voicing onset when VOT is more
prolonged than 20 ms. In similar manner, categorical-like magnetic
responses have been obtained with double-click stimulation, where the
sources of activation for the two components should be identical
(Celesia 1976
; Joliot et al. 1994
).
Nonlinear changes in this pattern of physiological activity can be
viewed as an example of forward masking. As such, it requires
interaction between the two response components, wherein activation
evoked by the initial portion of the stimulus (e.g., consonant release)
modifies the ability to generate a response to the second stimulus
segment (e.g., voicing onset). The ability of stimuli with disparate
frequency components to generate activity at single cortical locations
is consonant with the widespread activation of auditory cortex by
suprathreshold tones. (Bakin et al. 1996
; Howard
et al. 1996a
; Phillips et al. 1994
;
Schreiner 1998
). Thus, speech sounds presented at
conversational levels should produce activation patterns in auditory
cortex that contain regions where responses evoked by
frequency-specific formants can interact to elicit forward masking phenomena.
In this study, activity recorded from three sites in the planum
temporale of one patient did not reflect the VOT of the syllables. While this observation is consistent with previous findings that AEPs
from the superior temporal plane were not highly dependent on the
physical characteristics of acoustic stimuli (Halgren et al.
1995), multiple other reasons may be responsible for this negative result. Sites representing VOT with responses time-locked to
consonant release and voicing onset could have been missed by the
limited sampling of the tissue. AEPs were recorded in the right
hemisphere, which, in this patient, was nondominant for speech.
Temporal response patterns representing VOT may be restricted to
locations in the homologous cytoarchitectonic area of the dominant, language hemisphere. Finally, other activity patterns that do not
represent VOT with synchronized responses of neuronal ensembles time-locked to consonant release and voicing onset may be the relevant
physiological encoding mechanism at these nonprimary sites.
Considerations such as these indicate that it is premature to place too
much emphasis on this negative result for generating schemes related to
speech sound processing.
In contrast, the positive finding that AEPs recorded from the convexity
of the posterior superior temporal gyrus maintain temporal
representation of VOT supports the evolving concept that lateral belt
areas of auditory cortex participate in the pattern recognition of
sound, including speech (Rauschecker 1998;
Rauschecker et al. 1995
). In this model of cortical
sound encoding, auditory cortex is organized into hierarchical streams
of processing that occur within distinct pathways. Spatial features of
sounds activate a dorsal pathway leading to the parietal cortex,
whereas spectral and temporal patterns of complex sounds are
represented in parallel within a ventral pathway that includes regions
of the lateral superior temporal gyrus. Functional neuroimaging studies
demonstrating greater activation of the lateral superior temporal gyrus
bilaterally with complex acoustic stimuli from that evoked by simple
tones or noise bursts add additional support for the existence of a ventral pathway involved in the pattern recognition of sound
(Strainer et al. 1997
; Zatorre et al.
1992
).
A critical facet of this study is its complementary relationship with
work performed in experimental animals (for review, see Phillips
1998). Temporal representation of VOT in the human AEP lends
relevance to similar findings in A1 of monkeys, cats, and guinea pigs
(Eggermont 1995a
; McGee et al. 1996
;
Schreiner 1998
; Steinschneider et al. 1982
, 1994
,
1995b
). Detailed investigation in A1 of the anesthetized cat
indicate that temporal representation of VOT is dependent on stimulus
intensity and is likely a graded, nonlinear, and saturating function of
the interval between consonant release and voicing onset
(Eggermont 1995b
, 1999
). In a more general framework,
present results reinforce the hypothesis that synchronized activity
within neuronal ensembles is a viable mechanism for encoding specific
features of complex acoustic stimuli (deCharms and Merzenich 1996
; Eggermont 1994
; Wang et al.
1995
). Although human AEPs offer relevance to the animal work,
experimental models allow details of physiological processing to be
identified at a level that are otherwise unobtainable. Studies in
monkeys demonstrate a response transformation between thalamocortical
fibers and A1 that accentuates the acoustic transients of consonant
release and voicing onset and have suggested the synaptic events that
generate the speech-evoked human AEP in A1 (Steinschneider et
al. 1994
). Detailed spatiotemporal patterns of A1 activation
induced by speech sounds have been sequenced (Schreiner
1998
). Some forms of developmental language disorders may be
based on temporal processing deficits (e.g., Mannis et al.
1997
; Merzenich et al. 1996
; Tallal et
al. 1996
; Wright et al. 1997
), which in turn may
be associated with dysfunction or dysgenesis of the medial geniculate
and auditory cortex (Galaburda et al. 1985
, 1994
;
Humphreys et al. 1990
; Nagarajan et al.
1999
). Experimentally induced cortical dysgenesis produces
temporal processing deficits specific for rapidly presented sounds that
mirror abnormalities present in language-impaired children
(Fitch et al. 1994
; Herman et al. 1997
).
These exciting findings emphasize that animal models may not only
assist in defining normal processes associated with carefully selected
features of speech encoding, but may help clarify neural mechanisms
underlying aberrant language development.
![]() |
ACKNOWLEDGMENTS |
---|
The authors thank J. C. Arezzo and two anonymous reviewers for helpful comments on an earlier draft of this manuscript and acknowledge the assistance of H. Damasio, M. C. Ollendieck, S. Seto, D. H. Reser, and Y. I. Fishman.
This research was supported by National Institute of Deafness and Other Communications Disorders Grants DC-00120 and DC-00657.
![]() |
FOOTNOTES |
---|
Address for reprint requests: M. Steinschneider, Dept. of Neurology, Kennedy Center, Room 322, Albert Einstein College of Medicine, 1300 Morris Park Av., Bronx, New York 10461.
The costs of publication of this article were defrayed in part by the payment of page charges. The article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
Received 26 January 1999; accepted in final form 5 August 1999.
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|