1Sloan Center for Theoretical Neuroscience, 2Department of Psychiatry, and 3Department of Physiology, University of California, San Francisco 94143-0444; and 4Department of Psychology, University of California, Berkeley, California 94720-1650
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Sen, Kamal, Frédéric E. Theunissen, and Allison J. Doupe. Feature Analysis of Natural Sounds in the Songbird Auditory Forebrain. J. Neurophysiol. 86: 1445-1458, 2001. Although understanding the processing of natural sounds is an important goal in auditory neuroscience, relatively little is known about the neural coding of these sounds. Recently we demonstrated that the spectral temporal receptive field (STRF), a description of the stimulus-response function of auditory neurons, could be derived from responses to arbitrary ensembles of complex sounds including vocalizations. In this study, we use this method to investigate the auditory processing of natural sounds in the birdsong system. We obtain neural responses from several regions of the songbird auditory forebrain to a large ensemble of bird songs and use these data to calculate the STRFs, which are the best linear model of the spectral-temporal features of sound to which auditory neurons respond. We find that these neurons respond to a wide variety of features in songs ranging from simple tonal components to more complex spectral-temporal structures such as frequency sweeps and multi-peaked frequency stacks. We quantify spectral and temporal characteristics of these features by extracting several parameters from the STRFs. Moreover, we assess the linearity versus nonlinearity of encoding by quantifying the quality of the predictions of the neural responses to songs obtained using the STRFs. Our results reveal successively complex functional stages of song analysis by neurons in the auditory forebrain. When we map the properties of auditory forebrain neurons, as characterized by the STRF parameters, onto conventional anatomical subdivisions of the auditory forebrain, we find that although some properties are shared across different subregions, the distribution of several parameters is suggestive of hierarchical processing.
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
To understand how sounds are
heard and interpreted and ultimately influence an organism's behavior,
it is important to investigate the processing of natural sounds.
However, little is known about the neural encoding of natural sounds.
This is partly because the majority of studies have used synthetic
stimuli such as white noise or tones to characterize auditory
processing (for a review, see Eggermont et al. 1983c).
Although these studies have provided a wealth of information on the
organization of the auditory pathway and on the response
characteristics of auditory neurons, it has become increasingly clear
that it is difficult to use this knowledge to predict the neural
responses to complex natural sounds such as vocalizations
(Eggermont et al. 1983b
; Theunissen et al.
2000
). This is particularly problematic for characterizing
high-level auditory neurons that may be optimized to analyze natural
sounds. An alternative and more direct approach is to characterize
auditory neurons using these sounds.
Many natural sounds are structurally complex and contain both spectral
and temporal correlations (Attias and Schreiner 1997; Nelken et al. 1999
; Theunissen et al.
2000
). Until recently, this posed a methodological problem for
the systematic characterization of the stimulus-response function of
auditory neurons with natural sounds. This is because the reverse
correlation method that was used to estimate the spectral-temporal
receptive field (STRF) assumed a stimulus ensemble free of spectral and
temporal correlations (Aertsen and Johannesma 1981
;
Eggermont et al. 1983a
). We recently extended the STRF
method to overcome this limitation by taking into account the spectral
and temporal correlations present in the stimulus ensemble
(Theunissen et al. 2000
). Our method corrects for the
spectral and temporal correlations present in sounds by performing a
weighted average of the stimulus around each spike using a mathematical
operation that involves a de-correlation in frequency and
de-convolution in time. In this study, we apply this extended method to
investigate the processing of natural sounds in the birdsong system.
The birdsong system offers several advantages for studying the
processing of natural sounds. Songbirds display a remarkable ability to
process auditory information (for a review of the birdsong system and
behavior, see Konishi 1985). At birth, songbirds are endowed
with an inborn behavioral selectivity for the sounds of their own
species (Marler 1991
). Auditory information plays a critical role in song learning in juvenile songbirds and in song maintenance in adult birds and is an important component of many social
behaviors in songbirds. For this highly sophisticated behavioral repertoire to be possible, a wide variety of natural sounds, especially songs, must be detected and discriminated by the auditory system of songbirds. Currently, the neural basis of these behaviors
is poorly understood.
Anatomical (Fortune and Margoliash 1992; Kelley
and Nottebohm 1979
; Vates et al. 1996
) and
physiological (Janata and Margoliash 1999
;
Langner et al. 1981
; Lewicki and Arthur
1996
; Mello and Clayton 1994
; Muller and
Leppelsack 1985
) experiments suggest that auditory forebrain
areas such as field L may contribute to the ability of songbirds to
detect and discriminate a wide variety of complex natural sounds. In
the anatomical chain of acoustical processing stages of the avian
brain, the field L region lies between the thalamic auditory relay
nucleus ovoidalis (Ov) and higher-level auditory areas such as HVc and
the medial portion of the caudal neostriatum (NCM) (Vates et al.
1996
) (Fig. 1). This location is
analogous to the location of auditory cortex in mammals. As in the
primary auditory areas of many other animals, field L in zebra finches
and other birds displays a tonotopic organization (Bonke et al.
1979
; Gehr et al. 1999
; Muller and Leppelsack 1985
; Zaretsky and Konishi 1976
).
Based on Nissl and Golgi staining studies, the field L region has been
divided into 5 subregions called L2a, L2b, L1, L3, and L
(Fortune and Margoliash 1992
). Neuro-anatomical tracer
studies have shown that the thalamic input from Ov projects strongly to
area L2a and L2b and more weakly to L1 and L3. L2a projects strongly to
L1 and L3, and all field L regions project to cHV (Fig. 1).
|
In zebra finches, the stages of auditory processing in field L and
other auditory forebrain areas are also likely to contribute to the
response properties of "song-selective" neurons found in high level
auditory areas such as HVc, since these areas are the primary source of
sensory input to HVc. Song-selective neurons, which respond more
strongly to the bird's own song (BOS) than to even very similar
auditory stimuli, have been well characterized in a number of studies
(Margoliash 1983, 1986
; Margoliash and Fortune
1992
; Mooney 2000
; Theunissen and Doupe
1998
; Volman 1993
). However, the earlier stages
of auditory processing that may participate in the generation of such
highly selective neurons have only begun to be explored (Janata
and Margoliash 1999
; Lewicki and Arthur 1996
).
So far, a systematic study of the stimulus-response function of auditory forebrain neurons has not been undertaken with natural sounds. Thus several interesting questions remain to be addressed. To what features of natural sounds do auditory forebrain neurons respond? What are the characteristic spectral and temporal parameters of such features? Do the distributions of parameters indicate the emergence of increasingly complex features in the auditory forebrain? In this paper, we address these questions by obtaining the STRFs for auditory forebrain neurons using a large ensemble of conspecific songs (CONs) and extracting several parameters from the STRFs to assess multiple aspects of the processing of songs in the auditory forebrain.
![]() |
METHODS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Electrophysiology
All physiological recordings were done in urethan-anesthetized
adult male zebra finches in acute experiments. Extracellular waveforms
were obtained using parylene-coated tungsten electrodes (resistance
1-3 M) that were inserted into the neostriatum of the bird at
locations that were previously marked with stereotaxic measurements.
The extracellular waveforms were transformed into spike trains, using a
window discriminator, by windowing the largest action potential.
Waveforms from successive spikes in the window were examined on a fast
time base to estimate the number of units. Cases where the waveform had
a single reliable and stereotyped spike shape were classified as single
units. Multiunit recordings consisted of spike waveforms that could be
easily distinguished from background activity but not from each other.
Single units (18/62) or small multiunit clusters consisting of two to
five neurons (44/62) were recorded in this manner. We did not observe any significant differences in our results for these two groups (see
RESULTS). At the end of the experiment, the bird was deeply anesthetized and transcardially perfused. The locations of the recordings were verified histologically in Nissl-stained brain sections. The location of the sites was classified into anatomical subregions of field L as described in Fortune and Margoliash
(1992)
. We considered L and L2b as a single composite region
since no clear border between these two regions was apparent. We will
refer to this composite region as L2b. The data presented here were obtained from 10 birds and 62 recording sites (6 in L2a, 21 in L2b, 13 in L1, 16 in L3 and 6 in cHV). (For a more detailed description of
recording methods, see Theunissen and Doupe 1998
;
Theunissen et al. 2000
.)
Stimuli
An ensemble of 20 conspecific songs, previously used in
Theunissen et al. (2000), was used to obtain neural
responses in the auditory forebrain of each bird. The same set of
conspecific songs was used in all our experiments. For each bird, we
added the BOS to this ensemble giving a total of 21 songs. Stimuli were
played at a peak intensity of 80 dB SPL and randomly interleaved to
obtain 10 trials of responses to each song in the ensemble. The average song duration was 2.1 s. All of the songs, including BOS, were used to compute robust estimates of stimulus ensemble properties such
as the power spectrum and autocorrelation matrix, as previously described in Theunissen et al. (2000)
. We did not
separately characterize the STRFs in response to the BOS in this study,
since a reliable estimate of the STRF (see following text) requires
much more data than we had for the BOS alone. Moreover, this
calculation would also lead to significant methodological difficulties,
because the BOS alone samples only a very small part of stimulus space (Theunissen et al. 2000
). To compare the responses of
neurons to BOS versus the other songs in the ensemble, we used the d' measure of selectivity, previously used to quantify song selectivity in
other areas of the song system (Janata and Margoliash
1999
; Theunissen and Doupe 1998
). We did not
detect a difference in response to BOS over the other songs
(P = 0.4, 1 sample sign test). Our observation is
consistent with previous studies, which have found the majority of
field L neurons to be unselective for the BOS compared with other
conspecific songs or manipulations of the BOS such as reversed BOS and
syllable order reversed BOS (Janata and Margoliash 1999
;
Lewicki and Arthur 1996
).
STRF calculation
A detailed description of the calculation of STRFs from natural
sounds can be found in Theunissen et al. (2000). This is
briefly summarized here. We used an invertible spectrographic
representation of sound in which sound is first decomposed by passing
it through a set of Gaussian filters of 250 Hz width (SD) spanning
center frequencies between 250 and 8,000 Hz. The sound is then
represented by a set of functions of time
s{i}(t), where si(t) is taken to be the log of
the amplitude envelope of the signal in the frequency band
i. The STRF is defined as the multi-dimensional linear
Volterra filter
h{i}(t) such that
![]() |
![]() |
To determine the significance of regions in the STRFs obtained, we used
a jackknife resampling method where STRFs were calculated for multiple
subsets of the conspecific song ensemble that were obtained by deleting
one song at a time from the complete ensemble. The variance for each
spectral-temporal bin in the STRF estimate was calculated from this set
of STRFs using the jackknife formula
![]() |
Figure 2A shows the raw STRF obtained from a site in L2a, and Fig. 2B shows the jackknife standard error for this STRF.
|
To display the significant part of the STRF, we first estimated the noise level in the raw STRFs using a singular value decomposition (SVD) technique. The SVD decomposes the STRF into a weighted sum of a number of terms, each of which is an outer product of a function of time and a function of frequency. The weights corresponding to each of the terms are the singular values obtained from the SVD. For an ideal, completely noise-free STRF the nonzero singular values can be used to reconstruct the STRF without any loss of information. In practice, due to noise in the estimation of the STRF, the singular values do not drop abruptly to zero but tail off gradually. We therefore compared the SVD obtained from a window (width, 100 ms) containing all of the structure in the STRF to the SVD obtained from a window representing noise (a 100-ms window from the acausal portion of the STRF corresponding to stimulus following spikes). The singular values obtained from the raw STRF that exceeded the maximal singular value obtained from the noise were used to reconstruct the STRF (Fig. 2). We found that this method effectively filtered out the noise in the raw STRFs. Then, to illustrate the significance of the different regions of the STRF, we show the contours for one and two times the significance level superimposed on these reconstructed STRFs. As a conservative estimate, we defined the significance level to be the maximal jackknife standard error for the STRF.
Parameters describing STRFs
We obtained several parameters from each STRF characterizing its
temporal and spectral properties. Similar parameters have been obtained
from STRFs in the auditory (Depireux et al. 2001; Hermes et al. 1981
, 1982
; Keller and Takahashi
2000
; Kim and Young 1994
) as well as visual
(Cai et al. 1997
) domains. The time to peak
(Tpeak) was defined as the time to the
absolute maximal value of the STRF. We also used the STRF to directly
estimate the temporal characteristics of each neuron's processing of
amplitude envelopes of songs. We call this parameter the best
modulation frequency (BMF). To obtain the BMF, we took a slice through
the maximal value of the STRF along the temporal dimension and obtained
the peak of the power spectral density of this slice (Fig.
6D). The power spectral density was estimated using a fast
Fourier transform with a Hanning window. As defined here, this measure
may differ from the conventional BMF, which is obtained from neural
responses to simple amplitude modulated tone bursts, using a range of
AM frequencies. To quantify more spectral characteristics of neural responses, we took a slice through the maximal value of the STRF along
the frequency dimension to obtain the peak frequency (CF) and a width
at half-maximum (W). We used a quality factor, defined as
the peak frequency divided by the width, Q = CF/W, as a measure of sharpness of spectral tuning of the
largest spectral peak. The excitatory and inhibitory peak amplitudes
were the maximal and minimal values of the STRF, respectively.
We also used the SVD of the STRF to assess the degree of the
time-frequency separability of the STRF. Similar methods have been used
in the visual system to describe the space-time inseparability of the
spatio-temporal receptive fields of visual neurons (De Valois
and Cottaris 1998; Jagadeesh et al. 1997
;
Kontsevich 1995
) and the frequency-time inseparability
of auditory neurons (Depireux et al. 2001
). By
definition, a separable STRF can be expressed as a single product of a
function of time and a function of frequency. Thus for an ideal
separable STRF, only one of the singular values obtained from the SVD
should be nonzero. An index of separability could therefore be defined
as the magnitude of the leading singular value relative to the
sum of all the singular values. To avoid the effects of the noise tail
in the singular values in assessing the separability of the STRFs, we
defined a separability index SI as follows
![]() |
Prediction of responses
The method for obtaining a prediction of neural responses using
the STRF is described in detail in Theunissen et al.
(2000) and only briefly summarized here. The predicted firing
rate was obtained by convolving the STRF with the stimulus and
rectifying and scaling the result to minimize the squared error between
the predicted rate and the firing rates estimated from the actual data.
To obtain the predicted firing rate for each song, we used the STRF
calculated from all songs in the ensemble except for the song used to
generate the stimulus-response data being tested. We quantified the
quality of the prediction by calculating the cross-correlation
coefficient (CC) between the predicted and estimated firing rates. The
measured firing rate was obtained by smoothing the PSTH (but not the
predicted firing rate) with a Hanning window that gave the maximal CC.
We corrected the CC for bias and obtained the standard error for
the CC using a jackknife resampling method.
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The goal of this study was to investigate the processing of natural sounds in the songbird auditory forebrain. We began by systematically characterizing the stimulus-response function of auditory forebrain neurons in response to natural sounds. We obtained STRFs from the responses of auditory forebrain neurons in adult male zebra finches to a large ensemble of zebra finch songs. These STRFs show the spectral-temporal features of songs to which auditory forebrain neurons respond and describe the optimal linear component of the response to songs. By extracting a variety of parameters from the STRFs, we were able to quantify several aspects of the processing of natural sounds in the auditory forebrain. First, we obtained multiple STRF parameters to describe the spectral and temporal properties of features important to forebrain auditory neurons. Second, we characterized the spectral-temporal separability of the STRFs. Third, we assessed the linearity versus nonlinearity of the neuronal encoding of songs by quantifying the quality of response predictions obtained from the STRF model. Finally, to begin to assess the relationship between functional properties of auditory forebrain neurons and conventional anatomical subdivisions of auditory areas, we examined how the STRF parameters in our data set mapped onto different subregions of the auditory forebrain.
Neural responses
We obtained neural responses from throughout the auditory forebrain including subregions L2a, L2b, L1, and L3 of field L and the overlying region of cHV. Figure 3 illustrates examples of the trial-by-trial and average neural responses, one from each of the five subregions. As can be seen, the sites in all subregions of field L responded strongly to songs. The response in the site from cHV was weaker and more variable in comparison to field L. The average firing rate in the auditory forebrain was 9 ± 1 (SE) spikes/s.
|
STRF
Songs have a highly complex spectral-temporal structure including
strong time-varying correlations across different frequencies. Consequently, as illustrated in Fig. 4,
A and B, it is difficult to assess to what
spectral-temporal features of songs neurons respond, simply by
comparing the song, in its spectrographic representation (Fig.
4B), and the neural response (Fig. 4A). The STRF
method addresses this difficulty by analyzing the stimulus preceding each neural response, for many stimuli and many spikes, and calculating what weightings of the spectral and temporal components of the stimuli
produce the best linear estimate of the actual neural response [for
the mathematical definition of the STRF used in this paper, see
METHODS and Theunissen et al. (2000); for
discussions on the interpretation of the STRF, see Eggermont et
al. (1983c)
; Klein et al. (2000)
;
Theunissen et al. (2000
, 2001
)]. The resulting STRF can
be thought of as a filter that characterizes the linear component of
the stimulus-response function of auditory forebrain neurons and that
can reveal the features of song critical to the neuronal response. The
relationship between the STRF and the neural response to a particular
stimulus can be seen by sliding a window (Fig. 4B)
containing the time-reversed STRF (Fig. 4C) over the stimulus and obtaining a moment to moment prediction of the response. In each window, the stimulus is weighted by the overlapping part of the
STRF, point-wise at the corresponding time and frequency, and the
results from all points in the window are summed to obtain the
predicted response (Fig. 4D). Mathematically, this is
performing a convolution operation. Intuitively, the time-reversed STRF
can therefore be thought of as the most effective stimulus that could drive this neuron, if the neuron was completely linear. In this example
drawn from our data from region L2a, the STRF, which has a relatively
simple structure, provides a good prediction of the neural response to
a very complex auditory stimulus (the goodness of the linear STRF model
is quantified and discussed in Linearity versus
nonlinearity).
|
Feature analysis of songs by auditory forebrain neurons
Figure 5 shows 15 examples of STRFs obtained from the auditory forebrain (3 from each of the different subregions in field L and cHV), which illustrate the range of STRFs we observed in our data. As can be seen, the STRFs in Fig. 5, A and D (subregions L2a and L2b, respectively), indicate sensitivity to a simple, narrowband component of song. In contrast, much more complex features are observed in some other examples (Fig. 5, F, I, K, L, and O). For instance, the STRF in Fig. 5L (subregion L3) shows an excitatory-inhibitory component that reverses in time, and the STRF in Fig. 5O (subregion cHV) shows a multi-peaked frequency stack. Figure 8D (subregion L3) shows another STRF with a complex feature, a frequency sweep. A further observation that can be made from Fig. 5 is the difference in the time-scales of the STRFs. The STRF in Fig. 5A from L2a has a short delay and width. In contrast, the STRFs in Fig. 5, L and O, from L3 and cHV show longer delays and are extended over much longer durations. Collectively, these STRFs illustrate the variety of ways in which songs are analyzed by neurons in the auditory forebrain and the wide range of time scales associated with this analysis. In the following sections, we quantify some of these qualitative observations by examining a variety of parameters describing different aspects of the STRFs.
|
STRF parameters
To characterize some of the spectral-temporal properties of the
STRFs and to quantify the differences between the STRFs in different
subregions, we first extracted several simple parameters from each
STRF. Such parameters have previously been used to characterize the
response of auditory neurons to simple sounds such as white noise or
tone pips. However, we extracted these parameters directly from the
STRFs obtained with natural sounds to quantify several aspects of the
processing of these sounds. This was an important step since we had
previously observed that the STRFs obtained from natural sounds could
be dramatically different from the STRFs obtained from simple stimuli
for auditory forebrain neurons (Theunissen et al. 2000).
Figure 6 illustrates the parameters for a particular STRF in subregion L2a. As illustrated in the figure, the parameters are obtained from the spectral and temporal slices of the STRF taken along its maximal value. We obtained the time to peak (Tpeak), which is a measure of delay between the stimulus and response (Fig. 6B); the Q factor, defined as the ratio of the best frequency to the width at half-maximum, which is a measure of the sharpness of spectral tuning of the largest spectral peak (Fig. 6C); the BMF, which is defined as the frequency corresponding to the peak of the power spectral density and is a measure of the frequency of amplitude modulations (AM) in songs that drive neurons best (Fig. 6D); and the ratio of the excitatory and inhibitory peak amplitudes of the STRF (see METHODS for definitions). Figure 5 can be used to illustrate how the values of these parameters correspond to the particular STRF from which they were obtained. For example, Fig. 5A shows an STRF that has a short delay with a Tpeak of 11 ms, whereas the STRF shown in Fig. 5N has a much longer delay with a Tpeak of 55 ms. The STRFs in Fig. 5, A and L, have BMF values of 70 and 10 Hz, indicating preferences for relatively high and low modulation frequencies, respectively. An example of an STRF that has relatively sharp spectral tuning with a Q value of 3.2 is shown in Fig. 5B, whereas Fig. 5C shows an STRF that has a more broadly tuned spectral peak with a Q value of 0.71.
|
Figure 7 shows the distribution of these parameters for the auditory forebrain and quantifies the diversity of processing of songs in the auditory forebrain, confirming our qualitative observations in Fig. 5. Although Tpeak (Fig. 7A) ranged from 7 to 55 ms, the majority of the sites we examined fell into an intermediate range, consistent with the location of the auditory forebrain between the auditory thalamus and HVc. The distribution of the values for BMF (Fig. 7B) shows that the majority of sites (~90%) in our data set preferred relatively lower frequency AM in songs (<30 Hz). Almost half the sites in our data (~48%) had a Q factor close to 1 (between 0.5 and 1.5), indicating that for many sites the width of the largest spectral peak of the STRF was comparable to the peak frequency (Fig. 7C; also see Fig. 6 and METHODS). The ratio of excitatory and inhibitory peaks of the STRFs (E-I ratio; Fig. 7D) was distributed around a peak value at 1.3, indicating an approximate balance between the relative magnitudes of the excitatory and inhibitory peaks within a range around this value.
|
Our data consisted of both single units as well as small clusters of units (see METHODS). Although, in theory, if individual neurons close to each other differed markedly, complex STRFs could be created simply by the simultaneous recording of single units with different properties, we saw no evidence suggesting that this was occurring. The range of complexity of STRFs from single units was similar to that seen with the small clusters (examples of STRFs obtained from single units are shown in Figs. 5, A, D, G, K, and L, and 8A). Moreover, we did not observe a significant difference between single units and clusters for any STRF parameter (P = 0.8 for Tpeak; P = 0.9 for BMF and Q; P = 0.5 for E-I ratio; Wilcoxon rank sum test).
Separability versus inseparability
A parameter that describes the complexity of STRFs is the degree of separability in time and frequency. Separable features can be described as a product of a spectral and temporal function, whereas inseparable features cannot be described in this simple manner. Using the singular value decomposition technique (SVD; see METHODS and Fig. 2), we analyzed the separability of song-derived STRFs and defined a separability index (SI) ranging from 0 to 1, with 1 indicating a fully separable STRF. We observed both separable and inseparable STRFs in the auditory forebrain. Figure 8A shows an example of an approximately separable STRF from subregion L2b. Figure 8B shows the STRF obtained using only the first component of the SVD of this STRF, and Fig. 8C shows the difference between this first component and the full STRF. As can be seen, the first component accounts for most of the structure of the full STRF, and thus this STRF is separable. This STRF had a SI of 0.91. In contrast, Fig. 8, D-F, shows an example of an inseparable STRF from subregion L3, which contains a frequency sweep. Unlike the separable STRF in Fig. 8, A-C, the difference (Fig. 8F) between the leading component and the full STRF is much larger in this case, and this STRF had a SI of 0.53. Figure 5, A and K, shows additional examples of STRFs with relatively high and low separability indices (SI = 0.82 and 0.52, respectively). Figure 8G shows the broad distribution of SIs obtained from the auditory forebrain for our entire data set. We did not observe a significant difference between the SI distributions of single units and small clusters of units in our data (P = 0.3).
|
Linearity versus nonlinearity
The STRF is a linear model in that it describes only the
linear component of the neural encoding of the stimulus. Thus one can
use the quality of the predictions of the neuronal responses obtained
from the linear STRF model to assess the linearity or nonlinearity of
the neural encoding of the stimulus. We used the STRFs to obtain
predictions of the neuronal responses to songs (see
METHODS) and quantified the quality of the prediction by the correlation coefficient (CC) between an estimation of the deterministic part of the actual response and the response predicted by
the STRF (see METHODS) (see also Theunissen et al.
2000). Figure 9A shows
the estimated response from the actual data (top) and predicted response (bottom) using the STRF shown on the
right of the traces, to a section of the stimulus ensemble for a site in L2a. For this site, a relatively good prediction could be obtained (CC = 0.68), indicating that a substantial component of the
encoding of this site was linear. However, this linear component varied over a wide range for our data set (range of CCs: 0.07-0.72), indicating both relatively linear as well as nonlinear encoding of
songs. Illustrative examples with different values of CC are shown in
Fig. 9, B-E. These examples illustrate the range of
performance of the linear STRF model in being able to predict the
neural response. For example, in Fig. 9, A and B,
the timing and widths as well as the relative
amplitudes of the peaks and troughs in the responses appear to be well
predicted. Figure 9C shows an example where the timing and
width of the responses are still relatively well predicted but the STRF
fails to capture the relative amplitudes of the peaks and troughs in
the response. Figure 9, D and E, illustrates examples where the STRF makes errors in predicting the timing and width
of the response peaks and troughs as well. Figure 9F shows
the distribution of CCs obtained for our entire data set. A comparison
of the distribution of CCs for single units and small clusters of units
did not indicate a significant difference (P = 0.7).
Table 1 summarizes the values of all the
parameters we obtained from the STRFs.
|
|
We also investigated whether the parameters obtained from the STRF were correlated with each other. We examined the pair-wise scatter plots of the parameters for all pairs (data not shown), calculated the correlation coefficient between parameters and the significance of the correlation coefficient (Fisher's r to z test). We found that many of the parameters were significantly correlated with each other. These values are summarized in Table 2. In particular, Tpeak and BMF, Tpeak and CC, BMF and CC, and E-I ratio and CC were strongly correlated. This suggests that short latency responses, short integration times, and a preponderance of excitation tended to co-occur with increased linearity.
|
Mapping STRF parameters onto anatomical subregions
To begin to investigate the relationship between the functional properties of auditory forebrain neurons, as indicated by the STRF parameters, and conventional anatomical subdivisions of the auditory forebrain, we compared the STRF parameters in our data across the different subregions of the auditory forebrain: L2a, L2b, L1, L3, and cHV. Figure 10A shows the mean and the inter-quartile range of values for the parameter Tpeak across the different subregions. We observed a significant difference in the mean Tpeak across the subregions (P = 0.024, F = 3.0, ANOVA, see figure legend for further statistics). Tpeak was shortest in L2a (mean Tpeak = 14 ms) and longest in cHV (31 ms) with subregions L2b (20 ms), L1 (21 ms), and L3 (22 ms) showing intermediate values. This pattern indicates the timing for the processing of songs in the different subregions. On average neurons in the thalamo-recipient area L2a responded fastest followed by the subsequent areas. The range of Tpeak in regions L1, L3, and cHV was larger compared with the range in L2a, indicating a more heterogeneous distribution in these areas (P = 0.002, 0.001, and 0.0006 for L2a vs. the other areas, respectively, F test with Bonferroni correction; see Table 1 for SEs and ranges of values). We did not observe significant differences in heterogeneity between the remaining areas.
|
The average values of BMF (Fig. 10B), showed a significant difference across subregions (P = 0.006, F = 4.1). We observed a preference for high modulation frequencies in L2a (mean BMF = 38 Hz) compared with lower modulation frequencies in L2b (21 Hz), L1 (22 Hz), L3 (15 Hz), and cHV (17 Hz). The inverse of the BMF parameter can be thought of as a characteristic time scale of integration of songs. The average values of this parameter indicated a short time scale of integration for sites in L2a (26 ms), followed by L1 (46 ms) and L2b (48 ms), cHV (59 ms), and L3 (67 ms).
A comparison of the Q factor (see METHODS and Fig. 10C) did not show a significant difference (P = 0.17) across subregions. Thus on average the features obtained from the different subregions were comparable in the sharpness of spectral tuning of the largest spectral peak (see Table 1).
We compared the magnitudes of peak excitatory and inhibitory STRF amplitudes in each of the auditory areas. As can be seen by comparing Fig. 10, E and F, the ratios of the excitatory to inhibitory peak in the different subregions were approximately equal (P = 0.7; see Table 1) even though both the excitatory and inhibitory amplitudes varied significantly across the subregions (P = 0.02).
When we examined the SI for different subregions (Fig. 10D), we found that although the subregions L2b, L1, L3, and cHV contained the sites that were the most spectral temporally inseparable, there was no statistically significant difference in the mean SI across the different subregions (P = 0.8; see Table 1 for values).
Figure 10G shows the mean values for the CC across the different subregions. These values indicate a significant difference across subregions (P = 0.023, F = 3.1) with the CCs being highest in L2a (mean CC = 0.63) and significantly different from the CCs in all the other regions, followed by L1 (0.48), L2b (0.44), L3 (0.37), and cHV (0.37). Although the sample size is small, the range of CCs in L2a was also significantly smaller compared with L2b, L1, and L3, indicating a more heterogeneous distribution in these regions (P = 0.002, 0.004, 0.002, respectively; see Table 1 for ranges). There were no significant differences in heterogeneity between the regions L2b, L1, L3, and cHV. These results suggest a difference in the nonlinear component of the neural encoding of songs in different regions, with region L2a showing relatively linear encoding of songs and subsequent areas showing linear as well as nonlinear encoding of songs.
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
An important goal of auditory neuroscience is to understand the
processing of natural sounds by auditory neurons, which may have
evolved to efficiently encode these sounds (Attias and Schreiner 1998; Rieke et al. 1995
) and which respond much
more strongly to such sounds in higher-level auditory areas
(Margoliash 1983
; Rauschecker et al.
1995
; Theunissen et al. 2000
; Wang et al.
1995
). However, due to the complexity of natural sounds such as
human speech and birdsong, it has been difficult to obtain the
stimulus-response properties of auditory neurons with such sounds using
conventional methods. Previously, the STRF approach has been
successfully employed to characterize the responses of auditory neurons
to synthetic sounds (deCharms et al. 1998
;
Depireux et al. 2001
; Eggermont et al.
1983a
,c
; Escabi et al. 1998
; Keller and
Takahashi 2000
; Klein et al. 2000
;
Kowalski et al. 1996a
,b
). In this study, we used our
recent extension of the STRF approach (Theunissen and Doupe
1998
; Theunissen et al. 2000
, 2001
) to analyze
the processing of natural sounds in the songbird auditory forebrain.
In the few physiological studies that have been done to date with small
sets of natural vocalizations and complex synthetic sounds, auditory
neurons in field L were found to be quite diverse, ranging from broadly
responsive to selective (Langner et al. 1981; Muller and Leppelsack 1985
; Scheich et al.
1979
; Uno et al. 1991
). However, these studies
could not identify the components of the stimuli responsible for the
neuronal response. Our approach here was to use the extended STRF
method to investigate directly the features of songs to which auditory
forebrain neurons responded.
Our results revealed a diverse range of processing of songs in the auditory forebrain with some neurons responding to simple tonal components of songs and others responding to more complex spectral-temporal structures such as frequency sweeps and multi-peaked frequency stacks. We quantified multiple aspects of the processing of songs in the auditory forebrain by extracting several parameters from the STRFs. Using the parameter Tpeak, we characterized the timing of responses in the auditory forebrain. The range of values indicated both fast and relatively slower processing of song features.
Another important temporal parameter of complex sounds, such as speech
and birdsong, is the modulation in the amplitude envelope of sounds.
Complex sounds typically contain a broad range of modulation frequencies. The BMF parameter, extracted from the STRF, allowed us to
characterize the preferred modulation frequency for auditory forebrain
neurons and showed that, as a group, auditory forebrain neurons could
encode a broad range of AM frequencies. The majority of the neurons,
however, preferred lower modulation frequencies, approximately matching
the dominant range of modulation frequencies found in songs
(Theunissen et al. 2000). By itself, the value of the
BMF does not necessarily imply sharp band-pass tuning to an AM
frequency corresponding to the BMF. It is, nevertheless, a useful
indicator of the AM frequency in songs that is most effective in
driving neurons. In addition, the value of the BMF parameter obtained
here could be different from the conventional BMF parameter obtained
with simple amplitude modulated tone bursts because, as we found in our
previous study, many auditory forebrain neurons show different
stimulus-response properties when probed with natural versus synthetic
sounds (Theunissen et al. 2000
). The inverse of the BMF
parameter also gave us an indication of the time scale of integration
for auditory forebrain neurons. High-level auditory neurons displaying
context dependent phenomena such as combination-sensitivity have often
been found to integrate their inputs over a relatively long duration
(Lewicki and Arthur 1996
; Margoliash
1983
; Margoliash and Fortune 1992
;
Ohlemiller et al. 1996
). In our data, the time scales of
the features to which neurons responded in some of the auditory
forebrain regions were surprisingly long, in some cases showing
integration times on the order of 100 ms. Integration of input over
such a long duration could contribute to the known sensitivity of some
field L neurons to combinations of song syllables as well as to the
selectivity for BOS seen in high level auditory areas (Lewicki
and Arthur 1996
).
The quality of the predictions of neural responses obtained from the
STRF model, as assessed by the CC, indicated the presence of both
relatively linear as well as more nonlinear encoding of songs in the
auditory forebrain. Here, it is important to point out that, although
we were able to estimate the magnitude of the nonlinear component of
the stimulus-response function by assessing the quality of predictions
obtained from the linear STRF model, this model could not provide any
information about the exact nature of the nonlinearity. We have
previously shown that part of the nonlinearity across different
stimulus ensembles can be described by constructing separate STRFs for
each stimulus ensemble (Theunissen et al. 2000). This is
analogous to constructing a piece-wise linear approximation of a
nonlinear function. However, describing the residual nonlinearities
within a particular stimulus ensemble remains an important challenge
for current methods in auditory neuroscience. In principle, one
could include higher-order terms in the Volterra expansion
describing the stimulus-response relationship. However, estimating
these terms and interpreting their biophysical significance is quite
difficult. Examination of the linear prediction showed several types of
errors. In some cases, the timing and width of responses was well
predicted but the amplitude was not. In such cases, it may be possible
to improve the prediction by incorporating a static nonlinearity in the
model for predicting responses. In other cases, errors occurred in
predicting the timing and width of responses as well, suggesting
dynamic nonlinearities. Such nonlinearities could arise from underlying
nonlinear cellular and synaptic processes such as adaptation,
facilitation, and depression. Further elucidation of the nonlinearities
may require modeling them based on a detailed description of such
underlying biophysical mechanisms or developing new methods that
describe such nonlinearities.
The auditory forebrain showed narrowly as well as broadly tuned STRFs,
suggesting that neurons in this region analyze songs at a variety of
spectral resolutions. Analysis over a range of spectral resolutions is
thought to be a prominent principle of the organization of mammalian
auditory cortex as well (Schreiner et al. 2000).
We found that the ratio of the excitatory and inhibitory peaks of the
STRFs was approximately balanced in the auditory forebrain, which may
reflect properties of the local circuitry in the auditory forebrain.
Models of auditory neurons have suggested how neural responses can be
shaped by the local excitatory and inhibitory circuitry (Nelken
and Young 1997; Shamma 1989
). STRFs with
excitatory and inhibitory regions could be the result of such
excitatory and inhibitory interactions. Such a balance of excitatory
and inhibitory regions, organized in an appropriate way in the
time-frequency domain, could result in more temporally phasic and/or
more spectrally selective responses. For example, in cases in which the
excitatory region precedes the inhibitory region, response would be
initiated by the activation of the excitatory region but subsequently
terminated or attenuated by the activation of the inhibitory region,
thus producing a more temporally phasic response. One possible way to
directly investigate the relation between the STRF and the local
excitatory and inhibitory circuitry in the auditory forebrain would be
to manipulate the amounts of inhibition or excitation in these areas
and examine the resultant changes in the STRFs.
We observed both separable and inseparable STRFs in the auditory
forebrain. Neurons with inseparable STRFs could be used to detect
spectral temporal structures of sound that change with time, such as
frequency sweeps, analogous to direction selective neurons found in the
visual system. Such STRFs might be important in the analysis of songs,
since frequency sweeps are prominent in many zebra finch songs. In the
visual system, a simple model for motion-sensitive neurons was
proposed, in which two spatio-temporally separable receptive fields
combine in quadrature to produce a spatio-temporally inseparable
receptive field (Adelson and Bergen 1985; Watson
and Ahumada 1985
). In the auditory system, a similar principle
could apply in the spectral-temporal domain. Thus the inseparable STRFs
found in the auditory forebrain could be generated by combining inputs
from the separable STRFs in the same or previous regions.
The preceding discussion highlights the diversity of the auditory forebrain in the distribution of STRF parameters, reflecting the range of complexity we observed in the STRFs. In our data, we also observed that some parameters indicative of more complex processing tended to co-occur. For example, neurons with long time scales of integration also tended to have more nonlinear encoding properties, indicating that some neurons found in the auditory forebrain could be jointly complex in multiple attributes. Thus several functional stages of song processing, ranging from simple to quite complex, appear to occur within the auditory forebrain and suggest that the auditory forebrain may be involved in the analysis of many different aspects of song structure. The resultant multiple representations of songs, of varying complexity and time scales, could together provide useful information to higher level auditory areas that are likely to be involved in the perception of highly complex, behaviorally relevant stimuli.
Mapping STRF parameters
A problem of great interest in the study of auditory systems has been to understand the organization of auditory maps of different parameters of sounds. To begin to look for patterns in the mapping of functional properties of auditory forebrain neurons onto conventional anatomical subregions of the auditory forebrain, we compared the STRF parameters across the different subregions. Clearly, more data will be required for a complete analysis of the different subregions, especially subregions such as L2a and cHV, where we had a relatively small number of neurons. This is even more important for subregion cHV, which was quite heterogeneous in the distribution of STRF parameters, unlike L2a. Nevertheless we observed several significant and suggestive trends in our data.
A comparison of the parameter Tpeak
across the different regions revealed a significant difference in the
timing for the processing of songs in the auditory forebrain, with L2a
responding fastest, followed by L2b, L1, and L3, and then cHV, which
had the slowest responses of all the areas studied here. This pattern
is consistent with the known anatomical connectivity in the auditory
forebrain (see Fig. 1) (see also Vates et al. 1996).
When we compared the time scales of integration in different regions of
the auditory forebrain, we found that L2a showed relatively short
integration time scales compared with regions L2b, L1, L3, and cHV. A
similar increase in the time scale of integration, as indicated by the
best modulation frequency, has also been observed in successive areas
of the auditory cortex of cats (Schreiner and Urbas
1988).
The quality of the predictions of neural responses obtained from the
STRF model, as assessed by the CC between the estimated and predicted
response, also varied significantly across the auditory forebrain
regions. CCs were highest in area L2a, followed by L1, L2b, L3, and
cHV. This difference is suggestive of an increase in the nonlinear
component of the neural encoding of songs from L2a to L1, L2b, L3, and
cHV, respectively. Such an increase in nonlinearity could be reflective
of preparatory stages of processing for the generation of highly
nonlinear properties such as BOS selectivity, seen later in the
auditory pathway (Janata and Margoliash 1999;
Margoliash 1983
; Margoliash and Fortune
1992
; Theunissen and Doupe 1998
). An increase in
the nonlinear component of the processing of sensory stimuli in
successive stages of a sensory system has also been reported in the
electric fish system (Gabbiani et al. 1996
).
The rough mapping of the STRF parameters discussed in the preceding
text onto conventional anatomical subdivisions of the auditory
forebrain is suggestive of hierarchical processing, with the
thalamo-recipient area L2a showing parameters characteristic of simpler
processing and subsequent areas revealing the gradual emergence of more
complex processing properties. However, other observations indicate
that the auditory forebrain may not be organized in a strictly serial
hierarchy. Not all STRF parameters varied significantly across the
different subregions. For instance, the separability index did not
reveal a significant difference among the subregions. It remains
possible, however, that qualitatively different types of inseparability
occur in the different subregions. The subregions also shared
properties such as the sharpness of spectral tuning and the ratio of
excitatory to inhibitory peaks. Thus instead of being organized in a
strictly serial hierarchy, the auditory forebrain may be organized in a
more elaborate way, performing both serial and parallel processing of
auditory information. The known, extensive interconnectivity between
the anatomical subregions of the auditory forebrain also supports this
idea (Vates et al. 1996). Thus the complex processing
properties we observed could arise via a combination of hierarchical
and parallel processing in the network of auditory forebrain
subregions. The intrinsic circuitry within each of the subregions may
also play a role in the emergence of this complexity.
Overall our data are consistent with L2a being the major input region
of the auditory forebrain, responding to relatively simple features of
complex sounds with short delays, short integration times and more
linear processing. Surprisingly, area L2b often showed complex STRFs,
even though it is anatomically described as an early auditory area
similar to L2a. There are several possible explanations for this
finding. Although L2b receives direct thalamic input, the parts of Ov
that project to L2b and L2a are distinct, thus potentially contributing
to the differences in the response properties of these two areas
(Vates et al. 1996). Second, in this study, area L2b was
defined to include area L, thus making it a much larger composite
region. Since the inputs to area L have not been described in detail so
far, it remains possible that the strongest sources of inputs to parts
of this composite region are from other auditory forebrain regions and
not directly from the thalamus, which could lead to more complex
response properties. Our results suggest a gradual emergence of more
complex features, longer delays and integration times, and nonlinear
processing properties in the auditory forebrain subsequent to area L2a.
As auditory forebrain areas begin to be probed in much more detail it
is likely that additional differences between the subregions of the
auditory forebrain will be identified. The stages of processing in
these areas are likely to contribute both to the generation of song
selective neurons found in higher-level areas in the songbird brain as
well as to the detection and discrimination of a wide variety of
natural sounds behaviorally relevant to songbirds.
![]() |
ACKNOWLEDGMENTS |
---|
We thank M. Brainard, M. Escabi, and C. Schreiner for comments and discussion on an earlier version of the manuscript; R. Kimpo, C. Roddey, G. Carrillo, and A. Arteseros for technical assistance; and two anonymous referees for critical comments on the manuscript.
This work was supported by research grants from the Alfred P. Sloan Foundation (to K. Sen, F. E. Theunissen, and A. J. Doupe) and the National Institute of Neurological Disorders and Stroke (NS-34835 to A. J. Doupe).
![]() |
FOOTNOTES |
---|
Address for reprint requests: K. Sen, Dept. of Physiology, Box 0444, University of California, 513 Parnassus Ave., San Francisco, CA 94143-0444 (E-mail: kamal{at}phy.ucsf.edu).
Received 19 January 2001; accepted in final form 7 May 2001.
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|