ABSTRACT
Objective:
Speech perception relies on precise spectral and temporal cues. However, cochlear implant (CI) processing is confined to a limited frequency range, affecting the information transmitted to the auditory system. This study analyzes the influence of channel interaction and the number of channels on word recognition scores (WRS) within the CI simulation framework.
Methods:
Two distinct experiments were conducted. The first experiment (n=29, average age =23 years, 14 females) evaluated the number of channels using eight, twelve, sixteen, and 22 channel vocoded and non-vocoded word lists for WRS assessment. The second experiment (n=29, average age =25 years, 16 females) explored channel interaction across low, middle, and high-interaction conditions.
Results:
In the first experiment, participants scored 57.93%, 80.97%, 83.59%, 91.03%, and 95.45% under 8, 12, 16, and 22-channel vocoder and non-vocoder conditions, respectively. The number of vocoder channels significantly affected WRS, with significant differences observed in all conditions except between the 12-channel and 16-channels (p<0.01). In the second experiment, the participants scored 2.2%, 20.6%, and 50.6% under high, mid, and low interaction conditions, respectively. Statistically significant differences were observed across all channel interaction conditions (p<0.01).
Conclusions:
While the number of channels had a notable impact on WRS, it is essential to note that certain conditions (12 vs. 16) did not yield statistically significant differences. The observed differences in WRS were eclipsed by the pronounced effects of channel interaction. Notably, all conditions in the channel interaction experiment exhibited statistically significant differences. These findings underscore the paramount importance of prioritizing channel interaction in signal processing and CI fitting.
INTRODUCTION
Cochlear implant (CI) technology has improved dramatically since its invention and has become a widely accepted and successful intervention method for individuals with severe or profound sensorineural hearing loss. This improvement leads to successful auditory intervention and rehabilitation1. However, although CI recipients perform exceptionally well in various auditory tasks compared with their performance with hearing loss, their performance is still poorer than that of normal hearing (NH) listeners. Patient-related factors and technological constraints are the two critical aspects that are related to this poorer performance. Various studies have aimed to understand patient-related predictors in this regard, yet there are still technical and fundamental limitations in the CI sound processor technology and signal processing that require detailed addressing.
The CI processor is a multi-band filter bank with a set number of channels that extract the temporal envelope of each band accordingly and convert that information to pulses that are transferred to the electrode array inside the cochlea at a limited rate2. During this process, infinite and fine-structured acoustical events in the real world are (a) filtered through a limited number of channels around 12-22 depending on the manufacturer, (b) limited with a decreased frequency range between 50 and 8000 Hz, and (c) compressed within a narrow amplitude envelope, which results in a decreased dynamic range around 20-80 dB3. Consequently, some critical acoustical information is missed during CI processing, thus limiting the auditory performance of CI recipients. Considering that in a healthy cochlea, there are approximately 1000 inner hair cells that transmit information through approximately 30,000 auditory neurons, the limited number of spectral channels with poor spatial selectivity creates a massive limitation in signal encoding. Therefore, the number of channels that require acceptable speech perception is an intriguing topic for researchers.
Speech perception as a function of the number of spectral channels naturally attracted a lot of interest because speech is the fundamental aspect of hearing once audibility is ensured. Dorman and Loizou4 assessed vowel perception as a function of the number of channels, which varied between 1 and 9, and the results showed that in the most difficult and least difficult test conditions, the performance increase was not statistically significant after 8 and 5 channels, respectively. The same research group used sentences in noise as an assessment method and varied the number of channels between 6 and 20. The performance peak was reached with 12 channels at +2 dB signal-to-noise ratio (SNR) and 20 channels at -2 dB SNR conditions5. Similarly, Başkent6 showed that in the quiet background, the performance reaches its peak around 8 channels for vowel and consonant perception. When the background noise was introduced at 0 and -5 dB SNR, peak performance was achieved at 12 and 16 channels, respectively. Friesen et al.7 tested and compared vowels, consonants, consonant-vowel-consonant (CVC) words, and sentence perception abilities of NH listeners using vocoder stimuli with CI recipients. Although the performance of NH listeners increased to 20 channels, CVC word recognition score (WRS) reached over 90% only when the channel number reached around 12 and CI recipients reached peak performance for all tests around 8-10 electrodes. As a result, there is a suggestion that the overlap in neural excitation between neighboring electrode channels in CI processing could lead to a decline in spectrotemporal resolution and a decrease in the count of distinct information channels available for use7,8. Nonetheless, recent research has yielded conflicting results. It has been demonstrated that adults9 and children10 with CI can achieve performance enhancements with as many as 22 channels, surpassing the previously suggested saturation point of 7-16 channels. While the ideal channel count for CI recipients continues to be a topic of discussion, the influence of channel interaction emerges as a crucial aspect warranting further investigation in shaping auditory outcomes.
Numerous studies have examined the impact of excitation spread by simulating varying degrees of channel interaction, employing techniques such as shallow filter slopes for high interaction or steep filter slopes for low interaction. The consistent findings across these studies suggest that heightened channel interaction is correlated with auditory performance8,11,12, particularly when spectral resolution is compromised, as seen with a limited number of channels13. Recently, Goehring et al.14 deliberately induced spectral blurring to augment channel interaction in CI recipients, and the outcomes affirmed the detrimental consequences of increased excitation spread.
As stated previously, patient-related factors that impact CI performance are hard-to-control variables in CI research. Although the number of channels and frequency response of each channel can be manipulated with CI fitting software, every CI recipient has a different etiology, auditory exposure, residual hearing, etc. Hereby, researchers have proposed a “vocoder” approach, which in some cases is named “CI simulation (CIS),” for the assessment of channel number and interaction effects on auditory perception4. A realistic and precise CIS allows researchers and engineers to assess and pre-test various signal processing strategy properties in homogenous NH listeners before a heterogenous and limited number of CI listeners. It also helps to better understand the sound signal degradation that occurred during CI processing15. Currently and widely used CIS models4,5,15,16,17 include a bank of bandpass filters with various filter cut-off points and bandwidths, temporal envelope generators for each band, a low-pass filter to eliminate low-frequency cues, and a modulated carrier (noise, sinusoidal or pulse-spreading harmonic complex) with the corresponding envelope to resynthesize the signal.
It is worth noting that most of the studies referenced were conducted with either English or Dutch speakers. Consequently, there may be some language-specific effects to be considered. It is established that the suprasegmental properties of individual languages can influence spectral characteristics to varying degrees18,19,20,21,22. Therefore, data specific to the Turkish language would offer a more precise understanding of vocoder settings. In addition, it is crucial to highlight the choice of speech materials in previous research. While phoneme recognition yields valuable insights into speech comprehension, monosyllabic WRS is perhaps the most widely used method for assessing speech perception. They are integral to both routine audiological evaluations and CI hearing assessments. As a result, this study assesses the impact of varying numbers of spectral channels on Turkish monosyllabic word recognition tasks, aiming to provide a language-specific and clinically oriented measurement.
MATERIALS and METHODS
Participants
Two distinct experiments were conducted in this study, each involving different sets of participants. In the first experiment, which focused on the number of channels, 29 adult participants with NH were enrolled. The participants included 15 males and 14 females, with an average age of 23 years (range: 18-38 years). In the second experiment, which focused on channel interaction conditions, a separate group of 29 NH adult participants was recruited. This group comprised 13 males and 16 females, with an average age of 25 years (range: 18-41 years). To mitigate potential learning effects and accommodate the extended duration of the tests, we divided the experiments into two distinct experiments, each involving separate participant groups. Participants were matched in terms of age and gender as in the first experiment to ensure consistency and rigor across both sets of participants.
NH status was confirmed through the presence of a type A tympanogram, detection of acoustical reflexes, distortion product otoacoustic emissions (noise to harmonics ratio ≥6 dB) in at least three tested frequency bands, and a pure tone average of 20 dB or better (between 250-8000 Hz) in behavioral audiometry. In addition to NH, the inclusion criteria encompassed being at least 18 years of age and lacking any form of cognitive or neurodevelopmental disorder. Ethical approval for all protocols was obtained from the Ankara Medipol University Non-invasive Clinical Studies Ethics Committee (decision no: 147, date: 01.08.2022). Furthermore, all participants provided detailed information about the study and provided informed consent. The research was conducted in accordance with the principles outlined in the Declaration of Helsinki.
Speech Test Material and Procedure
Although various monosyllabic word lists have been developed for audiological assessments in Turkey, the Durankaya et al.23 list is the only one that has been validated and found to be phonemically balanced, homogenous, and familiar. Hence, we proceeded with using this word list for speech perception testing. All words included in the Durankaya et al.23 list were recorded in an audio recording studio by one male and one female professional voice user using the RME Babyface Pro FS (RME, Germany) audio interface and the Rode NT2-A (The Freedman Group, Australia) condenser microphone. The same voice recordings were used for another ongoing study, and preliminary analysis showed that the word recognition performance measured by the female voice was better than that measured by the male voice. Therefore, word lists recorded by the female voice were used in this study.
The test procedure interface was developed using the JsPsych24 framework based on JavaScript. JsPsych provides a library of tools for developing behavioral testing. The computer program automatically presented audio files with the prompt “The word now you will hear is:” and after the audio files were presented, it asked the listener to type the word they heard into an input box. After each word list (which included 25 words) was presented under different conditions, the program automatically calculated the percentage of correct words (WRS) typed down by the listener. The number of channels and channel interaction conditions for vocoder settings were selected based on the literature, which often suggests 8 channels with middle to low channel interaction (at least 60 dB/octave) for good speech perception6,7,25 and to mimic the commercially used CI processors, at least to a degree. Although there are considerable technical and signal processing differences between CI devices, and it is not within the scope of this manuscript to assess the manufacturer-specific features of processors, we believe that this approach would provide valuable information for various situations.
Each participant in experiment 1 (number of channels) completed a 25-word list under 5 test conditions: 8, 12, 16, and 22 channel vocoder word lists, and one non-processed word list. The order of the word lists was randomized for each participant to prevent any type of acclimatization or listening fatigue. In experiment 2 (channel interaction), participants completed the test under 3 different channel interaction conditions. The WRS for each condition was calculated as a percentage of correctly typed words and used in the subsequent statistical analysis.
Testing was conducted in the university audiology laboratory with Telephonic TDH-39P (Telephonics Corporation, USA) headphones and an Inventis Harp (Inventis, Padova, Italy) audiometer connected to the testing computer. Stimuli were presented binaurally at 60 dB SPL presentation level. Before testing, each participant attended a short practice session and then proceeded with the actual test.
Vocoder Settings
A custom-made MATLAB (MathWorks, USA) program developed by Gaudrain26 for CIS research was used for vocoding stimuli in the WRS test. The frequency spectrum of speech stimuli between 80 and 8000 Hz was analyzed using discrete zero-phase Butterworth bandpass filters. In experiment 1 (number of channels), filters were designed on the basis of the respective number of channels (8, 12, 16, 22) using the Greenwood function27, which reflects the nonlinear frequency response within the cochlea. The filter slope was set at 72 dB/octave for every condition in experiment 1, which was selected to provide a realistic listening condition to test the effects of the number of channels. In experiment 2 (channel interaction), to emulate a challenging listening condition similar to that of CI recipients, an 8-channel vocoder setting was employed. The slope of the filters varied to match the channel conditions: 4th order (24 dB/octave-high), 8th order (48 dB/octave-middle), and 10th order (60 dB/octave-low). In experiment 2, besides the corresponding filter slope, the identical vocoder settings as those in experiment 1 were applied to all test stimuli.
After discrete filtering, a Hilbert transform was applied to the output of each channel, and the amplitude envelope was obtained using half-wave rectification and second-order Butterworth low-pass filtering at 160 Hz to remove periodicity cues. We employed Gaussian broadband noise as the carrier, which was modulated for each channel using its respective temporal envelopes. These modulated noise bands were then summed to create the final set of test materials. The same settings were consistently applied to both the analysis and synthesis filters.
Statistical Analysis
Statistical analysis was performed using SPSS 24.0 (SPSS Inc., Chicago, IL, USA). Descriptive statistics (mean, standard deviation, range, and interquartile range) were used to describe the study variables. Repeated measures ANOVA was used in each experiment to analyze the variance between five test conditions, including non-vocoder and 8, 12, 16, and 22-channel vocoder settings (experiment 1) and high, mid, and low channel interaction settings (experiment 2). A 0.05 significance level was set.
RESULTS
Experiment 1: Number of Channels
In the first experiment, under 8, 12, 16, and 22 channel vocoder and non-vocoded setting, participants correctly recognized the words 57.93% [standard deviation (SD) =14.88], 80.97% (SD =5.92), 83.59% (SD =7.95), 91.03% (SD =6.82), and 95.45 (SD =0.95) on average (Figure 1).
A repeated measures ANOVA was performed to compare the WRS values obtained from five different test conditions in experiment 1. Mauchly’s test indicated a violation of the sphericity assumption, x2(2)=70.69, p<0.001, therefore the Greenhouse-Geisser (e=0.535) corrected results are reported. The results showed that WRS was affected by the number of vocoder channels, F (4, 108) =92.41, p≤0.001.
Because there was a significant difference between test conditions in the WRS measurement, a post-hoc test with a Bonferroni correction was applied to conduct pairwise comparisons. Statistically significant differences were observed across all channel conditions (p<0.01), except for the 12-channel vs. 16-channel comparison (p=0.881). Refer to Table 1 for detailed p-values.
Experiment 2: Channel Interaction
In the second experiment, participants attained an average WRS of 2.2% (SD =2.04), 20.6% (SD =9.11), and 50.6% (SD =6.27) under high, medium, and low interaction conditions, respectively.
A repeated measures ANOVA was performed to compare the WRS obtained from the three different test conditions in experiment 2. Mauchly’s test didnot violate the sphericity assumption, x2(2)=85.71, p=0.249. Results showed that WRS was affected by the number of channel interactions, F (2, 54) =108.81, p≤0.001.
Because there was a significant difference between test conditions in the WRS measurement, a post-hoc test with a Bonferroni correction was applied to conduct pairwise comparisons. Differences were statistically significant among all channel interaction conditions (p<0.001).
DISCUSSION
The results of our study underscore the critical role of channel configuration in CI processing. We found that achieving relatively good monosyllable WRS (>80%) with CIs requires a minimum of 12 channels. Furthermore, our investigation into channel interaction revealed its significant impact on WRS. Participants exhibited average WRS values of 2.2%, 20.6%, and 50.6% under high, mid, and low interaction conditions, respectively. Notably, our observations indicate that with 22 channels in the vocoder condition, word recognition performance closely approaches that of non-vocoder listening conditions (91% vs. 95%). However, a statistically significant difference emerged between the 22-channel vocoded and non-vocoded conditions, suggesting that while 22 channels can offer clinically comparable performance to natural listening conditions, a discernible statistical distinction persists. This discrepancy may be attributed to the intricate interplay of channel interaction, which warrants further investigation.
The WRS performance in an 8-channel vocoder setting was significantly poor (57.93%). However, the WRS performance increased dramatically with the 12-channel vocoder setting (83.59%). At first, this finding may seem inconsistent with previous reports that showed that maximum speech perception could be acquired with only 8 channels5,6,25. However, the speech materials and speech perception tasks used in these studies are different. We used monosyllable words as speech material, which we believe is more suitable for practical, clinical use. Our findings are similar to those of previous reports that used similar speech perception tasks. These studies showed that performance increase is possible with channel numbers beyond eight28,29 and up to at least 20 channels7. Therefore, it can be assumed that at least 12 spectral channels are required to provide an acceptable peripheral stimulation for speech perception tasks that are widely used in audiology clinics, at least within the range we tested and with a superb (very low-72 dB/octave) channel interaction setting in this experiment.
Moreover, the performance in the 12- and 16-channel (80.97%) vocoder settings was remarkably similar, revealing a lack of statistically significant differences between these conditions. This intriguing result prompts a closer examination of the complex relationship between the number of channels and the actual usable information provided by these channels, which is influenced by channel interaction. Previous studies have consistently indicated the existence of a plateau in speech perception between 8 and 16 channels, which is contingent on the specific speech material used and the nature of channel interaction within a given experimental context5,6,9,12. Our findings align with this trend, suggesting that the perceptual benefits may reach a saturation point within this channel range. This observation may be attributed to the intricate interplay between the number of channels through which the signal is processed and the effective information yielded by these channels, a factor influenced by channel interaction. The 16-channel condition, despite the increase in spectral resolution, did not yield a significant advantage over the 12-channel condition. This lack of distinction may be indicative of perceptual blending, where the additional channels potentially contributed to overlapping information, rendering the 16-channel condition comparable to the 12-channel condition. Furthermore, 20 channels provided enough cues for better speech perception, albeit this should not be taken for granted with actual CI processing, considering the previously mentioned complex relationship between the number of channels and channel interaction.
The second experiment, which delved into the effects of channel interaction, yielded intriguing insights. The participants demonstrated notably improved WRS under conditions of lower channel interaction, where the spectral and temporal fine structure was better preserved. This finding aligns with previous studies, suggesting that a more refined channel interaction, characterized by lower levels of spectral overlap, may play a pivotal role in enhancing speech perception8,12,13,14. These results prompt us to consider the possibility that achieving an optimal channel interaction may hold greater significance than simply increasing the number of channels. Indeed, it raises the intriguing prospect that prioritizing an improved channel interaction may substantially contribute to the overall efficacy of CI processing strategies, as suggested in previous studies12. In clinical practice, this suggests that professionals should prioritize achieving better channel interaction over simply maximizing the number of channels in CI fitting procedures.
Another critical finding was superior performance in the 22-channel setting (91.03%). Although it was still statistically poorer compared with the non-coded setting, WRS performance over 90% is considered a sign of healthy auditory function in routine audiological evaluations30; hence, 22 channels can be considered as an adequate input in terms of spectral information, at least in quiet test conditions with a superb channel interaction setting. However, this suggestion should be approached carefully because although adequate peripheral perception and decoding are essential, a healthy central auditory function and cognitive capacity are required for good speech perception. Both scientific literature and anecdotal evidence showed that many CI recipients have some form of auditory deprivation, which may lead to poor central auditory development31,32,33. Post-implantation benefit is highly influenced by pre-implantation auditory experiences and development; thus, findings on NH participants with CIS should be viewed within this scope. Moreover, the non-vocoded condition (95.45%) was still superior to the 22-channel vocoder setting, and this finding indicates that even though a good WRS can be achieved with 22 channels, some spectral information is still missing.
As stated previously, speech test materials may affect outcome in clinical and research settings. Although monosyllabic word lists in quiet backgrounds are widely used in audiology clinics, this approach is slowly being replaced by speech perception in noise tests with sentences or words. For example, the British Society of Audiology currently suggests using speech perception in noise tests for routine clinical assessments34. Even though currently in Turkey there is no legal or professional suggestion and/or guidelines in that regard, using a monosyllabic WRS in a quiet background may be considered a limitation of the current study. However, our data should be considered as a starting point for similar research that focuses on Turkish speech test materials, whether it is a sentence, syllable, or phonemes in quiet or noisy backgrounds.
CONCLUSION
This study advances our understanding of the spectral and temporal constraints inherent in CI processing, particularly in the context of Turkish word recognition tests. Given the variability in CI devices and potential issues necessitating the deactivation of specific electrodes, such as surgical complications, device malfunctions, or cochlear malformations, the insights gleaned from this research are of paramount significance. Our findings emphasize that a minimum of 12 spectral channels are imperative to deliver a high-fidelity signal to the auditory system, thereby ensuring proficient speech perception. Moreover, the study underscores the critical role of channel interaction in optimizing CI processing, suggesting that achieving an optimal interaction among channels may prove more pivotal than simply increasing their number. This insight holds substantial implications for clinicians, encouraging them to prioritize refining channel interaction in CI fitting procedures, potentially revolutionizing the way we approach auditory rehabilitation.


