Mel scale
The Mel scale is a perceptual scale of pitches that approximates the nonlinear way humans perceive differences in sound frequency, such that equal intervals on the scale correspond to subjectively equal steps in pitch height. Developed through psychophysical experiments, it maps physical frequencies in hertz (Hz) to a unit called the mel, with a reference point of 1000 mels assigned to a 1000 Hz tone at moderate loudness. Unlike linear frequency scales, the Mel scale is roughly linear for low frequencies (below about 1000 Hz) and logarithmic for higher frequencies, reflecting the human auditory system's greater resolution at lower pitches and compression at higher ones. The scale originated in 1937 from experiments by S. S. Stevens, J. Volkmann, and E. B. Newman, who used the method of bisection—asking listeners to identify tones that halved the perceived pitch interval between two reference tones—to construct a subjective measure of pitch magnitude across frequencies from 200 Hz to 8000 Hz. Their work, published in the Journal of the Acoustical Society of America, revealed that perceived pitch differences align with the integration of just-noticeable differences (differential thresholds) in frequency, suggesting a uniform psychological scaling along the basilar membrane in the inner ear. Although the original data lacked a simple closed-form equation, later approximations facilitated practical use; a widely adopted formula to convert frequency f (in Hz) to mels m ism = 2595 \log_{10} \left(1 + \frac{f}{700}\right),
with the inverse
f = 700 \left(10^{m/2595} - 1\right).
This approximation, refined in subsequent psychoacoustic studies, ensures that 1000 mels corresponds closely to 1000 Hz and captures the scale's quasi-logarithmic behavior. In audio signal processing and machine learning, the Mel scale is foundational for features like Mel-frequency cepstral coefficients (MFCCs), introduced by Davis and Mermelstein in 1980 as compact representations of speech spectra that mimic human hearing. MFCCs apply Mel-scaled filter banks to short-time Fourier transforms of audio, followed by logarithm and discrete cosine transform, enabling robust applications in automatic speech recognition, speaker identification, music information retrieval, and sound classification systems. By prioritizing perceptually salient frequency regions, the scale improves model performance in tasks where linear frequency representations fall short, such as distinguishing vowels or detecting audio anomalies.
Perceptual Foundations
Human Pitch Perception
Pitch is a perceptual attribute of sound that arises from the auditory system's processing of acoustic stimuli, distinct from the physical property of frequency, which measures the number of vibrations per second in hertz (Hz).[1] Human listeners perceive pitch changes nonlinearly with respect to frequency, such that equal intervals in perceived pitch correspond to multiplicative rather than additive changes in frequency. For instance, the octave interval between 100 Hz and 200 Hz is readily distinguishable as a major perceptual shift, whereas a similar absolute difference of 100 Hz at higher frequencies, such as between 9000 Hz and 9100 Hz, produces a much subtler pitch variation relative to the overall scale. Psychophysical experiments have demonstrated this logarithmic-like perception of pitch for frequencies above approximately 500 Hz, where the just noticeable difference (JND) in frequency— the smallest detectable change—scales proportionally to the base frequency, aligning with Weber's law in audition. According to Weber's law, the JND (Δf) is a constant fraction (k) of the stimulus frequency (f), expressed as Δf / f = k, with k around 0.006 for frequencies above 1 kHz at moderate sensation levels. Below 500 Hz, the JND remains relatively constant at about 3 Hz, indicating higher absolute sensitivity in the low-frequency range. This pattern reflects the Weber-Fechner law's application to auditory pitch, where perceived pitch magnitude grows logarithmically with frequency, as evidenced in discrimination tasks using pure tones. Seminal measurements by Wier, Jesteadt, and Green confirmed these thresholds through adaptive forced-choice procedures, showing that frequency discrimination improves relatively at higher frequencies but requires larger absolute changes to achieve equivalent perceptual differences.[2] The basilar membrane in the cochlea plays a central role in this frequency-to-place mapping, exhibiting tonotopic organization where high frequencies stimulate the base and low frequencies the apex, resulting in a nonlinear transformation of frequency into spatial excitation patterns. This place coding contributes to the perceptual nonlinearity, as the membrane's mechanical properties cause broader displacement envelopes at low frequencies, enhancing resolution there. Critical bands, frequency ranges of about 100-200 Hz width that increase with center frequency, further define the auditory filters underlying pitch perception; sounds within the same critical band interact strongly, limiting independent resolution and influencing discrimination. Zwicker's foundational work established these bands through masking experiments, revealing that perceptual pitch judgments depend on integrated activity across these cochlear filters rather than precise frequency tuning alone.[3]Motivation for the Mel Scale
Human auditory perception of pitch does not align linearly with physical frequency in hertz (Hz), as equal frequency intervals produce unequal perceived pitch differences. For example, a 100 Hz increase from 100 Hz to 200 Hz is heard as a substantially larger pitch shift than the same 100 Hz increase from 3000 Hz to 3100 Hz, reflecting the compressed sensitivity of the ear at higher frequencies.[4] This nonlinearity stems from the cochlear mechanics, where low frequencies elicit broader neural activation than high ones, distorting linear scales for perceptual tasks.[5] To rectify this, the mel scale introduces a perceptual unit called the "mel," designed such that equal intervals in mels correspond to equal subjective pitch distances as judged by listeners. It is anchored by defining a 1000 Hz tone at 40 dB sound pressure level (SPL) as exactly 1000 mels, providing a reference for scaling other frequencies.[6] The scale's core objective is to remap physical frequencies onto a perceptually uniform domain, enabling precise quantification of pitch sensations for psychoacoustic research and engineering designs like audio processing systems.[7] This transformation supports applications where perceptual equivalence must align with physical measurements, such as in speech analysis and hearing models.[8]Mathematical Definition
Standard Formula
The standard formula for converting an acoustic frequency f (in hertz) to its corresponding value m on the Mel scale (in mels) is given by m = 2595 \log_{10}\left(1 + \frac{f}{700}\right). This expression approximates the nonlinear relationship between physical frequency and perceived pitch in human audition, exhibiting approximately linear behavior for low frequencies below around 1000 Hz—where \log_{10}(1 + x) \approx x / \ln(10) for small x, yielding m \approx (2595 / 700 / \ln(10)) f \approx 1.61 f—and transitioning to logarithmic scaling at higher frequencies, which aligns with the compressive nature of pitch perception in that range. The inverse formula, which maps a Mel value back to frequency, is f = 700 \left(10^{m / 2595} - 1\right). This bidirectional mapping facilitates applications requiring perceptual frequency warping while preserving the scale's empirical foundations. The constants in the formula arise from a least-squares fitting process applied to psychophysical data on pitch judgments, ensuring close alignment with listener-reported equal-percept intervals; specifically, the knee point at 700 Hz reflects the frequency where auditory sensitivity begins to exhibit logarithmic compression, as observed in experiments using fractionation and equisection methods on tones from 20 Hz to several kHz. The prefactor 2595 normalizes the scale such that m = 1000 at f = 1000 Hz, establishing a convenient reference tied to the original data's anchoring at a 1000-Hz tone judged as 1000 mels. This particular approximation, among several possible fits to the data, has become the most widely adopted due to its simplicity and accuracy across the audible range.Approximations and Implementations
Practical approximations of the Mel scale for computational use often adopt a piecewise structure, applying a linear transformation for frequencies below 1000 Hz with spacing of approximately 66.7 Hz per mel (corresponding to a slope of ≈0.015 mels/Hz), and a logarithmic transformation above this breakpoint to capture the nonlinear perception of higher pitches. This design, as in the Slaney-style implementation, enhances efficiency in digital signal processing by avoiding complex continuous functions while maintaining perceptual fidelity, though the low-frequency scaling differs from the psychophysical slope of ≈1.61 mels/Hz. A widely adopted logarithmic approximation, derived from empirical fits to psychophysical data and equivalent to the standard formula, is given by m = 1127 \ln\left(1 + \frac{f}{700}\right), where f is the frequency in Hz and m is the corresponding Mel value; this formula uses the natural logarithm and approximates the linear behavior at low frequencies since \ln(1 + x) \approx x for small x, yielding m \approx 1.61 f.[9] In implementations, the Mel scale is typically realized through a filter bank consisting of overlapping triangular filters centered at frequencies spaced uniformly in the Mel domain. These filters, often 20 to 40 in number, weight the power spectrum to produce Mel-frequency coefficients, as introduced in the seminal MFCC framework. The lower edge of the filter bank starts near 0 Hz, while the upper limit is set to approximately 8000 Hz for standard speech processing (corresponding to a 16 kHz sampling rate) or up to 11000 Hz for wider-band audio, ensuring the analysis stays within the Nyquist limit of half the sampling rate. To handle varying sampling rates, implementations normalize the frequency range by scaling the maximum frequency proportionally (e.g., f_{\max} = 0.5 \times sr, where sr is the sample rate), and adjust filter bandwidths accordingly for consistent perceptual coverage.[10] A concrete example appears in Python's librosa library, where themel_frequencies function computes Mel-spaced center frequencies using Slaney's piecewise approximation: linear below 1000 Hz (with ≈66.7 Hz per mel) and logarithmic above, with parameters tuned to replicate the MATLAB Auditory Toolbox behavior for 128 bins spanning 0 to 8000 Hz by default. This facilitates efficient computation of Mel spectrograms via librosa.feature.melspectrogram. For illustration, pseudocode for the common logarithmic approximation is:
Such functions are inverted for generating filter centers and can be adapted for Slaney-style piecewise variants using the library's implementation, which applies linear spacing below 1000 Hz rather than m = f.[10][11]import math def hz_to_mel(f): return 1127 * math.[log](/page/Log)(1 + f / 700) def mel_to_hz(m): return 700 * (math.[exp](/page/Exp)(m / 1127) - 1)import math def hz_to_mel(f): return 1127 * math.[log](/page/Log)(1 + f / 700) def mel_to_hz(m): return 700 * (math.[exp](/page/Exp)(m / 1127) - 1)
Historical Development
Early Experiments
The foundational empirical basis for the Mel scale emerged from psychophysical experiments in the 1930s and 1940s, focusing on how humans perceive equal pitch intervals across frequencies. In 1937, Stevens, Volkmann, and Newman conducted a study where five observers fractionated tones at 10 different frequencies to determine the "half-value" of pitches, with loudness held constant at 60 dB above threshold.[12] This method of bisection involved listeners judging tones that bisected the perceived pitch interval between a standard tone and silence or a low-frequency reference, establishing an initial subjective pitch scale in units later termed mels, with a 1000 Hz tone arbitrarily assigned 1000 mels.[12] The experiment covered frequencies from approximately 125 Hz to 12,000 Hz, yielding early mel values up to around 3000 mels for higher pitches, and revealed that perceived pitch intervals, such as octaves, expand in subjective size at higher frequencies.[12] During the 1940s, wartime research at the Harvard Psycho-Acoustic Laboratory, established in 1940 to enhance communication systems for military applications, expanded the dataset underlying the Mel scale.[13] These efforts addressed challenges in telephone quality assessment and sound localization in noisy environments, requiring detailed mappings of pitch perception to improve signal intelligibility.[13] Building on the 1937 work, a 1940 revision incorporated additional psychophysical data from equisection tasks, where listeners divided pitch intervals into equal perceptual segments, refining the scale across a broader range of intensities and frequencies.[14] Key methods in these early studies included absolute pitch matching, where observers adjusted variable tones to match the perceived pitch of standards, and magnitude estimation, in which listeners assigned numerical values to the subjective intensity of pitch differences.[12][14] Bisection and equisection judgments provided consistent data points, such as equating 1000 Hz at 40 phon to exactly 1000 mels, anchoring the scale to a perceptually uniform reference.[14] These approaches prioritized direct listener reports to quantify nonlinear pitch perception, laying the groundwork for the Mel scale's empirical validity without relying on musical training.[12][14]Key Contributors and Publications
The Mel scale was primarily developed by psychologist Stanley Smith Stevens, who led the foundational research on perceptual scaling of pitch. Stevens, along with collaborators John Volkmann and Edwin B. Newman, introduced the scale in their seminal 1937 paper published in the Journal of the Acoustical Society of America, where they proposed a unit called the "mel" to quantify subjective pitch intervals based on listener judgments. The unit was named "mel" after the word "melody."[15] This work established the scale's core empirical basis, defining one mel as one-thousandth of the pitch span from 0 to 1000 Hz at a 1000 Hz reference tone. In the early 1940s, Stevens further refined his theoretical framework for measurement scales, including perceptual ones like the mel, in his influential 1946 paper in Science, which categorized scales into nominal, ordinal, interval, and ratio types and emphasized ratio scales for sensory magnitudes such as pitch.[16] A key practical refinement to the mel scale itself came in 1940 through Stevens and Volkmann's collaborative paper in The American Journal of Psychology, which revised the original formulation to better align with frequency-to-pitch mappings across a wider auditory range, incorporating adjustments for low-frequency tones.[14] During the 1950s, Stevens continued to refine psychophysical methods applicable to the mel scale through his broader work on sensory scaling, including direct magnitude estimation techniques that reinforced the scale's ratio properties.[17] By the 1960s, the mel scale had become integrated into emerging models of auditory perception, with Stevens and contemporaries citing it in publications within the Journal of the Acoustical Society of America to link pitch perception to physiological and computational auditory processes.[18] These efforts by Stevens, Volkmann, and Newman solidified the mel scale as a standard tool in psychoacoustics.Variations and Related Scales
Alternative Formulations
Early psychophysical studies on pitch perception, such as those by S.S. Stevens and colleagues in the 1930s and 1940s, showed that direct magnitude estimation of pitch height follows a power-law relation m \propto f^{0.3} to f^{0.4}, reflecting compressive nonlinearity. However, the mel scale was constructed using interval bisection methods, leading to a quasi-logarithmic form rather than power-law. A widely used approximation for the mel scale, proposed by Douglas O'Shaughnessy in his 1987 book Speech Communication: Human and Machine, is the formulam = 2595 \log_{10} \left(1 + \frac{f}{700}\right),
which applies across the frequency range and ensures 1000 mels at 1000 Hz. This logarithmic mapping captures the scale's behavior, linear at low frequencies and compressive at high ones, based on fits to historical pitch discrimination data. Another common variant, often used in computational auditory models like those in MATLAB's Audio Toolbox, is the Slaney formulation (1998):
m = 1127 \ln \left(1 + \frac{f}{700}\right),
equivalent to the O'Shaughnessy form but using natural logarithm for numerical stability in implementations. These approximations facilitate applications in speech processing, differing slightly in high-frequency scaling due to fitting choices. Other variants include adjustments for specific listener data or cochlear models, such as inverse mappings for low-frequency emphasis. These formulations highlight the mel scale's evolution toward precise, computationally efficient models prioritizing perceptual linearity.
Comparisons to Other Perceptual Scales
The Bark scale, introduced by Eberhard Zwicker in 1961, models the audible frequency range into approximately 24-25 critical bands, each a Bark unit, based on psychoacoustic masking experiments. Sounds within the same band interact strongly. Unlike the mel scale's focus on pitch intervals, the Bark scale is more linear at high frequencies, suiting spectral masking and loudness models. The conversion from frequency f in Hz to Bark z uses:z = 13 \arctan(0.00076 f) + 3.5 \arctan\left(\left(\frac{f}{7500}\right)^2\right). The equivalent rectangular bandwidth (ERB) scale, developed by Brian R. Glasberg and Brian C. J. Moore in 1990, approximates cochlear filter bandwidths from notched-noise masking, spanning ~17-20 ERBs. The bandwidth at center frequency f is
\text{ERB}(f) = 24.7 + 0.108 f,
with narrower filters at low frequencies. The ERB-rate (spectral position) is often approximated as
\text{ERB-rate}(f) = \frac{21.4 \ln \left( \frac{0.108 f + 24.7}{24.7} \right)}{0.108},
but common implementations integrate the inverse bandwidth. It emphasizes frequency selectivity, differing from Bark's discrete bands. These scales nonlinearly warp frequency: expanding lows and compressing highs. Mel prioritizes pitch equality, Bark critical bands for integration, ERB filter bandwidths for resolution. Bark and ERB align on bandwidth phenomena, while mel suits logarithmic pitch tasks. Example mappings using standard formulas (Mel: m(f) = 2595 \log_{10}(1 + f/700), Bark as above, ERB-rate via Glasberg-Moore approximation yielding ~9.3 at 1000 Hz):
| Frequency (Hz) | Mel Scale | Bark Scale | ERB-rate Scale |
|---|---|---|---|
| 100 | 150 | 1.0 | 2.8 |
| 1000 | 1000 | 8.5 | 9.3 |
| 4000 | 2146 | 17.0 | 19.5 |