Spectral Representations: Mel Spectrograms, CQT, and HCQT

Three-dimensional waves representing spectral audio analysis in grey and light blue tones on a white background, with a stylized graphic equalizer. — A modern and minimalist interpretation of spectral audio analysis created with DALL·E.

Have you ever wondered how machines understand the nuances of sound? In my previous post, we explored spectral analysis and learned how spectrograms reveal the frequency content of audio signals using the Short-Time Fourier Transform (STFT). Now, let’s dive deeper into advanced spectral representations for audio analysis, including Mel Spectrograms, CQT, and HCQT, and show how they can be used for perceptual audio analysis and feature extraction. These tools are essential for building machine learning models for tasks like audio classification, a field I’m currently exploring.

Why Feature Extraction?

Spectral analysis provides us a visual map of audio frequencies, but for machine learning, we need compact, meaningful features that capture the essence of sound. Raw spectrograms are rich but high-dimensional, making them inefficient for direct use in models. By refining them into perceptually relevant or musically meaningful representations, we can extract features that align with how we hear or interpret audio. This is crucial for applications like genre classification, pitch detection, or environmental sound recognition.

Advanced Spectral Representations

Let’s explore three advanced spectral representations that address the limitations of STFT-based spectrograms: Mel Spectrograms, Constant-Q Transform (CQT), and Harmonic-CQT (HCQT). Each of these tools offers unique advantages for audio analysis and feature extraction.

Mel Spectrogram (MEL) and Log-Mel Spectrogram (LMS)

What Are They?

The Mel Spectrogram adapts the STFT to the Mel scale, a perceptual scale of pitch that reflects how humans hear frequency differences (e.g., we’re more sensitive to changes at lower frequencies). It compresses the frequency axis into Mel bins, reducing dimensionality while prioritizing auditory perception. The Log-Mel Spectrogram takes this further by applying a logarithmic transformation to the amplitude, mimicking the logarithmic response of our ears to loudness.

Why Use Them?

Perceptual Relevance: Mel Spectrograms align with human hearing, making them ideal for speech and music analysis.
Machine Learning Ready: Log-Mel Spectrograms are compact and widely used as input features for deep learning models.

Example in Python

```python
import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

# Load audio
y, sr = librosa.load('/content/drive/My Drive/audio_files/sample.wav')

# Compute Mel Spectrogram
S = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128)
S_db = librosa.power_to_db(S, ref=np.max)  # Log-Mel Spectrogram

# Plot
plt.figure(figsize=(14, 5))
librosa.display.specshow(S_db, sr=sr, x_axis='time', y_axis='mel')
plt.colorbar(format='%+2.0f dB')
plt.title('Violin Log-Mel Spectrogram')
plt.show()
```

n_mels=128: Number of Mel bins (adjustable based on your needs).
Output: Time vs. Mel frequency, with colour showing log-amplitude.

Log-Mel Spectrogram for advanced spectral representations in audio analysis — **Figure 1:** Example Log-Mel Spectrogram generated from an audio file using the code above. The x-axis represents time, and the y-axis shows frequency, giving a visual representation of the sound’s intensity over time.

Constant-Q Transform (CQT)

What Is It?

The Constant-Q Transform (CQT) is an alternative to STFT that uses a logarithmic frequency scale, where the frequency resolution is constant relative to the center frequency (constant Q-factor). Unlike STFT’s fixed window size, CQT’s window size varies—longer for low frequencies, shorter for high ones.

Why Use It?

Musical Advantage: Its logarithmic scale matches the intervals of musical notes (e.g., octaves), making it perfect for pitch-related tasks like chord recognition or music transcription.
Better Resolution: It captures low-frequency details (e.g., bass notes) better than STFT.

Example in Python

The following example was implemented following the code used before.

```python
# Compute CQT
C = librosa.cqt(y, sr=sr)
C_db = librosa.amplitude_to_db(abs(C), ref=np.max)

# Plot
plt.figure(figsize=(14, 5))
librosa.display.specshow(C_db, sr=sr, x_axis='time', y_axis='cqt_note')
plt.colorbar(format='%+2.0f dB')
plt.title('Violin Constant-Q Transform')
plt.show()
```

y_axis=’cqt_note’: Labels the y-axis with musical notes (e.g., C4, D4), emphasizing its musical focus.

Constant-Q Transform (CQT) for advanced spectral representations in audio analysis. — **Figure 2:** Example Constant-Q Transform generated from an audio file using the code above. The x-axis represents time, and the y-axis shows musical notation, giving a visual representation of the sound’s intensity over time.

Harmonic-CQT (HCQT)

What Is It?

The Harmonic Constant-Q Transform (HCQT) extends CQT by analysing harmonic structures. It computes CQTs at multiple harmonic multiples (e.g., fundamental frequency and its overtones) and stacks them into a 3D representation.

Why Use It?

Pitch-Related Applications: HCQT excels at separating harmonic content (e.g., a piano’s notes) from noise or percussive elements, ideal for pitch detection or source separation.
Research Edge: It’s advanced and less common, showcasing cutting-edge techniques.

Note on Implementation

`Librosa` doesn’t directly provide HCQT, but you can approximate it by computing CQTs for harmonic multiples manually or use external libraries like `nnAudio`. Here are simplified examples using both libraries:

With `Librosa`:

```python
import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np

# Load audio file
y, sr = librosa.load('/content/drive/My Drive/audio_files/sample.wav', sr=22050)  # Replace with your file path
hop_length = 512  # Number of samples between successive frames
harmonics = [1, 2, 3]  # Harmonics to analyze (fundamental + overtones)

# Compute HCQT for the fundamental (h=1)
fmin = librosa.note_to_hz('C1') * 1  # Convert note C1 to Hz (~32.7 Hz)
n_bins = 60  # Total bins (5 octaves: 60/12 = 5)

# Check Nyquist limit (prevents aliasing)
nyquist_limit = fmin * (2 ** (n_bins / 12)) 
if nyquist_limit < sr / 2:
    # Compute Constant-Q Transform
    cqt = librosa.cqt(y, sr=sr, hop_length=hop_length, 
                     fmin=fmin, n_bins=n_bins, bins_per_octave=12)
else:
    raise ValueError("Nyquist limit exceeded! Adjust parameters.")

# Convert CQT magnitude to decibels (normalized to max amplitude)
cqt_db = librosa.amplitude_to_db(np.abs(cqt), ref=np.max)

# Generate CQT frequency axis (logarithmic scale)
frequencies = librosa.cqt_frequencies(n_bins=n_bins, fmin=fmin, bins_per_octave=12)

# Plot the spectrogram
plt.figure(figsize=(14, 5))
librosa.display.specshow(cqt_db, sr=sr, hop_length=hop_length,
                        y_axis='cqt_hz', x_axis='time',  # Log-frequency axis
                        fmin=fmin, bins_per_octave=12, 
                        vmin=-80, vmax=0)  # dB range and optional colourmap add , cmap='viridis'
plt.colorbar(format='%+2.0f dB', label='Amplitude (dB)')
plt.ylim(frequencies[0], frequencies[-1])  # Set frequency axis limits
plt.title('Violin Harmonic-CQT (Fundamental) - Librosa')
plt.xlabel('Time (s)')
plt.ylabel('Frequency (Hz)')
plt.show()
```

Limitations:

– Tedious manual setup.

– No native harmonic stacking.

– Limited to CPU computation.

For efficient HCQT computation, we use `nnAudio`, a PyTorch-based library that leverages GPU acceleration. First, install it:

```python
pip install nnAudio
```

Then, run the following code:

```python
import torch
from nnAudio.features.cqt import CQT
import matplotlib.pyplot as plt

# Parameters
sr = 22050  # Sample rate
hop_length = 512  # Hop size
n_bins = 60  # Number of frequency bins (reduced to avoid Nyquist issues)
fmin = 32.7  # Minimum frequency (C1 in Hz)
harmonics = [1, 2, 3]  # Harmonics to compute

# Load audio (using librosa)
y, _ = librosa.load("/content/drive/My Drive/audio_files/sample.wav", sr=sr)

# Convert to PyTorch tensor
y_tensor = torch.tensor(y).float()

# Compute HCQT for each harmonic
hcqt = []
for h in harmonics:
    cqt = CQT(sr=sr, hop_length=hop_length, n_bins=n_bins,
              fmin=fmin * h, bins_per_octave=12, output_format='Magnitude')
    cqt_output = cqt(y_tensor)  # Shape: (1, n_bins, time)
    cqt_db = 20 * torch.log10(torch.clamp(cqt_output, min=1e-5))  # Avoid log(0)
    hcqt.append(cqt_db)

# Plot the fundamental harmonic
if hcqt:
    plt.figure(figsize=(14, 5))
    plt.imshow(hcqt[0].squeeze().numpy(), aspect='auto', origin='lower', cmap='viridis', vmin=-80, vmax=0, interpolation='bilinear')
    plt.colorbar(format='%+2.0f dB')
    plt.title('Violin Harmonic-CQT (Fundamental) - nnAudio')
    plt.xlabel('Time')
    plt.ylabel('Frequency (bins)')
    plt.show()
```

Advantages:

– GPU Acceleration: Faster computation for large datasets.

– Native Harmonic Support: Streamlined parameter setup.

– PyTorch Integration: Direct compatibility with deep learning pipelines.

Violin Harmonic-CQT (Fundamental) computed using Librosa, showing frequency and amplitude variations over time.

Violin Harmonic-CQT (Fundamental) computed using nnAudio, showing frequency and amplitude variations over time. — **Figure 3:** HCQT computed with librosa (top) vs. nnAudio (bottom). The nnAudio implementation offers cleaner harmonic separation due to GPU-optimized computation.
*The axis are labelled different but basic programming configurations to plot are the same.*

What Do These Representations Tell Us?

Mel/Log-Mel: Highlights perceptually significant frequencies (e.g., speech formants or musical timbre).
CQT: Reveals musical structure (e.g., note transitions in a melody).
HCQT: Isolates harmonic patterns (e.g., a chord’s overtones), distinguishing pitched sounds from noise.

These features are more targeted than raw STFT spectrograms, making them powerful inputs for machine learning models.

Reflection

Exploring these spectral representations has been a transformative experience for me. Initially, I relied heavily on STFT, but discovering Mel Spectrograms showed me how aligning analysis with human perception could significantly boost classification accuracy—something I’m currently testing with various audio datasets. Implementing CQT was a revelation for its musical precision, though working with HCQT pushed my coding skills to the limit. I spent hours digging into research papers and experimenting with harmonic stacking to get it right. These challenges have deepened my understanding of audio feature extraction and increased my excitement for applying these techniques to machine learning models.

Conclusion

Spectral representations like Mel Spectrograms, CQT, and HCQT take us beyond basic spectrograms, offering perceptually and musically relevant features for audio analysis.

In this post, we’ve explored advanced spectral representations for audio analysis, including Mel Spectrograms, CQT, and HCQT, and seen how they can be used for audio analysis and feature extraction. These tools take us beyond waveforms and basic spectrograms, offering perceptually and musically relevant features that are essential for machine learning tasks.

Additional Resources

Librosa Documentation: librosa.org/doc
nnAudio: nnAudio 0.2.0
Deep Learning 101 for Audio-based MIR, ISMIR 2024 Tutorial by Geoffroy Peeters et al. (2024).
Z. Rafii, “The Constant-Q Harmonic Coefficients: A timbre feature designed for music signals [Lecture Notes],” in IEEE Signal Processing Magazine, vol. 39, no. 3, pp. 90-96, May 2022, doi: 10.1109/MSP.2021.3138870. keywords: {Cepstral analysis;Instruments;Transforms;Speech recognition;Power system harmonics;Harmonic analysis;Feature extraction},
K. W. Cheuk, H. Anderson, K. Agres and D. Herremans, “nnAudio: An on-the-Fly GPU Audio to Spectrogram Conversion Toolbox Using 1D Convolutional Neural Networks,” in IEEE Access, vol. 8, pp. 161981-162003, 2020, doi: 10.1109/ACCESS.2020.3019084.