DefinePK hosts the largest index of Pakistani journals, research articles, news headlines, and videos. It also offers chapter-level book search.
Title: LOW-LATENCY AUDIO-VISUAL SPEECH ENHANCEMENT USING HYBRID ATTENTION-BASED DEEP LEARNING MODEL
Authors: Fahad Khalil Peracha, Mohammad Irfan Khattak, Nasir Saleem, Waqas Tariq Paracha, Mohammad Usman Ali Khan, Atif Jan
Journal: Spectrum of Engineering Sciences
| Category | From | To |
|---|---|---|
| Y | 2024-10-01 | 2025-12-31 |
Publisher: Sociology Educational Nexus Research Institute
Country: Pakistan
Year: 2025
Volume: 3
Issue: 9
Language: en
Keywords: Learning ModelLOW-LATENCY AUDIO-VISUALSPEECH ENHANCEMENT USINGHYBRID ATTENTION-BASED DEEP
Speech enhancement aims to recover clean speech from noisy signals. In many applications — video conferencing, hearing aids, augmented reality — latency must be low, because delays degrade intelligibility and user experience. Recent work shows that combining audio with visual cues (lip movements, facial features) can improve performance under low signal-to-noise ratios (SNR), especially in noisy or reverberant environments. However, many existing audio-visual speech enhancement (AV-SE) methods suffer from high latency, non-causality, or inefficient fusion of modalities. This paper proposes a hybrid attention-based deep learning model designed for real-time, low-latency audio-visual speech enhancement. The model combines temporal, frequency, and cross-modal attention mechanisms to extract features from the noisy audio, align and fuse visual and audio features, and reconstruct enhanced speech with minimal delay. In the encoder, spectral features of the noisy audio are processed via a convolutional front end followed by frequency-axis attention to capture global spectral dependencies. Parallelly, a visual encoder processes lip and face region motion via convolution and temporal attention to model dynamics in the visual stream. A cross-modal attention module enables selective fusion, letting the model weight visual cues more when audio is unreliable (e.g. low SNR), while giving more weight to audio when visual information is less helpful (e.g. occluded or blurred). A decoder network then combines fused features, using skip connections and attention gates, to output a clean spectrogram, which is converted back to waveform via an inverse transform. Causality is ensured by only using past and current frames (no future frames). The model also uses lightweight attention blocks and optimized frame sizes to keep computational and algorithmic latency low. We evaluate our model on standard benchmarks including AVSpeech and NTCD-TIMIT, under several noise conditions (stationary, non-stationary, low/high SNR) and visual degradations (blur, partial occlusion). Metrics include objective speech quality (PESQ), intelligibility (STOI), and real-time latency. Our results show that the hybrid attention model outperforms strong baselines including audio-only speech enhancement and simpler AV-SE models with naive fusion, achieving improvements in PESQ and STOI of ~0.5–1.2 dB/points in moderate to low SNR, while maintaining total processing latency under 40 ms. In particular, under very low SNRs (e.g. -5 dB), visual cues via cross-modal attention grant significant gains. This work contributes: (1) a hybrid attention framework that fuses audio and visual features adaptively under constrained latency; (2) architectural design choices (lightweight attention blocks, skip-connections, causal temporal/frequency attention) optimized for low delay; (3) experimental validation showing the feasibility of high quality AV speech enhancement in real-time. Potential applications include live communication tools, hearing assistance devices, and any system where delayed feedback harms user perception.
Loading PDF...
Loading Statistics...