Table of contents
This project was achieved using faster-whisper technology throughout the entire Python codebase, and is implemented in a WebSocket server that enables real-time audio transcription and translation.
Whisper Streaming
Whisper Streaming is a real-time speech-to-text transcription and translation system built on OpenAI's Whisper model with WebSocket support and optional Google Translate integration. The system is powered exclusively by faster-whisper for optimal performance and speed. This project aims to be robust, scalable, and FAST for production use.
Demos
Real-time Whisper Transcription
This demo showcases the core WebSocket-based transcription functionality with near-instant results.
Please note: In the video, you'll notice the WebSocket connects to an IP address, this is an AWS EC2 GPU instance where the server was deployed for optimal real-time performance and faster processing.
Streaming Transcription with Translation
This demo demonstrates the combined transcription and translation capabilities using Google Translate API.
Please note: The processing appears slower in this video because it was running on a laptop with CPU-only processing (no GPU). Even with CPU limitations, the system delivers near real-time performance with approximately 5-second latency, which is acceptable for CPU-based processing.
Chrome Extension - Tab Audio Capture
The Chrome extension captures audio from any browser tab and provides real-time transcription and translation in a persistent side panel. Perfect for YouTube videos, podcasts, and online meetings works with any audio content playing in your browser tabs.
Features
- ✅ Real-time streaming - WebSocket-based audio streaming with low latency processing
- ✅ Voice Activity Detection - Optimized processing with VAD/VAC for better performance
- ✅ Live Translation - Real-time Google Translate integration for multilingual support
- ✅ Chrome Extension - Capture and translate audio from any browser tab
- ✅ Web Interface - Browser-based UI with language switching capabilities
Streaming support: The application processes audio chunks in real-time as they arrive. You don't need to wait for the complete audio to be processed before receiving transcription results.
Faster-Whisper Backend: Built exclusively with faster-whisper for optimal performance and speed. This implementation provides the best balance of accuracy and processing speed for real-time applications.
WebSocket Architecture: Built on WebSocket protocol for real-time bidirectional communication between client and server, enabling live audio streaming and instant results.
Voice Activity Detection: Integrated VAD/VAC reduces processing overhead by only transcribing when speech is detected, improving performance and accuracy.
Translation Integration: Optional Google Translate API integration provides real-time translation alongside transcription for multilingual applications.
Chrome Extension: Capture audio from any browser tab (YouTube, podcasts, meetings) and get live transcription and translation in a persistent side panel.
Easy Configuration: Simple launcher script with extensive command-line options for quick setup and customization.
Please note: This project supports all Whisper-compatible languages and can be easily extended with additional features. Contributions and feature suggestions are welcome!
Installation
Required Dependencies
Install the core dependencies first:
pip install librosa soundfile websockets
Faster-Whisper Backend (Recommended)
Several alternative backends are integrated. The most recommended one is faster-whisper with GPU support. For optimal performance, GPU acceleration is highly recommended.
CPU Installation
pip install faster-whisper
GPU Installation (Recommended)
For GPU support, follow the faster-whisper instructions for NVIDIA libraries. We succeeded with CUDNN 8.5.0 and CUDA 11.7.
pip install faster-whisper
GPU Configuration: After installation, navigate to whisper_online.py
and find the model variable around line 119. Uncomment the GPU model line to enable GPU acceleration.
Optional Features
- Voice Activity Detection:
pip install torch torchaudio
- Translation Support:
pip install requests
(for Google Translate integration)
Quick Start
Once installed, start the server with:
python start_whisper.py
Then open index.html
in your browser or navigate to http://localhost:43007