WhisperX Gradio Interface
WhisperX is an advanced audio processing tool that provides high-quality speech recognition, translation, and analysis capabilities through an intuitive Gradio web interface.Key Features
Speech-to-Text
Convert audio files to accurate text transcripts with high precision and low latency.
Multi-Language Support
Support for 99+ languages with automatic language detection and translation capabilities.
Speaker Diarization
Identify and separate different speakers in audio recordings automatically.
Word-Level Timestamps
Get precise timestamps for each word in the transcription for detailed analysis.
Quick Start
1
Launch GPU Instance
Launch a GPU instance with the WhisperX template from your Comput3 dashboard.
WhisperX Template
Pre-configured instance with WhisperX, Gradio interface, and all dependencies ready to use.
Recommended GPU
RTX 4090 48GB, L40S, or A100 for optimal performance with audio processing.
2
Access Gradio Interface
Connect to your GPU instance and open the WhisperX Gradio interface.
The WhisperX template automatically starts the Gradio service and makes it accessible via web browser.
3
Upload Audio File
Use the Gradio interface to upload your audio file:
- Supported Formats
- File Size Limits
Audio file formats:
- MP3, WAV, FLAC, M4A
- AAC, OGG, WMA
- Video files with audio tracks (MP4, AVI, MOV)
4
Configure Processing Options
Set your processing preferences:
- Language: Auto-detect or specify language
- Model Size: Choose between base, small, medium, large, or large-v2
- Speaker Diarization: Enable/disable speaker identification
- Translation: Translate to different languages
- Output Format: Choose transcript format (TXT, SRT, VTT, JSON)
5
Process and Download
Click “Process” to start transcription and download results when complete.
Processing Options
Model Selection
Whisper Base
Whisper Base
Fast processing with good accuracy
- Size: 39 MB
- Speed: ~16x real-time
- Accuracy: Good for clear speech
- Best for: Quick transcriptions, real-time processing
Whisper Small
Whisper Small
Balanced speed and accuracy
- Size: 244 MB
- Speed: ~6x real-time
- Accuracy: Better than base model
- Best for: General purpose transcription
Whisper Medium
Whisper Medium
High accuracy with moderate speed
- Size: 769 MB
- Speed: ~2x real-time
- Accuracy: High quality results
- Best for: Professional transcription, important content
Whisper Large
Whisper Large
Maximum accuracy
- Size: 1550 MB
- Speed: ~1x real-time
- Accuracy: Highest quality
- Best for: Critical applications, complex audio
Whisper Large-v2
Whisper Large-v2
Latest model with improved performance
- Size: 1550 MB
- Speed: ~1x real-time
- Accuracy: Best available
- Best for: Production use, highest quality requirements
Language Support
Major Languages
Major Languages
Fully supported languages:
- English, Spanish, French, German, Italian
- Portuguese, Russian, Chinese (Mandarin), Japanese, Korean
- Arabic, Hindi, Dutch, Swedish, Norwegian
- Polish, Czech, Hungarian, Romanian, Bulgarian
Regional Languages
Regional Languages
Regional and dialect support:
- Chinese (Cantonese, Traditional, Simplified)
- Spanish (Mexico, Argentina, Spain variants)
- English (US, UK, Australian, Canadian)
- Portuguese (Brazil, Portugal)
- French (France, Canada, African variants)
Low-Resource Languages
Low-Resource Languages
Emerging language support:
- Swahili, Yoruba, Igbo, Amharic
- Bengali, Tamil, Telugu, Gujarati
- Ukrainian, Belarusian, Kazakh
- Thai, Vietnamese, Indonesian, Malay
Advanced Features
Speaker Diarization
Identify and separate different speakers in your audio:Translation Capabilities
Translate audio content to different languages:- Real-time Translation
- Batch Translation
Translate while transcribing:
Word-Level Timestamps
Get precise timing information for each word:Timestamp Formats
Timestamp Formats
Multiple output formats:
- JSON: Detailed word-level data with confidence scores
- SRT: SubRip subtitle format with timestamps
- VTT: WebVTT format for web applications
- TXT: Plain text with timestamps
Confidence Scores
Confidence Scores
Quality indicators:
- Word-level confidence scores (0-1)
- Segment-level quality metrics
- Speaker identification confidence
- Language detection confidence
Use Cases
Content Creation
Podcast Transcription
Convert podcast episodes to searchable text with speaker identification and timestamps.
Video Subtitles
Generate accurate subtitles for videos with precise timing and multiple language support.
Meeting Notes
Automatically transcribe meetings and generate structured notes with speaker attribution.
Content Analysis
Analyze audio content for keywords, sentiment, and engagement metrics.
Business Applications
Customer Support
Transcribe customer calls for quality assurance and training purposes.
Legal Documentation
Create accurate transcripts of depositions, hearings, and legal proceedings.
Educational Content
Convert lectures and educational materials to accessible text formats.
Media Production
Generate scripts and captions for media production workflows.
Performance Optimization
GPU Acceleration
CUDA Optimization
CUDA Optimization
GPU-accelerated processing:
- Automatic GPU detection and utilization
- CUDA memory optimization
- Batch processing for multiple files
- Real-time processing capabilities
Memory Management
Memory Management
Efficient resource usage:
- Dynamic model loading and unloading
- Memory-efficient audio processing
- Automatic cleanup after processing
- Support for large audio files
Processing Speed
| Model | GPU Type | Processing Speed | Memory Usage |
|---|---|---|---|
| Base | RTX 4090 48GB | ~32x real-time | 2GB |
| Small | RTX 4090 48GB | ~12x real-time | 4GB |
| Medium | RTX 4090 48GB | ~4x real-time | 6GB |
| Large | RTX 4090 48GB | ~2x real-time | 8GB |
| Large-v2 | RTX 4090 48GB | ~1.5x real-time | 10GB |
API Integration
REST API Endpoints
Process Audio
Process Audio
Main processing endpoint:
Get Status
Get Status
Check processing status:
List Models
List Models
Available models:
Troubleshooting
Common Issues
Audio Quality Issues
Audio Quality Issues
Poor transcription quality:
- Use higher quality audio files (16kHz+ sample rate)
- Reduce background noise before processing
- Try different model sizes (larger models for better accuracy)
- Specify the correct language if auto-detection fails
Processing Errors
Processing Errors
File processing failures:
- Check file format compatibility
- Ensure file size is under 1GB limit
- Verify audio file is not corrupted
- Check GPU memory availability
Performance Issues
Performance Issues
Slow processing speeds:
- Use smaller models for faster processing
- Ensure GPU is properly utilized
- Close other applications to free up resources
- Consider using batch processing for multiple files