Skip to main content
Process audio files with advanced AI-powered transcription and analysis using WhisperX Gradio Interface on Comput3 Network’s GPU infrastructure.

WhisperX Gradio Interface

WhisperX is an advanced audio processing tool that provides high-quality speech recognition, translation, and analysis capabilities through an intuitive Gradio web interface.

Key Features

Speech-to-Text

Convert audio files to accurate text transcripts with high precision and low latency.

Multi-Language Support

Support for 99+ languages with automatic language detection and translation capabilities.

Speaker Diarization

Identify and separate different speakers in audio recordings automatically.

Word-Level Timestamps

Get precise timestamps for each word in the transcription for detailed analysis.

Quick Start

1

Launch GPU Instance

Launch a GPU instance with the WhisperX template from your Comput3 dashboard.

WhisperX Template

Pre-configured instance with WhisperX, Gradio interface, and all dependencies ready to use.

Recommended GPU

RTX 4090 48GB, L40S, or A100 for optimal performance with audio processing.
2

Access Gradio Interface

Connect to your GPU instance and open the WhisperX Gradio interface.
# WhisperX Gradio runs on port 7860 by default
http://<your-instance-ip>:7860
The WhisperX template automatically starts the Gradio service and makes it accessible via web browser.
3

Upload Audio File

Use the Gradio interface to upload your audio file:
Audio file formats:
  • MP3, WAV, FLAC, M4A
  • AAC, OGG, WMA
  • Video files with audio tracks (MP4, AVI, MOV)
4

Configure Processing Options

Set your processing preferences:
  • Language: Auto-detect or specify language
  • Model Size: Choose between base, small, medium, large, or large-v2
  • Speaker Diarization: Enable/disable speaker identification
  • Translation: Translate to different languages
  • Output Format: Choose transcript format (TXT, SRT, VTT, JSON)
5

Process and Download

Click “Process” to start transcription and download results when complete.

Processing Options

Model Selection

Fast processing with good accuracy
  • Size: 39 MB
  • Speed: ~16x real-time
  • Accuracy: Good for clear speech
  • Best for: Quick transcriptions, real-time processing
Balanced speed and accuracy
  • Size: 244 MB
  • Speed: ~6x real-time
  • Accuracy: Better than base model
  • Best for: General purpose transcription
High accuracy with moderate speed
  • Size: 769 MB
  • Speed: ~2x real-time
  • Accuracy: High quality results
  • Best for: Professional transcription, important content
Maximum accuracy
  • Size: 1550 MB
  • Speed: ~1x real-time
  • Accuracy: Highest quality
  • Best for: Critical applications, complex audio
Latest model with improved performance
  • Size: 1550 MB
  • Speed: ~1x real-time
  • Accuracy: Best available
  • Best for: Production use, highest quality requirements

Language Support

Fully supported languages:
  • English, Spanish, French, German, Italian
  • Portuguese, Russian, Chinese (Mandarin), Japanese, Korean
  • Arabic, Hindi, Dutch, Swedish, Norwegian
  • Polish, Czech, Hungarian, Romanian, Bulgarian
Regional and dialect support:
  • Chinese (Cantonese, Traditional, Simplified)
  • Spanish (Mexico, Argentina, Spain variants)
  • English (US, UK, Australian, Canadian)
  • Portuguese (Brazil, Portugal)
  • French (France, Canada, African variants)
Emerging language support:
  • Swahili, Yoruba, Igbo, Amharic
  • Bengali, Tamil, Telugu, Gujarati
  • Ukrainian, Belarusian, Kazakh
  • Thai, Vietnamese, Indonesian, Malay

Advanced Features

Speaker Diarization

Identify and separate different speakers in your audio:
import requests

def process_audio_with_speakers(audio_file):
    response = requests.post(
        "http://your-instance-ip:7860/api/process",
        files={"audio": open(audio_file, "rb")},
        data={
            "model": "large-v2",
            "language": "auto",
            "speaker_diarization": True,
            "min_speakers": 2,
            "max_speakers": 10
        }
    )
    
    result = response.json()
    return result["transcript_with_speakers"]

# Process meeting recording
transcript = process_audio_with_speakers("meeting.mp3")
print(transcript)

Translation Capabilities

Translate audio content to different languages:
Translate while transcribing:
# Transcribe and translate in one step
result = requests.post(
    "http://your-instance-ip:7860/api/process",
    files={"audio": open("spanish_audio.mp3", "rb")},
    data={
        "model": "large-v2",
        "source_language": "es",  # Spanish
        "target_language": "en",  # English
        "translate": True
    }
)

Word-Level Timestamps

Get precise timing information for each word:
Multiple output formats:
  • JSON: Detailed word-level data with confidence scores
  • SRT: SubRip subtitle format with timestamps
  • VTT: WebVTT format for web applications
  • TXT: Plain text with timestamps
Quality indicators:
  • Word-level confidence scores (0-1)
  • Segment-level quality metrics
  • Speaker identification confidence
  • Language detection confidence

Use Cases

Content Creation

Podcast Transcription

Convert podcast episodes to searchable text with speaker identification and timestamps.

Video Subtitles

Generate accurate subtitles for videos with precise timing and multiple language support.

Meeting Notes

Automatically transcribe meetings and generate structured notes with speaker attribution.

Content Analysis

Analyze audio content for keywords, sentiment, and engagement metrics.

Business Applications

Customer Support

Transcribe customer calls for quality assurance and training purposes.

Legal Documentation

Create accurate transcripts of depositions, hearings, and legal proceedings.

Educational Content

Convert lectures and educational materials to accessible text formats.

Media Production

Generate scripts and captions for media production workflows.

Performance Optimization

GPU Acceleration

GPU-accelerated processing:
  • Automatic GPU detection and utilization
  • CUDA memory optimization
  • Batch processing for multiple files
  • Real-time processing capabilities
Efficient resource usage:
  • Dynamic model loading and unloading
  • Memory-efficient audio processing
  • Automatic cleanup after processing
  • Support for large audio files

Processing Speed

ModelGPU TypeProcessing SpeedMemory Usage
BaseRTX 4090 48GB~32x real-time2GB
SmallRTX 4090 48GB~12x real-time4GB
MediumRTX 4090 48GB~4x real-time6GB
LargeRTX 4090 48GB~2x real-time8GB
Large-v2RTX 4090 48GB~1.5x real-time10GB

API Integration

REST API Endpoints

Main processing endpoint:
POST /api/process
Content-Type: multipart/form-data

Parameters:
- audio: Audio file (required)
- model: Model size (base, small, medium, large, large-v2)
- language: Source language (auto-detect if not specified)
- translate: Enable translation (true/false)
- target_language: Target language for translation
- speaker_diarization: Enable speaker identification
- min_speakers: Minimum number of speakers
- max_speakers: Maximum number of speakers
Check processing status:
GET /api/status/{job_id}

Response:
{
  "status": "processing|completed|error",
  "progress": 0.75,
  "estimated_time_remaining": 30,
  "result": {...}
}
Available models:
GET /api/models

Response:
{
  "models": [
    {"name": "base", "size": "39MB", "languages": 99},
    {"name": "small", "size": "244MB", "languages": 99},
    {"name": "medium", "size": "769MB", "languages": 99},
    {"name": "large", "size": "1550MB", "languages": 99},
    {"name": "large-v2", "size": "1550MB", "languages": 99}
  ]
}

Troubleshooting

Common Issues

Poor transcription quality:
  • Use higher quality audio files (16kHz+ sample rate)
  • Reduce background noise before processing
  • Try different model sizes (larger models for better accuracy)
  • Specify the correct language if auto-detection fails
File processing failures:
  • Check file format compatibility
  • Ensure file size is under 1GB limit
  • Verify audio file is not corrupted
  • Check GPU memory availability
Slow processing speeds:
  • Use smaller models for faster processing
  • Ensure GPU is properly utilized
  • Close other applications to free up resources
  • Consider using batch processing for multiple files

Next Steps