Audio Processing - Comput3 Network

Process audio files with advanced AI-powered transcription and analysis using WhisperX Gradio Interface on Comput3 Network’s GPU infrastructure.

WhisperX Gradio Interface

WhisperX is an advanced audio processing tool that provides high-quality speech recognition, translation, and analysis capabilities through an intuitive Gradio web interface.

Key Features

Speech-to-Text

Convert audio files to accurate text transcripts with high precision and low latency.

Multi-Language Support

Support for 99+ languages with automatic language detection and translation capabilities.

Speaker Diarization

Identify and separate different speakers in audio recordings automatically.

Word-Level Timestamps

Get precise timestamps for each word in the transcription for detailed analysis.

Quick Start

Launch GPU Instance

Launch a GPU instance with the WhisperX template from your Comput3 dashboard.

WhisperX Template

Pre-configured instance with WhisperX, Gradio interface, and all dependencies ready to use.

Recommended GPU

RTX 4090 48GB, L40S, or A100 for optimal performance with audio processing.

Access Gradio Interface

Connect to your GPU instance and open the WhisperX Gradio interface.

# WhisperX Gradio runs on port 7860 by default
http://<your-instance-ip>:7860

The WhisperX template automatically starts the Gradio service and makes it accessible via web browser.

Upload Audio File

Use the Gradio interface to upload your audio file:

Supported Formats
File Size Limits

Audio file formats:

MP3, WAV, FLAC, M4A
AAC, OGG, WMA
Video files with audio tracks (MP4, AVI, MOV)

Configure Processing Options

Set your processing preferences:

Language: Auto-detect or specify language
Model Size: Choose between base, small, medium, large, or large-v2
Speaker Diarization: Enable/disable speaker identification
Translation: Translate to different languages
Output Format: Choose transcript format (TXT, SRT, VTT, JSON)

Process and Download

Click “Process” to start transcription and download results when complete.

Processing Options

Model Selection

Whisper Base

Fast processing with good accuracy

Size: 39 MB
Speed: ~16x real-time
Accuracy: Good for clear speech
Best for: Quick transcriptions, real-time processing

Whisper Small

Balanced speed and accuracy

Size: 244 MB
Speed: ~6x real-time
Accuracy: Better than base model
Best for: General purpose transcription

Whisper Medium

High accuracy with moderate speed

Size: 769 MB
Speed: ~2x real-time
Accuracy: High quality results
Best for: Professional transcription, important content

Whisper Large

Maximum accuracy

Size: 1550 MB
Speed: ~1x real-time
Accuracy: Highest quality
Best for: Critical applications, complex audio

Whisper Large-v2

Latest model with improved performance

Size: 1550 MB
Speed: ~1x real-time
Accuracy: Best available
Best for: Production use, highest quality requirements

Language Support

Major Languages

Fully supported languages:

English, Spanish, French, German, Italian
Portuguese, Russian, Chinese (Mandarin), Japanese, Korean
Arabic, Hindi, Dutch, Swedish, Norwegian
Polish, Czech, Hungarian, Romanian, Bulgarian

Regional Languages

Regional and dialect support:

Chinese (Cantonese, Traditional, Simplified)
Spanish (Mexico, Argentina, Spain variants)
English (US, UK, Australian, Canadian)
Portuguese (Brazil, Portugal)
French (France, Canada, African variants)

Low-Resource Languages

Emerging language support:

Swahili, Yoruba, Igbo, Amharic
Bengali, Tamil, Telugu, Gujarati
Ukrainian, Belarusian, Kazakh
Thai, Vietnamese, Indonesian, Malay

Advanced Features

Speaker Diarization

Identify and separate different speakers in your audio:

import requests

def process_audio_with_speakers(audio_file):
    response = requests.post(
        "http://your-instance-ip:7860/api/process",
        files={"audio": open(audio_file, "rb")},
        data={
            "model": "large-v2",
            "language": "auto",
            "speaker_diarization": True,
            "min_speakers": 2,
            "max_speakers": 10
        }
    )
    
    result = response.json()
    return result["transcript_with_speakers"]

# Process meeting recording
transcript = process_audio_with_speakers("meeting.mp3")
print(transcript)

Translation Capabilities

Translate audio content to different languages:

Real-time Translation
Batch Translation

Translate while transcribing:

# Transcribe and translate in one step
result = requests.post(
    "http://your-instance-ip:7860/api/process",
    files={"audio": open("spanish_audio.mp3", "rb")},
    data={
        "model": "large-v2",
        "source_language": "es",  # Spanish
        "target_language": "en",  # English
        "translate": True
    }
)

Process multiple files:

import os

audio_files = ["file1.mp3", "file2.wav", "file3.m4a"]
translations = []

for audio_file in audio_files:
    result = requests.post(
        "http://your-instance-ip:7860/api/process",
        files={"audio": open(audio_file, "rb")},
        data={
            "model": "medium",
            "translate": True,
            "target_language": "en"
        }
    )
    translations.append(result.json())

Word-Level Timestamps

Get precise timing information for each word:

Timestamp Formats

Multiple output formats:

JSON: Detailed word-level data with confidence scores
SRT: SubRip subtitle format with timestamps
VTT: WebVTT format for web applications
TXT: Plain text with timestamps

Confidence Scores

Quality indicators:

Word-level confidence scores (0-1)
Segment-level quality metrics
Speaker identification confidence
Language detection confidence

Use Cases

Content Creation

Podcast Transcription

Convert podcast episodes to searchable text with speaker identification and timestamps.

Video Subtitles

Generate accurate subtitles for videos with precise timing and multiple language support.

Meeting Notes

Automatically transcribe meetings and generate structured notes with speaker attribution.

Content Analysis

Analyze audio content for keywords, sentiment, and engagement metrics.

Business Applications

Customer Support

Transcribe customer calls for quality assurance and training purposes.

Legal Documentation

Create accurate transcripts of depositions, hearings, and legal proceedings.

Educational Content

Convert lectures and educational materials to accessible text formats.

Media Production

Generate scripts and captions for media production workflows.

Performance Optimization

GPU Acceleration

CUDA Optimization

GPU-accelerated processing:

Automatic GPU detection and utilization
CUDA memory optimization
Batch processing for multiple files
Real-time processing capabilities

Memory Management

Efficient resource usage:

Dynamic model loading and unloading
Memory-efficient audio processing
Automatic cleanup after processing
Support for large audio files

Processing Speed

Model	GPU Type	Processing Speed	Memory Usage
Base	RTX 4090 48GB	~32x real-time	2GB
Small	RTX 4090 48GB	~12x real-time	4GB
Medium	RTX 4090 48GB	~4x real-time	6GB
Large	RTX 4090 48GB	~2x real-time	8GB
Large-v2	RTX 4090 48GB	~1.5x real-time	10GB

API Integration

REST API Endpoints

Process Audio

Main processing endpoint:

POST /api/process
Content-Type: multipart/form-data

Parameters:
- audio: Audio file (required)
- model: Model size (base, small, medium, large, large-v2)
- language: Source language (auto-detect if not specified)
- translate: Enable translation (true/false)
- target_language: Target language for translation
- speaker_diarization: Enable speaker identification
- min_speakers: Minimum number of speakers
- max_speakers: Maximum number of speakers

Get Status

Check processing status:

GET /api/status/{job_id}

Response:
{
  "status": "processing|completed|error",
  "progress": 0.75,
  "estimated_time_remaining": 30,
  "result": {...}
}

List Models

Available models:

GET /api/models

Response:
{
  "models": [
    {"name": "base", "size": "39MB", "languages": 99},
    {"name": "small", "size": "244MB", "languages": 99},
    {"name": "medium", "size": "769MB", "languages": 99},
    {"name": "large", "size": "1550MB", "languages": 99},
    {"name": "large-v2", "size": "1550MB", "languages": 99}
  ]
}

Troubleshooting

Common Issues

Audio Quality Issues

Poor transcription quality:

Use higher quality audio files (16kHz+ sample rate)
Reduce background noise before processing
Try different model sizes (larger models for better accuracy)
Specify the correct language if auto-detection fails

Processing Errors

File processing failures:

Check file format compatibility
Ensure file size is under 1GB limit
Verify audio file is not corrupted
Check GPU memory availability

Performance Issues

Slow processing speeds:

Use smaller models for faster processing
Ensure GPU is properly utilized
Close other applications to free up resources
Consider using batch processing for multiple files

Next Steps

Image Generation

Create stunning images using ComfyUI and Stable Diffusion workflows.

Video Generation

Generate dynamic videos and animations with ComfyUI video workflows.

Launch GPU

Learn how to deploy GPU instances for media generation workloads.

Getting Started

API

Chat

IDE/CLI

Launch GPU

Generate Medias

COM Token

MCP

ELIZAOS

​WhisperX Gradio Interface

​Key Features

Speech-to-Text

Multi-Language Support

Speaker Diarization

Word-Level Timestamps

​Quick Start

WhisperX Template

Recommended GPU

​Processing Options

​Model Selection

​Language Support

​Advanced Features

​Speaker Diarization

​Translation Capabilities

​Word-Level Timestamps

​Use Cases

​Content Creation

Podcast Transcription

Video Subtitles

Meeting Notes

Content Analysis

​Business Applications

Customer Support

Legal Documentation

Educational Content

Media Production

​Performance Optimization

​GPU Acceleration

​Processing Speed

​API Integration

​REST API Endpoints

​Troubleshooting

​Common Issues

​Next Steps

Image Generation

Video Generation

Launch GPU

WhisperX Gradio Interface

Key Features

Quick Start

Processing Options

Model Selection

Language Support

Advanced Features

Speaker Diarization

Translation Capabilities

Word-Level Timestamps

Use Cases

Content Creation

Business Applications

Performance Optimization

GPU Acceleration

Processing Speed

API Integration

REST API Endpoints

Troubleshooting

Common Issues

Next Steps