Speech to Text API
Audio transcription services using various models
Speech to Text API
Audio transcription services using Whisper, Whisper Lite, and T-One models for converting speech to text.
Base URL: api.koveh.com/speech-to-text/
Endpoints
Method | Endpoint | Description |
---|---|---|
GET | / | Service info |
POST | /whisper-lite | Transcribe with Whisper Lite |
POST | /whisper | Transcribe with Whisper |
POST | /t-one | Transcribe with T-One |
GET | /health | Service health check |
Authentication
All endpoints require Bearer token authentication:
curl -H "Authorization: Bearer YOUR_API_KEY" \
"api.koveh.com/speech-to-text/whisper"
Whisper Transcription
Transcribe audio using OpenAI's Whisper model.
Endpoint: POST /whisper
Request Body
{
"audio_file": "base64_encoded_audio_or_file_path",
"language": "en",
"model": "whisper-1",
"response_format": "json",
"temperature": 0.0
}
Parameters
audio_file
(string, required): Base64 encoded audio data or file pathlanguage
(string, optional): Language code (e.g., "en", "es", "fr"). Default: auto-detectmodel
(string, optional): Whisper model to use. Default: "whisper-1"response_format
(string, optional): Response format. Default: "json"temperature
(number, optional): Sampling temperature (0-1). Default: 0.0
Response
{
"text": "Hello, this is a test transcription of audio content.",
"language": "en",
"duration": 5.2,
"segments": [
{
"start": 0.0,
"end": 5.2,
"text": "Hello, this is a test transcription of audio content."
}
],
"model": "whisper-1",
"timestamp": "2025-08-30T09:19:31.245295"
}
Example Request
curl -X POST "api.koveh.com/speech-to-text/whisper" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"audio_file": "base64_encoded_audio_data",
"language": "en"
}'
Whisper Lite Transcription
Transcribe audio using Whisper Lite model (faster, lighter version).
Endpoint: POST /whisper-lite
Request Body
{
"audio_file": "base64_encoded_audio_or_file_path",
"language": "en",
"model": "whisper-lite",
"response_format": "json"
}
Parameters
audio_file
(string, required): Base64 encoded audio data or file pathlanguage
(string, optional): Language code. Default: auto-detectmodel
(string, optional): Model to use. Default: "whisper-lite"response_format
(string, optional): Response format. Default: "json"
Response
{
"text": "Hello, this is a test transcription using Whisper Lite.",
"language": "en",
"duration": 5.2,
"model": "whisper-lite",
"timestamp": "2025-08-30T09:19:31.245295"
}
Example Request
curl -X POST "api.koveh.com/speech-to-text/whisper-lite" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"audio_file": "base64_encoded_audio_data",
"language": "en"
}'
T-One Transcription
Transcribe audio using T-One model (specialized for certain languages/accents).
Endpoint: POST /t-one
Request Body
{
"audio_file": "base64_encoded_audio_or_file_path",
"language": "en",
"model": "t-one",
"response_format": "json"
}
Parameters
audio_file
(string, required): Base64 encoded audio data or file pathlanguage
(string, optional): Language code. Default: auto-detectmodel
(string, optional): Model to use. Default: "t-one"response_format
(string, optional): Response format. Default: "json"
Response
{
"text": "Hello, this is a test transcription using T-One model.",
"language": "en",
"duration": 5.2,
"model": "t-one",
"timestamp": "2025-08-30T09:19:31.245295"
}
Example Request
curl -X POST "api.koveh.com/speech-to-text/t-one" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"audio_file": "base64_encoded_audio_data",
"language": "en"
}'
Service Information
Get information about the speech-to-text service.
Endpoint: GET /
Response
{
"service": "Speech to Text API",
"version": "1.0.0",
"models": [
{
"name": "whisper",
"description": "OpenAI Whisper model for high-quality transcription",
"supported_languages": ["en", "es", "fr", "de", "it", "pt", "ru", "ja", "ko", "zh"],
"max_audio_duration": 300
},
{
"name": "whisper-lite",
"description": "Lightweight Whisper model for faster transcription",
"supported_languages": ["en", "es", "fr", "de"],
"max_audio_duration": 300
},
{
"name": "t-one",
"description": "T-One model for specialized transcription",
"supported_languages": ["en", "es", "fr"],
"max_audio_duration": 300
}
],
"supported_formats": ["mp3", "wav", "m4a", "flac", "ogg"],
"max_file_size": "25MB"
}
Example Request
curl -X GET "api.koveh.com/speech-to-text/" \
-H "Authorization: Bearer YOUR_API_KEY"
Health Check
Check service health status.
Endpoint: GET /health
Response
{
"status": "healthy",
"timestamp": "2025-08-30T09:19:31.245295",
"models_loaded": {
"whisper": true,
"whisper-lite": true,
"t-one": true
},
"gpu_available": true
}
Example Request
curl -X GET "api.koveh.com/speech-to-text/health"
Integration Examples
Python Example
import requests
import base64
def transcribe_audio(audio_file_path, model="whisper", language="en"):
# Read and encode audio file
with open(audio_file_path, "rb") as audio_file:
audio_data = base64.b64encode(audio_file.read()).decode("utf-8")
response = requests.post(
f"http://api.koveh.com/speech-to-text/{model}",
headers={"Authorization": "Bearer YOUR_API_KEY"},
json={
"audio_file": audio_data,
"language": language
}
)
return response.json()
# Transcribe audio file
result = transcribe_audio("audio.mp3", "whisper", "en")
print(f"Transcription: {result['text']}")
JavaScript Example
async function transcribeAudio(audioFile, model = 'whisper', language = 'en') {
// Convert file to base64
const base64 = await new Promise((resolve) => {
const reader = new FileReader();
reader.onload = () => resolve(reader.result.split(',')[1]);
reader.readAsDataURL(audioFile);
});
const response = await fetch(`http://api.koveh.com/speech-to-text/${model}`, {
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'application/json'
},
body: JSON.stringify({
audio_file: base64,
language: language
})
});
return await response.json();
}
// Transcribe audio file
const audioFile = document.getElementById('audioFile').files[0];
transcribeAudio(audioFile, 'whisper', 'en')
.then(result => console.log(`Transcription: ${result.text}`));
Real-time Transcription with WebSocket
import asyncio
import websockets
import json
import base64
async def real_time_transcription():
uri = "ws://api.koveh.com/speech-to-text/ws"
async with websockets.connect(uri) as websocket:
# Send audio chunks
while True:
# Get audio chunk (implement your audio capture logic)
audio_chunk = get_audio_chunk()
audio_base64 = base64.b64encode(audio_chunk).decode("utf-8")
await websocket.send(json.dumps({
"audio": audio_base64,
"model": "whisper-lite"
}))
# Receive transcription
response = await websocket.recv()
result = json.loads(response)
print(f"Real-time: {result['text']}")
# Run real-time transcription
asyncio.run(real_time_transcription())
Error Handling
The API returns standard error responses:
{
"error": "Invalid audio file format",
"status_code": 400,
"timestamp": "2025-08-30T09:19:31.245295"
}
Common error codes:
400
: Bad Request (invalid audio format, file too large)401
: Unauthorized (missing or invalid API key)404
: Not Found (invalid endpoint)413
: Payload Too Large (audio file exceeds size limit)500
: Internal Server Error (transcription model error)
Rate Limiting
- Rate Limit: 100 requests per minute
- Concurrent Requests: 5 simultaneous requests
- Timeout: 60 seconds per request (for longer audio files)
Best Practices
- Audio Quality: Use high-quality audio (16kHz+ sample rate, clear speech)
- File Size: Keep audio files under 25MB for optimal performance
- Language Detection: Let the model auto-detect language when possible
- Model Selection: Use Whisper Lite for faster processing, Whisper for higher accuracy
- Error Handling: Always check for error responses and handle timeouts
- Caching: Cache transcriptions for repeated audio content
Supported Audio Formats
- MP3: Most common, good compression
- WAV: Uncompressed, high quality
- M4A: Apple format, good compression
- FLAC: Lossless compression
- OGG: Open source format
Use Cases
- Meeting Transcription: Convert meeting recordings to text
- Podcast Processing: Transcribe podcast episodes
- Voice Notes: Convert voice memos to text
- Accessibility: Provide captions for video content
- Language Learning: Transcribe speech for language practice
- Content Creation: Generate text from audio interviews
- Customer Service: Transcribe customer calls for analysis