Speech to Text
Convert audio files or streams into text using advanced speech recognition models from multiple providers.
What is the Speech to Text Node?
The Speech to Text (STT) node is a unified interface that provides access to multiple leading speech recognition services. This powerful node allows you to transcribe audio into text with high accuracy using models from OpenAI, Amazon AWS Transcribe, and ElevenLabs. Each provider offers unique capabilities, output formats, and language support, giving you flexibility to choose the best option for your transcription needs.
Supported Models
OpenAI Models
- GPT-4o Transcribe: Advanced transcription with superior accuracy and understanding
- GPT-4o Mini Transcribe: Fast, cost-effective transcription for high-volume needs
- Whisper: Robust multilingual speech recognition with excellent accuracy
Amazon AWS Models
- AWS Transcribe: Enterprise-grade transcription with extensive language support
ElevenLabs Models
- ElevenLabs Scribe v1: High-quality transcription with speaker diarization and audio event tagging
How to use it?
-
Add the Speech to Text node: Drag and drop the Speech to Text node into your workflow from the Speech category.
-
Select Your Model: Choose from the available providers and models based on your requirements:
- For accuracy: Use OpenAI GPT-4o Transcribe
- For speed: Use OpenAI GPT-4o Mini Transcribe or ElevenLabs Scribe v1
- For enterprise: Use AWS Transcribe
- For multilingual: All models support multiple languages
-
Configure Credentials: Select appropriate credentials based on your chosen model:
- OpenAI models require OpenAI API credentials
- AWS Transcribe requires AWS credentials
- ElevenLabs models require an ElevenLabs API Key
-
Provider-Specific Configuration:
-
OpenAI Configuration (GPT-4o, GPT-4o Mini, Whisper)
- Temperature: Control randomness in transcription (0-2, default: 0.7)
- Lower values (0-0.3): More focused and deterministic output
- Medium values (0.4-1.0): Balanced approach
- Higher values (1.1-2.0): More creative interpretation
- Prompt (Optional): Provide context or specific terminology to improve accuracy
- Output Format:
- Text: Plain text transcription
- JSON: Structured data with timestamps and segments
- Verbose JSON (Whisper only): Detailed metadata including word-level timestamps
- SRT (Whisper only): Subtitle format with timestamps
- VTT (Whisper only): WebVTT subtitle format
- Temperature: Control randomness in transcription (0-2, default: 0.7)
-
AWS Transcribe Configuration
- Region: Select AWS region for processing
- Temperature: Control transcription behavior (0-2, default: 0.7)
- Output: Plain text transcription
-
ElevenLabs Scribe Configuration
- Diarize (Optional): Enable speaker diarization to identify different speakers (default: true)
- Tag Audio Events (Optional): Detect and tag non-speech audio like laughter, applause (default: true)
- Timestamp Granularity: Choose between word-level or character-level timestamps (default: word)
- Language (Optional): Specify audio language or use auto-discovery (default: auto)
- Supports 80+ languages including major world languages
- Output Format:
- Text: Plain text transcription
- JSON: Structured data with speakers and timestamps
- DOCX: Microsoft Word document
- SRT: Subtitle format
- PDF: Formatted PDF document
- TXT: Plain text file
- HTML: Web-formatted document
-
-
Connect Audio Input: Connect an audio file source to provide the content you want to transcribe:
- Use File Reader to read audio files from storage
- Connect directly from audio-generating nodes
- Supported formats: MP3, WAV, M4A, FLAC, OGG, and more
-
Connect Output: The transcription output can be connected to:
- Text Output to display the transcription
- File Writer to save the transcription
- Other nodes for further text processing
Example Task: Transcribing Meeting Recordings
Objective: Convert meeting audio recordings into text transcripts with speaker identification.
Step-by-Step Setup
-
Add a File Reader:
- Drag and drop a File Reader node into your workflow
- Storage Provider: Select AWS S3
- File Path: Enter
meetings/team-meeting-2024-11.mp3 - Bucket Name: Enter your S3 bucket name
- Region: Select your AWS region
-
Add and Configure Speech to Text:
- Drag the Speech to Text node into your workflow
- Select Model: Choose "ElevenLabs Scribe v1" for speaker diarization
- Select Credentials: Use Nocodo Managed Credentials
- Enable Diarize: Set to true to identify different speakers
- Tag Audio Events: Set to true to capture laughter, applause, etc.
- Timestamp Granularity: Select "word" for detailed timestamps
- Language: Set to "auto" for automatic detection or specify (e.g., "eng" for English)
- Output Format: Select "JSON" for structured output with speaker information
-
Connect File Reader to STT:
- Connect the file output from the File Reader to the audio input of the STT node
-
Add File Writer:
- Drag a File Writer node to save the transcription
- Storage Provider: Select AWS S3
- File Path: Enter
transcripts/team-meeting-2024-11.json - Bucket Name: Enter your S3 bucket name
-
Connect STT to File Writer:
- Connect the transcription output from the STT node to the file input of the File Writer
Example Task: Creating Subtitles from Video
Objective: Generate subtitle files from video audio for accessibility.
Step-by-Step Setup
-
Extract and Prepare Audio:
- Use a File Reader to load your video's audio track
- Ensure the audio is in a supported format (MP3, WAV, etc.)
-
Configure Speech to Text for Subtitles:
- Add the Speech to Text node
- Select Model: Choose "OpenAI Whisper" for excellent subtitle generation
- Select Credentials: Choose your OpenAI credentials
- Temperature: Set to 0.5 for balanced accuracy
- Prompt: Enter context like "This is a tutorial video about programming"
- Output Format: Select "SRT" or "VTT" for subtitle files
-
Save Subtitle File:
- Connect the SRT/VTT output to a File Writer
- Save with appropriate extension (.srt or .vtt)
Example Task: Podcast Transcription
Objective: Create searchable text transcripts of podcast episodes.
Step-by-Step Setup
-
Load Podcast Audio:
- Use File Reader to load your podcast audio file
- Recommended format: MP3 or M4A
-
Configure for Long-Form Content:
- Add the Speech to Text node
- Select Model: Choose "OpenAI GPT-4o Transcribe" for high accuracy
- Temperature: Set to 0.3 for consistent, accurate transcription
- Prompt: Include podcast context like "Podcast about technology and startups"
- Output Format: Select "JSON" for structured output with timestamps
-
Process and Format:
- Connect output to text processing nodes if needed
- Use File Writer to save as searchable text or JSON
Cost Optimization Tips
-
Model Selection:
- Use GPT-4o Mini Transcribe for high-volume, cost-sensitive applications
- Use ElevenLabs Scribe v1 for features like diarization without switching services
- Reserve GPT-4o Transcribe for content requiring highest accuracy
-
Audio Preprocessing:
- Compress audio files to appropriate quality (e.g., 128kbps MP3)
- Remove silence and non-speech portions before transcription
- Split very long recordings into manageable segments
-
Prompt Optimization:
- Provide context prompts to reduce errors and re-processing
- Include technical terms, names, or specific vocabulary
- Keep prompts concise but informative
-
Output Format:
- Use plain text when timestamps aren't needed
- Choose appropriate granularity for timestamps
- Disable features like diarization if not required
Required AWS IAM Roles and Permissions
When using AWS Transcribe, ensure your IAM user has the following permissions:
transcribe:StartTranscriptionJobtranscribe:GetTranscriptionJobtranscribe:DeleteTranscriptionJobs3:GetObject(for input audio)s3:PutObject(for output transcripts)
When reading from/writing to S3:
s3:GetObjects3:PutObjects3:ListBucket
Supported Audio Formats
The Speech to Text node supports various audio formats including:
- MP3: Compressed audio, widely compatible
- WAV: Uncompressed audio, highest quality
- M4A: Apple audio format
- FLAC: Lossless compression
- OGG: Open-source audio format
- WebM: Web-optimized format
- MP4: Video container (audio extracted)
Output Format Details
JSON Output Structure
{
"text": "Full transcription text",
"segments": [
{
"id": 0,
"start": 0.0,
"end": 5.2,
"text": "Segment text",
"speaker": "Speaker 1"
}
],
"language": "en",
"duration": 180.5
}
SRT Subtitle Format
1
00:00:00,000 --> 00:00:05,200
First subtitle text
2
00:00:05,200 --> 00:00:10,400
Second subtitle text
Useful Resources
- OpenAI Whisper Documentation: Official OpenAI speech-to-text guide
- AWS Transcribe Documentation: Complete AWS Transcribe service documentation
- ElevenLabs Scribe Documentation: ElevenLabs speech-to-text API guide
- File Reader Node Documentation: Learn how to load audio files
- File Writer Node Documentation: Learn how to save transcriptions
Troubleshooting
Common Issues
-
Low Transcription Accuracy:
- Check audio quality and reduce background noise
- Provide context prompts with technical terms
- Ensure correct language is specified
- Try lowering temperature for more focused output
-
Missing Words or Segments:
- Verify audio file is not corrupted
- Check for excessive silence or very quiet sections
- Ensure audio format is supported
- Try a different model for comparison
-
Speaker Diarization Errors:
- Ensure clear audio separation between speakers
- Avoid overlapping speech when possible
- Review speaker labels in JSON output and correct as needed
- Consider using higher-quality audio input
-
Timestamp Synchronization Issues:
- Verify audio file duration matches expected length
- Check for variable playback speed in source audio
- Ensure consistent audio sampling rate
- Test with different timestamp granularity settings
-
API Errors or Timeouts:
- Verify credentials are valid and active
- Check API usage limits and quotas
- Ensure proper IAM permissions for AWS services
- Split very long audio files into smaller segments
- Check file size limits for your chosen provider
-
Language Detection Failures:
- Specify language explicitly instead of using auto-detection
- Ensure sufficient speech content in audio
- Verify language is supported by chosen model
- Check for mixed-language content that may confuse detection
Advanced Features
Context-Aware Transcription
Provide detailed prompts to improve accuracy for specialized content:
Prompt examples:
- "Medical consultation discussing patient symptoms and treatment"
- "Technical presentation about cloud computing and AWS services"
- "Legal deposition with multiple speakers and formal language"
By following these guidelines and leveraging the appropriate model for your use case, you can create accurate transcriptions for any application, from meeting notes and podcast transcripts to video subtitles and accessibility features.