Speech to Text

Convert audio files or streams into text using advanced speech recognition models from multiple providers.

What is the Speech to Text Node?

The Speech to Text (STT) node is a unified interface that provides access to multiple leading speech recognition services. This powerful node allows you to transcribe audio into text with high accuracy using models from OpenAI, Amazon AWS Transcribe, and ElevenLabs. Each provider offers unique capabilities, output formats, and language support, giving you flexibility to choose the best option for your transcription needs.

Supported Models

OpenAI Models

GPT-4o Transcribe: Advanced transcription with superior accuracy and understanding
GPT-4o Mini Transcribe: Fast, cost-effective transcription for high-volume needs
Whisper: Robust multilingual speech recognition with excellent accuracy

Amazon AWS Models

AWS Transcribe: Enterprise-grade transcription with extensive language support

ElevenLabs Models

ElevenLabs Scribe v1: High-quality transcription with speaker diarization and audio event tagging

How to use it?

Add the Speech to Text node: Drag and drop the Speech to Text node into your workflow from the Speech category.
Select Your Model: Choose from the available providers and models based on your requirements:
- For accuracy: Use OpenAI GPT-4o Transcribe
- For speed: Use OpenAI GPT-4o Mini Transcribe or ElevenLabs Scribe v1
- For enterprise: Use AWS Transcribe
- For multilingual: All models support multiple languages
Configure Credentials: Select appropriate credentials based on your chosen model:
- OpenAI models require OpenAI API credentials
- AWS Transcribe requires AWS credentials
- ElevenLabs models require an ElevenLabs API Key
Provider-Specific Configuration:
- OpenAI Configuration (GPT-4o, GPT-4o Mini, Whisper)
  - Temperature: Control randomness in transcription (0-2, default: 0.7)
    - Lower values (0-0.3): More focused and deterministic output
    - Medium values (0.4-1.0): Balanced approach
    - Higher values (1.1-2.0): More creative interpretation
  - Prompt (Optional): Provide context or specific terminology to improve accuracy
  - Output Format:
    - Text: Plain text transcription
    - JSON: Structured data with timestamps and segments
    - Verbose JSON (Whisper only): Detailed metadata including word-level timestamps
    - SRT (Whisper only): Subtitle format with timestamps
    - VTT (Whisper only): WebVTT subtitle format
- AWS Transcribe Configuration
  - Region: Select AWS region for processing
  - Temperature: Control transcription behavior (0-2, default: 0.7)
  - Output: Plain text transcription
- ElevenLabs Scribe Configuration
  - Diarize (Optional): Enable speaker diarization to identify different speakers (default: true)
  - Tag Audio Events (Optional): Detect and tag non-speech audio like laughter, applause (default: true)
  - Timestamp Granularity: Choose between word-level or character-level timestamps (default: word)
  - Language (Optional): Specify audio language or use auto-discovery (default: auto)
    - Supports 80+ languages including major world languages
  - Output Format:
    - Text: Plain text transcription
    - JSON: Structured data with speakers and timestamps
    - DOCX: Microsoft Word document
    - SRT: Subtitle format
    - PDF: Formatted PDF document
    - TXT: Plain text file
    - HTML: Web-formatted document
Connect Audio Input: Connect an audio file source to provide the content you want to transcribe:
- Use File Reader to read audio files from storage
- Connect directly from audio-generating nodes
- Supported formats: MP3, WAV, M4A, FLAC, OGG, and more
Connect Output: The transcription output can be connected to:
- Text Output to display the transcription
- File Writer to save the transcription
- Other nodes for further text processing

Example Task: Transcribing Meeting Recordings

Objective: Convert meeting audio recordings into text transcripts with speaker identification.

Step-by-Step Setup

Add a File Reader:
- Drag and drop a File Reader node into your workflow
- Storage Provider: Select AWS S3
- File Path: Enter meetings/team-meeting-2024-11.mp3
- Bucket Name: Enter your S3 bucket name
- Region: Select your AWS region
Add and Configure Speech to Text:
- Drag the Speech to Text node into your workflow
- Select Model: Choose "ElevenLabs Scribe v1" for speaker diarization
- Select Credentials: Use Nocodo Managed Credentials
- Enable Diarize: Set to true to identify different speakers
- Tag Audio Events: Set to true to capture laughter, applause, etc.
- Timestamp Granularity: Select "word" for detailed timestamps
- Language: Set to "auto" for automatic detection or specify (e.g., "eng" for English)
- Output Format: Select "JSON" for structured output with speaker information
Connect File Reader to STT:
- Connect the file output from the File Reader to the audio input of the STT node
Add File Writer:
- Drag a File Writer node to save the transcription
- Storage Provider: Select AWS S3
- File Path: Enter transcripts/team-meeting-2024-11.json
- Bucket Name: Enter your S3 bucket name
Connect STT to File Writer:
- Connect the transcription output from the STT node to the file input of the File Writer

Example Task: Creating Subtitles from Video

Objective: Generate subtitle files from video audio for accessibility.

Step-by-Step Setup

Extract and Prepare Audio:
- Use a File Reader to load your video's audio track
- Ensure the audio is in a supported format (MP3, WAV, etc.)
Configure Speech to Text for Subtitles:
- Add the Speech to Text node
- Select Model: Choose "OpenAI Whisper" for excellent subtitle generation
- Select Credentials: Choose your OpenAI credentials
- Temperature: Set to 0.5 for balanced accuracy
- Prompt: Enter context like "This is a tutorial video about programming"
- Output Format: Select "SRT" or "VTT" for subtitle files
Save Subtitle File:
- Connect the SRT/VTT output to a File Writer
- Save with appropriate extension (.srt or .vtt)

Example Task: Podcast Transcription

Objective: Create searchable text transcripts of podcast episodes.

Step-by-Step Setup

Load Podcast Audio:
- Use File Reader to load your podcast audio file
- Recommended format: MP3 or M4A
Configure for Long-Form Content:
- Add the Speech to Text node
- Select Model: Choose "OpenAI GPT-4o Transcribe" for high accuracy
- Temperature: Set to 0.3 for consistent, accurate transcription
- Prompt: Include podcast context like "Podcast about technology and startups"
- Output Format: Select "JSON" for structured output with timestamps
Process and Format:
- Connect output to text processing nodes if needed
- Use File Writer to save as searchable text or JSON

Cost Optimization Tips

Model Selection:
- Use GPT-4o Mini Transcribe for high-volume, cost-sensitive applications
- Use ElevenLabs Scribe v1 for features like diarization without switching services
- Reserve GPT-4o Transcribe for content requiring highest accuracy
Audio Preprocessing:
- Compress audio files to appropriate quality (e.g., 128kbps MP3)
- Remove silence and non-speech portions before transcription
- Split very long recordings into manageable segments
Prompt Optimization:
- Provide context prompts to reduce errors and re-processing
- Include technical terms, names, or specific vocabulary
- Keep prompts concise but informative
Output Format:
- Use plain text when timestamps aren't needed
- Choose appropriate granularity for timestamps
- Disable features like diarization if not required

Required AWS IAM Roles and Permissions

When using AWS Transcribe, ensure your IAM user has the following permissions:

transcribe:StartTranscriptionJob
transcribe:GetTranscriptionJob
transcribe:DeleteTranscriptionJob
s3:GetObject (for input audio)
s3:PutObject (for output transcripts)

When reading from/writing to S3:

s3:GetObject
s3:PutObject
s3:ListBucket

Supported Audio Formats

The Speech to Text node supports various audio formats including:

MP3: Compressed audio, widely compatible
WAV: Uncompressed audio, highest quality
M4A: Apple audio format
FLAC: Lossless compression
OGG: Open-source audio format
WebM: Web-optimized format
MP4: Video container (audio extracted)

Output Format Details

JSON Output Structure

{
  "text": "Full transcription text",
  "segments": [
    {
      "id": 0,
      "start": 0.0,
      "end": 5.2,
      "text": "Segment text",
      "speaker": "Speaker 1"
    }
  ],
  "language": "en",
  "duration": 180.5
}

SRT Subtitle Format

1
00:00:00,000 --> 00:00:05,200
First subtitle text

2
00:00:05,200 --> 00:00:10,400
Second subtitle text

Useful Resources

OpenAI Whisper Documentation: Official OpenAI speech-to-text guide
AWS Transcribe Documentation: Complete AWS Transcribe service documentation
ElevenLabs Scribe Documentation: ElevenLabs speech-to-text API guide
File Reader Node Documentation: Learn how to load audio files
File Writer Node Documentation: Learn how to save transcriptions

Troubleshooting

Common Issues

Low Transcription Accuracy:
- Check audio quality and reduce background noise
- Provide context prompts with technical terms
- Ensure correct language is specified
- Try lowering temperature for more focused output
Missing Words or Segments:
- Verify audio file is not corrupted
- Check for excessive silence or very quiet sections
- Ensure audio format is supported
- Try a different model for comparison
Speaker Diarization Errors:
- Ensure clear audio separation between speakers
- Avoid overlapping speech when possible
- Review speaker labels in JSON output and correct as needed
- Consider using higher-quality audio input
Timestamp Synchronization Issues:
- Verify audio file duration matches expected length
- Check for variable playback speed in source audio
- Ensure consistent audio sampling rate
- Test with different timestamp granularity settings
API Errors or Timeouts:
- Verify credentials are valid and active
- Check API usage limits and quotas
- Ensure proper IAM permissions for AWS services
- Split very long audio files into smaller segments
- Check file size limits for your chosen provider
Language Detection Failures:
- Specify language explicitly instead of using auto-detection
- Ensure sufficient speech content in audio
- Verify language is supported by chosen model
- Check for mixed-language content that may confuse detection

Advanced Features

Context-Aware Transcription

Provide detailed prompts to improve accuracy for specialized content:

Prompt examples:
- "Medical consultation discussing patient symptoms and treatment"
- "Technical presentation about cloud computing and AWS services"
- "Legal deposition with multiple speakers and formal language"

By following these guidelines and leveraging the appropriate model for your use case, you can create accurate transcriptions for any application, from meeting notes and podcast transcripts to video subtitles and accessibility features.

What is the Speech to Text Node?​

Supported Models​

OpenAI Models​

Amazon AWS Models​

ElevenLabs Models​

How to use it?​

Example Task: Transcribing Meeting Recordings​

Step-by-Step Setup​

Example Task: Creating Subtitles from Video​

Step-by-Step Setup​

Example Task: Podcast Transcription​

Step-by-Step Setup​

Cost Optimization Tips​

Required AWS IAM Roles and Permissions​

Supported Audio Formats​

Output Format Details​

JSON Output Structure​

SRT Subtitle Format​

Useful Resources​

Troubleshooting​

Common Issues​

Advanced Features​

Context-Aware Transcription​

What is the Speech to Text Node?

Supported Models

OpenAI Models

Amazon AWS Models

ElevenLabs Models

How to use it?

Example Task: Transcribing Meeting Recordings

Step-by-Step Setup

Example Task: Creating Subtitles from Video

Step-by-Step Setup

Example Task: Podcast Transcription

Step-by-Step Setup

Cost Optimization Tips

Required AWS IAM Roles and Permissions

Supported Audio Formats

Output Format Details

JSON Output Structure

SRT Subtitle Format

Useful Resources

Troubleshooting

Common Issues

Advanced Features

Context-Aware Transcription