What does it do?
The Audio – Transcribe connector converts audio or video files into text using advanced speech-to-text models. It provides options to fine-tune silence detection, chunking, and model selection, making it ideal for transcribing interviews, meetings, lectures, or any spoken content.
How do I use it?
To use the Audio – Transcribe connector in your workflow:
- Add the Audio – Transcribe Connector
- Insert the node into your pipeline where you want to transcribe audio or video files
- Connect Input
- Connect the input lane (usually audio or video) to your audio or video source
- Configure Parameters
- Adjust the transcription model, silence detection, chunking, and VAD level as needed (see tables below)
- Connect Output
- The connector outputs the transcribed text for further processing, analysis, or storage
Configuration Parameters
Customize the transcription process with the following parameters:
| Parameter | Description | Effect/Usage |
|---|---|---|
| Model | The Whisper model to use for transcription (see model options below) | Controls speed and accuracy of transcription; larger models are slower but more accurate |
| Silence Threshold | The silence threshold (in seconds) to detect silence in speech | Lower values are more sensitive to silence; higher values may treat more as speech |
| Minimum Seconds | The minimum seconds of audio to process in a batch and to look for silence | Controls the minimum chunk size for processing |
| Maximum Seconds | The maximum seconds of audio to buffer and process at once | Controls the maximum chunk size for processing |
| VAD Level | Voice Activity Detection (VAD) level for silence detection (see VAD options below) | Controls how aggressively the system filters out non-speech and background noise |
Model Options
Choose the Whisper model that best balances speed and accuracy for your needs:
| Option | Description | Performance |
|---|---|---|
| tiny | Fastest, least accurate | Quickest processing, basic accuracy |
| base | Fast, low accuracy | Good speed, improved accuracy |
| small | Medium speed and accuracy | Balanced performance |
| medium | Slower, high accuracy | Better accuracy, longer processing |
| large | Slowest, highest accuracy | Best accuracy, longest processing time |
VAD Level Options
Voice Activity Detection (VAD) controls how the system distinguishes speech from silence and noise:
| Option | Description | Behavior |
|---|---|---|
| Most permissive | Detects the most audio as speech | Risk: includes noise |
| Slightly more aggressive | Skips minor background noise | Filters light background sounds |
| Balanced | Moderate filtering of non-speech | Default in many tools |
| Most aggressive | Filters aggressively | May cut off quiet or short speech |
Example Use Cases
- Transcribe meeting recordings for searchable notes
- Convert podcasts or interviews into text for analysis
- Generate subtitles or captions for video content
- Create searchable archives of recorded lectures or presentations
- Process customer service calls for quality analysis
- Extract text from voicemails or audio messages
- Prepare audio content for accessibility compliance
Best Practices
- Model Selection: Use smaller models for quick drafts, larger models for final transcripts
- Audio Quality: Higher quality audio produces better transcription results
- Chunking: Adjust minimum/maximum seconds based on your content type (shorter for conversational, longer for lectures)
- VAD Tuning: Start with balanced VAD and adjust based on your audio environment
In summary:
The Audio – Transcribe connector provides flexible, high-quality speech-to-text transcription for audio and video files, with customizable options for model selection, silence detection, chunking, and VAD level to fit a wide range of transcription needs.
