Scribe v2: AI Speech to Text for 90+ Languages

Scribe v2 converts spoken audio into written text, handling everything from a quick voice memo to a 10-hour conference recording. If you've ever spent hours manually typing out interviews or meeting notes, this model cuts that work down to seconds. It reads MP3, WAV, M4A, video files, and a dozen other formats, so you don't need to convert anything before you start. The model supports over 90 languages and can automatically detect which one is being spoken, making it practical for multilingual recordings. It separates up to 32 distinct speakers and labels each word by who said it, so transcripts of group interviews or panel discussions stay organized. You can also feed in a list of product names or technical terms to steer the model toward the right spelling when the audio quality is imperfect. Journalists, researchers, podcast editors, and customer support teams all use speech-to-text tools as the first step in their editing workflow. Scribe v2 fits naturally at that entry point: drop in your file, get a clean transcript back, and move straight into editing, translation, or subtitling. Files up to 3 GB are supported, so full-length films or long podcast episodes are no problem.

Official

Elevenlabs

15.7k runs

Scribe V2

2026-05-05

Commercial Use

Scribe v2: AI Speech to Text for 90+ Languages

Overview

Scribe v2 converts spoken audio into accurate text across more than 90 languages, returning results fast enough to fit into a real editing workflow. The problem it solves is time: transcribing an hour-long interview by hand takes three to four hours even for a fast typist, and the output still needs heavy correction. Scribe v2 does the same job in minutes, producing a structured transcript with speaker labels, word-level timestamps, and inline tags for background sounds like applause or laughter. On Picasso IA, the whole process is a few clicks, no code required.

How It Works

Upload your audio or video file. Supported formats include MP3, WAV, M4A, FLAC, MP4, MOV, MKV, and many others. Files up to 3 GB and 10 hours long are accepted.
Set the language if you know it, or leave detection on automatic. Specifying a language improves accuracy on noisy or heavily accented recordings.
Turn on speaker diarization if your recording has multiple voices. Enter the number of speakers you expect so the model can separate them cleanly.
Add keyterms for any product names, proper nouns, or technical phrases that need to appear correctly in the final text. Up to 1,000 terms are accepted.
Run the model. Your transcript comes back with timestamps, a speaker label on each word or segment, and audio event tags wherever non-speech sounds occur.

Frequently Asked Questions

Do I need programming skills or technical knowledge to use this? No, just open Scribe v2 on Picasso IA, adjust the settings you want, and hit generate.

Is it free to try? Yes, you can run Scribe v2 without a paid subscription to get started. Check the current pricing page for credit details and plan options.

How long does it take to get results? A 10-minute clip typically comes back in under a minute. A full hour of audio usually takes two to three minutes. File length and background noise both affect processing time.

What file formats does it support? Scribe v2 accepts MP3, WAV, M4A, FLAC, OGG, OPUS, WebM, AAC, MP4, MOV, MKV, AVI, and several other common audio and video formats. The per-file limit is 3 GB and 10 hours.

Can it tell different speakers apart in a conversation? Yes. Enable speaker diarization before running and each word in the transcript is tagged with a speaker ID. The model handles up to 32 distinct speakers in a single recording.

What if the model transcribes a name or term incorrectly? Add it to the keyterms field before generating. You can list up to 1,000 terms, each up to 50 characters, and the model will weight those words more heavily during transcription.

Where can I use the transcripts I generate? The output is plain text with no watermarks or restrictions. Paste it into a document, feed it into a subtitle editor, or use it however your project requires.

Credit Cost

Each generation consumes 1 credit

1 credit

or 5 credits for 5 generations

Features

Everything this model can do for you

90+ language support

Transcribe audio in over 90 languages, with automatic language detection for mixed or unknown recordings.

Speaker diarization

Identify and label up to 32 individual speakers, giving each word a speaker tag in the output.

Word-level timestamps

Get precise start and end times for every word, ready to sync with video subtitles or captions.

Audio event tagging

Flag non-speech sounds like laughter, applause, or footsteps directly inside the transcript.

Custom term biasing

Supply a list of up to 1000 preferred spellings so the model favors the correct form of brand names and jargon.

Large file support

Upload audio or video files up to 3 GB and 10 hours without splitting or compressing them first.

Clean transcript mode

Remove filler words, false starts, and disfluencies to produce a polished, readable output.

Broad format compatibility

Accepts MP3, WAV, M4A, FLAC, OGG, MP4, MOV, MKV, and many other audio and video formats.

Use Cases

Transcribe a recorded interview into a timestamped text document, with each speaker's words labeled separately

Convert a podcast episode into a written transcript for blog posts, show notes, or repurposing into articles

Automatically detect and tag non-speech sounds like applause or laughter in event recordings

Transcribe multilingual meeting recordings and let the model identify the language automatically

Generate clean, readable transcripts by stripping out filler words like 'um' and 'uh' from the output

Bias the transcription toward specific product names or technical jargon by providing a custom list of preferred terms

Extract word-level timestamps from a video file to sync subtitles or closed captions

Transcribe a 10-hour recorded lecture or conference session from a single file upload

Switch Category

Effects

Text To Image

Text To Video

Large Language Models

Text To Speech

Super Resolution

Lipsync

AI Music Generation

Video Editing

Speech To Text

AI Enhance Videos

Remove Backgrounds