Scribe v2 converts spoken audio into written text, handling everything from a quick voice memo to a 10-hour conference recording. If you've ever spent hours manually typing out interviews or meeting notes, this model cuts that work down to seconds. It reads MP3, WAV, M4A, video files, and a dozen other formats, so you don't need to convert anything before you start. The model supports over 90 languages and can automatically detect which one is being spoken, making it practical for multilingual recordings. It separates up to 32 distinct speakers and labels each word by who said it, so transcripts of group interviews or panel discussions stay organized. You can also feed in a list of product names or technical terms to steer the model toward the right spelling when the audio quality is imperfect. Journalists, researchers, podcast editors, and customer support teams all use speech-to-text tools as the first step in their editing workflow. Scribe v2 fits naturally at that entry point: drop in your file, get a clean transcript back, and move straight into editing, translation, or subtitling. Files up to 3 GB are supported, so full-length films or long podcast episodes are no problem.
Scribe v2 converts spoken audio into accurate text across more than 90 languages, returning results fast enough to fit into a real editing workflow. The problem it solves is time: transcribing an hour-long interview by hand takes three to four hours even for a fast typist, and the output still needs heavy correction. Scribe v2 does the same job in minutes, producing a structured transcript with speaker labels, word-level timestamps, and inline tags for background sounds like applause or laughter. On Picasso IA, the whole process is a few clicks, no code required.
Do I need programming skills or technical knowledge to use this? No, just open Scribe v2 on Picasso IA, adjust the settings you want, and hit generate.
Is it free to try? Yes, you can run Scribe v2 without a paid subscription to get started. Check the current pricing page for credit details and plan options.
How long does it take to get results? A 10-minute clip typically comes back in under a minute. A full hour of audio usually takes two to three minutes. File length and background noise both affect processing time.
What file formats does it support? Scribe v2 accepts MP3, WAV, M4A, FLAC, OGG, OPUS, WebM, AAC, MP4, MOV, MKV, AVI, and several other common audio and video formats. The per-file limit is 3 GB and 10 hours.
Can it tell different speakers apart in a conversation? Yes. Enable speaker diarization before running and each word in the transcript is tagged with a speaker ID. The model handles up to 32 distinct speakers in a single recording.
What if the model transcribes a name or term incorrectly? Add it to the keyterms field before generating. You can list up to 1,000 terms, each up to 50 characters, and the model will weight those words more heavily during transcription.
Where can I use the transcripts I generate? The output is plain text with no watermarks or restrictions. Paste it into a document, feed it into a subtitle editor, or use it however your project requires.
Everything this model can do for you
Transcribe audio in over 90 languages, with automatic language detection for mixed or unknown recordings.
Identify and label up to 32 individual speakers, giving each word a speaker tag in the output.
Get precise start and end times for every word, ready to sync with video subtitles or captions.
Flag non-speech sounds like laughter, applause, or footsteps directly inside the transcript.
Supply a list of up to 1000 preferred spellings so the model favors the correct form of brand names and jargon.
Upload audio or video files up to 3 GB and 10 hours without splitting or compressing them first.
Remove filler words, false starts, and disfluencies to produce a polished, readable output.
Accepts MP3, WAV, M4A, FLAC, OGG, MP4, MOV, MKV, and many other audio and video formats.