• Picasso AI Logo
    Logo Picasso IA
  • Home
  • AI Image
    Nano Banana 2
  • AI Video
    Veo 3.1 Lite
  • AI Chat
    Gemini 3 Pro
  • Edit Images
  • Upscale Image
  • Remove Background
  • Text to Speech
  • Effects
    NEW
  • Generations
  • Billing
  • Support
  • Account
  1. Collection
  2. Speech to Text
  3. Granite Speech 4.1 2b

Granite Speech 4.1 2B: Speech to Text in 6 Languages

Granite Speech 4.1 2B is a compact speech recognition model built for people who need accurate transcription across multiple languages without complex infrastructure. Whether you are a podcaster working with international guests, a researcher handling multilingual interviews, or a developer building a voice-enabled app, it converts spoken audio directly into text you can use immediately. The model handles automatic speech recognition in six languages: English, French, German, Spanish, Portuguese, and Japanese. Beyond transcription, it supports bidirectional speech translation, converting spoken content from one language into written text in another in a single step. At just 2 billion parameters, it runs efficiently and returns results without the delays typical of larger speech models. You can feed it a single short clip or a longer recording, and it returns clean text ready to paste into documents, subtitle files, or databases. It fits naturally into content production workflows, multilingual customer service pipelines, and transcription projects. Give it an audio sample right now and have your transcript in seconds.

Official

Ibm Granite

9 runs

Granite Speech 4.1 2b

2026-04-27

Commercial Use

Granite Speech 4.1 2B: Speech to Text in 6 Languages

Table of contents

  • Overview
  • How It Works
  • Frequently Asked Questions
  • Credit Cost
  • Features
  • Use Cases
Get Nano Banana Pro

Overview

Granite Speech 4.1 2B turns spoken audio into accurate written text across six languages, solving a problem that stops many creators and professionals cold: getting a reliable transcript without spending hours on manual work. Whether you are a journalist working through recorded interviews, a content creator pulling quotes from a podcast episode, or an analyst reviewing meeting recordings, this model handles the conversion quickly. You upload your audio on Picasso IA and receive a clean, readable transcript within seconds, or a translation if you need the content in a different language. It covers English, French, German, Spanish, Portuguese, and Japanese, with bidirectional translation between those languages built in.

How It Works

  • Upload your audio file in one of the six supported languages, or pass in a recording from your device
  • Optionally write a short prompt or system instruction to shape the output, for example requesting a specific format or asking for a translation into a target language
  • Adjust settings like temperature or token limits if you want tighter control over output length and consistency
  • Hit generate and receive a plain-text transcript within seconds, scaled to the length of the recording
  • Copy the result from the output panel and paste it into your document, subtitle file, report, or any other tool in your workflow

Frequently Asked Questions

Do I need programming skills or technical knowledge to use this? No, just open Granite Speech 4.1 2B on Picasso IA, adjust the settings you want, and hit generate.

Is it free to try? Yes, you can run Granite Speech 4.1 2B without any upfront commitment. Check your account page for current credit or plan details.

What languages does the model support? The model covers English, French, German, Spanish, Portuguese, and Japanese. It can transcribe speech within any of those languages and translate audio content between them in both directions.

How long does it take to get a transcript? Most audio clips return a result within a few seconds. Longer recordings take a bit more time depending on file length and audio clarity.

What does the model return? The model returns plain text. You can copy it directly from the results panel and drop it into any document, email, subtitle editor, or publishing tool.

Can I ask the model to translate instead of just transcribing? Yes. Use the prompt or system prompt fields to specify your target language. For example, writing "Translate this audio to English" will return the content in that language rather than the original.

What if the transcript has mistakes? Try lowering the temperature setting for more consistent output, and make sure the recording is as clear as possible. Providing a short context prompt about the topic or speaker can also help the model produce more accurate results.

Credit Cost

Each generation consumes 1 credit

1 credit

or 5 credits for 5 generations

Features

Everything this model can do for you

Multilingual ASR

Recognizes speech in English, French, German, Spanish, Portuguese, and Japanese out of the box.

Bidirectional translation

Converts spoken audio in one language into written text in another without a separate step.

Compact 2B model

Returns accurate transcriptions faster than larger models due to its smaller parameter count.

Real-time streaming

Outputs text as it generates, so you get partial results before the full audio finishes processing.

Seed-based reproducibility

Set a seed value to reproduce identical transcription output across multiple runs.

Sampling controls

Adjust temperature, top-k, and top-p values to tune output precision for your specific audio.

Flexible input modes

Accepts audio alongside chat-style messages or standard completion prompts for different integration styles.

Use Cases

Transcribe a recorded podcast episode or interview into a written transcript you can edit and publish

Convert a voice memo recorded in Spanish or French into an English text document in a single step

Generate text from a Japanese audio recording for archiving, translation, or review

Transcribe customer service calls in multiple languages to analyze for quality and compliance

Extract spoken content from a meeting recording and paste it directly into notes or a summary

Build a voice input feature into an app by connecting audio data to the model's transcription output

Create subtitles for a multilingual video by feeding the audio track and receiving the text back

Switch Category

Effects

Text To Image

Text To Image

Text To Video

Large Language Models

Large Language Models

Text To Speech

Text To Speech

Super Resolution

Super Resolution

Lipsync

AI Music Generation

AI Music Generation

Video Editing

Speech To Text

Speech To Text

AI Enhance Videos

AI Enhance Videos

Remove Backgrounds

Remove Backgrounds