• Picasso AI Logo
    Logo Picasso IA
  • Home
  • AI Image
    Nano Banana 2
  • AI Video
    Veo 3.1 Lite
  • AI Chat
    Gemini 3 Pro
  • Edit Images
  • Upscale Image
  • Remove Background
  • Text to Speech
  • Effects
  • AI Toolkit
    NEW
  • Generations
  • Billing
  • Support
  • Account
Unlimited Videos ARE HERE ยท Nano Banana 2 & GPT Image 2.0 UNLIMITED UNTIL June 25Upgrade
  1. Collection
  2. Text to Speech
  3. Realtime Tts 1.5 Mini

Explore voices to match your need

ASMR

ASMR

Japanese
Whisper
Whispering Woman

Whispering Woman

Whisper
Relaxation
Lucky Robot

Lucky Robot

Robotic
Creative
Angry Pirate

Angry Pirate

Character
Creative

Audio Tools

Original Audio
Cloned
Result

Clone Your Voice

Experience instant voice magic with just 10 seconds of audio input!

Start Now
Pirate Captain
Pirate Captain
Greedy Goblin
Greedy Goblin
Southern Belle
Southern Belle

Voice Design

Create Any Voice You Can Imagine - From Simple Text Description

Start Now

Realtime TTS 1.5 Mini: 120ms AI Voice Synthesis

Realtime TTS 1.5 Mini converts written text into spoken audio in roughly 120 milliseconds, making it one of the fastest text-to-speech options available. If you have ever waited several seconds for audio to generate before a demo, a customer interaction, or a live product test, this model cuts that wait to a fraction of a second. It works across 15 languages, so one setup handles multilingual content without juggling multiple tools. You can shape the output in several ways. Emotion tags like [happy] or [sad] shift the speaker's tone without any extra processing step. SSML break tags let you control where pauses fall, giving you the rhythm you need for narration or dialogue. The model accepts sample rates from 8 kHz to 48 kHz and outputs audio as MP3, WAV, OGG Opus, or FLAC, so the file fits whatever platform or pipeline receives it. A temperature setting controls how expressive or consistent the delivery sounds across repeated runs. For voice-powered apps, interactive phone bots, online course narration, or any project where audio latency is a real constraint, this model slots in without requiring a heavy infrastructure change. Drop in your text, pick a voice and language, and get back a ready-to-use audio file in under a second.

Official

Inworld

89.6k runs

Realtime Tts 1.5 Mini

2026-03-10

Commercial Use

Realtime TTS 1.5 Mini: 120ms AI Voice Synthesis

Table of contents

  • Overview
  • How It Works
  • Frequently Asked Questions
  • Credit Cost
  • Features
  • Use Cases
Get Nano Banana Pro

Overview

Realtime TTS 1.5 Mini converts written text into natural-sounding speech in roughly 120 milliseconds, making it one of the fastest synthesis models available for live applications. If you're building a customer support bot, a reading assistant, or a voice interface that needs to respond in real time, waiting two or three seconds for audio to render is a dealbreaker. Picasso IA hosts this model so you can test it directly in the browser, with no API setup required. It covers 15 languages out of the box, so a single model handles multilingual projects without switching tools.

How It Works

  • Type or paste your text into the input field, up to 2,000 characters per request
  • Choose a preset voice from the library or supply a custom cloned voice ID
  • Set the speaking rate and temperature to control speed and expressiveness, and pick your output format (MP3, WAV, OGG, FLAC)
  • Select the sample rate that fits your target environment, from 8 kHz for telephony up to 48 kHz for high-fidelity audio
  • Hit generate and receive your audio file in under a second for most inputs

Frequently Asked Questions

Do I need programming skills or technical knowledge to use this? No, just open Realtime TTS 1.5 Mini on Picasso IA, adjust the settings you want, and hit generate.

Is it free to try? Picasso IA lets you run the model without creating an account or entering payment details. You can generate audio and listen to it directly in the browser before downloading anything.

How long does it take to get results? The model targets around 120 milliseconds from input to audio. In practice, most short-to-medium texts render in well under a second, even on a standard internet connection.

What output formats are supported? You can download your audio as MP3, WAV, OGG Opus, or FLAC. MP3 is the default and plays back in virtually every environment. Choose FLAC or WAV if you need lossless audio for post-production editing.

Can I control the voice's tone and speed? Yes. The temperature setting adjusts how expressive or neutral the voice sounds. The speaking rate multiplier lets you speed up or slow down delivery without changing the pitch. You can also insert break tags and emotion markers directly in your text to shape pauses and tone at specific moments.

What languages does the model support? The model covers 15 languages, so you can synthesize speech across multiple locales using the same workflow without switching to a different model for each language.

What happens if I'm not happy with the result? Try adjusting the temperature slider for a different expressiveness level, or switch to a different voice from the preset library. Small changes to phrasing in the source text can also noticeably affect how natural the output sounds.

Credit Cost

Each generation consumes 1 credit

1 credit

or 5 credits for 5 generations

Features

Everything this model can do for you

~120ms latency

Returns audio fast enough for live voice applications and real-time pipelines.

15-language support

Produce speech in fifteen different languages from a single API call.

Emotion markup

Insert [happy], [sad], or similar tags to shift the speaker's emotional tone.

Flexible audio formats

Download output as MP3, WAV, OGG Opus, or FLAC to match any platform.

Custom voices

Use preset names like Ashley or Dennis, or supply your own cloned voice ID.

SSML pause control

Place natural-sounding breaks anywhere in the text with break time tags.

Adjustable sample rate

Choose from 8 kHz to 48 kHz to balance file size against audio fidelity.

Text normalization

Expand numbers, dates, and abbreviations automatically before synthesis.

Use Cases

Generate voiced instructions for a mobile app walkthrough in under a second per sentence

Produce multilingual product announcements in up to 15 languages from a single text template

Create voiced customer service responses for a chatbot that needs replies delivered in real time

Add emotion-tagged narration to a video script by inserting [happy] or [sad] markers in the text

Build an audiobook preview by converting a sample chapter to MP3 or WAV with natural pacing

Insert timed pauses into podcast intros using SSML break tags for a scripted, polished feel

Test different speaker voices on the same script to pick the tone that fits your brand before launch

Switch Category

Effects

Text To Image

Text To Video

Large Language Models

Text To Speech

Super Resolution

Lipsync

AI Music Generation

Video Editing

Speech To Text

AI Enhance Videos

Remove Backgrounds