Restyle Any Video with Text using ControlVideo

ControlVideo is a text-to-video model that restyles existing footage by following the structure of a source video while applying the look and content you describe in a prompt. If you have a clip of someone walking and want it to look like an oil painting, a sketch, or a scene in a different location, you describe it and the model handles the rest. It reads the depth, edges, or pose data from your original video so the new output stays in sync with the motion. The model supports three structure modes: depth maps, Canny edge detection, and pose estimation. Depth mode preserves three-dimensional relationships between objects, edge mode follows silhouettes and contours, and pose mode tracks body positions in human subjects. You control how closely the output follows your prompt versus the original structure using the guidance scale, and you can produce longer clips by enabling the hierarchical sampler. It fits into any video content workflow where you need a different visual style without reshooting. Animators can restyle reference footage, marketers can repurpose clips with new aesthetics, and creators can iterate on a single take until the look is right. Open ControlVideo on Picasso IA, paste your prompt, and run it.

Cjwbw

2.4k runs

Controlvideo

2023-05-27

Commercial Use

Overview

ControlVideo lets you restyle an existing video clip by following its structure and applying the visual content you describe in a text prompt. You upload a short clip, write a description of the look you want, and the model generates a new video that matches the original motion while adopting your specified style. Picasso IA runs ControlVideo directly in the browser with no installation needed. A scene of someone jogging can become a watercolor illustration, a pencil sketch, or a detailed fantasy landscape, all from a single run. It works for animation, product visualization, and creative style tests where you want to change what a video looks like without altering how subjects move through the frame.

How It Works

Upload your source video file to use as the structural reference
Write a text prompt describing the visual style, setting, or appearance you want in the output
Select a condition type: depth captures spatial layers, Canny traces edges and outlines, pose follows body positions in human subjects
Adjust the guidance scale to set how strongly the output follows your text versus the original video structure
Optionally set the output length and enable the long-video mode if you need a clip longer than the default 15 frames
Hit generate and download your restyled video when it is ready

Frequently Asked Questions

Do I need programming skills or technical knowledge to use this? No, just open ControlVideo on Picasso IA, adjust the settings you want, and hit generate.

Is it free to try? Yes, you can run ControlVideo without a subscription to test it on your own footage.

How long does generation take? A standard 15-frame clip at 50 denoising steps typically takes between 30 seconds and 2 minutes depending on current server load.

Which condition type should I choose? Depth works best for scenes with clear spatial layers between foreground and background. Canny is better for preserving hard edges and object silhouettes. Pose is designed specifically for clips with visible human figures moving on screen.

Can I generate longer videos? Yes. Turn on the long-video toggle in the settings panel, and the model uses a hierarchical sampler to keep frames consistent across the full clip duration.

How do I fix flickering or frame inconsistencies? Set the smoother steps field to include intermediate timesteps during generation. This pass reduces visual drift and flickering between adjacent frames.

Where can I use the outputs? The exported video file has no watermark and can go directly into a social post, a presentation, a demo reel, or any other project.

Credit Cost

Each generation consumes 10 credits

10 credits

or 50 credits for 5 generations

Features

Everything this model can do for you

No fine-tuning needed

Run the model on any source video without configuring or retraining additional weights.

Three condition modes

Choose depth, Canny edge, or pose to control how structure is extracted from the source video.

Guidance scale control

Adjust how strongly the output follows the text prompt versus the original video structure.

Long video support

Enable the hierarchical sampler to produce extended clips beyond the default 15 frames.

Temporal smoother

Reduce flicker and frame inconsistencies by setting smoother steps during generation.

Seed reproducibility

Reuse the same seed to reproduce identical outputs for side-by-side comparison.

Flexible output length

Set the clip duration to match your specific production or publishing requirements.

Random seed option for varied outputs

Use Cases

Restyle a walking or running clip into a painted or illustrated aesthetic by typing a description of the target look

Convert a real-world human movement video into a stylized animation by selecting pose-based structure tracking

Apply a consistent new visual treatment to a short clip without changing any of the original camera or subject motion

Test multiple visual styles on the same source footage by running different text prompts against the same video

Generate a stylized environment clip by uploading a depth-guided video and describing the scene you want

Produce a longer restyled clip from short source footage using the hierarchical long-video mode

Use Canny edge mode to preserve object outlines while completely changing the surface textures and color palette

Enhance social media content with unique visuals

Examples

2m 56s

Condition: canny

Video Length: 24

Is Long Video: Yes

Guidance Scale: 12.5

Smoother Steps: 25

Num Inference Steps: 50

A white swan movingon the lake, cartoon style.

1m 38s

Condition: pose

Video Length: 15

Is Long Video: Yes

Guidance Scale: 12.5

Smoother Steps: 19, 20

Num Inference Steps: 50

James bond moonwalk on the beach, animation style.

3m 9s

Condition: depth

Video Length: 15

Guidance Scale: 12.5

Smoother Steps: 19, 20

Num Inference Steps: 50

A striking mallard floats effortlessly on the sparkling pond.

Restyle Any Video with Text using ControlVideo

Cjwbw

2.4k runs

Controlvideo

2023-05-27

Commercial Use

Overview

How It Works

Upload your source video file to use as the structural reference

Write a text prompt describing the visual style, setting, or appearance you want in the output

Select a condition type: depth captures spatial layers, Canny traces edges and outlines, pose follows body positions in human subjects

Adjust the guidance scale to set how strongly the output follows your text versus the original video structure

Optionally set the output length and enable the long-video mode if you need a clip longer than the default 15 frames

Hit generate and download your restyled video when it is ready

Frequently Asked Questions

Do I need programming skills or technical knowledge to use this? No, just open ControlVideo on Picasso IA, adjust the settings you want, and hit generate.

Is it free to try? Yes, you can run ControlVideo without a subscription to test it on your own footage.

How long does generation take? A standard 15-frame clip at 50 denoising steps typically takes between 30 seconds and 2 minutes depending on current server load.

Can I generate longer videos? Yes. Turn on the long-video toggle in the settings panel, and the model uses a hierarchical sampler to keep frames consistent across the full clip duration.

Where can I use the outputs? The exported video file has no watermark and can go directly into a social post, a presentation, a demo reel, or any other project.