Any-to-any generation seeks to translate between arbitrary subsets of modalities, enabling flexible cross-modal synthesis. Despite recent success, existing flow-based approaches are challenged by its inefficiency, as they require large-scale datasets often with restrictive pairing constraints, incur high computation cost from modeling joint distribution, and multi-stage training pipeline. We propose FlowBind, an efficient framework for any-to-any generation. Our approach is distinguished by its simplicity: it learns a shared latent space capturing cross-modal information, with modality-specific invertible flows bridging this latent to each modality. Both components are optimized jointly under a single flow-matching objective, and at inference the invertible flows act as encoders and decoders for direct translation across modalities. By factorizing interactions through the shared latent, FlowBind naturally leverages arbitrary subsets of modalities for training, and achieves competitive generation quality while substantially reducing data requirements and computational cost. Experiments on text, image, and audio demonstrate that FlowBind attains comparable quality while requiring up to 6× fewer parameters and training 10× faster than prior methods.
FlowBind’s shared latent space supports smooth, semantically meaningful interpolations across modalities, revealing well-aligned multimodal representation.
FlowBind can take multiple inputs and generate outputs that are aligned with all provided modalities.
Input Conditions |
Generated Results |
||||||
"A cute dog playing in the park."+ |
|
Input Conditions |
Generated Results |
||||||
"A car speeding down a highway."+ |
|
Input Conditions |
Generated Results |
||||||
"A misty forest path."+ |
|
Input Conditions |
Generated Results |
||||||
"A living room with a carpet."+ |
|
Input Conditions |
Generated Results |
||||||
"A garden with flowers"+ |
|
Input Conditions |
Generated Results |
||||||
"A rooftop terrace with a view."+ |
|
Input Conditions |
Generated Results |
||||||
+ |
|
Input Conditions |
Generated Results |
||||||
+ |
|
Input Conditions |
Generated Results |
||||||
+ |
|
Input Conditions |
Generated Results |
||||||
+ |
|
Input Conditions |
Generated Results |
||||||
+ |
|
Input Conditions |
Generated Results |
||||||
+ |
|
FlowBind can take a single and generates multiple outputs of different modalities.
Input Conditions |
Generated Results |
||
"The train tracks are lined with bluebonnets." |
FlowBind |
CoDi |
OmniFlow |
|
|
|
|
Input Conditions |
Generated Results |
||
"A cat is sitting on the sofa." |
FlowBind |
CoDi |
OmniFlow |
|
|
|
|
Input Conditions |
Generated Results |
||
"A dog is sitting on a couch and barking." |
FlowBind |
CoDi |
OmniFlow |
|
|
|
|
Input Conditions |
Generated Results |
||
|
FlowBind |
CoDi |
OmniFlow |
The ceiling is painted with paintings on it. |
Statue of the marion on the altar of the cathedral and the architect of the cathedral |
A wind blows through the microphone as a horse gallops |
|
Input Conditions |
Generated Results |
||
|
FlowBind |
CoDi |
OmniFlow |
An older man in a suit and tie on a bouton. |
President said that he was a politician who talked about politics tonight. |
A man in a suit and tie, speaking |
|
Input Conditions |
Generated Results |
||
|
FlowBind |
CoDi |
OmniFlow |
A kitchen sink with a faucet and stainless steel faucet. |
sexy she and his big boyfriend with sexy man and she both love with a man with a suit and his pretty dress for them. |
A faucet is turned on |
|
Input Conditions |
Generated Results |
||
|
FlowBind |
CoDi |
OmniFlow |
The dog is smiling with its mouth open. |
dogs are dogs |
A dog growls and barks |
|
Input Conditions |
Generated Results |
||
|
FlowBind |
CoDi |
OmniFlow |
The Nissan aqua is shown. |
ceo de 300 m 2 in a very [unk] number 300 the car is complete in 3d. |
A car with a revving engine and a spo |
|
FlowBind supports all six one-to-one cross-modal generation directions among text, images, and audio.
Input Conditions |
Generated Results |
||||||
"A busy street with several cars and buses driving on it." |
|
Input Conditions |
Generated Results |
||||||
"The baseball player is throwing the ball hard." |
|
Input Conditions |
Generated Results |
||||||
"There are many teddy bears sitting on the shelves." |
|
Input Conditions |
Generated Results |
||||||
"A group of biker turning into a curb." |
|
Input Conditions |
Generated Results |
||||||
"A white horse standing next to a tree." |
|
Input Conditions |
Generated Results |
||||||
"Meat and broccoli in sauce are in a bowl." |
|
Input Conditions |
Generated Results |
||||||
|
|
Input Conditions |
Generated Results |
||||||
|
|
Input Conditions |
Generated Results |
||||||
|
|
Input Conditions |
Generated Results |
||||||
|
|
Input Conditions |
Generated Results |
||||||
|
|
Input Conditions |
Generated Results |
||||||
|
|
Input Conditions |
Generated Results |
||
"an engine idling with bells ringing in the background" |
FlowBind |
CoDi |
OmniFlow |
Input Conditions |
Generated Results |
||
"television plays in the distant background and then a sewing machine starts up" |
FlowBind |
CoDi |
OmniFlow |
Input Conditions |
Generated Results |
||
"helicopter blades spinning then fading away" |
FlowBind |
CoDi |
OmniFlow |
Input Conditions |
Generated Results |
||
"a cat meowing and whining" |
FlowBind |
CoDi |
OmniFlow |
Input Conditions |
Generated Results |
||
"several loud burps" |
FlowBind |
CoDi |
OmniFlow |
Input Conditions |
Generated Results |
||
"typing is occurring on a keyboard in a quiet environment" |
FlowBind |
CoDi |
OmniFlow |
Input Conditions |
Generated Results |
||||||
|
Input Conditions |
Generated Results |
||||||
|
Input Conditions |
Generated Results |
||||||
|
Input Conditions |
Generated Results |
||||||
|
Input Conditions |
Generated Results |
||||||
|
Input Conditions |
Generated Results |
||||||
|
Input Conditions |
Generated Results |
||||||
|
Input Conditions |
Generated Results |
||||||
|
Input Conditions |
Generated Results |
||||||
|
Input Conditions |
Generated Results |
||||||
|
Input Conditions |
Generated Results |
||
|
FlowBind |
CoDi |
OmniFlow |
Input Conditions |
Generated Results |
||
|
FlowBind |
CoDi |
OmniFlow |
Input Conditions |
Generated Results |
||
|
FlowBind |
CoDi |
OmniFlow |
If you find our work helpful, please cite the following paper.
@misc{cha2025flowbind,
title={FlowBind: Efficient Any-to-Any Generation with Bidirectional Flows},
author={Cha, Yeonwoo and Kim, Semin and Kwon, Jinhyeon and Hong, Seunghoon},
Eprint={arXiv:2512.15420},
year={2025}
}