FlowBind: Efficient Any-to-Any Generation with Bidirectional Flows

Yeonwoo Cha^*, Semin Kim^*, Jinhyeon Kwon, Seunghoon Hong

KAIST

^* Equal contribution

TL;DR: We propose an efficient flow-based multimodal any-to-any generation model with bidirectional flows.

An overview of FlowBind. During training, we jointly learn the shared latent and per-modality drift networks in a single stage. At inference, the learned drift networks perform flexible any-to-any generation by solving per-modality ODEs forward and backward in time.

Abstract

Any-to-any generation seeks to translate between arbitrary subsets of modalities, enabling flexible cross-modal synthesis. Despite recent success, existing flow-based approaches are challenged by its inefficiency, as they require large-scale datasets often with restrictive pairing constraints, incur high computation cost from modeling joint distribution, and multi-stage training pipeline. We propose FlowBind, an efficient framework for any-to-any generation. Our approach is distinguished by its simplicity: it learns a shared latent space capturing cross-modal information, with modality-specific invertible flows bridging this latent to each modality. Both components are optimized jointly under a single flow-matching objective, and at inference the invertible flows act as encoders and decoders for direct translation across modalities. By factorizing interactions through the shared latent, FlowBind naturally leverages arbitrary subsets of modalities for training, and achieves competitive generation quality while substantially reducing data requirements and computational cost. Experiments on text, image, and audio demonstrate that FlowBind attains comparable quality while requiring up to 6× fewer parameters and training 10× faster than prior methods.

FlowBind achieves strong performance across various cross-mdodal generation settings, including one-to-one, one-to-many, and many-to-one generation.

Shared Latent Space of FlowBind

FlowBind’s shared latent space supports smooth, semantically meaningful interpolations across modalities, revealing well-aligned multimodal representation.

Latent interpolation results — Data with blue boundary inidcates the input, showing our shared latent representation unifies information from all input modalities.

Many-to-One Generation

FlowBind can take multiple inputs and generate outputs that are aligned with all provided modalities.

Text + Audio → Image

Input Conditions

Generated Results

"A cute dog playing in the park."

+

FlowBind	CoDi	OmniFlow

Input Conditions

Generated Results

"A car speeding down a highway."

+

FlowBind	CoDi	OmniFlow

Input Conditions

Generated Results

"A misty forest path."

+

FlowBind	CoDi	OmniFlow

Input Conditions

Generated Results

"A living room with a carpet."

+

FlowBind	CoDi	OmniFlow

Input Conditions

Generated Results

"A garden with flowers"

+

FlowBind	CoDi	OmniFlow

Input Conditions

Generated Results

"A rooftop terrace with a view."

+

FlowBind	CoDi	OmniFlow

Image + Audio → Text

Input Conditions

Generated Results

+

FlowBind	CoDi	OmniFlow
A bell rings while a motorcycle grooms.	a horseback riding with her woman	A wind blows through the microphone as a horse gallops

Input Conditions

Generated Results

+

FlowBind	CoDi	OmniFlow
The birds tweet on the building, after they sing.	a beer is pouring wine for a party.	A city street at night with Christmas lights and a fireworks display.

Input Conditions

Generated Results

+

FlowBind	CoDi	OmniFlow
France players celebrating with their team after a victory	soccer players celebrate after they won lisa's goal after their parents left the soccer match on the français after their	A group of boys cheer and scream.

Input Conditions

Generated Results

+

FlowBind	CoDi	OmniFlow
a child living room with couches and children	boy doesn't appear in the house, that is day.	A child's room with a couch, table and book shelf.

Input Conditions

Generated Results

+

FlowBind	CoDi	OmniFlow
An old time photo shows a car horn blows while a man	blue woman reading with a train next.	a black and white painting of a steam locomotive with a man standing next to it

Input Conditions

Generated Results

+

FlowBind	CoDi	OmniFlow
A mens laughs with a taco and salad	two smiling friends are happy with a happy one having a smile and enjoying the [unk].	A person burps several times and then laughs

One-to-Many Generation

FlowBind can take a single and generates multiple outputs of different modalities.

Text → Image + Audio

Input Conditions

Generated Results

"The train tracks are lined with bluebonnets."

FlowBind

CoDi

OmniFlow

Input Conditions

Generated Results

"A cat is sitting on the sofa."

FlowBind

CoDi

OmniFlow

Input Conditions

Generated Results

"A dog is sitting on a couch and barking."

FlowBind

CoDi

OmniFlow

Image → Text + Audio

Input Conditions

Generated Results

FlowBind

CoDi

OmniFlow

The ceiling is painted with paintings on it.

Statue of the marion on the altar of the cathedral and the architect of the cathedral

A wind blows through the microphone as a horse gallops

Input Conditions

Generated Results

FlowBind

CoDi

OmniFlow

An older man in a suit and tie on a bouton.

President said that he was a politician who talked about politics tonight.

A man in a suit and tie, speaking

Input Conditions

Generated Results

FlowBind

CoDi

OmniFlow

A kitchen sink with a faucet and stainless steel faucet.

sexy she and his big boyfriend with sexy man and she both love with a man with a suit and his pretty dress for them.

A faucet is turned on

Input Conditions

Generated Results

FlowBind

CoDi

OmniFlow

The dog is smiling with its mouth open.

dogs are dogs

A dog growls and barks

Input Conditions

Generated Results

FlowBind

CoDi

OmniFlow

The Nissan aqua is shown.

ceo de 300 m 2 in a very [unk] number 300 the car is complete in 3d.

A car with a revving engine and a spo

One-to-One Generation

FlowBind supports all six one-to-one cross-modal generation directions among text, images, and audio.

Text → Image

Input Conditions

Generated Results

"A busy street with several cars and buses driving on it."

FlowBind	CoDi	OmniFlow

Input Conditions

Generated Results

"The baseball player is throwing the ball hard."

FlowBind	CoDi	OmniFlow

Input Conditions

Generated Results

"There are many teddy bears sitting on the shelves."

FlowBind	CoDi	OmniFlow

Input Conditions

Generated Results

"A group of biker turning into a curb."

FlowBind	CoDi	OmniFlow

Input Conditions

Generated Results

"A white horse standing next to a tree."

FlowBind	CoDi	OmniFlow

Input Conditions

Generated Results

"Meat and broccoli in sauce are in a bowl."

FlowBind	CoDi	OmniFlow

Image → Text

Input Conditions

Generated Results

FlowBind	CoDi	OmniFlow
A monkey is sitting on top of a rock eating some food.	monkey on a little girl eating a banana on a monkey	a man feeding a baby chimpanzee with a spoon

Input Conditions

Generated Results

FlowBind	CoDi	OmniFlow
A bathroom with tile and sink, and a mirror above the wall	new bathroom chairs get messy so bathroom rooms have bathroom colors	a bathroom with a wall-mounted garden tray

Input Conditions

Generated Results

FlowBind	CoDi	OmniFlow
People at the beach in the sand are gathered on a sunny day.	three men are beach at the beach home	a group of people and children sitting at a beach with several trucks parked nearby

Input Conditions

Generated Results

FlowBind	CoDi	OmniFlow
A man and woman in formal attire standing next to each other.	couple and dress wear an opera and an attractive man	a man and woman standing together while he is kissing her

Input Conditions

Generated Results

FlowBind	CoDi	OmniFlow
A plate with a pizza and food on top	pizza and tomato, turkey and pizza	a plate of fires and a piece of chicken on plate

Input Conditions

Generated Results

FlowBind	CoDi	OmniFlow
A young boy is holding a red toy on the beach as an adult looks on.	boy gets his boat while for the swim.	two boys sitting on a surfboard with a yellow kite

Text → Audio

Input Conditions

Generated Results

"an engine idling with bells ringing in the background"

FlowBind

CoDi

OmniFlow

Input Conditions

Generated Results

"television plays in the distant background and then a sewing machine starts up"

FlowBind

CoDi

OmniFlow

Input Conditions

Generated Results

"helicopter blades spinning then fading away"

FlowBind

CoDi

OmniFlow

Input Conditions

Generated Results

"a cat meowing and whining"

FlowBind

CoDi

OmniFlow

Input Conditions

Generated Results

"several loud burps"

FlowBind

CoDi

OmniFlow

Input Conditions

Generated Results

"typing is occurring on a keyboard in a quiet environment"

FlowBind

CoDi

OmniFlow

Audio → Text

Input Conditions

Generated Results

FlowBind	CoDi	OmniFlow
Birds vocalize chirping while a man speaks	Man singing for the bird in the tree	A man speaks and a duck quacks

Input Conditions

Generated Results

FlowBind	CoDi	OmniFlow
A toilet flushes and water drains	Water reflection person a bathroom in bathroom	A cat meows followed by a thump

Input Conditions

Generated Results

FlowBind	CoDi	OmniFlow
A loud power tool grinding and drilling	Firefighter working on a stick is sharp in the fire	A machine is being used

Input Conditions

Generated Results

FlowBind	CoDi	OmniFlow
A person is snoring while sleeping and someone speaks	Two men sleeping and watching the other in the night	A person snoring

Audio → Image

Input Conditions

Generated Results

FlowBind	CoDi	OmniFlow

Input Conditions

Generated Results

FlowBind	CoDi	OmniFlow

Input Conditions

Generated Results

FlowBind	CoDi	OmniFlow

Input Conditions

Generated Results

FlowBind	CoDi	OmniFlow

Input Conditions

Generated Results

FlowBind	CoDi	OmniFlow

Input Conditions

Generated Results

FlowBind	CoDi	OmniFlow

Image → Audio

Input Conditions

Generated Results

FlowBind

CoDi

OmniFlow

Input Conditions

Generated Results

FlowBind

CoDi

OmniFlow

Input Conditions

Generated Results

FlowBind

CoDi

OmniFlow

References

Any-to-Any Generation via Composable Diffusion
arXiv:2305.10855
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows
arXiv:2412.01169

Citation

If you find our work helpful, please cite the following paper.

@misc{cha2025flowbind,
  title={FlowBind: Efficient Any-to-Any Generation with Bidirectional Flows},
  author={Cha, Yeonwoo and Kim, Semin and Kwon, Jinhyeon and Hong, Seunghoon},
  Eprint={arXiv:2512.15420},
  year={2025}
}

FlowBind: Efficient Any-to-Any Generation with Bidirectional Flows

TL;DR: We propose an efficient flow-based multimodal any-to-any generation model with bidirectional flows.

Abstract

FlowBind achieves strong performance across various cross-mdodal generation settings, including one-to-one, one-to-many, and many-to-one generation.

Shared Latent Space of FlowBind

Many-to-One Generation

Text + Audio → Image

Input Conditions

Generated Results

"A cute dog playing in the park."

+

FlowBind

CoDi

OmniFlow

Input Conditions

Generated Results

"A car speeding down a highway."

+

FlowBind

CoDi

OmniFlow

Input Conditions

Generated Results

"A misty forest path."

+

FlowBind

CoDi

OmniFlow

Input Conditions

Generated Results

"A living room with a carpet."

+

FlowBind

CoDi

OmniFlow

Input Conditions

Generated Results

"A garden with flowers"

+

FlowBind

CoDi

OmniFlow

Input Conditions

Generated Results

"A rooftop terrace with a view."

+

FlowBind

CoDi

OmniFlow

Image + Audio → Text

Input Conditions

Generated Results

+

FlowBind

CoDi

OmniFlow

A bell rings while a motorcycle grooms.

a horseback riding with her woman

A wind blows through the microphone as a horse gallops

Input Conditions

Generated Results

+

FlowBind

CoDi

OmniFlow

The birds tweet on the building, after they sing.

a beer is pouring wine for a party.

A city street at night with Christmas lights and a fireworks display.

Input Conditions

Generated Results

+

FlowBind

CoDi

OmniFlow

France players celebrating with their team after a victory

soccer players celebrate after they won lisa's goal after their parents left the soccer match on the français after their

A group of boys cheer and scream.

Input Conditions

Generated Results

+