FlowBind: Efficient Any-to-Any Generation with Bidirectional Flows

KAIST

* Equal contribution

TL;DR: We propose an efficient flow-based multimodal any-to-any generation model with bidirectional flows.

An overview of FlowBind. During training, we jointly learn the shared latent and per-modality drift networks in a single stage. At inference, the learned drift networks perform flexible any-to-any generation by solving per-modality ODEs forward and backward in time.

Abstract

Any-to-any generation seeks to translate between arbitrary subsets of modalities, enabling flexible cross-modal synthesis. Despite recent success, existing flow-based approaches are challenged by its inefficiency, as they require large-scale datasets often with restrictive pairing constraints, incur high computation cost from modeling joint distribution, and multi-stage training pipeline. We propose FlowBind, an efficient framework for any-to-any generation. Our approach is distinguished by its simplicity: it learns a shared latent space capturing cross-modal information, with modality-specific invertible flows bridging this latent to each modality. Both components are optimized jointly under a single flow-matching objective, and at inference the invertible flows act as encoders and decoders for direct translation across modalities. By factorizing interactions through the shared latent, FlowBind naturally leverages arbitrary subsets of modalities for training, and achieves competitive generation quality while substantially reducing data requirements and computational cost. Experiments on text, image, and audio demonstrate that FlowBind attains comparable quality while requiring up to 6× fewer parameters and training 10× faster than prior methods.

FlowBind achieves strong performance across various cross-mdodal generation settings, including one-to-one, one-to-many, and many-to-one generation.

Shared Latent Space of FlowBind

FlowBind’s shared latent space supports smooth, semantically meaningful interpolations across modalities, revealing well-aligned multimodal representation.

Latent interpolation results
Data with blue boundary inidcates the input, showing our shared latent representation unifies information from all input modalities.

Many-to-One Generation

FlowBind can take multiple inputs and generate outputs that are aligned with all provided modalities.

Text + Audio → Image


Image + Audio → Text


One-to-Many Generation

FlowBind can take a single and generates multiple outputs of different modalities.

Text → Image + Audio


Image → Text + Audio


One-to-One Generation

FlowBind supports all six one-to-one cross-modal generation directions among text, images, and audio.

Text → Image


Image → Text


Text → Audio


Audio → Text


Audio → Image


Image → Audio


References

  1. Any-to-Any Generation via Composable Diffusion
    arXiv:2305.10855
  2. OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows
    arXiv:2412.01169

Citation

If you find our work helpful, please cite the following paper.

@misc{cha2025flowbind,
  title={FlowBind: Efficient Any-to-Any Generation with Bidirectional Flows},
  author={Cha, Yeonwoo and Kim, Semin and Kwon, Jinhyeon and Hong, Seunghoon},
  Eprint={arXiv:2512.15420},
  year={2025}
}