Any-to-Any
Diffusers
nielsr HF Staff commited on
Commit
99a1d30
·
verified ·
1 Parent(s): d49646c

Improve model card with abstract, library tag, and code link

Browse files

This PR enhances the model card for Show-o2 by:

- Adding the detailed paper abstract to provide a better understanding of the model.
- Including `library_name: transformers` metadata, as the model leverages components from the 🤗 Transformers library (e.g., Qwen2.5).
- Integrating a direct link to the specific GitHub repository for Show-o2, making it easier for users to access the code.

Files changed (1) hide show
  1. README.md +8 -3
README.md CHANGED
@@ -1,11 +1,12 @@
1
  ---
2
  license: apache-2.0
3
  pipeline_tag: any-to-any
 
4
  ---
 
5
  <div align="center">
6
  <br>
7
 
8
-
9
  [//]: # (<h3>Show-o2: Improved Unified Multimodal Models</h3>)
10
 
11
  [Jinheng Xie](https://sierkinhane.github.io/)<sup>1</sup>&nbsp;
@@ -14,9 +15,13 @@ pipeline_tag: any-to-any
14
 
15
  <sup>1</sup> [Show Lab](https://sites.google.com/view/showlab/home?authuser=0), National University of Singapore&nbsp; <sup>2</sup> Bytedance&nbsp;
16
 
17
- [![ArXiv](https://img.shields.io/badge/Arxiv-<2506.15564>-<COLOR>.svg)](https://arxiv.org/abs/2506.15564) [![WeChat badge](https://img.shields.io/badge/微信-加入-green?logo=wechat&amp)](https://github.com/showlab/Show-o/blob/main/docs/wechat_qa_3.jpg)
18
  </div>
19
 
 
 
 
 
20
  ## What is the new about Show-o2?
21
  We perform the unified learning of multimodal understanding and generation on the text token and **3D Causal VAE space**, which is scalable for **text, image, and video modalities**. A dual-path of spatial (-temporal) fusion is proposed to accommodate the distinct feature dependency of multimodal understanding and generation. We employ specific heads with **autoregressive modeling and flow matching** for the overall unified learning of **multimodal understanding, image/video and mixed-modality generation.**
22
  <img src="overview.png" width="1000">
@@ -76,4 +81,4 @@ To cite the paper and model, please use the below:
76
  }
77
  ```
78
  ### Acknowledgments
79
- This work is heavily based on [Show-o](https://github.com/showlab/show-o).
 
1
  ---
2
  license: apache-2.0
3
  pipeline_tag: any-to-any
4
+ library_name: transformers
5
  ---
6
+
7
  <div align="center">
8
  <br>
9
 
 
10
  [//]: # (<h3>Show-o2: Improved Unified Multimodal Models</h3>)
11
 
12
  [Jinheng Xie](https://sierkinhane.github.io/)<sup>1</sup>&nbsp;
 
15
 
16
  <sup>1</sup> [Show Lab](https://sites.google.com/view/showlab/home?authuser=0), National University of Singapore&nbsp; <sup>2</sup> Bytedance&nbsp;
17
 
18
+ [![ArXiv](https://img.shields.io/badge/Arxiv-<2506.15564>-<COLOR>.svg)](https://arxiv.org/abs/2506.15564) [![Code](https://img.shields.io/badge/Code-<GitHub_Repository>-<COLOR>.svg)](https://github.com/showlab/Show-o/tree/main/show-o2) [![WeChat badge](https://img.shields.io/badge/微信-加入-green?logo=wechat&amp)](https://github.com/showlab/Show-o/blob/main/docs/wechat_qa_3.jpg)
19
  </div>
20
 
21
+ ## Abstract
22
+
23
+ This paper presents improved native unified multimodal models, \emph{i.e.,} Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at this https URL .
24
+
25
  ## What is the new about Show-o2?
26
  We perform the unified learning of multimodal understanding and generation on the text token and **3D Causal VAE space**, which is scalable for **text, image, and video modalities**. A dual-path of spatial (-temporal) fusion is proposed to accommodate the distinct feature dependency of multimodal understanding and generation. We employ specific heads with **autoregressive modeling and flow matching** for the overall unified learning of **multimodal understanding, image/video and mixed-modality generation.**
27
  <img src="overview.png" width="1000">
 
81
  }
82
  ```
83
  ### Acknowledgments
84
+ This work is heavily based on [Show-o](https://github.com/showlab/Show-o).