Improve model card with abstract, library tag, and code link
Browse filesThis PR enhances the model card for Show-o2 by:
- Adding the detailed paper abstract to provide a better understanding of the model.
- Including `library_name: transformers` metadata, as the model leverages components from the 🤗 Transformers library (e.g., Qwen2.5).
- Integrating a direct link to the specific GitHub repository for Show-o2, making it easier for users to access the code.
README.md
CHANGED
|
@@ -1,11 +1,12 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
pipeline_tag: any-to-any
|
|
|
|
| 4 |
---
|
|
|
|
| 5 |
<div align="center">
|
| 6 |
<br>
|
| 7 |
|
| 8 |
-
|
| 9 |
[//]: # (<h3>Show-o2: Improved Unified Multimodal Models</h3>)
|
| 10 |
|
| 11 |
[Jinheng Xie](https://sierkinhane.github.io/)<sup>1</sup>
|
|
@@ -14,9 +15,13 @@ pipeline_tag: any-to-any
|
|
| 14 |
|
| 15 |
<sup>1</sup> [Show Lab](https://sites.google.com/view/showlab/home?authuser=0), National University of Singapore <sup>2</sup> Bytedance
|
| 16 |
|
| 17 |
-
[](https://arxiv.org/abs/2506.15564) [](https://github.com/showlab/Show-o/blob/main/docs/wechat_qa_3.jpg)
|
| 18 |
</div>
|
| 19 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
## What is the new about Show-o2?
|
| 21 |
We perform the unified learning of multimodal understanding and generation on the text token and **3D Causal VAE space**, which is scalable for **text, image, and video modalities**. A dual-path of spatial (-temporal) fusion is proposed to accommodate the distinct feature dependency of multimodal understanding and generation. We employ specific heads with **autoregressive modeling and flow matching** for the overall unified learning of **multimodal understanding, image/video and mixed-modality generation.**
|
| 22 |
<img src="overview.png" width="1000">
|
|
@@ -76,4 +81,4 @@ To cite the paper and model, please use the below:
|
|
| 76 |
}
|
| 77 |
```
|
| 78 |
### Acknowledgments
|
| 79 |
-
This work is heavily based on [Show-o](https://github.com/showlab/
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
pipeline_tag: any-to-any
|
| 4 |
+
library_name: transformers
|
| 5 |
---
|
| 6 |
+
|
| 7 |
<div align="center">
|
| 8 |
<br>
|
| 9 |
|
|
|
|
| 10 |
[//]: # (<h3>Show-o2: Improved Unified Multimodal Models</h3>)
|
| 11 |
|
| 12 |
[Jinheng Xie](https://sierkinhane.github.io/)<sup>1</sup>
|
|
|
|
| 15 |
|
| 16 |
<sup>1</sup> [Show Lab](https://sites.google.com/view/showlab/home?authuser=0), National University of Singapore <sup>2</sup> Bytedance
|
| 17 |
|
| 18 |
+
[](https://arxiv.org/abs/2506.15564) [](https://github.com/showlab/Show-o/tree/main/show-o2) [](https://github.com/showlab/Show-o/blob/main/docs/wechat_qa_3.jpg)
|
| 19 |
</div>
|
| 20 |
|
| 21 |
+
## Abstract
|
| 22 |
+
|
| 23 |
+
This paper presents improved native unified multimodal models, \emph{i.e.,} Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at this https URL .
|
| 24 |
+
|
| 25 |
## What is the new about Show-o2?
|
| 26 |
We perform the unified learning of multimodal understanding and generation on the text token and **3D Causal VAE space**, which is scalable for **text, image, and video modalities**. A dual-path of spatial (-temporal) fusion is proposed to accommodate the distinct feature dependency of multimodal understanding and generation. We employ specific heads with **autoregressive modeling and flow matching** for the overall unified learning of **multimodal understanding, image/video and mixed-modality generation.**
|
| 27 |
<img src="overview.png" width="1000">
|
|
|
|
| 81 |
}
|
| 82 |
```
|
| 83 |
### Acknowledgments
|
| 84 |
+
This work is heavily based on [Show-o](https://github.com/showlab/Show-o).
|