THU-KEG
/

LLaDA-8B-BGPO-countdown

@@ -1,7 +1,7 @@
 ---
-license: apache-2.0
 language:
 - en
 tags:
 - reinforcement-learning
 - planning
@@ -11,6 +11,8 @@ tags:
 - llada
 size_categories:
 - 8B
 ---
 # LLaDA-8B-BGPO-countdown
@@ -24,27 +26,61 @@ size_categories:
 ## Model Details
-- **Model Type**: Diffusion Large Language Model (dLLM)
-- **Parameters**: 8 billion
-- **Training Method**: Boundary-Guided Policy Optimization (BGPO)
-- **Base Model**: LLaDA-8B-Instruct
-- **Task**: Countdown
-- **Language**: English
 ## Training Details
-- **Training Steps**: 560 steps
-- **Response Length**: 256 tokens
-- **Train Diffusion Steps**: 128
-- **Eval Diffusion Steps**: 256
-- **Block Size**: 32
-- **Monte Carlo Sample Size ($n_t$)**: 16
-- **Learning Rate**: 5e-7
-- **Batch Size**: 16
-- **Framework**: Built on VeRL (Volcengine Reinforcement Learning)
 ## Usage & Limitations
-- Primarily designed for countdown tasks.
-- Performance may vary on other tasks.
-- Requires appropriate computational resources for inference.

 ---
 language:
 - en
+license: apache-2.0
 tags:
 - reinforcement-learning
 - planning
 - llada
 size_categories:
 - 8B
+pipeline_tag: text-generation
+library_name: transformers
 ---
 # LLaDA-8B-BGPO-countdown
 ## Model Details
+-   **Model Type**: Diffusion Large Language Model (dLLM)
+-   **Parameters**: 8 billion
+-   **Training Method**: Boundary-Guided Policy Optimization (BGPO)
+-   **Base Model**: LLaDA-8B-Instruct
+-   **Task**: Countdown
+-   **Language**: English
 ## Training Details
+-   **Training Steps**: 560 steps
+-   **Response Length**: 256 tokens
+-   **Train Diffusion Steps**: 128
+-   **Eval Diffusion Steps**: 256
+-   **Block Size**: 32
+-   **Monte Carlo Sample Size ($n_t$)**: 16
+-   **Learning Rate**: 5e-7
+-   **Batch Size**: 16
+-   **Framework**: Built on VeRL (Volcengine Reinforcement Learning)
 ## Usage & Limitations
+-   Primarily designed for countdown tasks.
+-   Performance may vary on other tasks.
+-   Requires appropriate computational resources for inference.
+## Performance
+1.  **Overall Performance**: BGPO vs. baselines on mathematics, coding, and planning tasks
+    ![Main Results](https://github.com/THU-KEG/BGPO/raw/main/assets/main_results.png)
+2.  **Monte Carlo Analysis**: Performance with different sampling sizes $n_t$
+    ![MC Results](https://github.com/THU-KEG/BGPO/raw/main/assets/mc_results.png)
+3.  **Out-of-Domain**: Generalization performance (<span style="color: #939393">gray</span> = in-domain)
+    ![OOD Results](https://github.com/THU-KEG/BGPO/raw/main/assets/ood_results.png)
+## Acknowledgments
+We thank the open-source community for their valuable contributions, particularly:
+-   [VeRL](https://github.com/volcengine/verl) for the RL framework
+-   [HuggingFace](https://huggingface.co/) for model hosting
+-   The research community for their feedback and suggestions
+## Citation
+If you find our work useful, please consider citing our paper:
+```bibtex
+@misc{lin2025boundaryguidedpolicyoptimizationmemoryefficient,
+      title={Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models},
+      author={Nianyi Lin and Jiajie Zhang and Lei Hou and Juanzi Li},
+      year={2025},
+      eprint={2510.11683},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG},
+      url={https://arxiv.org/abs/2510.11683},
+}
+```