Upload folder using huggingface_hub

Browse files

Files changed (2) hide show

README.md +19 -17
assets/fig1_aime24_curves_added.png +2 -2

README.md CHANGED Viewed

@@ -6,25 +6,26 @@ datasets:
 language:
 - en
 base_model:
-- deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
 pipeline_tag: text-generation
 ---
 <div align="center">
-<span style="font-family: default; font-size: 1.5em;">AscentRL: Simplicity at Scale</span>
 <div>
 🚀 Competitive RL Performance Without Complex Techniques 🌟
 </div>
 </div>
 <br>
 <div align="center" style="line-height: 1;">
-  <a href="https://github.com/HBX-hbx/AscentRL" style="margin: 2px;">
     <img alt="Code" src="https://img.shields.io/badge/GitHub-100000?style=for-the-badge&logo=github&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
   </a>
-  <a href="https://huggingface.co/collections/hbx/ascentrl" style="margin: 2px;">
-    <img alt="Hugging Face" src="https://img.shields.io/badge/AscentRL-fcd022?style=for-the-badge&logo=huggingface&logoColor=000&labelColor" style="display: inline-block; vertical-align: middle;"/>
   </a>
   <a href="[YOUR_BLOG_LINK]" target="_blank" style="margin: 2px;">
     <img alt="Notion" src="https://img.shields.io/badge/Notion-%23000000.svg?style=for-the-badge&logo=notion&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
@@ -38,11 +39,11 @@ pipeline_tag: text-generation
 ## Overview
-**AscentRL** demonstrates that competitive reinforcement learning performance for small language models doesn't require complex multi-stage pipelines or dynamic schedules. Using a minimal recipe with single-stage training and fixed hyperparameters, we achieve state-of-the-art results on mathematical reasoning tasks.
 We release two models:
-- [**AscentRL-DeepSeek-1.5B**](https://huggingface.co/hbx/AscentRL-DeepSeek-1.5B): Trained from DeepSeek-R1-Distill-Qwen-1.5B
-- [**AscentRL-Nemotron-1.5B**](https://huggingface.co/hbx/AscentRL-Nemotron-1.5B): Trained from OpenMath-Nemotron-1.5B
 Both models use identical hyperparameters without per-model tuning, demonstrating the robustness of our approach.
@@ -62,25 +63,25 @@ Both models use identical hyperparameters without per-model tuning, demonstratin
 ## Performance
-### AscentRL-DeepSeek-1.5B (Based on DeepSeek-R1-Distill-Qwen-1.5B)
 | Model                    | AIME24 (@32) | AIME25 (@32) | AMC23 (@32) | MATH-500 (@4) | Minerva (@4) | OlympiadBench (@4) | HMMT25 (@32) | BRUMO25 (@32) | CMIMC25 (@32) | Avg       |
 | ------------------------ | ------------ | ------------ | ----------- | ------------- | ------------ | ------------------ | ------------ | ------------- | ------------- | --------- |
 | DeepSeek-R1-Distill-1.5B | 29.90        | 22.40        | 63.82       | 84.90         | 34.65        | 45.95              | 13.44        | 30.94         | 12.89         | 37.65     |
 | DeepScaleR-1.5B-Preview  | 40.21        | 28.65        | 73.83       | 89.30         | 39.34        | 52.79              | 18.96        | 40.00         | 21.00         | 44.88     |
-| ProRL-V2                 | 51.87        | 35.73        | 88.75       | 92.00         | 49.03        | **67.84**          | 19.38        | 47.29         | **25.86**     | 53.08     |
 | BroRL                    | **57.50**    | 36.88        | /           | **92.14**     | 49.08        | 61.54              | /            | /             | /             | /         |
-| AscentRL-DeepSeek-1.5B   | 52.29        | **37.19**    | **91.02**   | 91.55         | **51.47**    | 66.77              | **21.98**    | **52.71**     | 25.63         | **54.51** |
 Besides, the real question is whether our simplicity comes at a computational cost. It doesn't. We match half of ProRL-V2's compute budget while using a single-stage recipe with fixed hyperparameters. BroRL requires 4.9× more compute by increasing rollouts to 512 per example, essentially exhaustively exploring the solution space. Our approach achieves competitive performance without this computational overhead.
-### AscentRL-Nemotron-1.5B (Based on OpenMath-Nemotron-1.5B)
 | Model                  | AIME24 (@32) | AIME25 (@32) | AMC23 (@32) | MATH-500 (@4) | Minerva (@4) | OlympiadBench (@4) | HMMT25 (@32) | BRUMO25 (@32) | CMIMC25 (@32) | Avg       |
 | ---------------------- | ------------ | ------------ | ----------- | ------------- | ------------ | ------------------ | ------------ | ------------- | ------------- | --------- |
 | OpenMath-Nemotron-1.5B | 58.75        | 48.44        | 90.55       | 92.40         | 26.93        | 71.70              | 30.10        | 61.67         | 30.08         | 56.74     |
 | QUESTA-Nemotron-1.5B   | **71.56**    | 62.08        | 93.44       | 92.95         | **32.08**    | 72.28              | **40.94**    | **67.50**     | 41.48         | 63.81     |
-| AscentRL-Nemotron-1.5B | 69.69        | **62.92**    | **96.02**   | **94.15**     | 30.24        | **76.59**          | 40.63        | 66.88         | **41.72**     | **64.32** |
 We achieve 64.32% average, slightly outperforming QuestA's 63.81% and leading on five of nine benchmarks. The gap is narrow, which makes sense—both approaches are pushing the boundaries of what's achievable at 1.5B scale. The key difference is in how we get there. We use 2× less compute while achieving slightly better average performance without designing a complex curriculum as used in QuestA.
@@ -89,6 +90,7 @@ We achieve 64.32% average, slightly outperforming QuestA's 63.81% and leading on
 Our approach is deliberately minimal:
 **Core Algorithm**: Standard GRPO with binary outcome rewards
 - **Reward**: Simple DAPO verifier (string-matching, no SymPy)
 - **Training**: Single-stage, no curriculum or stage transitions
 - **Hyperparameters**: Fixed throughout (no adaptive schedules)
@@ -108,7 +110,7 @@ We train on [DAPO-Math-17k](https://huggingface.co/datasets/BytedTsinghua-SIA/DA
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
-model_name = "hbx/AscentRL-Nemotron-1.5B"  # or AscentRL-DeepSeek-1.5B
 model = AutoModelForCausalLM.from_pretrained(
     model_name,
     torch_dtype="auto",
@@ -145,7 +147,7 @@ print(response)
 from vllm import LLM, SamplingParams
 llm = LLM(
-    model="hbx/AscentRL-Nemotron-1.5B",
     tensor_parallel_size=1,
     max_model_len=32768
 )
@@ -162,12 +164,12 @@ responses = llm.generate(problems, sampling_params)
 ## Reproduction
-We provide evaluation scripts based on [POLARIS](https://github.com/ChenxinAn-fdu/POLARIS), the evaluation script is [TODO](TODO).
 ## Citation
 ```bibtex
-@misc{he2025ascentrl,
   title        = {TODO},
   author       = {TODO},
   year         = {2025},

 language:
 - en
 base_model:
+- nvidia/OpenMath-Nemotron-1.5B
 pipeline_tag: text-generation
 ---
 <div align="center">
+<span style="font-family: default; font-size: 1.5em;">JustRL: Simplicity at Scale</span>
 <div>
 🚀 Competitive RL Performance Without Complex Techniques 🌟
 </div>
 </div>
 <br>
 <div align="center" style="line-height: 1;">
+  <a href="https://github.com/HBX-hbx/JustRL" style="margin: 2px;">
     <img alt="Code" src="https://img.shields.io/badge/GitHub-100000?style=for-the-badge&logo=github&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
   </a>
+  <a href="https://huggingface.co/collections/hbx/justrl" style="margin: 2px;">
+    <img alt="Hugging Face" src="https://img.shields.io/badge/JustRL-fcd022?style=for-the-badge&logo=huggingface&logoColor=000&labelColor" style="display: inline-block; vertical-align: middle;"/>
   </a>
   <a href="[YOUR_BLOG_LINK]" target="_blank" style="margin: 2px;">
     <img alt="Notion" src="https://img.shields.io/badge/Notion-%23000000.svg?style=for-the-badge&logo=notion&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
 ## Overview
+**JustRL** demonstrates that competitive reinforcement learning performance for small language models doesn't require complex multi-stage pipelines or dynamic schedules. Using a minimal recipe with single-stage training and fixed hyperparameters, we achieve state-of-the-art results on mathematical reasoning tasks.
 We release two models:
+- [**JustRL-DeepSeek-1.5B**](https://huggingface.co/hbx/JustRL-DeepSeek-1.5B): Trained from DeepSeek-R1-Distill-Qwen-1.5B
+- [**JustRL-Nemotron-1.5B**](https://huggingface.co/hbx/JustRL-Nemotron-1.5B): Trained from OpenMath-Nemotron-1.5B
 Both models use identical hyperparameters without per-model tuning, demonstrating the robustness of our approach.
 ## Performance
+### JustRL-DeepSeek-1.5B (Based on DeepSeek-R1-Distill-Qwen-1.5B)
 | Model                    | AIME24 (@32) | AIME25 (@32) | AMC23 (@32) | MATH-500 (@4) | Minerva (@4) | OlympiadBench (@4) | HMMT25 (@32) | BRUMO25 (@32) | CMIMC25 (@32) | Avg       |
 | ------------------------ | ------------ | ------------ | ----------- | ------------- | ------------ | ------------------ | ------------ | ------------- | ------------- | --------- |
 | DeepSeek-R1-Distill-1.5B | 29.90        | 22.40        | 63.82       | 84.90         | 34.65        | 45.95              | 13.44        | 30.94         | 12.89         | 37.65     |
 | DeepScaleR-1.5B-Preview  | 40.21        | 28.65        | 73.83       | 89.30         | 39.34        | 52.79              | 18.96        | 40.00         | 21.00         | 44.88     |
+| ProRL-V2                 | 51.87        | 35.73        | 88.75       | 92.00         | 49.03        | 67.84              | 19.38        | 47.29         | **25.86**     | 53.08     |
 | BroRL                    | **57.50**    | 36.88        | /           | **92.14**     | 49.08        | 61.54              | /            | /             | /             | /         |
+| JustRL-DeepSeek-1.5B     | 52.60        | **38.75**    | **91.02**   | 91.65         | **51.47**    | **67.99**          | **21.98**    | **52.71**     | 25.63         | **54.87** |
 Besides, the real question is whether our simplicity comes at a computational cost. It doesn't. We match half of ProRL-V2's compute budget while using a single-stage recipe with fixed hyperparameters. BroRL requires 4.9× more compute by increasing rollouts to 512 per example, essentially exhaustively exploring the solution space. Our approach achieves competitive performance without this computational overhead.
+### JustRL-Nemotron-1.5B (Based on OpenMath-Nemotron-1.5B)
 | Model                  | AIME24 (@32) | AIME25 (@32) | AMC23 (@32) | MATH-500 (@4) | Minerva (@4) | OlympiadBench (@4) | HMMT25 (@32) | BRUMO25 (@32) | CMIMC25 (@32) | Avg       |
 | ---------------------- | ------------ | ------------ | ----------- | ------------- | ------------ | ------------------ | ------------ | ------------- | ------------- | --------- |
 | OpenMath-Nemotron-1.5B | 58.75        | 48.44        | 90.55       | 92.40         | 26.93        | 71.70              | 30.10        | 61.67         | 30.08         | 56.74     |
 | QUESTA-Nemotron-1.5B   | **71.56**    | 62.08        | 93.44       | 92.95         | **32.08**    | 72.28              | **40.94**    | **67.50**     | 41.48         | 63.81     |
+| JustRL-Nemotron-1.5B   | 69.69        | **62.92**    | **96.02**   | **94.15**     | 30.24        | **76.59**          | 40.63        | 66.88         | **41.72**     | **64.32** |
 We achieve 64.32% average, slightly outperforming QuestA's 63.81% and leading on five of nine benchmarks. The gap is narrow, which makes sense—both approaches are pushing the boundaries of what's achievable at 1.5B scale. The key difference is in how we get there. We use 2× less compute while achieving slightly better average performance without designing a complex curriculum as used in QuestA.
 Our approach is deliberately minimal:
 **Core Algorithm**: Standard GRPO with binary outcome rewards
 - **Reward**: Simple DAPO verifier (string-matching, no SymPy)
 - **Training**: Single-stage, no curriculum or stage transitions
 - **Hyperparameters**: Fixed throughout (no adaptive schedules)
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
+model_name = "hbx/JustRL-Nemotron-1.5B"  # or JustRL-DeepSeek-1.5B
 model = AutoModelForCausalLM.from_pretrained(
     model_name,
     torch_dtype="auto",
 from vllm import LLM, SamplingParams
 llm = LLM(
+    model="hbx/JustRL-Nemotron-1.5B",
     tensor_parallel_size=1,
     max_model_len=32768
 )
 ## Reproduction
+We provide evaluation scripts based on [POLARIS](https://github.com/ChenxinAn-fdu/POLARIS), the evaluation script is [here](https://github.com/HBX-hbx/JustRL).
 ## Citation
 ```bibtex
+@misc{he2025justrl,
   title        = {TODO},
   author       = {TODO},
   year         = {2025},

assets/fig1_aime24_curves_added.png CHANGED Viewed

Git LFS Details

SHA256: 41411783824481587b631388fd128e037bf953b9c025c20e29ca23a7aef72021
Pointer size: 131 Bytes
Size of remote file: 381 kB

Git LFS Details

SHA256: 55e10fc37dbdee38f23dccfa144a6cf4fcadb6b87192927d4561e685ff15d482
Pointer size: 131 Bytes
Size of remote file: 395 kB