hbx commited on
Commit
05aae61
·
verified ·
1 Parent(s): bcad438

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. README.md +19 -17
  2. assets/fig1_aime24_curves_added.png +2 -2
README.md CHANGED
@@ -6,25 +6,26 @@ datasets:
6
  language:
7
  - en
8
  base_model:
9
- - deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
10
  pipeline_tag: text-generation
11
  ---
12
 
13
  <div align="center">
14
- <span style="font-family: default; font-size: 1.5em;">AscentRL: Simplicity at Scale</span>
15
  <div>
16
  🚀 Competitive RL Performance Without Complex Techniques 🌟
17
  </div>
18
  </div>
19
 
 
20
  <br>
21
 
22
  <div align="center" style="line-height: 1;">
23
- <a href="https://github.com/HBX-hbx/AscentRL" style="margin: 2px;">
24
  <img alt="Code" src="https://img.shields.io/badge/GitHub-100000?style=for-the-badge&logo=github&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
25
  </a>
26
- <a href="https://huggingface.co/collections/hbx/ascentrl" style="margin: 2px;">
27
- <img alt="Hugging Face" src="https://img.shields.io/badge/AscentRL-fcd022?style=for-the-badge&logo=huggingface&logoColor=000&labelColor" style="display: inline-block; vertical-align: middle;"/>
28
  </a>
29
  <a href="[YOUR_BLOG_LINK]" target="_blank" style="margin: 2px;">
30
  <img alt="Notion" src="https://img.shields.io/badge/Notion-%23000000.svg?style=for-the-badge&logo=notion&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
@@ -38,11 +39,11 @@ pipeline_tag: text-generation
38
 
39
  ## Overview
40
 
41
- **AscentRL** demonstrates that competitive reinforcement learning performance for small language models doesn't require complex multi-stage pipelines or dynamic schedules. Using a minimal recipe with single-stage training and fixed hyperparameters, we achieve state-of-the-art results on mathematical reasoning tasks.
42
 
43
  We release two models:
44
- - [**AscentRL-DeepSeek-1.5B**](https://huggingface.co/hbx/AscentRL-DeepSeek-1.5B): Trained from DeepSeek-R1-Distill-Qwen-1.5B
45
- - [**AscentRL-Nemotron-1.5B**](https://huggingface.co/hbx/AscentRL-Nemotron-1.5B): Trained from OpenMath-Nemotron-1.5B
46
 
47
  Both models use identical hyperparameters without per-model tuning, demonstrating the robustness of our approach.
48
 
@@ -62,25 +63,25 @@ Both models use identical hyperparameters without per-model tuning, demonstratin
62
 
63
  ## Performance
64
 
65
- ### AscentRL-DeepSeek-1.5B (Based on DeepSeek-R1-Distill-Qwen-1.5B)
66
 
67
  | Model | AIME24 (@32) | AIME25 (@32) | AMC23 (@32) | MATH-500 (@4) | Minerva (@4) | OlympiadBench (@4) | HMMT25 (@32) | BRUMO25 (@32) | CMIMC25 (@32) | Avg |
68
  | ------------------------ | ------------ | ------------ | ----------- | ------------- | ------------ | ------------------ | ------------ | ------------- | ------------- | --------- |
69
  | DeepSeek-R1-Distill-1.5B | 29.90 | 22.40 | 63.82 | 84.90 | 34.65 | 45.95 | 13.44 | 30.94 | 12.89 | 37.65 |
70
  | DeepScaleR-1.5B-Preview | 40.21 | 28.65 | 73.83 | 89.30 | 39.34 | 52.79 | 18.96 | 40.00 | 21.00 | 44.88 |
71
- | ProRL-V2 | 51.87 | 35.73 | 88.75 | 92.00 | 49.03 | **67.84** | 19.38 | 47.29 | **25.86** | 53.08 |
72
  | BroRL | **57.50** | 36.88 | / | **92.14** | 49.08 | 61.54 | / | / | / | / |
73
- | AscentRL-DeepSeek-1.5B | 52.29 | **37.19** | **91.02** | 91.55 | **51.47** | 66.77 | **21.98** | **52.71** | 25.63 | **54.51** |
74
 
75
  Besides, the real question is whether our simplicity comes at a computational cost. It doesn't. We match half of ProRL-V2's compute budget while using a single-stage recipe with fixed hyperparameters. BroRL requires 4.9× more compute by increasing rollouts to 512 per example, essentially exhaustively exploring the solution space. Our approach achieves competitive performance without this computational overhead.
76
 
77
- ### AscentRL-Nemotron-1.5B (Based on OpenMath-Nemotron-1.5B)
78
 
79
  | Model | AIME24 (@32) | AIME25 (@32) | AMC23 (@32) | MATH-500 (@4) | Minerva (@4) | OlympiadBench (@4) | HMMT25 (@32) | BRUMO25 (@32) | CMIMC25 (@32) | Avg |
80
  | ---------------------- | ------------ | ------------ | ----------- | ------------- | ------------ | ------------------ | ------------ | ------------- | ------------- | --------- |
81
  | OpenMath-Nemotron-1.5B | 58.75 | 48.44 | 90.55 | 92.40 | 26.93 | 71.70 | 30.10 | 61.67 | 30.08 | 56.74 |
82
  | QUESTA-Nemotron-1.5B | **71.56** | 62.08 | 93.44 | 92.95 | **32.08** | 72.28 | **40.94** | **67.50** | 41.48 | 63.81 |
83
- | AscentRL-Nemotron-1.5B | 69.69 | **62.92** | **96.02** | **94.15** | 30.24 | **76.59** | 40.63 | 66.88 | **41.72** | **64.32** |
84
 
85
  We achieve 64.32% average, slightly outperforming QuestA's 63.81% and leading on five of nine benchmarks. The gap is narrow, which makes sense—both approaches are pushing the boundaries of what's achievable at 1.5B scale. The key difference is in how we get there. We use 2× less compute while achieving slightly better average performance without designing a complex curriculum as used in QuestA.
86
 
@@ -89,6 +90,7 @@ We achieve 64.32% average, slightly outperforming QuestA's 63.81% and leading on
89
  Our approach is deliberately minimal:
90
 
91
  **Core Algorithm**: Standard GRPO with binary outcome rewards
 
92
  - **Reward**: Simple DAPO verifier (string-matching, no SymPy)
93
  - **Training**: Single-stage, no curriculum or stage transitions
94
  - **Hyperparameters**: Fixed throughout (no adaptive schedules)
@@ -108,7 +110,7 @@ We train on [DAPO-Math-17k](https://huggingface.co/datasets/BytedTsinghua-SIA/DA
108
  ```python
109
  from transformers import AutoModelForCausalLM, AutoTokenizer
110
 
111
- model_name = "hbx/AscentRL-Nemotron-1.5B" # or AscentRL-DeepSeek-1.5B
112
  model = AutoModelForCausalLM.from_pretrained(
113
  model_name,
114
  torch_dtype="auto",
@@ -145,7 +147,7 @@ print(response)
145
  from vllm import LLM, SamplingParams
146
 
147
  llm = LLM(
148
- model="hbx/AscentRL-Nemotron-1.5B",
149
  tensor_parallel_size=1,
150
  max_model_len=32768
151
  )
@@ -162,12 +164,12 @@ responses = llm.generate(problems, sampling_params)
162
 
163
  ## Reproduction
164
 
165
- We provide evaluation scripts based on [POLARIS](https://github.com/ChenxinAn-fdu/POLARIS), the evaluation script is [TODO](TODO).
166
 
167
  ## Citation
168
 
169
  ```bibtex
170
- @misc{he2025ascentrl,
171
  title = {TODO},
172
  author = {TODO},
173
  year = {2025},
 
6
  language:
7
  - en
8
  base_model:
9
+ - nvidia/OpenMath-Nemotron-1.5B
10
  pipeline_tag: text-generation
11
  ---
12
 
13
  <div align="center">
14
+ <span style="font-family: default; font-size: 1.5em;">JustRL: Simplicity at Scale</span>
15
  <div>
16
  🚀 Competitive RL Performance Without Complex Techniques 🌟
17
  </div>
18
  </div>
19
 
20
+
21
  <br>
22
 
23
  <div align="center" style="line-height: 1;">
24
+ <a href="https://github.com/HBX-hbx/JustRL" style="margin: 2px;">
25
  <img alt="Code" src="https://img.shields.io/badge/GitHub-100000?style=for-the-badge&logo=github&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
26
  </a>
27
+ <a href="https://huggingface.co/collections/hbx/justrl" style="margin: 2px;">
28
+ <img alt="Hugging Face" src="https://img.shields.io/badge/JustRL-fcd022?style=for-the-badge&logo=huggingface&logoColor=000&labelColor" style="display: inline-block; vertical-align: middle;"/>
29
  </a>
30
  <a href="[YOUR_BLOG_LINK]" target="_blank" style="margin: 2px;">
31
  <img alt="Notion" src="https://img.shields.io/badge/Notion-%23000000.svg?style=for-the-badge&logo=notion&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
 
39
 
40
  ## Overview
41
 
42
+ **JustRL** demonstrates that competitive reinforcement learning performance for small language models doesn't require complex multi-stage pipelines or dynamic schedules. Using a minimal recipe with single-stage training and fixed hyperparameters, we achieve state-of-the-art results on mathematical reasoning tasks.
43
 
44
  We release two models:
45
+ - [**JustRL-DeepSeek-1.5B**](https://huggingface.co/hbx/JustRL-DeepSeek-1.5B): Trained from DeepSeek-R1-Distill-Qwen-1.5B
46
+ - [**JustRL-Nemotron-1.5B**](https://huggingface.co/hbx/JustRL-Nemotron-1.5B): Trained from OpenMath-Nemotron-1.5B
47
 
48
  Both models use identical hyperparameters without per-model tuning, demonstrating the robustness of our approach.
49
 
 
63
 
64
  ## Performance
65
 
66
+ ### JustRL-DeepSeek-1.5B (Based on DeepSeek-R1-Distill-Qwen-1.5B)
67
 
68
  | Model | AIME24 (@32) | AIME25 (@32) | AMC23 (@32) | MATH-500 (@4) | Minerva (@4) | OlympiadBench (@4) | HMMT25 (@32) | BRUMO25 (@32) | CMIMC25 (@32) | Avg |
69
  | ------------------------ | ------------ | ------------ | ----------- | ------------- | ------------ | ------------------ | ------------ | ------------- | ------------- | --------- |
70
  | DeepSeek-R1-Distill-1.5B | 29.90 | 22.40 | 63.82 | 84.90 | 34.65 | 45.95 | 13.44 | 30.94 | 12.89 | 37.65 |
71
  | DeepScaleR-1.5B-Preview | 40.21 | 28.65 | 73.83 | 89.30 | 39.34 | 52.79 | 18.96 | 40.00 | 21.00 | 44.88 |
72
+ | ProRL-V2 | 51.87 | 35.73 | 88.75 | 92.00 | 49.03 | 67.84 | 19.38 | 47.29 | **25.86** | 53.08 |
73
  | BroRL | **57.50** | 36.88 | / | **92.14** | 49.08 | 61.54 | / | / | / | / |
74
+ | JustRL-DeepSeek-1.5B | 52.60 | **38.75** | **91.02** | 91.65 | **51.47** | **67.99** | **21.98** | **52.71** | 25.63 | **54.87** |
75
 
76
  Besides, the real question is whether our simplicity comes at a computational cost. It doesn't. We match half of ProRL-V2's compute budget while using a single-stage recipe with fixed hyperparameters. BroRL requires 4.9× more compute by increasing rollouts to 512 per example, essentially exhaustively exploring the solution space. Our approach achieves competitive performance without this computational overhead.
77
 
78
+ ### JustRL-Nemotron-1.5B (Based on OpenMath-Nemotron-1.5B)
79
 
80
  | Model | AIME24 (@32) | AIME25 (@32) | AMC23 (@32) | MATH-500 (@4) | Minerva (@4) | OlympiadBench (@4) | HMMT25 (@32) | BRUMO25 (@32) | CMIMC25 (@32) | Avg |
81
  | ---------------------- | ------------ | ------------ | ----------- | ------------- | ------------ | ------------------ | ------------ | ------------- | ------------- | --------- |
82
  | OpenMath-Nemotron-1.5B | 58.75 | 48.44 | 90.55 | 92.40 | 26.93 | 71.70 | 30.10 | 61.67 | 30.08 | 56.74 |
83
  | QUESTA-Nemotron-1.5B | **71.56** | 62.08 | 93.44 | 92.95 | **32.08** | 72.28 | **40.94** | **67.50** | 41.48 | 63.81 |
84
+ | JustRL-Nemotron-1.5B | 69.69 | **62.92** | **96.02** | **94.15** | 30.24 | **76.59** | 40.63 | 66.88 | **41.72** | **64.32** |
85
 
86
  We achieve 64.32% average, slightly outperforming QuestA's 63.81% and leading on five of nine benchmarks. The gap is narrow, which makes sense—both approaches are pushing the boundaries of what's achievable at 1.5B scale. The key difference is in how we get there. We use 2× less compute while achieving slightly better average performance without designing a complex curriculum as used in QuestA.
87
 
 
90
  Our approach is deliberately minimal:
91
 
92
  **Core Algorithm**: Standard GRPO with binary outcome rewards
93
+
94
  - **Reward**: Simple DAPO verifier (string-matching, no SymPy)
95
  - **Training**: Single-stage, no curriculum or stage transitions
96
  - **Hyperparameters**: Fixed throughout (no adaptive schedules)
 
110
  ```python
111
  from transformers import AutoModelForCausalLM, AutoTokenizer
112
 
113
+ model_name = "hbx/JustRL-Nemotron-1.5B" # or JustRL-DeepSeek-1.5B
114
  model = AutoModelForCausalLM.from_pretrained(
115
  model_name,
116
  torch_dtype="auto",
 
147
  from vllm import LLM, SamplingParams
148
 
149
  llm = LLM(
150
+ model="hbx/JustRL-Nemotron-1.5B",
151
  tensor_parallel_size=1,
152
  max_model_len=32768
153
  )
 
164
 
165
  ## Reproduction
166
 
167
+ We provide evaluation scripts based on [POLARIS](https://github.com/ChenxinAn-fdu/POLARIS), the evaluation script is [here](https://github.com/HBX-hbx/JustRL).
168
 
169
  ## Citation
170
 
171
  ```bibtex
172
+ @misc{he2025justrl,
173
  title = {TODO},
174
  author = {TODO},
175
  year = {2025},
assets/fig1_aime24_curves_added.png CHANGED

Git LFS Details

  • SHA256: 41411783824481587b631388fd128e037bf953b9c025c20e29ca23a7aef72021
  • Pointer size: 131 Bytes
  • Size of remote file: 381 kB

Git LFS Details

  • SHA256: 55e10fc37dbdee38f23dccfa144a6cf4fcadb6b87192927d4561e685ff15d482
  • Pointer size: 131 Bytes
  • Size of remote file: 395 kB