FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale Paper • 2605.14445 • Published 6 days ago • 19
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation Paper • 2605.10912 • Published 9 days ago • 45
Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling Paper • 2605.13301 • Published 7 days ago • 152
CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models Paper • 2605.08735 • Published 11 days ago • 68
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs Paper • 2605.09063 • Published 11 days ago • 77
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling Paper • 2605.08083 • Published 12 days ago • 66
AcademiClaw: When Students Set Challenges for AI Agents Paper • 2605.02661 • Published 16 days ago • 16
From Context to Skills: Can Language Models Learn from Context Skillfully? Paper • 2604.27660 • Published 17 days ago • 157
AI Co-Mathematician: Accelerating Mathematicians with Agentic AI Paper • 2605.06651 • Published 13 days ago • 15
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key Paper • 2605.06638 • Published 13 days ago • 14
SkillOS: Learning Skill Curation for Self-Evolving Agents Paper • 2605.06614 • Published 13 days ago • 45
AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery Paper • 2604.25256 • Published 22 days ago • 29
World-R1: Reinforcing 3D Constraints for Text-to-Video Generation Paper • 2604.24764 • Published 23 days ago • 118
Abstain-R1: Calibrated Abstention and Post-Refusal Clarification via Verifiable RL Paper • 2604.17073 • Published Apr 18 • 9
MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval Paper • 2604.18584 • Published about 1 month ago • 15
SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents Paper • 2604.17308 • Published Apr 19 • 22
When Can LLMs Learn to Reason with Weak Supervision? Paper • 2604.18574 • Published about 1 month ago • 25