After training šš¦šØš„ššš on ššš ššššš¬ for nearly a month, I've come to realize something most people overlook: š¢š§šš«šš¬šš«š®ššš®š«š š¢š¬ šš”š š¦šš¤š-šØš«-šš«ššš¤ šššššØš« š¢š§ ššš šš«šš¢š§š¢š§š . š„
Everyone talks about model architecture and data quality. And yes, those matter immensely. But here's what nobody tells you: when your training run fails at 2 AM because of mysterious šššš šš«š«šØš«š¬, or when your expensive GPU cluster is running at šš% šššš¢šš¢šš§šš², the problem isn't your model. It's most probably a š¦š¢š¬š®š¬š šØš šš”š š”šš«šš°šš«š. š ļø
Questions that seemed simple but had no clear answers: Why is ššØš šš«šš¢š§š¢š§š š¬š„šØš°šš« šš”šš§ ššš§š¬š š¦šØššš„š¬? Which šššš šš„šš š¬ should we actually set? How often should we checkpoint without killing throughput?
That's why we built šš”š šš¦šØš„ šš«šš¢š§š¢š§š šš„šš²ššØšØš¤ š: a complete guide covering everything from model architecture and data curation to the SmolLM3 training marathon, post-training techniques, and crucially, the š¢š§šš«šš¬šš«š®ššš®š«š š„šš²šš« that most teams get wrong.
We validated real vs theoretical bandwidth across the entire stack: šššš š”š¢ššš¢š§š š šš/š¬, šššš¢š§š¤ š.š š«šššš”š¢š§š ššš šš/š¬, šššš ššš§š šš šš.š šš/š¬. Then we ran collective operations across ššš šššš¬ (16 nodes, 8xH100s each) and measured how performance degrades at scale: all-reduce drops from ššš šš/š¬ on a single node to ššš-ššš šš/š¬ across 16 nodes.
If you've ever wondered why your training runs are slower than they should be, or you're planning to scale up and want to avoid expensive mistakes, this guide might save you weeks of debugging.
Tremendous quality of life upgrade on the Hugging Face Hub - we now have auto-complete emojis š¤ š„³ š š š
Get ready for lots more very serious analysis on a whole range of topics from yours truly now that we have unlocked this full range of expression š š¤ š£ š
With the release of the EU data transparency template this week, we finally got to see one of the most meaningful artifacts to come out of the AI Act implementation so far (haven't you heard? AI's all about the data! šš)
The impact of the template will depend on how effectively it establishes a minimum meaningful transparency standard for companies that don't otherwise offer any transparency into their handling of e.g. personal data or (anti?-)competitive practices in commercial licensing - we'll see how those play out as new models are released after August 2nd š
In the meantime, I wanted to see how the template works for a fully open-source + commercially viable model, so I filled it out for the SmolLM3 - which my colleagues at Hugging Face earlier this month š¤ ICYMI, it's fully open-source with 3B parameters and performance matching the best similar-size models (I've switched all my local apps from Qwen3 to it, you should too š”)
Verdict: congrats to the European Commission AI Office for making it so straightforward! Fully open and transparent models remain a cornerstone of informed regulation and governance, but the different organizational needs of their developers aren't always properly accounted for in new regulation. In this case, it took me all of two hours to fill out and publish the template (including reading the guidelines) - so kudos for making it feasible for smaller and distributed organizations š Definitely a step forward for transparency š
This is a fantastic example of large-scale curation of public domain books with intentional governance for AI research and use - definitely recommend checking it out, experimenting with the metadata (institutional/institutional-books-1.0-metadata), and starting to build on top of it š¤