AI & ML interests

Training DALL-E with volunteers from all over the Internet using hivemind and dalle-pytorch (NeurIPS 2021 demo)

Recent Activity

flozi00 
posted an update 5 days ago
view post
Post
3029
We recently discussed how Tensor Parallelism slices matrices to reduce latency within a single node. But what happens when you need to scale beyond that, where the bandwidth drops?

That is where Pipeline Parallelism (PP) takes over.

Instead of slicing the operation, PP slices the model depth. It turns your GPU cluster into an assembly line: GPU 0 handles layers 1-12, GPU 1 handles 13-24, and so on.

The hardware challenge here isn't the interconnect speed—it is the "Pipeline Bubble." In a naive setup, expensive H100s sit idle for most of the cycle waiting for data to flow through the chain.

My latest guide breaks down the scheduling strategies used to minimize this idle silicon time.

In this deep dive, we cover:

The Hardware Mechanics: Vertical Slicing
Unlike TP which requires "chatty" All-Reduce operations, PP relies on lightweight Point-to-Point (Send/Recv) communication. This makes it the only viable strategy for crossing node boundaries over Ethernet or InfiniBand.

Fighting the Bubble: 1F1B vs. GPipe
We analyze the scheduling algorithms that keep the GPUs fed:

GPipe: The "flush and fill" approach. Simple, but memory-intensive.
1F1B (One-Forward-One-Backward): The industry standard. By interleaving forward and backward passes, we aggressively free up memory and reduce the bubble size.
The Math of Efficiency
The "Bubble" is a mathematical inevitability. We look at the efficiency formula
M+N−1
M

to understand why you need massive global batch sizes to make PP worth the effort.

The article includes a conceptual PyTorch implementation of the 1F1B state machine to illustrate exactly how the data is handed off between stages.

Read the full breakdown here:
https://flozi.net/en/guides/ai/scaling/pipeline_parallel
flozi00 
posted an update 11 days ago
view post
Post
3553
When models get too large for a single GPU, simply stacking layers vertically (Pipeline Parallelism) isn't always the answer. Sometimes, you need to slice the matrices themselves.

My latest guide breaks down the hardware mechanics of Tensor Parallelism (TP). We look at how to shard individual operations across devices to make a cluster function as one massive accelerator.

This isn't high-level theory—it is a look at the bare metal implementation.

Here is what is covered in the deep dive:

The Strategies: Column vs. Row Parallelism
We analyze how to split weight matrices (W) and inputs (X).

Column-Linear: Splits weights by columns. Requires an All-Gather to reconstruct the output.
Row-Linear: Splits weights by rows. Requires an All-Reduce to sum partial results.
The "Megatron-LM" Optimization
Efficiency comes from minimizing communication. By sandwiching the non-linearity (GeLU) between a Column-Parallel layer and a Row-Parallel layer, we can skip synchronization entirely during the activation phase. This cuts communication events by 50% per block.

The Hardware Reality: The Bandwidth Wall
In TP, the dist.all_reduce operation sits on the critical path. The CUDA cores effectively stall while waiting for the ring-reduce to finish.

Intra-Node: Works well because NVLink provides enough bandwidth to hide this latency.
Inter-Node: Fails at scale. Standard networking (Ethernet/InfiniBand) is too slow for the high-frequency syncs required by TP.
The article includes a raw PyTorch implementation using torch.distributed primitives to show exactly where the data moves and where the bottlenecks sit.

Read the full hardware-centric guide here:
https://flozi.net/en/guides/ai/scaling/tensor_parallel
flozi00 
posted an update 19 days ago
view post
Post
2761
Running large language models efficiently is more than just raw GPU power. The latest guide breaks down the essential math to determine if your LLM workload is compute-bound or memory-bound.

We apply these principles to a real-world example: Qwen's 32B parameter model on the new NVIDIA RTX PRO 6000 Blackwell Edition.

In this guide, you will learn how to:

Calculate your GPU's operational intensity (Ops:Byte Ratio)
Determine your model's arithmetic intensity
Identify whether your workload is memory-bound or compute-bound

Read the full guide here: https://flozi.net/en/guides/ai/llm-inference-math
flozi00 
posted an update 24 days ago
flozi00 
posted an update about 1 month ago
view post
Post
1950
I just got asked about the differences between Blackwell systems and Grace Blackwell systems. What's the difference and how much of a performance gap is there between them?

https://flozi.net/en/hardware/nvidia/benchmarks/b200-vs-gb200-efficiency-comparison

Here's a summary of the key points from the article:

GB200 (Grace Blackwell) is a Superchip: It integrates a Grace CPU and two Blackwell GPUs into a single package.
B200 is a GPU-only module: It's designed to be paired with x86 or ARM CPUs in more traditional server setups.


Performance and Efficiency:

Based on MLPerf Training v5.0 benchmarks, the article concludes:

GB200 systems are approximately 42% more efficient than B200 systems on average. This is especially true in large-scale deployments (100+ GPUs), where the GB200's integrated design and high-speed NVLink interconnect provide a significant advantage.

In smaller, single-node systems (e.g., 8 GPUs), the performance difference is much smaller, around 10-15%.


Use Cases:

Choose GB200 for large-scale AI clusters, training massive models, and when maximum efficiency is the top priority.

Choose B200 for smaller deployments, when you need the flexibility to choose your own CPU, or for mixed AI and HPC workloads.
flozi00 
posted an update about 1 month ago
view post
Post
3142
Some weeks ago, i've just decide its time to leave LinkedIn for me.
It got silent around my open source activities the last year, so i thought something has to change.

That's why my focus will move to share experiences and insights about hardware, drivers, kernels and linux. I won't post about how to use models, built agents or do prompting. I want to share about some deeper layers the actual hypes are built on.

I will start posting summarizations of my articles here on the hub.

English version:
https://flozi.net/en

German translated version:
https://flozi.net/de

Feel free to reach me if you want to read something specific.
  • 2 replies
·
IlyasMoutawwakil 
posted an update 4 months ago
view post
Post
3476
🚀 Optimum: The Last v1 Release 🚀
Optimum v1.27 marks the final major release in the v1 series. As we close this chapter, we're laying the groundwork for a more modular and community-driven future:
- Optimum v2: A lightweight core package for porting Transformers, Diffusers, or Sentence-Transformers to specialized AI hardware/software/accelerators..
- Optimum‑ONNX: A dedicated package where the ONNX/ONNX Runtime ecosystem lives and evolves, faster-moving and decoupled from the Optimum core.

🎯 Why this matters:
- A clearer governance path for ONNX, fostering stronger community collaboration and improved developer experience..
- Enable innovation at a faster pace in a more modular, open-source environment.

💡 What this means:
- More transparency, broader participation, and faster development driven by the community and key actors in the ONNX ecosystem (PyTorch, Microsoft, Joshua Lochner 👀, ...)
- A cleaner, more maintainable core Optimum, focused on extending HF libraries to special AI hardware/software/accelerators tooling and used by our partners (Intel Corporation, Amazon Web Services (AWS), AMD, NVIDIA, FuriosaAI, ...)

🛠️ Major updates I worked on in this release:
✅ Added support for Transformers v4.53 and SmolLM3 in ONNX/ONNXRuntime.
✅ Solved batched inference/generation for all supported decoder model architectures (LLMs).

✨ Big shoutout to @echarlaix for leading the refactoring work that cleanly separated ONNX exporter logic and enabled the creation of Optimum‑ONNX.

📝 Release Notes: https://lnkd.in/gXtE_qji
📦 Optimum : https://lnkd.in/ecAezNT6
🎁 Optimum-ONNX: https://lnkd.in/gzjyAjSi
#Optimum #ONNX #OpenSource #HuggingFace #Transformers #Diffusers
julien-c 
posted an update 8 months ago
view post
Post
8242
BOOOOM: Today I'm dropping TINY AGENTS

the 50 lines of code Agent in Javascript 🔥

I spent the last few weeks working on this, so I hope you will like it.

I've been diving into MCP (Model Context Protocol) to understand what the hype was all about.

It is fairly simple, but still quite powerful: MCP is a standard API to expose sets of Tools that can be hooked to LLMs.

But while doing that, came my second realization:

Once you have a MCP Client, an Agent is literally just a while loop on top of it. 🤯

➡️ read it exclusively on the official HF blog: https://huggingface.co/blog/tiny-agents
  • 1 reply
·
thomwolf 
posted an update 8 months ago
view post
Post
7502
If you've followed the progress of robotics in the past 18 months, you've likely noticed how robotics is increasingly becoming the next frontier that AI will unlock.

At Hugging Face—in robotics and across all AI fields—we believe in a future where AI and robots are open-source, transparent, and affordable; community-built and safe; hackable and fun. We've had so much mutual understanding and passion working with the Pollen Robotics team over the past year that we decided to join forces!

You can already find our open-source humanoid robot platform Reachy 2 on the Pollen website and the Pollen community and people here on the hub at pollen-robotics

We're so excited to build and share more open-source robots with the world in the coming months!
  • 1 reply
·