The Secrets of Distributed Training with Zach Mueller
Practical Strategies for Faster, Cheaper, Smarter Model Training
Your model doesn't fit in memory. Or it fits, but takes days to train.
Distributed training solves this by splitting your model's computation across multiple GPUs or machines, allowing you to train faster and handle larger models. But the options feel overwhelming: DDP, FSDP, ZeRO, pipeline parallelism.
As models explode from millions to billions of parameters, picking the right distributed strategy has become critical. Choose wrong and you'll burn weeks debugging communication bottlenecks or, worse, discover your approach won't scale when you need it most.
is a Technical Lead at HuggingFace. He's been in the AI field for almost a decade, starting in the fast.ai community, and quickly learning how modern-day training pipelines are built and operated. He then moved to Hugging Face, where he currently leads the Accelerate project and manages the Transformers Trainer.He has written numerous blogs, built many courses, and given talks on distributed training and PyTorch throughout his career.
Zach's latest course, "From Scratch to Scale", launches in September, and it's jam-packed with the secrets of distributed training for the modern world. What's even more exciting is that he's got the who's who of the industry giving guest lectures throughout the course.
As a valued subscriber of Gradient Ascent, you get 35% off here:
When the opportunity came to probe him on distributed training, I jumped at it. What follows below are Zach's answers to my questions on distributed training and his career so far.
TL;DR:
Technical Insights:
• Start with 2 GPUs, not 8, scale incrementally to catch issues early
• Your biggest bottleneck is rarely compute. Profile data loading first
• Use FSDP when memory-constrained, DDP when optimizing for speed
• 2D parallelism (data + model sharding) is the sweet spot for most models
• Network bandwidth caps at 25 GB/s on Ethernet, and this becomes your ceiling
Learning Philosophy:
• You don't need deep expertise to start. Zach began Accelerate with zero multi-GPU experience
• Get the "vibe" of how methods work through hands-on practice before diving into papers
• Build in public. Sharing your work creates unexpected opportunities
• Real expertise comes from being pushed beyond your comfort zone, not careful preparation
The Interview
Your path into ML is fascinating, from marine biology to leading Accelerate at Hugging Face. What was your personal 'scratch to scale' moment? When did you first realize a single GPU wasn't enough, and what were the initial, painful lessons?
[Zach]: I actually started on Accelerate with no multi-GPU experience. What ended up happening is I joined Hugging Face when I was still training on Google Colab. Basically, they presented me with, “Hey, we want you to tackle this particular problem” with Sylvain. Can you help us out? That’s what led me to work on Accelerate.
That being said, I was quickly thrown into the world of, “Hey, we’re no longer training ULMFiT, we’re training 70 billion parameter models across like eight GPUs,” which was a very massive change for someone who came from Google Colab.
You have a clear passion for making complex topics accessible through open source projects. What drives that passion for open source and teaching?
[Zach]: I think a good answer for this is the fact that I got started in the fast.ai community. Without that and Jeremy’s push towards building in the open, I probably would have stayed silent, built everything on private GitHub repos, and never done anything with it.
As Jeremy pushed us to build publicly, that starts a chain reaction. I then decided to make my teachings public, which then led to me making essentially five iterations of a course, into somehow landing a job at Novetta, Hugging Face, and now doing my own version of courses that I love to teach people about multi-GPU training.
What advice do you have for skilled engineers who are just beginning to face the challenges of scaling beyond a single machine?
[Zach]: As always, don’t have a big problem and expect that you can one-shot the solution.
Start with two GPUs, then four, then eight. Learn the quirks that come when you scale. Identifying your biggest time sinks is the most important thing.
Make sure that you’re not losing time by pulling down data. Make sure that the longest span of time is the model actually training, and even then, make sure that that small part is as optimized as possible. Get that going on two GPUs, then make sure that works on four, then eight, and then as you scale up more and more
Engineers often focus on training time when they first hit scaling issues. What are the hidden costs of not having a scaling strategy in place early?
[Zach]: Not having a scaling strategy early leads to you eating time to figure out what scaling strategy works. This means that’s research, that’s development, and that’s trying to figure out how to make these scaling strategies actually work on your systems.
So you should always have a general idea of what you wanna do and why, even if it’s not applicable yet. For example, should you start with ZeRO1 to get it spun up quickly? What’s the best method to get there the fastest? DeepSpeed.
Moving from a single machine to a distributed system introduces problems like network partitions and nondeterminism. What is the biggest mental shift an engineer needs to make?
[Zach]: You need to think about every stage of the training pipeline as something you can optimize, and more than that, even at the hardware level.
Like, are you using the fastest possible method for getting your data?
Ethernet has a cap of 100 GBE if you're actually managing to do that, which is 25 gigs a second. Anything less than that and you're wasting time with, say, saving the model weights, places, downloading the data. And so you really have to think from the top down of the entire compute stack, where are your bottlenecks?
Based on those bottlenecks, that is how you determine what distributed strategies you're allowed to use with the amount of time that you want to get things done.
Data Parallelism hits a memory wall, which leads to sharded approaches like ZeRO and FSDP. Can you briefly explain the core innovation of sharding the model's states?
[Zach]: The idea behind model sharding is to take a model that's far too big to fit into memory in some capacity (whether it's weights, its gradients, or its optimizer states) and instead distribute it across a bunch of different GPUs. The idea being that you make use of this memory pool of interconnected GPUs rather than having one.
So, say take an 80 billion parameter model; it requires 80 gigs of memory in 4-bit for instance. Well, I don't have 80 gigs of memory at home in a single 3090, but I do across 4 of them.
And so sharding lets you imagine your hardware as a pool of resources rather than individual compute vectors for you to do your work in.
When a model is too big for one GPU, teams turn to Model Parallelism, either Pipeline or Tensor. What is the key trade-off when choosing between them?
[Zach]: As with anything distributed, the main cost is communication. Typically, you want pipeline parallelism if you can, because it's particularly fast, especially because you overlap each stage. So, as the first GPU gets done with the first batch and it gets sent to the second GPU, the first GPU has another batch that it's starting.
The key to choosing pipeline and tensor parallelism is that tensor parallelism is used for really big models whose individual layers need to be distributed across GPUs. However, when it comes to inference, teams tend to use pipeline parallelism because it's faster overall for models that you're probably serving and not needing to worry about actual tensor computations on the GPUs because the model probably fits on there.
The largest models use 3D Parallelism, combining all these strategies. Is orchestrating this where most teams struggle to bridge the gap between theory and practice?
[Zach]:
So we're actually finding that 2D parallelism is typically the sweet spot. Just look at Kimi K2, which essentially used a weird variation of ZeRO1.
That being said, as you apply more and more of these distributed topologies on top of each other, the codebase gets more complex. Typically, teams would need to wrap around interfaces that do this for you, like Nanotron or TorchTitan, which tend to be very successful out of the box. However, you also need to know which default values and overall factors work best for your model in your particular distributed setup to get there.
Let's talk about your course, 'Scratch to Scale.' How does it help an engineer build a robust mental model for navigating the complexities of DDP, FSDP, and model parallelism?
[Zach]: The idea behind Scratch to Scale is that I wanted to make a course that would help me during my first year working on Accelerate. We started out by taking the most basic algorithms for distributed training and implementing them from scratch.
Now it's turned into a conference where we're making sure that not only do the students know how to pass an exam if they have to implement FSDP from scratch or something, but also they have the latest knowledge on what the world is currently doing in the world of distributed training to also make sure that their skills stay relevant and they're learning what's being done today.
The course includes over 15 guest talks from engineers at companies like Meta and Snowflake. Why was building this 'conference-like' experience important to you?
[Zach]: It was important for me to get people who can talk about the areas that I don't know well enough myself, but are still important. And most of all, (people) that I want to learn from. I want to hear from the people that I follow, where I check for the latest news, the latest knowledge.
However, it's quickly snowballing into an all-encompassing idea, and so we're at a point where I have to think carefully about which speakers come in this cohort because I want all of this information for the students. But I also need to make sure that we don't overwhelm them with too many speakers, even though we're already at so many right now, we're edging towards that. I really wanted to make sure that, as this course is not cheap, I wanted the students to feel like they're getting the best value out of it.
By centralizing all of this knowledge into a single location, into a single resource that they can go back to in perpetuity, then it can sort of act as a guide towards them understanding the landscape today, even if they might not get through all the material for a number of months after the course ends.
For the Senior Engineer in your audience, what's the single most valuable 'I wish I knew this a year ago' skill they will get from the course?
[Zach]: What I hope the senior engineer in the audience gets out of this course more than anything is the ability to have other opinions, other insights, and other guides from real people that have dealt with the real problems of training at scale to help guide their decisions, their learnings, and their implementations of trying to take a model and scale it to however many GPUs they have to, but not feeling like they're starting from zero.
By taking the lessons that these people bring and applying them directly to the decisions that they make, in hopes that they don't have to repeat the mistakes that thousands of engineers make every single time that they decide to scale something.
How does the course help a Team Lead de-risk projects and avoid common scaling pitfalls?
[Zach]: If you understand what each of the distributed practices does and their pros and cons, you're able to weigh in on whether or not the team is veering off course when it comes to decisions like:
How do we scale the model next?
How much hardware do we have to use?
You can help guide them towards making sure that everything is being used as efficiently as possible, and also allow you to make MVPs as fast as possible, iterations as fast as possible, because you can identify problems that they might be having and come up with potential solutions.
How does the course equip a Manager or leader to confidently make high-stakes, multi-million dollar decisions about scaling strategy and hardware?
[Zach]: By identifying early the problems that will happen as you scale a method. Right? So if we're not using optimized hardware very early on, you can identify that by asking the critical points of, "Well, okay, are we using InfiniBand? Are we using PCIe?" One of them is much faster than the other.
And if we're training for a month, that is the difference between a training taking a month and a training taking a week or two. Scale that up, and it can cost thousands upon thousands of dollars in compute cost, time waste, and so forth.
You've been called 'one of the key people making distributed machine learning more accessible.' What does it mean to you to bridge the gap between cutting-edge research and the daily reality for engineers?
[Zach]: What I've found is that most material that talks about distributed training does it at a very low level and is geared towards an audience that has been reading the papers and through the implementations thoroughly. This leaves a barrier to entry that folks who have been training models for a year or two can't really get into because they don't have the prior knowledge.
So I've been trying to find ways to get there - the unspoken rule of the room on what you need to go read to understand how this all works. The educational content that I'm making tries to help that individual get onboarding faster, so that way they not only can go take Axolotl out of the box, for instance, and go run a command and have it magically scale, but also be able to understand how that scales, why that scales, and identify the bottlenecks without needing to go read 1000 papers to get there.
My personal ideology is that if you're a practitioner, you should understand how these methods feel. You could call it a general vibe of how these things work. You don't necessarily at first need to know all the math, all the science, and all the intricacies to get there. You build the foundational understanding first by just playing with it and getting it going. And then when you're ready to bring it to the next level, you go dig through the papers.
I care more about getting people going and getting people started than bogging them down with papers and blockers because that doesn't help breed a new generation of engineers.
Where can people find you online?
[Zach]: You can follow me on X at TheZachMueller and on LinkedIn. I also have a Substack called the Mueller Minute, where I (typically) will post a daily tidbit of information related to distributed training that I think more folks should know about
Check out Zach's course "From Scratch to Scale" if you're looking to build one of the most sought-after skills in the AI age. As a Gradient Ascent subscriber, you get 35% off below:
I hope you enjoyed this chat with Zach. See you next time!