Neural Network Speedster: 5 Practical ways to speed up neural networks

When the hardware you have is poor Ol' faithful...

Feb 11, 2023

5 practical techniques for blazing fast neural nets ⚡️
[Consider Reading] Video game worlds will never be the same again 🎮
[Check out] Cacophony to Grammy worthy 🎶
[MUST watch] The only resource you need to learn distributed training 🏭
[Consider reading] LLMs are coming for other tools folks! 🤖

The Fast and the Neurious: 5 Tricks to make your neural networks go brrr…

Let's face it - ChatGPT, Stable Diffusion, or even Google's new kid on the block, B.A.R.D are massive models. Training a model of similar size is beyond the reach of most of us.

MosaicML, a company that focuses on making the training of these models affordable, reduced the cost of training Stable Diffusion from $160k to $125k. Yes, you read that right.

MosaicML @MosaicML

🚨 Price cut for training Stable Diffusion 🚨 Brand new speedups dropped the cost from $160K to $125K, all made possible with our MosaicML tooling. Want to train your own stable diffusion model? Sign up for a demo today! forms.mosaicml.com/demo

Mihir Patel @mvpatel2000

Two weeks ago, we released a blog showing training Stable Diffusion from scratch only costs $160K. Proud to report that blog is already out of date. It now costs 💸 $125K 💸. Stay tuned for more speedup from @MosaicML, coming soon to a diffusion model near you! https://t.co/GDsQLO1QmX https://t.co/qeyHNtS5Dn

Despite their ongoing efforts, you or I aren't going to train these models on our desktops anytime soon (unless you secretly own a server farm that you haven't told anyone about). But training a model is just one part of the puzzle. What users (and companies) care about is how they can use these models to build cool stuff.

If you took the language model behind ChatGPT or B.A.R.D and tried to run it on your phone (assuming that memory isn't an issue), you'd see speeds that are at most comparable with this sloth:

That's because these models need specialized hardware to run efficiently. Specifically, they need a ton of memory and compute to handle the queries they get in a timely manner. Most consumer hardware can't meet these requirements. If you try using the cloud to run these models instead, you'll rack up a hefty bill really fast since inference costs aren't cheap either.

So, how can you make neural networks faster when it actually counts? I've got you covered.

That isn’t the Future-Flash. That’s your neural net!

The five techniques I'll share below are used in production to power some of the apps you love to use like your camera, your favorite search & recommendation app, and even social media applications. In fact, some of them help run AI models entirely on mobile devices. Without further ado, let's get started.

I'm dividing these techniques into two broad buckets - During training and after training. There are some tricks you can use while training models that not only make training faster but also make the model you train lightweight and nimble when you deploy it. Additionally, these give you fine-grained control over how much quality you sacrifice in exchange for speed. Remember, there's no free lunch!

But if you couldn't wait and already finished training your model, there are techniques that you can use after training to put your model on a diet. But, you won't have the same level of control over how much "goodness" you lose as a consequence.

During Training

Network architecture design

One of the first decisions you make in your machine learning experiment has a huge impact on how feasible your solution will be for all downstream tasks. This is the choice of architecture. The advice "Don't be a hero, use tried and true models" is generally correct. However, there are some edge cases when designing your own network might be preferable. For example, you might want to run your model on a phone or a tablet, or you might have a very high volume of queries that you need to handle quickly. In those cases, consider using the following design elements generously:

Separable convolutions
Pooling layers

These reduce the number of operations you do during inference, making the model faster. The MobileNet model was designed with these sensibilities in mind. As the name suggests, it's great for mobile devices!

Here's a tutorial that walks you through this: https://keras.io/api/applications/mobilenet/

Another approach is to avoid hand-designing the network altogether. We can let AI design an AI that works well for the circumstances you care about.

This is called Neural Architecture Search or NAS for short. In NAS, you let AI choose the best architecture from a candidate pool of architectures. At every step, the candidates are mutated in some shape or form giving rise to newer architectures and the process repeats until the best one is found (according to some criteria you set earlier). While time-consuming, NAS often leads to much better networks compared to hand-designing them. So, you can actually ask a NAS framework to design fast networks for you.

A shameless self-plug here. My amazing team worked tirelessly to open-source a toolkit that efficiently searches for the optimal network based on your hardware and performance constraints. We call it DyNAS-T.

Try it out here: https://github.com/IntelLabs/DyNAS-T

Knowledge Distillation

Using knowledge distillation is another smart way to pack a lot of power into smaller models. In knowledge distillation, you transfer the "knowledge" from a much larger model called the teacher into a smaller model called the "student". Over time the student learns to imitate the teacher and thus performs really well but at a fraction of the compute cost.

If you'd like to learn more about knowledge distillation, I wrote an explainer in plain English here:

Gradient Ascent

The Obi-Wan Skywalker Algorithm

This Week on Gradient Ascent: If knowledge distillation were Star Wars 💫 Why skip connections are awesome - The doodle edition 🎨 [Resource] High-quality image generation: Explained 🖼️ [Resource] Convert code into equations using code ➕ [Paper] The universal optimizer we've been waiting for 🤖…

3 years ago · 6 likes · Sairam Sundaresan

Mixed precision training

The final "during training" approach I'll cover is called mixed precision training. Neural networks are just a bunch of weights and biases (fancy numbers) that you multiply an input with to get amazing results. The beauty is that these weights and biases are learnt during training meaning less guessing for us. But, these numbers are floating point numbers. So, they take up a lot of memory. A floating point number occupies 32 bits (full precision). This allows it to represent a wide range of numbers before and after the decimal point.

A few years ago, researchers found that just 16-bit floating point numbers (half-precision) could do the job just as well. But you couldn't just willy-nilly switch out all the 32-bit floats with 16-bit ones. That had calamitous results.

Instead, you had to carefully choose which ones to keep in full precision and which ones to convert to half precision. As you might imagine, this gets tedious to do manually for larger models (They have millions of floating point numbers!).

Enter Automatic Mixed Precision or AMP which does this in just a few lines of code. What's even better is that oftentimes, this results in minimal degradation in performance.

Here's a lovely slide deck that explains AMP: https://www.cs.toronto.edu/ecosystem/documents/AMP-Tutorial.pdf

After Training

Pruning

In life, we often improve by periodically revisiting things and removing unnecessary stuff that we no longer need or that slows us down. Turns out that that strategy works like a charm for neural nets too!

Pruning is the process of completely dropping neurons that don't contribute much to the output. To achieve this, we set either the weights or the connections to zero. While the architecture looks the same after pruning, it's sparser since a lot of neurons are dropped. If done carefully, you can get pretty good results with pruning. But, you don't quite have the control you did as in the earlier methods.

Here's how you do pruning in Pytorch: https://pytorch.org/tutorials/intermediate/pruning_tutorial.html

Quantization

We previously saw how reducing a 32-bit float to 16-bits helped speed up training and reduced the model size. Well, quantization takes that to a whole other level. Typically in quantization, you store and compute intermediate results in 8-bit integers! That's a 4x reduction in memory bandwidth and model size. Your poor old phone can store 4 numbers where it previously could only store one. All of a sudden computations look like this:

But, it isn't that easy. Quantization requires very careful handling because it's extremely destructive. You've reduced the information you had by 4x as well. Some models lose a lot of quality in exchange for this speed and that might not always suit your use case. Like pruning, you do this post-training, so you don't have as much control over the results.

Here's a nice tutorial on quantization: https://pytorch.org/blog/quantization-in-practice

There you have it! These techniques will help you make your next neural network nifty, nimble, and nice. Which technique do you turn to when your models are too slow? Drop a note below!

Resources To Consider:

Unbounded 3D Scenes from 2D images!

Paper: https://arxiv.org/abs/2302.01330

Project page: https://scene-dreamer.github.io/

This is a really interesting piece of work where the authors "generate" infinite 3D scenes from random noise. Their model is trained on just 2D images which makes the approach pretty incredible and scalable. I can imagine these being the starting point for video game world generation in the near future. Check out the video below to see some really cool results.

That sounds like music to my ears

Paper: https://google-research.github.io/noise2music/noise2music.pdf

Project page: https://google-research.github.io/noise2music/

Google's new model generates high-fidelity music from noise. No surprises here that not one, but a series of diffusion models are applied in succession to generate the final music clip. To generate a training set for this model, the authors pseudo-labeled a large corpus of music using two other deep models. An LLM generated a large set of generic music descriptions that could be used as captions while a music-text joint embedding model assigned the captions to each music clip. The results are simply amazing!

Distributed neural network training explained

Link: https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/

Distributed training is one of the most important skills to learn as a machine learning practitioner. This video from Microsoft is an excellent resource for all practitioners. Definitely check this out.

Nando de Freitas 🏳️‍🌈 @NandoDF

This video on distributed neural net training should be part of every machine learning & AI course in the world. It is brilliantly done. Thanks ⁦@gbarthmaron⁩ for pointing me to it. Has anyone already converted it to a homework exercise in a class?

microsoft.comZeRO & DeepSpeed: New system optimizations enable training models with over 100 billion parameters - Microsoft ResearchThe latest trend in AI is that larger natural language models provide better accuracy; however, larger models are difficult to train because of cost, time, and ease of code integration. Microsoft is releasing an open-source library called DeepSpeed, which vastly advances large model training by impr…

Toolformer: Transformers that can use software!

Paper: https://arxiv.org/abs/2302.04761

Imagine if you had a language model that can teach itself to use external tools. It could call simple APIs, know exactly what arguments to pass to these APIs and then use the results. Guess what, in this paper, that's exactly what the authors did. This LLM can control a range of external tools including a calculator, Q/A system, search engines, and more! It's just February 2023 folks.