A Tree-mendous Doodle: A Visual Explanation of Gradient Boosting

Gradient boosting is all you need. And coffee...

Jan 27, 2023

It's been a tough few weeks in the tech industry as layoffs rear their ugly head. I've seen many friends and colleagues impacted by this and it's been difficult to process this personally. If you are grappling with this and need to talk, or need connections or referrals, please reply to this email. I'll do my best to help you in any way I can.

My friend and engineering manager turned entrepreneur,

Louie Bacaj

wrote about the current predicament more beautifully than I ever can here:

The M&Ms Newsletter

M&Ms: Single Point of Failure

The sans pareil of star destroyers is blown up by a single proton torpedo. The iconic Death Star from Star Wars is an impenetrable a floating metal planet. But after watching Star Wars, generations of us laughed at The Galactic Empire for being stupid enough to construct such a supermassive flying fortress with a single point of failure…

2 years ago · 20 likes · 8 comments · Louie Bacaj

There's no better way to be prepared than to invest in ourselves, and our personal growth. I hope for better times ahead.

This Week on Gradient Ascent:

Gradient boosting explained - the doodle edition 🎨
[Watch] How transformers behave in training vs test 📽️
[Try] Implement your own object detector from scratch 🧑‍💻
[Use] A nifty tool to build your own LLM applications 💻
[Consider reading] BERT style training for convolutional nets? 📜
[Consider reading] GANs make a comeback 📜
[Consider reading] Watermarking for large language models 📜

Poorly Drawn Machine Learning:

Gradient boosting is a versatile and powerful machine-learning algorithm. In simple terms, it's a way of combining many simple models (usually decision trees) to make a stronger and more accurate overall model. The idea is inspired by the wisdom of the crowd theory. This theory loosely states that the collective intelligence of large crowds outweighs that of individual experts.

How does it work?

In gradient boosting, one repeatedly adds new models to the mix (called an ensemble). Each new model (called a weak learner) is trained to correct the mistakes of the previous models. Specifically, each new learner is trained or fit on the residual error made by the previous learner.

The residual error is akin to the difference between the actual ground truth and the predicted value. By repeatedly adding learners to the ensemble to fix the predecessor's errors, we can minimize the overall error and thus obtain a strong model.

Useful concepts to learn

Loss function - A way to measure how well the model is doing. Examples: Mean-squared error, Mean-absolute error, Cross entropy, etc.
Gradient descent - The optimization process for gradient boosting. In gradient descent, we improve the model to make better predictions through small updates based on the gradient of the loss function (where Gradient boosting gets its name from incidentally).
Weak learner - A cog in the overall boosting machine. This is a simple model like a decision tree which isn't very accurate by itself (hence called weak). However, it yields a really powerful model when combined with other weak learners.
Boosting - The process of adding new learners to the ensemble and adjusting the weights of the previous learners.

Applications

Gradient boosting can be used both for regression and classification problems. It particularly shines for tabular datasets and is often a key component in many Kaggle competition-winning solutions.

If you'd like to learn more, check out this detailed deep dive.

Resources To Consider:

How transformers work at training vs inference

In this video, Niels Rogge, a machine learning engineer at HuggingFace walks through how a transformer network works during training time and inference time. This might be slightly advanced if you're not familiar with transformers, but it's really well put together.

Implement an object detector from scratch

Link: https://www.storminthecastle.com/posts/01_classification/

In this blog series, John Robinson walks through the process of building a single-shot object detector (YOLO/SSD) model using Pytorch and the FastAI library. He's released three parts so far and each one is chock full of details, visuals, and code. I highly recommend working through this yourself using the blog articles as a guide.

John Robinson @johnrobinsn

Object Detection from Scratch. New blog series where I incrementally show you how to build a YOLO/SSD single-shot detector. storminthecastle.com/posts/01_class… This series follows the same arc as @jeremyphoward's fantastic 2018 course covering the same topic but updated for FastAI v2! 🔥🔥

Building applications with LLMs

Link: https://github.com/hwchase17/langchain

We see new apps and solutions built using large language models every day. However, it's not a trivial process. Langchain aims to address this. Whether you want to build a chatbot, a Q & A agent for a notion database, or something completely different, check out this repository. It can help you develop these applications. All you need to do is "pip install" it :)

Mask pretraining for convolutional networks?

Paper: https://arxiv.org/abs/2301.03580

Code: https://github.com/keyu-tian/SparK

In this paper, the authors propose SparK which is the first BERT-style pretraining approach designed for convolutional networks (convnets). This is a really interesting breakthrough because it allows convnets to learn from "patches" without any modifications to the network architecture. To achieve this, the authors treat unmasked patches as flattened 3D point clouds. This allows them to apply sparse convolutions to encode them. Additionally, using SparK, results on standard convnets improve on downstream tasks.

Down but not out - The revenge of the GANs

Paper: https://arxiv.org/abs/2301.09515

Code: https://github.com/autonomousvision/stylegan-t

GANs have been surpassed by diffusion models in recent times in the generative domain. However, even the best diffusion models can only iteratively generate a single image. GANs can do it in a single pass. But they are far behind in the quality of the results. The model proposed in this paper, StyleGAN-T, addresses this issue and significantly improves over previous GANs. It actually outperforms distilled diffusion models - the previous state-of-the-art in fast text-to-image synthesis - in terms of sample quality and speed.

I know ChatGPT did your homework

Paper: https://arxiv.org/abs/2301.10226

Large language models can produce really convincing text output. Sometimes, this can be harmful, hallucinated, and can be misused. For example, how do you know if an article was written by a human or by a language model? This paper is one of the first steps in tackling this problem by watermarking text generated by a machine. The watermark can be embedded with negligible impact on text quality, yet can be easily detected by an open-source algorithm without access to the model or its parameters. This area will be really important in ensuring the safe use of language models for generative work. A very interesting paper worth reading.

GPTZero, which is an anti-plagiarism tool that was developed just prior to this work fails to be reliable as shown below.