This Week on Gradient Ascent:
Building your own old-school panorama generator 👵
[Consider reading] Get up to speed on NeRFs 📸
[Consider reading] Google makes object detection a language task? 💬
[Check out] Find research papers 10x faster 💡
[Be inspired by] Snake charming a diffusion model 🧑💻
Grandma's still got a job in the age of deep learning:
Note: The code for this week's article can be found here.
A couple of weeks ago, a reader asked me, "Is it still necessary to learn traditional algorithms? Isn't deep learning doing everything nowadays?". Before I could respond, my mind wandered. The question triggered an explosion of thoughts and memories.
Two thousand freakin' twelve. The year when everything changed (Read in your best impression of Johnny Ive). Deep learning was still in its diapers listening to anecdotes at uncle ImageNet's knee. Being a precocious kid with a voracious appetite for knowledge, it quickly devoured dataset after dataset. In the following decade, it removed all traces of traditional algorithms faster than the unsuspecting victims of the red wedding in Westeros.
Before the Big Bang redux, these algorithms were the talk of the town, sporting mullets and texting on flip phones. They're now relics of the past, relegated to the museum of "once upon a time". Boomers in a TikTok generation. You see, deep learning offloaded a lot of work from the practitioner's hands. With traditional algorithms, one had to hand-engineer things and account for many corner cases. Deep learning uses truckloads of data to do this for you. Also, it's usually ridiculously good at solving the task. Naturally, in this age of big data, deep learning took over.
But these museum artifacts haven't given up just yet. Some of these algorithms (read on below) are used even today because, unlike deep learning methods, they are fast and power efficient - Two super important considerations for practical applications. Don't believe me?
I'll prove it to you through an application you use almost every day. Pick up your fancy smartphone and fire up the camera. Switch to the panorama mode and create one.
Did you know that grandma's secret recipe powers this even today? Step aside deep learning. Can you smell what grandma is cookin? (Raises proverbial people's eyebrow)
The big wide stitch - what's in a panorama?
The hallmark of great technology is when we take it for granted; when it is a seamless (pun intended) experience for us users. I'd be hard-pressed to believe that you've ever had trouble taking panoramas with your smartphone. Yes, there'll be that odd pesky tourist who repeatedly photobombs your masterpiece.
But if you're one of the unfortunate few having trouble beyond this, the general process boils down to these steps - One takes out one's phone. One then opens up the camera and frames a grand vista. Moving with the precision of a ballet dancer on a tightrope, one pirouettes from left to right (or vice-versa) and captures the scene. One's phone then issues a merry statement to the tune of "processing" and displays a spinning circle. After a brief delay, one is greeted with the sight of a beautiful panorama. That's all there is to it.
Or is there?
Have you ever wondered why you need to take pictures moving from left to right in sequence? Or why that guide-line on your phone's screen gets angry when you move or shake the phone too rapidly? Or why it takes a while to generate a panorama when all your other regular pictures are instantly captured?
A good panorama is challenging to produce.
First, your phone's camera does a few things every time it captures a picture. It corrects for lighting (auto exposure), sharpness (autofocus), and tint (auto white balance). This is called "3A" in image processing parlance.
There's an easy way to see this in action. The next time you have your camera out, try moving it around and watch the live preview. You'll see the video feed become blurry and then sharp again, changing in color and brightness depending on where you point the camera. That's 3A in action!
We take this for granted. Now here's where it gets hairy for panoramas. As you take pictures moving from one side to another, the phone's camera needs to account for changes in light, focus, and tint. Add to this variation like movement in the scene, camera shake (yes that's you butterfingers), and more. If you combine all these images into a panorama, you'll end up with artifacts like seams, blur, and warped shapes.
It's a miracle that panoramas turn out as well as they do.
Clearly, there must be an uber-fancy algorithm powering your phone's panorama function. After all, with so many complexities, a simple solution won't work. Right?
Wrong.
Many if not all phones use a combination of traditional "old" algorithms to create panoramas. These algorithms offer the benefits of speed (how long you're willing to wait) and power efficiency (how fast your phone's battery drains). When used in concert, they become a formidable opponent to any panorama challenge you throw at them. Ok, enough with this preamble. Let's build one ourselves.
Grandma's wide-angle wisdom - broaden your horizon (literally)
To build our own panorama generator, we need to look at it from a first principles perspective.
What goes in? what comes out? A bunch of images goes in. A combination of these images comes out.
How are these images taken? Is there a method to the madness? Yes, the images are taken in sequence - from one side to the other.
Why? To stitch images together, you need to know where one image ends and the next one begins.
Well, why can't I take a bunch of pictures randomly and stitch them together? You can. But, are you willing to wait a while for your phone to figure out which picture goes where? You might also end up with some holes in your panorama if you missed catching a spot or two.
Here are three images I took early one morning at Crater lake in Oregon.
Intuitively, you can "see" how these images would look stitched together. Here's the resulting panorama from Photoshop.
Looks deceptively simple. Let's build a solution that tries to replicate this.
First, we need to figure out which parts of two adjacent images line up. Put another way, what are the common parts between two images? Next, we need to figure out how to combine them.
Grandma, what's feature matching?
Golden gate bridge. What pops into your head when you read that? Two majestic red towers balancing a crimson beam over a body of water.
Teenage mutant ninja turtles. What comes to mind? Cowabunga! Four larger-than-life green turtles with mean martial art skills. Colored bandanas covering their eyes. Pizza.
Seeing a trend here? Well, these attributes are called features. If someone showed you just these attributes, you'd know what you were looking at. Even if these were shown upside down, sideways, flipped, what have you.
Good features are instantly recognizable. Memorable.
So far so good.
When I showed you the three images above, subconsciously, you found these "distinctive" parts in each image and lined them up. You did it without thinking about it. That process of identifying the same distinctive parts in two images is called… You guessed it. Feature matching.
But, how do I get my phone to do this? Good question. To computers, images are just a collection of numbers on a grid. Every point on the grid is called a pixel. The number associated with each pixel represents the color of that pixel.
There are traditional algorithms to identify color, texture, and shape information from these numbers (segmentation, edge detection, histograms, and so on). From this, the computer can be taught to identify distinctive regions, just like us. This process is called feature detection.
Here's an example of how that looks. In the image below, I used a feature detection method called ORB (Oriented Rotated Brief) to identify these points. Each colored circle below is a distinctive part of the image that was found by this method. Notice how the algorithm identified edges, areas with interesting textures, and so on.
If we extract ORB features from two adjacent images, we can identify identical regions in these images. Remember, ideally, features are unique and distinctive. But this rarely happens in practice. That's why feature matching is a hard task. It's also why deep learning is so popular. It offloads the feature detection and matching part from our hands and finds great features.
In addition to matching similar points in two images, feature matching algorithms also return a score for each match. This indicates how confident the algorithm is of each match. We can use this score to rate how good a match is and throw away poor ones. Here's what feature matching looks like.
On the left is the first image in our panorama, and on the right is the second. The lines across the two images are the "same" points per the feature matching algorithm. Notice that all the feature points that ORB found weren't part of the final matches. The matching algorithm only uses the points that have a high confidence score. Cool, right?
Ransac-ing the kitchen: How grandma cooks panoramas
So far, we've found a way to identify distinctive regions in a pair of images and then match them. Next, we need to use this information to line up the overlapping parts of these two images.
You might be tempted to ask why you couldn't scooch over the first image until it lined up with the second. Two problems with that. First, how do you automatically determine how much to shift the first image until it perfectly overlaps the second?
Secondly, there's this teeny tiny thing called distortion.
To fight these tricky fiends, we'll turn to two more classic algorithms called Homography and RANSAC.
Ever had a friend or colleague ask you to shift your perspective - look at things from a new point of view? Homography does that for computers. It's a perspective transformation algorithm.
All it needs are at least 4 pairs of matching points between two images, and voila! It can compute where any point from the first image would be found in the second. If we transformed all the points on the first image using homography, we'd be able to accurately place it over the second.
Now if only we had matching pairs of points between the two images... Wait a second. Guess what we just did in the previous steps?
We have both matching points and an algorithm that can transform an image into another perspective. Are we done? Not quite!
Homography only does as well as the input it's given. In other words, the four pairs of points you choose greatly influence the quality of the transformation. How do you find the four best points to give to it?
Enter RANSAC.
RANSAC stands for RANdom SAmple Consensus. It's an algorithm that helps find the model that best fits the data - especially noisy data. In this context, it works with the homography algorithm as follows. RANSAC repeatedly does the following - Randomly choose 4 pairs of matching points. Compute the homography. Use this homography to find out the quality of the transformed result, i.e., check how many of the remaining points are consistent with this estimated homography. Return the best homography as the result.
When you take the first two images in our problem and compute the homography between them, you can get the stitched result below.
To complete our panorama, we simply repeat this process, but use the stitched result above, and the last image in our set. That yields the final panorama below.
Missing pages in grandma's cookbook
Yay! We built our own panorama system, but there are still some issues. You can see a couple of vertical lines in the result. Those are seams from the 3A issue I mentioned an eternity ago. There are also some parts that are stretched and weird boundaries artifacts.
These can be fixed by other algorithms like blending, warping, and cropping. But that's for you to go and discover.
So there you go. You built your own panorama system without using a lick of deep learning. These algorithms that we used are easy to run on phones. But that doesn't mean deep learning won't get to them.
Deep learning hasn't realized its full potential yet. Just as dinosaurs watched that fateful comet hurtle toward the earth making them extinct, ML of yore watches with bated breath as the deep learning supernova evolves. As a kid, I wanted to save my favorite dinosaurs from that explosion. I was too late. As an adult, I want to save my favorite algorithms from extinction. I think I can. Traditional algorithms like the ones we saw above inspire clever solutions and have inspired research in deep learning. Here's one example of that.
Just like some of grandma's secret recipes are passed down for generations, some algorithms are worth passing down too. It's not just enough for an algorithm to be awesome at its job. It also needs to be efficient and stand the test of time in practical scenarios.
So yes, it's still vital to learn traditional algorithms. They're the backbone upon which the pantheon of modern AI has been built.
Resources To Consider:
Nerding out over NeRFs
Paper: https://arxiv.org/abs/2210.00379
If you love NeRFs and want to get up to speed quickly on them, check out this excellent survey paper. It covers a lot of ground on NeRF research over the last two years.
Google's new language interface to detect objects
Paper: https://arxiv.org/abs/2109.10852
Blog: https://ai.googleblog.com/2022/04/pix2seq-new-language-interface-for.html
We typically associate object detection with computer vision models. In this paper, researchers at Google look at object detection as a language modeling task. Yep. That isn't a typo. Their idea is simple. If a neural network knows where and what the objects in an image are, it can simply call them out. By practicing this skill, it can learn useful object representations. Read the blog above for more detail.
Finding new ML papers the smart way
Link: https://arxivxplorer.com/
Tired - Google searching for papers to read. Wired - Following good newsletters to find them for you. Inspired - Using OpenAI's embeddings to semantically search for papers. Enter arxiv Xplorer. Find all relevant papers on a topic of interest using semantic search. This will save you a lot of time!
Mid-U Guidance: A powerful diffusion technique
Code: https://api.wandb.ai/report/johnowhitaker/vn6bbeot
Jonathan Whitaker is a wizard at diffusion models. He recently came up with a new technique to make these models more flexible at generating what you want. Check out his experimental report in the link above.
An interesting and thorough deep dive into a seemingly simple feature. Well done, Sairam!
Whoaa this made sense!