OpenAI's Dev Day, AI on Phones, Token Visualizers, LLM Advancements and More...

A nifty way to go from cloud compute to pocket computers

Nov 15, 2023

This week, we've got a newsletter packed with curated resources and a cheeky illustration of an AI concept. A new deep dive will be out next week. Let's get started!

Squeezing AI into Every Device:

Machine learning models come in all sizes and shapes. However, the modern marvels that have captured our attention aren't small in any sense of the word. Naturally, that limits where these models can be deployed– Huge servers. You might argue that the ChatGPT app or the nifty generative AI wallpaper feature runs just fine on the tiny computer in your pocket. I present my counterargument. Turn on Airplane mode and try using them1. Hiccups? Thought so.

Most of the AI-powered apps we use on mobile devices need wifi or cellular data to send our data over to huge server farms where the actual "AI" lives. Once these behemoths finish computing results (usually done in a jiffy), they send them back to us.

So does that mean you can't run AI models on smaller, much more limited hardware without internet connectivity? Thankfully, that isn't the case.

Mobile devices use a variety of techniques to run ML models entirely on-device. In fact, I spent eight years of my career getting them to run natively on phones. Today, I'll share one of those techniques with you.

Quantization.

Neural networks are comprised of millions of parameters called weights and biases. Each of these parameters is a floating point number. A floating point number typically has 32 bits. Quantizing these parameters involves packing the information in a 32-bit container into a smaller one, like 16 bits, 8 bits, or even 4 bits.

Thus, quantizing a neural net significantly decreases its memory footprint and computational requirements. This, in turn, makes it suitable to run directly on devices with limited resources like mobile phones.

Let's imagine a neural network as a painter. Given a scene, the network has to use the colors it has to paint a version that resembles the original scene as closely as possible.

An unquantized network (also called a full-precision network) can use 32-bit numbers for its parameters, like a painter given a wide palette of colors to use. So, it can capture all the rich information in the scene without losing any of it.

Quantization, then, is the process of reproducing the original scene with a reduced palette. By limiting the network to 16-bit parameters, we halve the data size. Yay! But there's no free lunch. In exchange for reduced memory usage and higher speed, we lose accuracy. If the painter has fewer colors to work with, they're less likely to reproduce the scene exactly. Thus, the result is still a recognizable image but with less nuance in color. These differences become starker when we reduce the precision further to 8 bits, demonstrating a clear tradeoff between complexity and efficiency2.

There are generally two categories of quantization techniques:

Post-training Quantization: This involves applying quantization after a model has been trained. While we don't need to retrain from scratch, this approach sacrifices accuracy for convenience.
Quantization-aware Training: This technique incorporates quantization into the training process itself. This often leads to better performance compared to post-training quantization but needs a careful hand to ensure that quantization doesn't derail the model's performance.

Quantization doesn't always mean a trade-off with quality. With the right technique, models can maintain their performance admirably despite the reduced bit-width. It's all about training the model to make the most of each bit.

To learn more, check out these two papers:

Resources To Consider:

OpenAI Dev Day Breakout Sessions

This YouTube playlist has a few interesting breakout sessions from OpenAI's most recent Dev day. I'd recommend watching the video explaining techniques to maximize LLM performance.

Watch the Playlist

Florence-2: A Unified Representation for Vision Tasks

Paper: https://arxiv.org/abs/2311.06242

Traditional vision models that rely on transfer learning struggle to perform a diversity of tasks with simple instructions. That's where Florence-2 comes in. It uses a sequence-to-sequence structure and text prompts as task instructions to perform various tasks like captioning, detection, grounding, or segmentation. Definitely check this one out!

Instant3D: Fast Text-to-3D Generation

Paper: https://arxiv.org/abs/2311.06214

Instant3D generates high-quality and diverse 3D assets from text prompts. Using a two-stage approach, which first generates four structured views from text and then directly regresses the NeRF from these images, they produce 3D assets within 20 seconds!

ChatGPT Detector catches AI-Generated Papers

Link: https://www.scientificamerican.com/article/chatgpt-detector-catches-ai-generated-papers-with-unprecedented-accuracy/

A specialized ML tool can now catch chemistry papers written by ChatGPT with very high accuracy. While the jury is still out as to how "effective" these detectors are, this tool claims to analyze 20 features of writing style to determine whether a paper was written by a human or a bot. Read this article and let me know what you think.

Zero-Shot Adaptive Prompting of LLMs

Link: https://blog.research.google/2023/11/zero-shot-adaptive-prompting-of-large.html

Generating sample prompts for LLMs isn't always easy. Researchers from Google have introduced an automatic zero-shot prompting method for reasoning problems that carefully selects and constructs pseudo-demonstrations for LLMs using only unlabeled samples. This largely closes the gap between zero-shot and few-shot while retaining the generality of the former. They also have introduced Universal Self-Adaptive Prompting (USP), which extends this idea to a wide range of general NLU and NLG tasks.

Token Visualizer

Link: https://tokenwiz.rahul.gs/

This nifty tool allows you to visualize tokens from text. It also allows you to see token IDs, which is pretty cool. Like OpenAI's tokenizer page, you can use this to understand how a piece of text might be tokenized by a language model and the total count of tokens in that piece of text.

How to Use OpenAI's Vision API

Project Page: https://github.com/roboflow/awesome-openai-vision-api-experiments

This repository is an excellent resource for both beginners and experts, showcasing applications ranging from simple image classifications to advanced zero-shot learning models using OpenAI's Vision API. Give it a shot and build something cool using images, videos, or webcam streams!

Not all models need internet connectivity. Some fall back to an on-device option that uses optimization techniques like the one we're discussing.

The drop in performance isn't always proportional to the magnitude of the reduction in bit width. It depends on the model and on the context in which it's being used.