GPT gets Chatty - The Future of Artificial Conversation
This Week on Gradient Ascent:
Let's chat about GPT 🗣️
Your weekly machine learning doodle 🎨
[Use] Language models to search tweets 🐦
[Check out] A singing stable diffusion? 🎶
[Consider reading] What does a Vision transformer see? 📜
GPT 3.5: The chatty Cathy:
Ladies and gentlemen, I must say, I am positively gobsmacked by ChatGPT, the latest iteration of the good old GPT we all know and love.
GPT stands for 'Generative Pre-training Transformer'. It's a machine learning model that generates human-like text, allowing it to perform a variety of tasks such as translation, summarization, and even writing fiction.
Thanks for reading Gradient Ascent! Subscribe for free to receive new posts and support my work.
But ChatGPT takes things to the next level, my dear reader. It's a chatbot with personality! Yes, you heard me right. ChatGPT can hold a conversation with you, cracking jokes and quipping with the wit and charm of the finest butler.
So, what sets ChatGPT apart? Well, it's all in the training. While GPT is trained using supervised learning, ChatGPT takes things a step further by using reinforcement learning with human feedback. This means that ChatGPT can learn from its interactions with humans and improve its responses over time.
But don't just take my word for it, give ChatGPT a try and see for yourself. I have a feeling you'll be as smitten with it as I am.
The keen-eyed reader that you are, you might have sensed something amiss in the paragraphs above. If you did, well then, kudos to you. Every single word in the paragraphs above was generated by ChatGPT.
To understand the full story, look through the pictures in this gallery:
On a whim, I spent some time prompting ChatGPT to write a newsletter draft (This one actually!) and had some interesting insights. I placed some additional constraints on it - 1) write in the style of my favorite author, P.G. Wodehouse, and 2) write with humor, wit, and banter.
The initial draft was decent with a lot of Wodehousian mannerisms sprinkled throughout the text. But, the article lacked technical detail. So, I asked it to include more technicalities but not lose the sense of humor in the first draft. The result was better, but I wasn't satisfied. In my final follow-up prompt, I specified the details that I wanted it to cover like Reinforcement Learning with Human Feedback (RLHF) and supervised training on dialogue. Its response to this prompt is the collection of paragraphs you just read.
Although it has some limitations (True creativity being one among them), the big takeaway here is that ChatGPT is a quantum leap in conversational AI. It learns from conversation by leveraging feedback.
How did ChatGPT get this good? To understand, let's look under the hood.
When it first made waves, GPT-3 (Generative Pretraining Transformer) showcased three skills:
Generating solutions for a new case given a few examples of a task
Generating language or completing partially filled sentences.
These were possible primarily due to the size of this model (175 billion parameters!) and the data it was trained on (The collective knowledge contained on the Internet).
Within a few months of OpenAI releasing an API powered by this model in 2020, there was a proliferation of apps and services in the wild that used this API to provide "AI-powered" search, conversation, and text completion.
But, there was a glaring issue - the model could be coaxed to generate outputs that were untruthful, toxic, or reflect harmful sentiments.
Why is that? It partially boils down to how the model was trained. GPT is an auto-regressive model trained to blindly predict the next word in a sentence (For more details, check out Issue #9 on large language models here). It wasn't aligned with "good or right" and "bad or wrong".
It wasn't aligned with users and their needs. The solution? Learning from feedback.
But how do you give continual feedback to such a model? To address this, researchers at OpenAI turned to another subfield in AI, reinforcement learning, and used the power of Reinforcement Learning with Human Feedback (RLHF).
What in the world is Reinforcement Learning with Human Feedback?
Reinforcement learning is a way for an AI (also called the agent) to learn how to do something by trying out different things and seeing what works best.
Initially, it doesn't know how to solve the task at hand, and so tries random things to see what happens. If it does something that gets it closer to solving the task, it gets a reward. Over time, it learns which actions get more rewards and move it closer to solving the task. Eventually, it figures out how to solve the task (sometimes better than humans!).
Have you played the 2007 hit game, Portal? You aren't given much in the way of information and have to figure out how to solve puzzles (frustrating as they might be). You try a bunch of stuff. Some work. Many more fail. Rinse and repeat until you get the cake. That's kind of how reinforcement learning feels.
Reinforcement Learning from Human Feedback (RLHF) is a twist on this that uses small amounts of human feedback to solve problems. So instead of being left to its own devices, the agent periodically shows its work to a human evaluator. The evaluator then chooses which action is closer to solving the task. The AI then uses this feedback to improve itself.
So to quickly recap - GPT-3 blindly followed prompts and researchers needed to figure out how to give it feedback so that it would be more receptive to instructions.
InstructGPT - Better but still lacking
Using RLHF, researchers were able to incorporate feedback to make GPT safer, and more helpful. Human annotators used prompts submitted by customers to the OpenAI API and demonstrated desired behaviors for these prompts. They also ranked several outputs from the GPT model from best to worst.
Think of how you'd onboard a new employee. You'd show them some examples and the expected outcome. Your new colleague would learn from your demonstrations, and also from how you rate their work. It's kind of a similar case here.
So, these pieces of feedback were used to finetune GPT. The result? A new and better model, InstructGPT.
Despite being over 100x smaller than GPT-3, InstructGPT was better at following instructions. It made facts up less often, and more importantly, showed a reduced ability to generate toxic output (See the image below).
But if InstructGPT already leveraged feedback, how is ChatGPT so much better?
Putting all the pieces together - Let's chat, GPT
A few key changes to the training process made a world of difference. First, researchers provided conversations in which they played both sides - the user and the AI assistant. This was used to fine-tune the model. Prior efforts only focused on finetuning with unidirectional data.
Next, to create a reward model (to decide how much the agent is to be rewarded) for reinforcement learning, researchers used ranked comparison data. Given a bunch of conversations with the chatbot, they randomly selected a model-written response and sampled several alternate completions. Human labelers then ranked these responses from best to worst. These rankings were used to train the reward model.
At this point, we have two pieces of the puzzle:
A language model that can generate text (trained on two-sided conversations)
A reward model trained to score how well humans perceive any text
The final piece is making the language model "listen" to feedback. Using a reinforcement learning approach called Proximal Policy Optimization or PPO (nothing to do with health insurance), researchers used the reward model to train the GPT model further.
Here's the kicker - The reward model was trained to learn how humans give feedback. So in a sense, the reward model acts like a human in the loop to give feedback. This makes ChatGPT much more powerful than its predecessors. Below is a flowchart summarizing the ChatGPT training process:
If you're curious to learn more about the workings of ChatGPT, I highly recommend these two resources:
The evolution of GPT:
This article dives into technical detail and is extremely well-researched. It covers the chronology of the GPT models and how they came to be.
Reinforcement learning with human feedback is a core component in ChatGPT's training. This article covers this in great depth and is illustrated as well.
I know this is a lot to take in, but I hope you found this useful in understanding how this amazing piece of tech works. What's your favorite way to use ChatGPT?
Poorly Drawn Machine Learning:
Learn more about gradient clipping here
Resources To Consider:
Searching tweets with Language Models
Perplexity's Bird SQL is a nifty tool that allows you to search tweets by leveraging the power of large language models. It's a great way to find the tweets that you actually want to read and cut through all the other noise. Check it out and you'll be pleasantly surprised!
Stable Diffusion generates images of music?
Ever had your mind blown by an idea and then had it blown again by the execution of the idea? This is definitely one of those!
A Spectrogram is a visualization of sound (in oversimplified terms). Basically an image. Now, what's really good at generating images? Stable Diffusion. Can we have it generate spectrograms based on the prompts we give it? Sure, why not.
Now, do those spectrograms actually sound like the music described in the prompt? Heck yes! Definitely play with this amazing piece of research yourself. Your mind will be blown. I guarantee it.
What do Vision Transformers look at?
In this new paper, researchers find that Vision Transformers (ViTs) like their counterparts CNNs, also progress from finding edges to using those features to identify objects, make better use of background info compared to CNNs and rely less on high-frequency attributes. This is a wonderful paper to read to understand how ViTs work the way they do. The paper and code are linked above.
Thanks for reading Gradient Ascent! Subscribe for free to receive new posts and support my work.