CLIP Matchmaking, LLM-agents, Longer LLaMAs, MiniGPT-v2 and more...

PDML returns

Oct 17, 2023

Keeping up with AI feels like chasing a bullet train on foot, doesn't it? That's why I'm shaking things up here at Gradient Ascent.

From now on, we're going to alternate between two formats:

Deep Dives: These are comprehensive explorations of pivotal AI research presented in the signature Gradient Ascent style you know and love.
Info-Snack Editions: In these, you'll find a treasure trove packed with quick, bite-sized servings of curated resources and fun visual explainers on hot AI topics.

Why the new setup? Two reasons:

In-Depth Coverage: I'll unravel the Gordian knots of the most important research papers so you don't have to. Each deep dive will distill complex topics into understandable insights, ensuring you deepen your expertise.
Staying Current: These updates are your finely tuned radar for the meteoric comings and goings in the AI cosmos. You'll never miss the launch of a significant paper, tool, or concept.

Curious? Excited? Skeptical? I'd love to hear what you think and what you're eager to see in upcoming editions. There's a lot to cover, so let's get started!

This Week on Gradient Ascent:

Poorly Drawn Machine Learning: CLIP Matchmaking 🎨
[Check out] Long context LLMs ✍️
[Consider reading] Decomposing LLMs into Understandable Components 🤓
[Check out] Kaggle's AI Report 2023 📄
[Check out] AutoGen: Building LLMs Applications using Multi-Agents 🕵️🕵️
[Consider reading] Diffusion with Forward Models 🖌️
[Consider reading] Meta-COT: Chain-of-Thought Prompting with LLMs 💭
[Check out] MiniGPT-v2 🦖

Poorly Drawn Machine Learning:

Contrastive Language–Image Pretraining, or CLIP as it's called in the streets, is a seminal model from OpenAI that elegantly matches text and images. Unlike traditional models specializing in text or images alone, CLIP brings both worlds under one umbrella.

CLIP employs a joint learning strategy, using both text and images as its training set. This enables the model to make intelligent associations between these disparate forms of data. The secret to its success lies in what's known as a contrastive loss function. This function pushes semantically similar text and images closer together in a multi-modal space while separating dissimilar ones. For example, an image of an ice cream cone and the text "ice cream with a cherry" would be pushed together, while the same image and "that dog needs a bath" would be moved farther apart.

What makes this model powerful is a robust training dataset of nearly 400 million image-text pairs, encompassing a wide range of images and associated text descriptions. Through contrastive training, CLIP learns not only to match these pairs correctly but also to weed out incorrect associations.

CLIP is versatile because it doesn't require task-specific adjustments to generalize its learning to new scenarios. While many models require specific tuning for individual tasks, CLIP is flexible enough to interpret abstract terms, recognize objects in images, and even produce descriptive text for visuals—all without additional fine-tuning.

In summary, CLIP transcends traditional barriers between text and visual data, offering a unified, adaptable model that promises broad applicability across diverse tasks and challenges. You can read more about it here.

Resources To Consider:

Long Context LLMs to Play With

LongLLaMA: https://github.com/CStanKonrad/long_llama

LongLLaMA is a large language model designed to handle exceptionally long text contexts—up to 256k tokens. Built on the foundation of OpenLLaMA and fine-tuned using the Focused Transformer method, it also has a code variant based on Code Llama. A smaller 3B version of the model is available under an Apache 2.0 license, and it's compatible with existing implementations for shorter text lengths up to 2048 tokens. The model and its inference code are also available on Hugging Face.

LongLoRA paper: https://arxiv.org/abs/2309.12307

LongLoRA code: https://github.com/dvlab-research/LongLoRA

LongLoRA is a clever solution for making LLMs even better by increasing context sizes without using a ton of computer power. LongLoRA saves time and energy and works well across different tasks. The repository even comes with its own dataset, LongQA, designed to help fine-tune the model for these longer text lengths.

Decomposing LLMs into Understandable Components

Link: https://www.anthropic.com/index/decomposing-language-models-into-understandable-components

Peer into the black box of neural networks with Anthropic's new approach! The article delves into how "features," not individual neurons, are the key to understanding complex language models. Researchers can now dissect and steer models with newfound precision by identifying these. This marks a seismic shift from a scientific hurdle to an engineering challenge, paving the way for hopefully safer and more reliable AI systems. This is a must-read for anyone invested in the future of AI safety and interpretability.

Kaggle's 2023 AI Report

Link: https://www.kaggle.com/AI-Report-2023

Kaggle's annual AI report is out, and it's packed with a chockful of insight. The report is a collection of essays written and submitted to Kaggle as part of a contest. The report has seven sections covering various areas within AI and ML. This is a fantastic resource to understand the state of play in the AI landscape and one which I highly recommend reading.

Building LLM Applications using AutoGen

Paper: https://arxiv.org/abs/2308.08155

Code: https://github.com/facebookresearch/audiocraft

AutoGen is a framework that enables the development of LLM applications using multiple agents that can converse with each other to solve tasks. AutoGen agents are customizable, conversable, and seamlessly allow human participation. They can operate in various modes that employ combinations of LLMs, human inputs, and tools.

Diffusion with Forward Models

Project Page (Paper & Code): https://diffusion-with-forward-models.github.io

This paper introduces a new class of conditional denoising diffusion models that can learn from partial observations rather than requiring complete ground-truth samples. This approach is integrated directly into the denoising process and has been proven effective in challenging computer vision tasks, such as generating 3D scenes from single 2D images. This is a fun read for anyone interested in advancing the capabilities of generative models in complex, data-scarce scenarios.

Meta-COT: Chain-of-Thought Prompting with LLMs

Paper: https://arxiv.org/abs/2310.06692

Code: https://github.com/Anni-Zou/Meta-CoT

This paper introduces a novel approach to "chain-of-thought" (CoT) prompting, overcoming the limitations of current methods that are either too general or too task-specific. Meta-CoT automatically categorizes scenarios and constructs demonstrations, achieving state-of-the-art results in ten benchmark tasks and showing robust performance in unseen datasets.

MiniGPT-v2: A Unified Interface for Vision-Language Learning

Project Page (Paper, Code, & Demo): https://minigpt-v2.github.io

MiniGPT-v2 is a unified interface capable of handling various tasks, from image description to visual question answering. By using unique identifiers for different tasks during training, it not only distinguishes tasks effortlessly but also learns each task more efficiently. With strong performance on multiple benchmarks, this is a fun read for anyone looking to push the boundaries of what vision-language models can do. Watch the video below to see what it can do!

Discussion about this post

Ready for more?