Benchmarking LLMs, Mixture-of-Experts explained, Choosing the Right LLM, Gemini, and more...
Plus contest winners announced
This week, I have a treasure trove of curated resources. A special deep dive will be out next week.
Announcing the Winners!
Thanks to everyone who kindly filled out the survey over these past two weeks. I'm delighted to announce the winners today.
Congratulations to M.J Welwood, Michael Dean, Reza Sugiarto, Vibhas Gejji, Arunachalam V, Grayson, Dirk Harms-Merbitz, Amar Rama, Abhimanyu Hans, and C.Walters!
I'll contact each of you later this week with invitations to the 1:1 strategy session using the emails provided in the survey.
Resources To Consider:
Choosing the Right LLM
Link: https://community.aws/posts/how-to-choose-your-llm
This opinion piece provides a nice framework to evaluate and choose the right LLM for your projects. If you're building something using language models, check this piece out to see if you can glean some tips.
A Deep Dive into MOEs
Link: https://huggingface.co/blog/moe
With Mistral's latest model leveraging the mixture-of-experts technique, it's an excellent time to dive into this detailed explainer from HuggingFace. I highly recommend diving into this over a few cups of coffee.
Language Models Meet World Models
Link: https://sites.google.com/view/neurips2023law/home
NeurIPS 2023 is underway, and this tutorial focuses on integrating large language models (LMs) with world models (WMs). LMs, while successful in many language tasks, often lack robust world knowledge, limiting their ability to perform complex reasoning and planning tasks. Integrating LMs with WMs, traditionally studied in reinforcement learning and robotics, offers opportunities for enhanced reasoning and planning capabilities. The tutorial includes discussions on the limitations of LMs, the background of WMs, and how LMs can utilize or learn WMs for improved task performance. I think videos will be up on this link later, but check out the slides for now.
LLMs and Autonomous Driving?
Paper: https://arxiv.org/abs/2312.06351
This paper introduces a novel multimodal LLM architecture that merges vectorized numeric modalities with pre-trained LLMs to enhance context understanding in driving situations. They also create a dataset with 160k QA pairs derived from 10k driving scenarios designed to train LLMs in interpreting and making decisions in complex driving conditions. The study addresses challenges in autonomous systems like generalization and interpretability, proposing innovative evaluation methods for Driving QA performance.
Photorealistic Video Generation with Diffusion Models
Paper: https://arxiv.org/abs/2312.06662
Researchers have developed W.A.L.T, a transformer-based AI diffusion model for photorealistic video generation. Using a causal encoder to compress images and videos into a unified latent space and a window attention architecture for efficient training and memory usage, W.A.L.T achieves high-resolution videos through a process involving a base latent video diffusion model and two super-resolution stages. Notably, the model excels in class-conditional video generation benchmarks without classifier-free guidance, showcasing quality and efficiency.
Nuvo: Neural UV Mapping for Unruly 3D Representations
Project Page: https://pratulsrinivasan.github.io/nuvo/
Nuvo introduces a novel UV mapping method optimized for complex geometries produced by advanced 3D reconstruction and generation techniques. It overcomes the limitations of traditional UV mapping algorithms by employing a neural field to create continuous, well-behaved UV mappings. This approach is specifically tailored to only the visible points in a scene, ensuring that the mappings are valid and detailed. Nuvo enables editable UV mappings that accurately represent intricate appearances, making it a significant advancement in 3D rendering and modeling.
🧨 Benchmarking LLMs in the Wild
Project Page: https://chat.lmsys.org/
Chatbot Arena allows you to question any two language models and then vote for the better one. Using over 100k+ crowd-sourced votes, it computes an Elo rating for each model. Additionally, models are ranked using this rating and two other benchmarks on a public leaderboard. This is a great initiative to see in real-time where various LLMs rank against each other.
Injecting the 3D World into LLMs
Paper: https://arxiv.org/abs/2307.12981
The paper introduces 3D-LLMs, a new model category that integrates the 3D physical world with large language models. Unlike traditional LLMs and VLMs, 3D LLMs can process 3D point clouds and their features, enabling them to perform a wide range of 3D-related tasks like captioning, 3D question answering, and 3D-assisted dialogue. The paper also presents a method for collecting over 300k 3D-language data using novel prompting mechanisms. These models are trained using 2D VLMs as backbones and a 3D feature extractor, enhanced by a 3D localization mechanism for better capturing spatial information. Experiments demonstrate that 3D-LLMs significantly outperform existing models in tasks like 3D captioning and 3D-assisted dialogue, pushing the boundaries of what LLMs can achieve.
LLM360: Towards Fully Transparent Open-Source LLMs
Paper: https://arxiv.org/abs/2312.06550
The authors of this paper advocate for enhancing transparency in the development of large language models (LLM) by promoting the open-sourcing of detailed training processes. They address challenges in LLM research, such as data provenance, reproducibility, and collaboration barriers. LLM360 introduces fully transparent, open-sourced LLMs like AMBER and CRYSTAL CODER, complete with training materials and analyses. This initiative marks an important step towards fostering innovation and replicability in LLM research by emphasizing complete openness, from model weights to training code.
Gemini: Google's Newest Multimodal Model
Technical Report: https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf
Gemini, Google DeepMind's latest creation, is a multimodal model, meaning it can process and understand various types of information, including text, code, audio, images, and video. It comes in three versions: Gemini Ultra for complex tasks, Gemini Pro for various tasks, and Gemini Nano for on-device tasks. Gemini outperforms* existing models in many benchmarks, demonstrating advanced capabilities in understanding and reasoning across different domains. Check out the technical report for more details.