AI Distillation #4
How to build LLMs, SAM for 3D points, Observing AI agents for better learning, and more...
These are some excellent resources I found covering various CV, NLP, and optimization topics over the past week.
Spann3R: 3D Reconstruction with Spatial Memory
Spann3R is a method for editing 3D objects in images by controlling spatial, textual, and style aspects through disentangled latent spaces. It achieves realistic edits by combining voxel grids for spatial precision with diffusion models for high-quality textures. Spann3R supports transformations like scaling, rotation, and material changes while maintaining coherence with the original scene. Check out the project page for more.
CogVLM2-Video: A Temporally-Aware Video Understanding Model
CogVLM2 introduces a new generation of visual language models for image and video understanding– CogVLM2, CogVLM2-Video, and GLM-4V. The model integrates multi-frame input with temporal grounding, achieving state-of-the-art results on multiple benchmarks. These models are open-sourced, so check them out for your work.
ReconX: Reconstruct Any Scene from Sparse Views
ReconX is a method for 3D reconstruction and editing using cross-modal generation techniques. It leverages the complementary strengths of 2D diffusion models and 3D representations to enable high-quality reconstructions from images and intuitive editing of 3D scenes.
SAM2Point: Segment Any 3D as Videos
SAM2Point adapts the Segment Anything Model 2 (SAM 2) for 3D segmentation in a zero-shot and promptable manner. It interprets 3D data as a series of multi-directional videos, allowing SAM 2 to perform 3D segmentation without further training or 2D-3D projection. SAM2Point supports various prompt types and generalizes across different 3D datasets.
DSLP: A Data Science Project Management Framework
I found this introduction to the DSLP (Data Science Lifecycle Process) framework quite fun to read. DSLP enhances data science project management by structuring projects into four stages: Define, Develop, Deliver, and Deploy. This framework improves team productivity, ensures business goal alignment, and facilitates stakeholder communication. DSLP provides a clear roadmap that helps data science teams deliver more consistent and impactful results.
Co-Storm: Engaged Learning through Language Model Conversations
The authors introduce Co-STORM, a collaborative system where users learn by observing and steering conversations between language model agents. The system helps users discover unknown information through these discussions, organizing findings into a dynamic mind map and generating comprehensive reports. What's fascinating is that this approach outperforms traditional search engines and chatbots in user evaluations, making it a promising tool for enhancing learning and discovery.
Building Large Language Models
This guest lecture by Yann Dubois provides a concise overview of building a ChatGPT-like model, covering both pretraining (language modeling) and post-training (SFT/RLHF). For each component, it explores standard practices in data collection, algorithms, and evaluation methods.
Foundation Models for Music
This survey highlights underexplored areas in music representation, the limitations of existing methods, and the potential of foundation models in various music-related tasks, including understanding, generation, and medical applications.