AI Distillation #4

How to build LLMs, SAM for 3D points, Observing AI agents for better learning, and more...

Sep 03, 2024

These are some excellent resources I found covering various CV, NLP, and optimization topics over the past week.

Spann3R: 3D Reconstruction with Spatial Memory

Spann3R is a method for editing 3D objects in images by controlling spatial, textual, and style aspects through disentangled latent spaces. It achieves realistic edits by combining voxel grids for spatial precision with diffusion models for high-quality textures. Spann3R supports transformations like scaling, rotation, and material changes while maintaining coherence with the original scene. Check out the project page for more.

CogVLM2-Video: A Temporally-Aware Video Understanding Model

[Project Page]

CogVLM2 introduces a new generation of visual language models for image and video understanding– CogVLM2, CogVLM2-Video, and GLM-4V. The model integrates multi-frame input with temporal grounding, achieving state-of-the-art results on multiple benchmarks. These models are open-sourced, so check them out for your work.

ReconX: Reconstruct Any Scene from Sparse Views

[Project Page]

ReconX is a method for 3D reconstruction and editing using cross-modal generation techniques. It leverages the complementary strengths of 2D diffusion models and 3D representations to enable high-quality reconstructions from images and intuitive editing of 3D scenes.

SAM2Point: Segment Any 3D as Videos

[Project Page]

SAM2Point adapts the Segment Anything Model 2 (SAM 2) for 3D segmentation in a zero-shot and promptable manner. It interprets 3D data as a series of multi-directional videos, allowing SAM 2 to perform 3D segmentation without further training or 2D-3D projection. SAM2Point supports various prompt types and generalizes across different 3D datasets.

DSLP: A Data Science Project Management Framework

[Blog]

I found this introduction to the DSLP (Data Science Lifecycle Process) framework quite fun to read. DSLP enhances data science project management by structuring projects into four stages: Define, Develop, Deliver, and Deploy. This framework improves team productivity, ensures business goal alignment, and facilitates stakeholder communication. DSLP provides a clear roadmap that helps data science teams deliver more consistent and impactful results.

Co-Storm: Engaged Learning through Language Model Conversations

[Paper Page] [Code]

The authors introduce Co-STORM, a collaborative system where users learn by observing and steering conversations between language model agents. The system helps users discover unknown information through these discussions, organizing findings into a dynamic mind map and generating comprehensive reports. What's fascinating is that this approach outperforms traditional search engines and chatbots in user evaluations, making it a promising tool for enhancing learning and discovery.

Building Large Language Models

This guest lecture by Yann Dubois provides a concise overview of building a ChatGPT-like model, covering both pretraining (language modeling) and post-training (SFT/RLHF). For each component, it explores standard practices in data collection, algorithms, and evaluation methods.

Foundation Models for Music

[Paper Page]

This survey highlights underexplored areas in music representation, the limitations of existing methods, and the potential of foundation models in various music-related tasks, including understanding, generation, and medical applications.

AI Distillation #4

How to build LLMs, SAM for 3D points, Observing AI agents for better learning, and more...

Spann3R: 3D Reconstruction with Spatial Memory

CogVLM2-Video: A Temporally-Aware Video Understanding Model

ReconX: Reconstruct Any Scene from Sparse Views

SAM2Point: Segment Any 3D as Videos

DSLP: A Data Science Project Management Framework

Co-Storm: Engaged Learning through Language Model Conversations

Building Large Language Models

Foundation Models for Music

Discussion about this post