Gradient Ascent

Gradient Ascent

Share this post

Gradient Ascent
Gradient Ascent
Depth Anything, Vision Mamba, Self-Extending LLMs, and more...
User's avatar
Discover more from Gradient Ascent
Gradient Ascent is your weekly guide to AI, trusted by Silicon Valley's top tech firms and the best academic labs worldwide.
Over 10,000 subscribers
Already have an account? Sign in

Depth Anything, Vision Mamba, Self-Extending LLMs, and more...

A round up of the most interesting resources from the week gone by

Sairam Sundaresan's avatar
Sairam Sundaresan
Jan 23, 2024
26

Share this post

Gradient Ascent
Gradient Ascent
Depth Anything, Vision Mamba, Self-Extending LLMs, and more...
2
4
Share

These are some of the most interesting resources I found over the past week covering a range of topics in computer vision and NLP. If you’re a new reader, I alternate between curating resources and long-form writing. As always, your feedback and comments are welcome!

Depth Anything

[Project Page]

Depth Anything is a streamlined approach to monocular depth estimation. Instead of adding new technical modules, it turns to a vast dataset (~62M unlabeled examples) to enhance adaptability and reduce errors. Using robust data augmentation and leveraging pre-trained encoders for semantic insights, it generalizes across diverse datasets and real-world images. This work is worth checking out.

Instant Identity-Preserving Generation

[Project Page]

The authors propose a new method for identity-preserving image synthesis using just a single facial image. They introduce a plug-and-play module for enhancing existing text-to-image diffusion models like SD 1.5 and SDXL. By integrating an ID embedding with an Image Adapter and IdentityNet, InstantID can generate personalized images in a variety of styles while maintaining a high level of fidelity.  

CamP and Zip-NeRF Code Released!

[Repository]

Researchers from Google have released implementations for both CamP and ZiP-NeRF. The implementations are in JAX and worth checking out if you're working on rendering and reconstruction projects.

Text-Driven Object Insertion in 3D Scenes

[Project Page]

InseRF inserts 3D objects into NeRF reconstructions of scenes using textual descriptions and a 2D bounding box. It can generate new objects in 3D scenes with consistency across multiple views without needing explicit 3D information. The method involves grounding 3D object insertion to a 2D object insertion in a reference view, then lifting the 2D edit to 3D using a single-view object reconstruction method. 

DocGraphLM: Documental Graph LM for Information Extraction

[Paper]

Researchers from JP Morgan integrate graph neural nets with pretrained language models to enhance document representation. The integration of graph neural features with language features and a novel link prediction approach leads to consistent improvements in information extraction and question-answering tasks, as well as faster convergence in learning. This framework marks a significant advancement in visually rich document understanding.

Vision Mamba: State Space Models for Vision

[Paper]  [Repository]

Vision Mamba (Vim) introduces a state space model approach for efficient visual representation learning. Vim marks image sequences with positional embeddings and employs bidirectional state space models, eliminating the reliance on self-attention. This leads to notable improvements in computation and memory efficiency, particularly with high-resolution images. Vim demonstrates superior performance in various vision tasks, including ImageNet classification, COCO object detection, and ADE20K semantic segmentation, while offering significant speed and memory advantages.

LoRA from Scratch

[Link]

Sebastian Raschka has written a comprehensive guide on implementing LoRA from scratch. The tutorial is hands-on, offering practical insights and coding examples. It's an excellent resource for anyone looking to understand and implement LoRA in their machine learning projects.

Self-Extend LLM Context Window without Tuning

[Paper]  [Repository]

This paper introduces a method for extending the context window of LLMs without the need for fine-tuning. The approach involves a simple mapping function and grouped attention, allowing the LLMs to handle longer contexts naturally–with just four lines of code change.

Gradient Ascent is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Sangvath nutvaril's avatar
Madan Kumar Y's avatar
Domingo Gallardo's avatar
Karena's avatar
Alejandro Piad Morffis's avatar
26 Likes∙
4 Restacks
26

Share this post

Gradient Ascent
Gradient Ascent
Depth Anything, Vision Mamba, Self-Extending LLMs, and more...
2
4
Share

Discussion about this post

User's avatar
Devansh's avatar
Devansh
Feb 1, 2024

Excellent work

Expand full comment
Like (1)
Reply
Share
1 reply by Sairam Sundaresan
1 more comment...
Text-to-Image Diffusion Models Part II
An Illustrated Guide to Diffusion Models
Jun 3, 2023 â€¢ 
Sairam Sundaresan
57

Share this post

Gradient Ascent
Gradient Ascent
Text-to-Image Diffusion Models Part II
Understanding Visual Instruction Tuning
And why the multimodal floor is LLaVA
Jan 19, 2024 â€¢ 
Sairam Sundaresan
22

Share this post

Gradient Ascent
Gradient Ascent
Understanding Visual Instruction Tuning
One Year Into The Gradient Ascent Adventure
A convenient index of the best articles & reflections from a year of writing about AI
Dec 21, 2023 â€¢ 
Sairam Sundaresan
27

Share this post

Gradient Ascent
Gradient Ascent
One Year Into The Gradient Ascent Adventure
18

Ready for more?

© 2025 Sairam Sundaresan
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Create your profile

User's avatar

Only paid subscribers can comment on this post

Already a paid subscriber? Sign in

Check your email

For your security, we need to re-authenticate you.

Click the link we sent to , or click here to sign in.