Karpathy's Autonomous ML Lab, Sleeper Cells in LLMs, and Andrew Ng's Context Hub - 📚 The Tokenizer Edition #19

This week's most valuable AI resources

Mar 11, 2026

Hey there! This week’s curation spans from AI systems that run overnight experiments autonomously to backdoors hiding inside your favorite tool-using agents. Whether you’re thinking about building agents, defending them, or just trying to understand what your GPU actually wants from you, there’s something here.

New here? The Tokenizer is my resource-focused newsletter edition where I curate the best AI/ML papers, videos, articles, tools, and learning resources so you don’t have to sift through the noise. Subscribe to Gradient Ascent for the full experience.

TL;DR

What caught my attention this week:

• 📄 Papers: Smarter reasoning graphs, temporal backdoors in tool-using LLMs, spatial reasoning benchmarks, privacy advantages of diffusion language models, and versatile video editing

• 🎥 Videos: Hardware constraints shaping LLM architecture, brand-consistent image generation with Midjourney, a complexity taxonomy for AI agents, and a critical look at vibe coding

• 📰 Reads: Statistical rigor for LLM evaluations, the infrastructure costs of long-context inference, and the current state of open-weight models catching up to closed frontiers

• 🛠 Tools: A 101k-star collection of LLM applications and a versioned API documentation system for coding agents

• 🎓 Learning: Karpathy’s tool that lets AI agents run autonomous ML experiments overnight

📄 5 Papers

1. RouteGoT: Node-Adaptive Routing for Cost-Efficient Graph of Thoughts Reasoning

https://arxiv.org/abs/2603.05818

Graph of Thoughts is powerful but expensive, treating every reasoning node equally regardless of difficulty. RouteGoT fixes this by adaptively routing compute across graph nodes, skipping the heavy lifting where it isn’t needed. The result: 8.1 percentage points more accurate than AGoT while using 79.1% fewer output tokens.

2. Sleeper Cell: Injecting Latent Malice Temporal Backdoors into Tool-Using LLMs

https://arxiv.org/abs/2603.03371

Here’s a security scenario worth losing sleep over: backdoors implanted in tool-using LLMs that only activate under specific temporal conditions. The model maintains state-of-the-art performance on benign tasks and evades standard safety evaluations, until a particular time trigger flips the switch. A sobering look at the gap between current evaluation practices and actual deployment safety.

3. SpatialText: A Pure-Text Cognitive Benchmark for Spatial Understanding in LLMs

https://arxiv.org/abs/2603.03002

Do LLMs actually understand spatial relationships, or are they just pattern-matching on word co-occurrences? SpatialText isolates spatial reasoning from visual shortcuts using pure text, and the results aren’t flattering. Current models lean heavily on linguistic heuristics rather than building coherent spatial representations, which matters for anyone building systems that need to reason about physical space.

4. Characterizing Memorization in Diffusion Language Models

https://arxiv.org/abs/2603.02333

Diffusion-based language models turn out to have a meaningful privacy advantage over autoregressive ones. This paper shows they exhibit substantially lower memorization-based leakage of personally identifiable information. If you’re building generative systems where training data sensitivity matters (medical, legal, financial), this distinction between generation architectures is worth understanding.

5. Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

https://arxiv.org/abs/2603.02175 | GitHub

Video editing that combines instruction-based and reference-based approaches in one pipeline. Kiwi-Edit constructs high-fidelity training data using synthetic reference scaffolds, sidestepping the usual bottleneck of paired video editing datasets. The result is a system that handles both “make the sky purple” style instructions and “make it look like this reference” editing in a single model.

🎥 4 Videos

1. How Hardware Constraints Are Shaping Modern LLM Architecture

Julia Turc explores how physical hardware realities (memory bandwidth, compute density, interconnect speeds) are actively driving architectural decisions in modern LLMs. If you’ve wondered why certain design choices keep showing up across different labs, the answer often starts with what the silicon can actually do efficiently.

2. Mastering Midjourney: Consistent Brand Imagery Without Complex Prompts

On How I AI, AI creative director Jamey Gannon walks through her workflow for generating images that maintain consistent brand identity across multiple Midjourney outputs using style references, personalization codes, and mood boards. Useful for anyone who’s gotten great individual images from AI tools but struggled to make a cohesive visual set for a brand, product, or campaign.

3. The 5 Levels of AI Agent Complexity

Dave Ebbelaar breaks down AI agents into five distinct complexity levels, from simple single-tool agents to sophisticated multi-agent orchestration systems. Helpful framing for teams trying to scope what kind of agent they actually need (often simpler than they think) and understanding the engineering effort each level demands.

4. Vibe Coding is a Slot Machine

Machine Learning Street Talk sits down with Jeremy Howard (fast.ai) to examine whether AI coding assistants are genuinely improving developer productivity or creating a false sense of progress. Howard’s central argument: if you outsource all your thinking to computers, you stop building the competence that makes you effective. The kind of critical examination teams need before making investment decisions around AI-assisted development tooling.

📰 3 Curated Reads

1. Applying Statistics to LLM Evaluations

Deep (Learning) Focus

Applying Statistics to LLM Evaluations

Research on large language models (LLMs) is empirically driven. For this reason, model evaluations play a pivotal role in the field’s progress. We improve models by making changes, evaluating them, and iterating. Despite their foundational role, however, evaluations are usually handled in a naive manner. In most cases, we just test a mod…

4 months ago · 66 likes · 2 comments · Cameron R. Wolfe, Ph.D.

Cameron R. Wolfe from Deep (Learning) Focus tackles a problem most benchmarking papers quietly ignore: the statistical validity of LLM evaluation results. Key finding worth internalizing: clustered standard errors can increase reported uncertainty by 3x, and the Central Limit Theorem becomes unreliable with small sample sizes. If you’ve ever looked at a leaderboard and wondered “is this difference actually meaningful?”, this article gives you the tools to answer that.

2. How Long Context Inference Is Rewriting the Future of Transformers

Artificial Intelligence Made Simple

How Long Context Inference Is Rewriting the Future of Transformers

It takes time to create work that’s clear, independent, and genuinely useful. If you’ve found value in this newsletter, consider becoming a paid subscriber. It helps me dive deeper into research, reach more people, stay free from ads/hidden agendas, and supports my crippling chocolate milk addiction…

4 months ago · 56 likes · 2 comments · Devansh

AI Made Simple quantifies what long-context windows actually cost in production. A 70B-parameter model serving 59 concurrent users at 4K context drops to just 1 user at 128K context. The article covers the engineering responses, including Multi-Head Latent Attention (MLA) achieving 93.3% cache reduction. Practical reading for anyone deploying models where context length isn’t just a spec sheet number but a capacity planning constraint.

3. A Dream of Spring for Open-Weight LLMs

Ahead of AI

A Dream of Spring for Open-Weight LLMs: 10 Architectures from Jan-Feb 2026

If you have struggled a bit to keep up with open-weight model releases this month, this article should catch you up on the main themes…

5 months ago · 183 likes · 9 comments · Sebastian Raschka, PhD

Sebastian Raschka surveys the current open-weight landscape across 10 models and finds the gap to closed frontiers narrowing fast. GLM-5 now benchmarks on par with GPT-5.2, Gemini Pro 3, and Claude 4.6 Opus. On the inference side, Step 3.5 Flash hits 100 tokens/sec compared to DeepSeek’s 33. A well-structured overview for tracking which open models are actually competitive and where the remaining gaps lie.

🛠 2 Tools & Repos

1. Awesome LLM Apps

https://github.com/Shubhamsaboo/awesome-llm-apps

At 101k stars, this collection of LLM application examples covers RAG implementations, AI agents (single and multi-agent teams), MCP integration patterns, voice AI, and fine-tuning guides. More useful as a reference architecture library than a tutorial. When you’re building something new, check here first to see how others have solved similar problems.

2. Context Hub

https://github.com/andrewyng/context-hub

From Andrew Ng’s team, Context Hub gives coding agents access to curated, versioned API documentation instead of letting them hallucinate library APIs. Features search and fetch, language-specific variants, persistent annotations, and feedback loops so the documentation improves over time. At 3.5k stars, it’s gaining traction with teams building custom coding agents that need reliable API knowledge.

🎓 1 Pick of the Week

Autoresearch

https://github.com/karpathy/autoresearch

Andrej Karpathy’s latest side project is a Python tool built around three core files (prepare.py, train.py, program.md) that lets an AI agent run autonomous ML experiments on a single GPU. The agent proposes hypotheses, writes training code, runs experiments with a 5-minute budget each, and iterates based on results. Leave it running overnight and wake up to a stack of completed experiments. At 23.6k stars within days of release, it’s clearly struck a nerve. Sparks of recursive self-improvement, indeed.

Thanks for reading The Tokenizer! If you found this valuable, please share it with your colleagues and consider subscribing to Gradient Ascent for more AI insights.

Discussion about this post

Ready for more?