Reading the Mind of an LLM
Speakers
Gabriele Orlandi
AI scientist at xtream
Luca Baggi
AI engineer @xtream and open source contributor
Abstract
What if you could watch an AI’s thought take shape? For years, LLMs have been impenetrable "black boxes," but we are finally beginning to find ways to see how the ghost in the machine actually works.
This talk explores mechanistic interpretability, a subfield of AI that aims to understand the internal workings of neural networks. Mapping these internal "circuits" is not only just a philosophical curiosity - or duty: it is a high-stakes engineering necessity for safety, debugging, and trust.
Description
What if we could step inside an LLM and watch it think in real time?
This talk distills the latest research from Anthropic, DeepMind, and OpenAI to present the current state of the art in LLM interpretability.
We’ll start with the modern interpretation of embeddings as sparse, monosemantic features living in high-dimensional space. From there, we’ll explore emerging techniques such as circuit tracing and attribution graphs, and see how researchers reconstruct the computational pathways behind behaviors like multilingual reasoning, refusals, and hallucinations.
We’ll also look at new evidence suggesting that models may have limited forms of introspection—clarifying what they can, and crucially cannot, reliably report about their internal processes.
Finally, we’ll connect these “microscopic” insights to real engineering practice: how feature-level understanding can improve debugging, safety, and robustness in deployed AI systems, and where current methods still fall short.