mechanistic interpretability

The study of how neural networks process information internally by identifying the circuits and components responsible for their outputs.