Explaining Neurons
This page provides a summary sheet that includes the general goal, reference papers (both mine and external) for an overview of the topic, as well as the domains explored so far. We are also interested in extending the applications of these techniques beyond their traditional domains. If you have expertise in other areas (e.g., neuroscience, gaming, or audio/speech modeling), we would be happy to explore potential extensions into those fields.
Goal: The goal of this research area is to understand what deep neural networks learn during the training process. Recently, this field has been categorized under the umbrella term Mechanistic Interpretability
. My research projects usually focus on analyzing the behavior of individual neurons and groups of neurons, identifying the concepts they learn to recognize, and understanding the relationships between these concepts. They typically combine tools from classical AI (e.g., heuristic search and clustering), statistical analysis, and recent advancements in AI to explore this direction.
Domains: NLP, Vision.
Reference Papers:
- Logical (Compositional) Explanations: [(La Rosa et al., 2023)] [(Makinwa et al., 2022)]
- Linear Explanations: [Link]
- Circuits (chain of neurons): [Link]
References
2023
- ConferenceTowards a fuller understanding of neurons with Clustered Compositional ExplanationsIn Thirty-seventh Conference on Neural Information Processing Systems, 2023
2022
- ConferenceDetection Accuracy for Evaluating Compositional Explanations of UnitsIn AIxIA 2021 - Advances in Artificial Intelligence, 2022