New AI Method Reveals Hidden Model Interactions

April 20, 20263 min read

TL;DR

SPEX and ProxySPEX algorithms detect complex feature relationships in ML systems, making model behavior interpretable at unprecedented scale.

Understanding how complex machine learning systems make decisions remains one of artificial intelligence's most pressing s. As models grow larger and more sophisticated, their behavior emerges from intricate interactions between features, training data, and internal components rather than isolated elements. Traditional interpretability s struggle with this complexity, particularly as the number of potential interactions grows exponentially with system size, making exhaustive analysis computationally impossible. This limitation has hindered progress toward transparent, trustworthy AI systems that can explain their reasoning to both developers and affected users.

Researchers have developed two new algorithms, SPEX and ProxySPEX, that overcome this scalability barrier by identifying influential interactions with far fewer computational resources than previous approaches. Both s build on the concept of ablation, which measures influence by observing what changes when components are removed from the system. The core innovation lies in exploiting structural properties of machine learning models to transform an intractable search problem into a solvable sparse recovery problem, drawing on techniques from signal processing and coding theory.

SPEX operates on two key observations about model behavior: sparsity, meaning relatively few interactions truly drive outputs, and low-degreeness, meaning influential interactions typically involve only small subsets of features. By strategically selecting ablations that combine many candidate interactions, then using efficient decoding algorithms to disentangle these combined signals, SPEX can identify specific interactions responsible for model behavior. ProxySPEX adds another structural observation—hierarchy, where important higher-order interactions imply important lower-order subsets—achieving similar performance with approximately ten times fewer ablations than SPEX.

The algorithms demonstrate practical effectiveness across three interpretability domains. In feature attribution, SPEX maintained high faithfulness on sentiment analysis tasks as context scaled to thousands of features, unlike marginal approaches that failed to capture complex interactions. On a modified trolley problem where GPT-4o mini failed 92% of the time, SPEX revealed a high-order synergy between specific words that aligned with human intuition about the dilemma's core components, while standard attribution s identified misleading individual features. When these interacting words were replaced with synonyms, the model's failure rate dropped to near zero.

For data attribution, ProxySPEX applied to a ResNet model trained on CIFAR-10 identified both synergistic interactions, where semantically distinct classes work together to define decision boundaries, and redundant interactions, where visual duplicates reinforce specific concepts. This fine-grained analysis enables new data selection techniques that preserve necessary synergies while safely removing redundancies. In model component attribution, ProxySPEX uncovered interactions between attention heads in an MMLU task, informing a pruning strategy that not only outperformed competing s but actually improved model performance on the target task.

The research reveals structural patterns in how models process information across different layers. Early layers function in a predominantly linear regime where components contribute largely independently, while later layers show more pronounced interactions between attention heads, with most contributions coming from interactions among heads within the same layer. This understanding of structural dependencies is vital for architectural interventions and provides insights into how complex behaviors emerge throughout model depth.

While representing a significant advance in interpretability, extending interaction from dozens to thousands of components, the researchers acknowledge limitations and future directions. Many questions remain about unifying different interpretability perspectives to provide more holistic understanding of machine learning systems. There is also interest in systematically evaluating interaction s against existing scientific knowledge in fields like genomics and materials science, which could both ground model and generate new, testable hypotheses.

The SPEX framework's versatility across the entire model lifecycle—from feature attribution on long-context inputs to identifying synergies among training data points and discovering interactions between internal components—makes it a valuable tool for researchers and practitioners. The code for both SPEX and ProxySPEX is fully integrated and available within the popular SHAP-IQ repository, inviting the research community to build upon these s and apply them to diverse interpretability s.