Inside the Black Box: MIT Researchers Develop Method to Expose and ‘Steer’ Hidden Personas in Large Language Models

MIT researchers can now root out hidden biases and "steer" personalities in LLMs like ChatGPT and Claude. Discover the new "Recursive Feature Machine" method.

By: AXL Media

Published: Feb 26, 2026, 8:42 AM EST

Source: The information in this article was sourced from MIT News

Inside the Black Box: MIT Researchers Develop Method to Expose and ‘Steer’ Hidden Personas in Large Language Models - article image

The Abstract Depths of AI

Modern AI assistants like ChatGPT and Claude are more than just text generators; they have become repositories of human knowledge, capable of mimicking complex human traits. However, exactly how these models represent abstract concepts like "mood" or "personality" has remained largely a mystery. On February 19, 2026, MIT researchers announced a new approach that treats these models not as simple input-output machines, but as complex structures with "hidden" layers that can be mathematically probed. This discovery reveals that LLMs store a vast array of concepts that aren't always active but can be triggered or suppressed with precision.

Baiting the Right Species of Data

Traditionally, scientists have used "unsupervised learning" to find patterns in AI models—a process lead researcher Adit Radhakrishnan compares to throwing a massive net into the ocean and sifting through everything caught. The new MIT method is more like using targeted bait. By utilizing a Recursive Feature Machine (RFM), the team can identify the specific mathematical patterns (vectors) within the model that correspond to a concept of interest. This allows for a much faster and less computationally expensive way to find vulnerabilities or specific traits compared to broad trawling methods.

The Power to 'Steer' Responses

The researchers tested their method on 512 distinct concepts across five classes:

Inside the Black Box: MIT Researchers Develop Method to Expose and ‘Steer’ Hidden Personas in Large Language Models

Categories

Topics

Related Coverage