Research

Current Research Interests

Primary Research Area

My main research focus is on AI safety, with a particular emphasis on developing methods to ensure that advanced AI systems are robust, transparent, and aligned with human values. I am especially interested in the theoretical and practical aspects of AI control and alignment.

Secondary Research Areas

Explainability Methods: Techniques for making AI models more interpretable and their decisions more transparent.
AI Control: Strategies for ensuring that AI systems behave as intended, even in novel or high-stakes situations.
Mechanistic Interpretability: Understanding the internal mechanisms of neural networks to gain insight into how and why they make decisions.
Chain of Thought: Investigating reasoning processes in AI models, including how step-by-step reasoning can improve reliability and safety.

Research Projects

Current Projects

Finding Misinformation Inside Language Models: Comparative XAI Methods on LIAR (May 2025-Present)
Independent
The goal is to evaluate Llama model’s ability to distinguish true from false claims in the LIAR dataset and explain its reasoning in ways humans can understand. By comparing the consistency, depth, and reliability of these explanations, I hope to uncover how such models handle misinformation at inference time — and whether their responses reflect a deeper internal structure or just surface-level fluency.
AI Safety Toolkit (July 2025-Present)
Independent
Small-scale toolkit that demonstrate real-world use of AI governance checks on AI use cases: prompt injection, fairness, goal hijacking, sensitive data leakage etc.

Research Tools & Methods

Mechanistic interpretability techniques
Explainability frameworks (e.g., SHAP, LIME)
Chain-of-thought prompting and analysis, Minicheck technique
AI control and alignment strategies
Python, PyTorch, TensorFlow
Statistical and analytical approaches for model evaluation

Collaboration & Service

I am open to collaborations in AI safety, interpretability, and related areas. Please reach out to anadaria [dot] zahaleanu [at] gmail [dot] com if you are interested in working together or discussing research ideas.