Research
Current Research Interests
Primary Research Area
My main research focus is on AI safety, with a particular emphasis on developing methods to ensure that advanced AI systems are robust, transparent, and aligned with human values. I am especially interested in the theoretical and practical aspects of AI control and alignment.
Secondary Research Areas
- Explainability Methods: Techniques for making AI models more interpretable and their decisions more transparent.
- AI Control: Strategies for ensuring that AI systems behave as intended, even in novel or high-stakes situations.
- Mechanistic Interpretability: Understanding the internal mechanisms of neural networks to gain insight into how and why they make decisions.
- Chain of Thought: Investigating reasoning processes in AI models, including how step-by-step reasoning can improve reliability and safety.
Research Projects
Current Projects
-
Finding Misinformation Inside Language Models: Comparative XAI Methods on LIAR (May 2025-Present)
Independent
The goal is to evaluate Llama model’s ability to distinguish true from false claims in the LIAR dataset and explain its reasoning in ways humans can understand. By comparing the consistency, depth, and reliability of these explanations, I hope to uncover how such models handle misinformation at inference time — and whether their responses reflect a deeper internal structure or just surface-level fluency. -
AI Safety Toolkit (July 2025-Present)
Independent
Small-scale toolkit that demonstrate real-world use of AI governance checks on AI use cases: prompt injection, fairness, goal hijacking, sensitive data leakage etc.
Research Tools & Methods
- Mechanistic interpretability techniques
- Explainability frameworks (e.g., SHAP, LIME)
- Chain-of-thought prompting and analysis, Minicheck technique
- AI control and alignment strategies
- Python, PyTorch, TensorFlow
- Statistical and analytical approaches for model evaluation
Collaboration & Service
I am open to collaborations in AI safety, interpretability, and related areas. Please reach out to anadaria [dot] zahaleanu [at] gmail [dot] com if you are interested in working together or discussing research ideas.