XLab AI Safety Fundamentals

Week 01

Philosophical and Political Foundations of AI Safety

Explore the implications of increasingly intelligent systems.

AI 2027
A narrative-form scenario describing the geopolitical dynamics and risks of the development of AGI.
Measuring AI Ability to Complete Long Tasks
A benchmark measuring the trajectory of the length of tasks AIs can complete.
Existential Risk from Power-Seeking AI
Joe Carlsmith lays out the case for why advanced AIs might develop power-seeking tendencies and how this could lead to catastrophe.
Machines of Loving GraceOptional
Dario Amodei's essay on the benefits powerful AI systems could bring.
Trends in Artificial Intelligence | Epoch AIOptional
A measurement of the central trends driving continued AI progress.
The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial AgentsOptional
Nick Bostrom's paper on the potential goals advanced AI systems are likely or unlikely to develop by default.

Week 02

Examine the challenges in correctly specifying training goals for AI systems.

Specification Gaming: How AI Can Turn Your Wishes Against You
A fun video from 2023 that discusses the problem of specification gaming.
Specification gaming: the flip side of AI ingenuity
A comprehensive overview of outer alignment issues from DeepMind researchers.
Learning from human preferences
Explore how alignment researchers have attempted to address issues in goal specification using human preferences.

Week 03

Investigate the concept of mesa-optimizers and the potential for deceptive behavior in AI systems.

Alignment Faking
Anthropic's research on alignment faking, where LLMs strategically attempt to preserve their values during training.
Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases
An informal note on some intuitions related to Mechanistic Interpretability by Chris Olah.
Deceptive Alignment
An in-depth exploration of deceptive alignment and pseudo-alignment, providing insights into inner alignment issues.

Week 04

Explore various AI security issues including jailbreaks, adversarial examples, and potential vulnerabilities.

A Playbook for Securing AI Model Weights
A comprehensive playbook for protecting AI models from theft and misuse.
Four Fallacies of AI Cybersecurity
Aguement that AI cybersecurity must learn from past security lessons, not reinvent them.
Stealing Part of a Production Language Modelonly abstract
How Researchers extract embedding layers from language models through inexpensive API attacks.
Sleight of hand: How China weaponizes software vulnerabilitiesOptional
China's new regulations force companies to report software vulnerabilities to government agencies.
Ironing Out the SquigglesOptional
A paper review post about adversarial examples, their implications, and potential solutions.
SolidGoldMagikarp - tokens that jailbreak LLMsOptional
Explore a famous case of LLM jailbreaking and its implications for AI security.

Week 05

Examine the challenges and approaches to governing AI development and deployment.

Open Problems in Technical AI Governance
An overview of technical AI governance and its methods for evaluating and enforcing AI control mechanisms.
Certified Safe: A Schematic for Approval Regulation of Frontier AI
A proposal for FDA-style approval regulation for frontier AI systems.

Week 06

Examine critiques of AI safety concerns and alternative perspectives on AI development.

Will AI kill all of us? | Marc Andreessen and Lex Fridman00:00 - 10:30
Listen to 00:00 - 10:30 for a discussion on criticisms of AI safety concerns.
Terrorism, Tylenol, and dangerous information
A useful reading for understanding infohazards in AI development.
Against Almost Every Theory of Impact of Interpretability
A critical examination of interpretability approaches in AI alignment.

Week 07