Fellowship

This fellowship reading list for the UChicago AI Safety Club provides a structured exploration of key AI safety topics across seven weeks. It covers a range of crucial subjects from scaling laws and instrumental convergence to AI governance and critical perspectives on AI safety.

Learn More & Apply

Week 1: Scaling and Instrumental Convergence

Explore the implications of increasingly intelligent systems, focusing on scaling laws, superintelligence, and instrumental convergence.

Transformer Language Models (Video) (0:00 - 11:30)
Watch 0:00 - 11:30 for an accessible introduction to scaling laws in language models.
The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents
Nick Bostrom's influential work on power-seeking and instrumental convergence in AI systems.

Week 2: Outer Alignment

Examine the challenges in correctly specifying training goals for AI systems.

Specification gaming: the flip side of AI ingenuity
A comprehensive overview of outer alignment issues from DeepMind researchers.
Learning from human preferences
Explore how alignment researchers have attempted to address issues in goal specification using human preferences.

Week 3: Deception & Mesa-optimization

Investigate the concept of mesa-optimizers and the potential for deceptive behavior in AI systems.

Deceptive Alignment
An in-depth exploration of deceptive alignment and pseudo-alignment, providing insights into inner alignment issues.

Week 4: AI Security Concerns

Explore various AI security issues including jailbreaks, adversarial examples, and potential vulnerabilities.

Intro to Large Language Models - Andrej Karpathy (42:15 - 59:23)
Watch 42:15 - 59:23 for an overview of LLM security concerns, analogous to OS security.
Ironing Out the Squiggles
A paper review post about adversarial examples, their implications, and potential solutions.
AI Sleeper Agents
Read the abstract and page 6 for an introduction to the concept of AI sleeper agents.
SolidGoldMagikarp - tokens that jailbreak LLMs
Explore a famous case of LLM jailbreaking and its implications for AI security.

Week 5: AI Governance

Examine the challenges and approaches to governing AI development and deployment.

Open Problems in Technical AI Governance
An overview of technical AI governance and its methods for evaluating and enforcing AI control mechanisms.
Certified Safe: A Schematic for Approval Regulation of Frontier AI
A proposal for FDA-style approval regulation for frontier AI systems.

Week 6: Criticisms and Counter-Arguments

Examine critiques of AI safety concerns and alternative perspectives on AI development.

Will AI kill all of us? | Marc Andreessen and Lex Fridman (00:00 - 10:30)
Listen to 00:00 - 10:30 for a discussion on criticisms of AI safety concerns.
Terrorism, Tylenol, and dangerous information
A useful reading for understanding infohazards in AI development.
Against Almost Every Theory of Impact of Interpretability
A critical examination of interpretability approaches in AI alignment.

Week 7: Further Reading and Discussion

Explore various AI alignment approaches and dive deeper into specific areas of interest. Fellows will choose one of the optional readings to focus on for the week.

A Brief Introduction to some Approaches to AI Alignment
An overview of various AI alignment approaches, providing a foundation for further exploration.
Why Agent Foundations? An Overly Abstract Explanation (Optional)
A deeper dive into the concept of agent foundations in AI alignment.
Goal Misgeneralisation: Why Correct Specifications Aren't Enough For Correct Goals (Optional)
An in-depth exploration of inner alignment issues and goal misgeneralization.
Toy Models of Superposition (Optional)
A technical exploration of interpretability in neural networks.
Steering Llama-2 with contrastive activation additions (Optional)
An examination of techniques for controlling large language models.