This fellowship reading list for the UChicago AI Safety Club provides a structured exploration of key AI safety topics across seven weeks. It covers a range of crucial subjects from scaling laws and instrumental convergence to AI governance and critical perspectives on AI safety.
Learn More & ApplyExplore the implications of increasingly intelligent systems, focusing on scaling laws, superintelligence, and instrumental convergence.
Watch 0:00 - 11:30 for an accessible introduction to scaling laws in language models.
Nick Bostrom's influential work on power-seeking and instrumental convergence in AI systems.
Examine the challenges in correctly specifying training goals for AI systems.
A comprehensive overview of outer alignment issues from DeepMind researchers.
Explore how alignment researchers have attempted to address issues in goal specification using human preferences.
Investigate the concept of mesa-optimizers and the potential for deceptive behavior in AI systems.
An in-depth exploration of deceptive alignment and pseudo-alignment, providing insights into inner alignment issues.
Explore various AI security issues including jailbreaks, adversarial examples, and potential vulnerabilities.
Watch 42:15 - 59:23 for an overview of LLM security concerns, analogous to OS security.
A paper review post about adversarial examples, their implications, and potential solutions.
Read the abstract and page 6 for an introduction to the concept of AI sleeper agents.
Explore a famous case of LLM jailbreaking and its implications for AI security.
Examine the challenges and approaches to governing AI development and deployment.
An overview of technical AI governance and its methods for evaluating and enforcing AI control mechanisms.
A proposal for FDA-style approval regulation for frontier AI systems.
Examine critiques of AI safety concerns and alternative perspectives on AI development.
Listen to 00:00 - 10:30 for a discussion on criticisms of AI safety concerns.
A useful reading for understanding infohazards in AI development.
A critical examination of interpretability approaches in AI alignment.
Explore various AI alignment approaches and dive deeper into specific areas of interest. Fellows will choose one of the optional readings to focus on for the week.
An overview of various AI alignment approaches, providing a foundation for further exploration.
A deeper dive into the concept of agent foundations in AI alignment.
An in-depth exploration of inner alignment issues and goal misgeneralization.
A technical exploration of interpretability in neural networks.
An examination of techniques for controlling large language models.