This fellowship reading list for the UChicago AI Safety Club provides a structured exploration of key AI safety topics across seven weeks. It covers a range of crucial subjects from scaling laws and instrumental convergence to AI governance and critical perspectives on AI safety.
Learn More & ApplyExplore the implications of increasingly intelligent systems, focusing on scaling laws, superintelligence, and instrumental convergence.
Watch 0:00 - 11:30 for an accessible introduction to scaling laws in language models.
Nick Bostrom's influential work on power-seeking and instrumental convergence in AI systems.
Examine the challenges in correctly specifying training goals for AI systems.
A comprehensive overview of outer alignment issues from DeepMind researchers.
Explore how alignment researchers have attempted to address issues in goal specification using human preferences.
Investigate the concept of mesa-optimizers and the potential for deceptive behavior in AI systems.
An in-depth exploration of deceptive alignment and pseudo-alignment, providing insights into inner alignment issues.
Explore various AI security issues including jailbreaks, adversarial examples, and potential vulnerabilities.
A comprehensive playbook for protecting AI models from theft and misuse.
Aguement that AI cybersecurity must learn from past security lessons, not reinvent them.
How Researchers extract embedding layers from language models through inexpensive API attacks.
China's new regulations force companies to report software vulnerabilities to government agencies.
A paper review post about adversarial examples, their implications, and potential solutions.
Explore a famous case of LLM jailbreaking and its implications for AI security.
Examine the challenges and approaches to governing AI development and deployment.
An overview of technical AI governance and its methods for evaluating and enforcing AI control mechanisms.
A proposal for FDA-style approval regulation for frontier AI systems.
Examine critiques of AI safety concerns and alternative perspectives on AI development.
Listen to 00:00 - 10:30 for a discussion on criticisms of AI safety concerns.
A useful reading for understanding infohazards in AI development.
A critical examination of interpretability approaches in AI alignment.
Explore various AI alignment approaches and dive deeper into specific areas of interest. Fellows will choose one of the optional readings to focus on for the week.
An overview of various AI alignment approaches, providing a foundation for further exploration.
A deeper dive into the concept of agent foundations in AI alignment.
An in-depth exploration of inner alignment issues and goal misgeneralization.
A technical exploration of interpretability in neural networks.
An examination of techniques for controlling large language models.