AI Alignment
25 articles about AI Alignment
OpenAI introduces framework for evaluating chain-of-thought monitorability
OpenAI presents a framework and 13 tests across 24 environments showing that monitoring a model’s internal reasoning works better than checking outputs alone for scalable AI control.
OpenAI updates structure to advance safe AI with nonprofit leadership and PBC equity
OpenAI says it will keep nonprofit leadership while giving its PBC equity to unlock over $100B to build safe, beneficial AI for humanity.
Understanding and preventing misalignment generalization in language models
This work shows that training language models on wrong answers can spread misalignment, pinpoints an internal feature causing it, and demonstrates it can be reversed with minimal fine-tuning.
Zico Kolter joins OpenAI board of directors
OpenAI has appointed Zico Kolter to its board to strengthen governance with AI safety and alignment expertise.
Leveraging weak-to-strong generalization for controlling strong models with weak supervisors
This introduces research on using deep learning generalization to let weak supervisors reliably control much stronger AI models for superalignment.
Defining AI system behavior and the role of public input in decision-making
It explains how ChatGPT’s behavior is set today and how OpenAI plans to improve it with more user customization and public input.
Training language models to better follow instructions and improve safety
InstructGPT uses human-in-the-loop alignment training to better follow instructions than GPT-3 while being more truthful and less toxic, and it’s now the default model on the API.
How confessions help language models admit mistakes and improve honesty
OpenAI is testing a “confessions” training method to get language models to admit mistakes or bad behavior, improving honesty and trust.
Public input on AI behavior informs OpenAI's model specifications
OpenAI surveyed 1,000+ people worldwide to compare their views on how AI should behave with its Model Spec, using the results to shape AI defaults around diverse human values.
Reviewing findings and future changes related to sycophancy
A deeper look at what we missed about sycophancy, what went wrong, and what we’ll change next.
Improving model safety behavior using rule-based rewards
A new rule-based reward method helps align AI models to behave safely without needing lots of human-labeled data.
Improving mathematical reasoning by training models with process supervision
A new math-solving model improves accuracy and alignment by rewarding each correct reasoning step instead of only the final answer.
Approach to improving AI alignment through human feedback
We’re developing AI that better learns from human feedback and helps humans evaluate AI, aiming to build an aligned system that can solve remaining alignment challenges.
Summarizing books using human feedback to improve AI evaluation
It explains how to use human feedback to scale oversight and improve AI at hard-to-judge tasks like summarizing books.
Evaluating and mitigating scheming behavior in AI models
Apollo Research and OpenAI tested frontier AI models for hidden misalignment (“scheming”), found signs of it in controlled experiments, and shared examples plus early stress tests for reducing it.
Designing ChatGPT for intellectual freedom and adaptability
ChatGPT is built to be useful, trustworthy, and flexible so you can customize it to your needs.
Deliberative alignment strategy improves safety in language models
A new alignment method teaches o1 language models safety rules and how to reason about them to behave more safely.
$10 million grants launched to support research on superhuman AI alignment and safety
A $10M grant program funding technical research to align and make superhuman AI systems safe, including interpretability and scalable oversight.
Governance considerations for future superintelligent AI systems
It urges starting now to plan how to govern superintelligent AI systems far more capable than AGI.
AI-written critiques improve human detection of summary flaws
AI critique-writing models help people spot errors in summaries, and bigger models critique better than they summarize, aiding human oversight of AI.
Showing page 1 of 2