AI Alignment

25 articles about AI Alignment

OpenAI introduces framework for evaluating chain-of-thought monitorability

OpenAI
Report
AI & Machine Learning

OpenAI presents a framework and 13 tests across 24 environments showing that monitoring a model’s internal reasoning works better than checking outputs alone for scalable AI control.

OpenAI updates structure to advance safe AI with nonprofit leadership and PBC equity

OpenAI
News
Tech News & Trends

OpenAI says it will keep nonprofit leadership while giving its PBC equity to unlock over $100B to build safe, beneficial AI for humanity.

Understanding and preventing misalignment generalization in language models

OpenAI
Insight
AI & Machine Learning

This work shows that training language models on wrong answers can spread misalignment, pinpoints an internal feature causing it, and demonstrates it can be reversed with minimal fine-tuning.

Zico Kolter joins OpenAI board of directors

OpenAI
News
AI & Machine Learning

OpenAI has appointed Zico Kolter to its board to strengthen governance with AI safety and alignment expertise.

Leveraging weak-to-strong generalization for controlling strong models with weak supervisors

OpenAI
Insight
AI & Machine Learning

This introduces research on using deep learning generalization to let weak supervisors reliably control much stronger AI models for superalignment.

Defining AI system behavior and the role of public input in decision-making

OpenAI
Insight
AI Tools & Prompts

It explains how ChatGPT’s behavior is set today and how OpenAI plans to improve it with more user customization and public input.

Training language models to better follow instructions and improve safety

OpenAI
Article
AI & Machine Learning

InstructGPT uses human-in-the-loop alignment training to better follow instructions than GPT-3 while being more truthful and less toxic, and it’s now the default model on the API.

How confessions help language models admit mistakes and improve honesty

OpenAI
Article
AI & Machine Learning

OpenAI is testing a “confessions” training method to get language models to admit mistakes or bad behavior, improving honesty and trust.

Public input on AI behavior informs OpenAI's model specifications

OpenAI
Report
AI & Machine Learning

OpenAI surveyed 1,000+ people worldwide to compare their views on how AI should behave with its Model Spec, using the results to shape AI defaults around diverse human values.

Reviewing findings and future changes related to sycophancy

OpenAI
Insight
Motivation & Inspiration

A deeper look at what we missed about sycophancy, what went wrong, and what we’ll change next.

Improving model safety behavior using rule-based rewards

OpenAI
Article
AI & Machine Learning

A new rule-based reward method helps align AI models to behave safely without needing lots of human-labeled data.

Improving mathematical reasoning by training models with process supervision

OpenAI
Article
AI & Machine Learning

A new math-solving model improves accuracy and alignment by rewarding each correct reasoning step instead of only the final answer.

Approach to improving AI alignment through human feedback

OpenAI
Insight
AI & Machine Learning

We’re developing AI that better learns from human feedback and helps humans evaluate AI, aiming to build an aligned system that can solve remaining alignment challenges.

Summarizing books using human feedback to improve AI evaluation

OpenAI
Article
AI & Machine Learning

It explains how to use human feedback to scale oversight and improve AI at hard-to-judge tasks like summarizing books.

Evaluating and mitigating scheming behavior in AI models

OpenAI
Report
AI & Machine Learning

Apollo Research and OpenAI tested frontier AI models for hidden misalignment (“scheming”), found signs of it in controlled experiments, and shared examples plus early stress tests for reducing it.

Designing ChatGPT for intellectual freedom and adaptability

OpenAI
Article
AI Tools & Prompts

ChatGPT is built to be useful, trustworthy, and flexible so you can customize it to your needs.

Deliberative alignment strategy improves safety in language models

OpenAI
Article
AI & Machine Learning

A new alignment method teaches o1 language models safety rules and how to reason about them to behave more safely.

$10 million grants launched to support research on superhuman AI alignment and safety

OpenAI
News
AI & Machine Learning

A $10M grant program funding technical research to align and make superhuman AI systems safe, including interpretability and scalable oversight.

Governance considerations for future superintelligent AI systems

OpenAI
Insight
Tech Policy & Startups Regulation

It urges starting now to plan how to govern superintelligent AI systems far more capable than AGI.

AI-written critiques improve human detection of summary flaws

OpenAI
Analysis
AI & Machine Learning

AI critique-writing models help people spot errors in summaries, and bigger models critique better than they summarize, aiding human oversight of AI.

Showing page 1 of 2