Language Model (LM)
13 articles about Language Model (LM)
Gemma Scope 2 expands interpretability tools for language models
Gemma Scope 2 releases open interpretability tools for the full Gemma 3 model family to help the AI safety community better understand complex language model behavior.
SimpleQA: a benchmark for evaluating factual question answering
SimpleQA is a factuality benchmark that tests how well language models answer short, fact-seeking questions.
ChatGPT introduces initial support for plugins to enhance functionality
ChatGPT now supports safety-focused plugins that let it fetch current info, perform calculations, and use third-party services.
Training language models for summarization using human feedback
Using human feedback and reinforcement learning, we trained language models to produce better summaries.
Unsupervised system learns sentiment representation from Amazon reviews
An unsupervised model learns strong sentiment understanding from Amazon reviews by only training to predict the next character.
Understanding and preventing misalignment generalization in language models
This work shows that training language models on wrong answers can spread misalignment, pinpoints an internal feature causing it, and demonstrates it can be reversed with minimal fine-tuning.
Gemma Scope: an open suite of sparse autoencoders for language model interpretability
Gemma Scope is an open suite of sparse autoencoders that helps the safety community interpret how language models work.
Training language models to better follow instructions and improve safety
InstructGPT uses human-in-the-loop alignment training to better follow instructions than GPT-3 while being more truthful and less toxic, and it’s now the default model on the API.
Fine-tuning GPT-2 using human feedback for improved task performance
Researchers fine-tuned the 774M-parameter GPT-2 with human feedback across tasks, finding it can match labeler preferences (sometimes by copying in summaries) and using 60k labels for summarization vs 5k for simpler style continuations to advance safer human-facing AI.
Deliberative alignment strategy improves safety in language models
A new alignment method teaches o1 language models safety rules and how to reason about them to behave more safely.
Prover-verifier games enhance legibility of language model outputs
Prover-verifier games make language model outputs clearer and easier to check, improving trust for humans and machines.
Improving language model behavior through fine-tuning on a curated dataset
Research shows language models can better follow specific behavioral values by fine-tuning on a small, carefully curated dataset.
Six-month update on the release and research of the 774M parameter GPT-2 model
OpenAI is releasing the 774M-parameter GPT-2 after staged earlier releases, alongside a model-sharing legal agreement and a report on coordinating with the AI community about misuse and publication norms.
Showing page 1 of 1