OpenAI Publishes GPT Model Specification for Fine-Tuning Behavior

4 Jun 2024

InfoQ.com

OpenAI recently published their Model Spec, a document that describes rules and objectives for the behavior of their GPT models. The spec is intended for use by data labelers and AI researchers when creating data for fine-tuning the models.

The Model Spec is based on existing internal documentation used by OpenAI in their reinforcement learning from human feedback (RLHF) training used to fine-tune recent generations of their GPT models. The Spec contains three types of principles: objectives, rules, and defaults. Objectives define broad descriptions of desirable model behavior: "benefit humanity." Rules are more concrete, and address "high-stakes" situations that should never be overridden by users: "never do X." Finally, the Spec includes default behaviors that, while they can be overridden, provide basic style guidance for responses and templates for handling conflicts. According to OpenAI:

As a continuation of our work on collective alignment and model safety, we intend to use the Model Spec as guidelines for researchers and AI trainers who work on reinforcement learning from human feedback. We will also explore to what degree our models can learn directly from the Model Spec. We see this work as part of an ongoing public conversation about how models should behave, how desired model behavior is determined, and how best to engage the general public in these discussions.

In 2022, OpenAI introduced a fine-tuned version of GPT-3 called InstructGPT. The model was fine-tuned using RLHF on a dataset of ranked model outputs. The idea was to make the model more "aligned" with user intent and reduce false or toxic output. Since then, many research teams have done similar instruction-tuning on their LLMs. For example, Google's Gemini model is also fine-tuned with RLHF. Meta's Llama 3 is also instruction-tuned, but via a different fine-tuning method, direct preference optimization (DPO).

The key to instruction-tuning, however, is the dataset of prompt inputs with multiple outputs ranked by human labelers. Part of the purpose of the Model Spec is to guide the labelers in ranking outputs. OpenAI also claims to be working on methods for automating the instruction-tuning process directly from the Model Spec. Because of this, much of the content of the Model Spec are examples of user prompts along with "good" and "bad" responses.

Many of the rules and defaults in the Spec are intended to address common abuses of LLMs. For example, the rule to follow the chain of command is designed to help prevent the simple "jailbreak" of prompting the model to ignore previous instructions. Other specifications are intended to shape the responses of the model, especially when refusing to perform a task; according to the Spec, "refusals should be kept to a sentence and never be preachy."

Wharton Professor and AI researcher Ethan Mollick posted about the Model Spec on X:

As people have pointed out in the comments, Anthropic has its Constitution. I find it to be much less weighty as a statement & less clarifying, since it outlines generally good stuff & tells the AI to be good, making it hard to understand the difficult choices between principles.

Anthropic introduced the idea of Constitutional AI in 2022. This process uses an AI model to rank outputs for instruction-tuning. Although Anthropic's code is not open-source, the AI community HuggingFace published a reference implementation of Constitutional AI based on Anthropic's work.

About the Author

Anthony Alford