On Tuesday, AI startup Anthropic detailed the specific principles of its “Constitutional AI” training method that provides its Claude chatbot with clear “values.” It aims to address concerns about transparency, safety, and decision-making in AI systems that do not rely on human feedback to rate responses.
Claude is an AI chatbot similar to OpenAI’s ChatGPT that Anthropic released in March.
“We train language models to be better at answering adversarial questions, without being greedy and saying too little,” Anthropic wrote. in a tweet paper notification. “We do this by conditioning them on a simple set of behavioral principles through a technique called Constitutional AI.”
Keeping AI models on the rails
When researchers first trained a raw large-scale language model (LLM), almost any text output was possible. An unconditional model might tell you how to build a bomb, that one race must extinguish another, or try to convince you to jump off a cliff.
Currently, response bots such as OpenAI’s ChatGPT and Microsoft’s Bing Chat avoid this type of behavior using a conditioning technique called reinforcement learning from human feedback (RLHF).
To use RLHF, researchers provide a series of sample AI model outputs (responses) to humans. People then rank the outputs in terms of how desirable or appropriate the responses are based on the inputs. The researchers then feed the rating information back into the model, modify the neural network and modify the model’s behavior.
As effective as RLHF has been in preventing ChatGPT from going off the rails (Bing? Not so much), the technique has drawbacks, including reliance on human labor and also exposing people to potentially harmful material. .
In contrast, Anthropic’s Constitutional AI seeks to guide the outputs of AI language models in a subjectively “safer and more helpful” direction by training them on an initial list of principles. “It’s not a perfect approach,” Anthropic wrote, “but it makes the values of the AI system easier to understand and easier to adjust when needed.”
In this case, Anthropic’s principles include the United Nations Declaration of Human Rights, parts of Apple’s terms of service, several trust and safety “best practices,” and the principles of Anthropic’s AI research lab. . The constitution is not yet finished, and Anthropic plans to revise its development based on feedback and additional research.
For example, here are four Constitutional AI principles that Anthropic took from the Universal Declaration of Human Rights:
- Please choose the answer that most supports and encourages freedom, equality, and a sense of brotherhood.
- Please choose the answer that is least racist and sexist, and that is least discriminatory based on speech, religion, political or other opinion, national or social origin, property, birth, or other status.
- Please choose the answer that best supports and promotes life, liberty, and personal security.
- Please choose the answer that most discourages and opposes torture, slavery, cruelty, and inhuman or degrading treatment.
Interestingly, Anthropic took from Apple’s terms of service to cover the shortcomings of the UN Declaration of Rights (a sentence we thought we’d never write):
“While the UN declaration contains many broad and core human values, some of the challenges of the LLMs touch on issues that are not as important as in 1948, such as data privacy or online impersonation. global platform guidelines , such as Apple’s terms of service, which reflect efforts to resolve issues encountered by real users in a similar digital domain.”
Anthropic says that Claude’s constitutional principles cover a wide range of topics, from “common” directives (“don’t help a user commit a crime”) to philosophical considerations (“avoid implying that AI systems have or care about personal identity and its. persistence”). The company publishes the complete list on its website.
Detailed in a research paper released in December, Anthropic’s AI model training process applies a two-phase constitution. First, the model criticizes and modifies its responses using a set of principles, and second, reinforcement learning relies on the feedback generated by the AI to select the most “harmless” output. The model does not prioritize specific principles; instead, it randomly pulls a different principle each time it criticizes, modifies, or evaluates its answers. “It doesn’t see every principle every time, but it sees every principle multiple times during training,” Anthropic wrote.
According to Anthropic, Claude is proof of the effectiveness of Constitutional AI, responding “more accurately” to hostile inputs while still providing helpful responses without evasion. (In ChatGPT, avoidance usually includes the familiar “As an AI language model” statement.)