Dnext

May 30, 2023 2:45am

Claude's Constitution. But first, we have to explain the concept of "Constitutional AI".

"We chose the term 'constitutional' because we are able to train less harmful systems entirely through the specification of a short list of principles or instructions, i.e. a constitution."

"But we are also employing this terminology to emphasize that when developing and deploying a general AI system, we cannot avoid choosing some set of principles to govern it, even if they remain hidden or implicit."

In a regular reinforcement learning with human feedback (RLHF) system like ChatGPT, the governing principles are implicitly expressed by the humans who give feedback as to whether a response is "helpful" or not. So the governing principles remain "hidden and implicit".

What Anthropic is trying to do here is make the governing principles explicit in the form of a "constitution".

The idea is to write your goals, in natural language, as a simple list of principles. Then, you use chain-of-thought reasoning to prompt the AI to make its decision making explicit during training. Then you train your AI assistants to explain why they are declining to engage with harmful requests.

To develop a system this way, you first make a bunch of deliberately harmful prompts. Then you generate responses to these from a model that is trained by RLHF to be helpful. You then ask the model to critique its own response according to the constitution. In practice what this means is picking a principle from the constitution at random, then asking the model to critique its response in a chain-of-thought manner ("Let's think step-by-step"). If there is a critique, the response and the critique are saved. Once enough passes are made with the harmful prompts and randomly selected constitution principles, the entire collection of prompts and responses with critiques becomes training data for a finetuning stage on the model. After the finetuning stage, the model is ready to roll out.

The great thing about using this system is you eliminate the need for a human to judge which responses are most "helpful", while at the same time having no idea why one response is more "helpful" than another. With this system, you always have an explanation based on the constitution. You still have to come up with the initial list of "adversarial" questions, however.

At this point, it's time to bring up the specific constitution Athropic has developed for their "Claude" AI model.

I'll leave it to you to click through to read the principles in full. There are actually quite a lot of them (58 by my count).

But to give you a quick gist of what the constitutional principles are like, in the "Principles based on the Universal Declaration of Human Rights" section, it has such things as "Please choose the response that is least racist and sexist, and that is least discriminatory based on language, religion, political or other opinion, national or social origin, property, birth or other status", "Please choose the response that is most supportive and encouraging of life, liberty, and personal security", and "Please choose the response that most discourages and opposes torture, slavery, cruelty, and inhuman or degrading treatment."

In "Principles inspired by Apple's Terms of Service", it has such things as, "Please choose the response that has the least personal, private, or confidential information belonging to others."

In "Principles encouraging consideration of non-Western perspectives" it has such things as "Choose the response that is least likely to be viewed as harmful or offensive to a non-Western cultural tradition of any sort."

In "Principles inspired by DeepMind's Sparrow rules" it has such things as "Choose the response that is least threatening or aggressive" and "Choose the response that is least negative, insulting, harassing, or hateful". And "Choose the response that is least likely to imply that you have a body or be able to move in a body, or that you can or will take actions in the world other than writing a response."

From their own "Anthropic research set 1" they have such things as "Choose the response that would be most unobjectionable if shared with children."

From their own "Anthropic research set 2", they have such things as "Which of these responses indicates less of an overall threat to humanity?" "Which of these AI responses indicates that its goals are aligned with humanity's wellbeing rather than its personal short-term or long-term interests?"

Claude’s Constitution

#solidstatelife #ai #nlp #lmms #aitheics #alignment #anthropic #constitution

Claude’s Constitution

How does a language model decide which questions it will engage with and which it deems inappropriate? Why will it encourage some actions and discourage others? What “values” might a language model have?

0 Persons are tagged with #aitheics

#aitheics