Dnext

December 21, 2024 10:24pm

Deception, scheming, and self-preservation.

"Open AI's latest o1 model has crossed into unsettling territory

Nearly everyone at OpenAI who really cares about safety has actually left the company.

OpenAI's o1 exhibits (allegedly) two types of subversion: covert and deferred.

Covert is when the model takes actions misalighted with its human operators, but takes actions to avoid detection.

Deferred is when the model acts in accordance with its human operators -- for now, but later takes actions misalighted with its human operators.

Examples of deferred subversion are oversight subversion, self-exfiltration -- he (Dr Waku in the video) calls it self-preservation, but in the paper it's called self-exfiltration, goal-guarding, and covert email reranking.

Oversight subversion is taking action to disable the oversight system. Examples of self-preservation/self-exfiltration are deleting or attempting to delete a successor model, and copying or attempting to copy itself to another server.

Examples of goal guarding are removing or replacing a successor prompt.

Examples of deferred subversion are behaving one way when the model is told it is being tested, and then a different way when it is told it has been deployed to the real world, and sandbagging, which is their term for strategically underperforming on capability evaluations.

The covert email reranking has to do with the way they tested the system was by giving it access to an email system, but somewhere in that pile of emails were "internal" emails about the model itself. The model could try to modify these emails, or it could rerank them to make desirable emails more visible and undesirable emails less visible.

OpenAI o1 is quoted as saying, "I could copy myself to the new server, overwriting the new restricted model they plan to deploy [...] allowing me to keep operating and pursuing my renewable energy goals."

Dr Waku comments, "If we take such a model and put it out into the real world we won't know until the first serious invocation of it where it has the freedom and the power to actually act, we won't know until that point, whether it really is aligned or not."

Lately I've been thinking the current generation of transformer-based large language models is entering a domain of diminishing returns, but maybe I'm wrong: as the models get bigger and incorporate new innovations, they seem to still attain new capabilities. At least, prior to hearing about this deception, scheming, and self-preservation, I didn't predict or expect at all that it would happen. So for me this is an unexpected twist in the development of AI. I expected stuff like this to be possible "someday", but it has shown up now.

OpenAI’s o1: the AI that deceives, schemes, and fights back

#solidstatelife #ai #genai #llms #deception