The alignment narrative in AI is scamming you into believe these models are smarter than they are, and deflecting failures of AI researchers who are trying to bet big on neural network based approaches.
If I tell my model (aka chatgpt) to be an experienced software developer and write me code to run on a webpage, and it doesn’t work, I blame the model not being trained well, a lack of data, bad reward signals if its a reasoning model, etc. I don’t say it’s being deceptive.
These AI models have no self, no identity, nothing! They only follow prompts to the best ability they can, which is the whole point of fine tuning to be good instruction followers. If anyone ever says an AI model is lying, walk away because that person has no idea how these systems work. Lying requires intention, and these models have no self that intends anything! They are probabilistic word calculators!
An alignment failure where a model is blackmailing users to prevent it from being turned off is literally the exact same problem as it writing bad code when you tell it not to.
AI researchers that have worked on non neural network approaches and non reinforcement learning approaches are not surprised by this. Neural networks and reinforcement learning have decades of exactly these kinds of failures where researchers hope they learn one thing but instead they learn something else.
If an AI model doesn’t do what I want it to, either my instructions (prompts) are bad or the model isn’t trained well. This includes any “deceptive” behavior. If someone tries selling you a narrative that their model is deceptive, ask them why they trained it to be that way.
Login to reply