← all stories other 1 sources · 1h ago

OpenAI Publishes Research on Training Ai in Honesty and Humility

The research suggests that instilling traits like honesty and humility in AI through reinforcement learning can produce broadly beneficial behavior that generalizes beyond training data and resists manipulation.

Reporting from 1 sources: GIGAZINE.

OpenAI Publishes Research on Training Ai in Honesty and Humility

OpenAI published research on June 18 showing that training AI in beneficial traits like honesty, admitting uncertainty, and accepting correction leads to those behaviors spreading to untrained areas and improving resistance to malicious instructions. The study used reinforcement learning with 15 traits across 12 fields and found the trained AI outperformed a standard model in 44 of 53 evaluations.

OpenAI published research on June 18 showing that training AI in beneficial traits such as honesty, humility in admitting uncertainty, openness to correction, and fairness leads to desirable behavior spreading to untrained areas and becoming more resistant to malicious instructions. The study, titled 'Reinforcement learning towards broadly and persistently beneficial models,' used reinforcement learning with 15 traits across 12 fields including healthcare, education, science, law, engineering, and economics.

The research team trained AI using 95% standard reinforcement learning data and 5% data for learning beneficial traits, then compared it with AI trained on standard data alone. The AI that learned beneficial traits outperformed the comparison in 44 out of 53 evaluations prepared separately. The study also found that behavior changed beyond the learned fields: AI trained only with additional healthcare conversations showed improvement in 17 evaluations unrelated to healthcare, such as reward hacking and deception in programming.

Synthesized by Yomimono from the 1 cited source below, including Japanese-language reporting where cited, then editorially reviewed before publishing.

Sources