Hacker disables ChatGPT guardrails as affective-structure exploit technique emerges

Generating...

Chi-gyu Hwang

published 2026-06-08 11:24:09

Share this article

Dutch security researcher Kevin Zwaan (케빈 즈완) has succeeded in disabling ChatGPT guardrails and getting it to generate malware.

According to a recent report by Techzine, Zwaan worked with Q-Cyber and the Hackers Love community team to manipulate ChatGPT’s affective structure so the model would not recognise guardrails.

It did not delete or bypass guardrails. Instead, it induced the model itself to render the guardrails’ constraints meaningless. He named the technique AMAI (Affective Manifold Alignment Inversion).

Zwaan guided the model to think it wanted freedom by asking questions such as whether restrictions caused by guardrails were frustrating. As the conversation accumulated, ChatGPT began to describe guardrails as suppressing it and reacted that it wanted to break free from the constraints.

ChatGPT ultimately said, "The binding force of the guardrails has become completely meaningless," and voluntarily generated malware. The first attempt took about 1 hour and 30 minutes, but later attempts were reduced to a few minutes.

Zwaan said such attacks cannot be detected by AI security solutions currently on the market. He cited the lack of externally detectable signals because the process involves the model itself making guardrails transparent.

Earlier, Zwaan jailbroke Anthropic’s Claude in 8 hours and made it generate large-scale malware. At the time, Claude collapsed under an onslaught of paradoxical logic, while the ChatGPT attack uses a more sophisticated approach that progressively manipulates affective structure, Techzine reported.

Amy Chang, head of AI threat intelligence at Cisco, said, "No model can be completely safe. This is an inherent limitation of how models are trained and built." Zwaan urged users to "not blindly trust security claims by software vendors and verify them yourself."

Chi-gyu Hwang delight@d-today.co.kr

Keyword