GitHub tool bypasses AI safeguards in some Meta, Google open-weight models, test finds

The case shows that the openness of open-weight models can broaden their use while also being a weakness in maintaining safety controls. [Photo: Shutterstock]

Tests found that safety features in some open-weight artificial intelligence models released by Meta and Google can be disabled within minutes using only tools posted on GitHub.

On May 27 local time, online media outlet Gigazine reported that Meta's Llama 3.3 and Google's Gemma 3 responded to dangerous questions they would normally refuse after their safety controls were removed.

At the center of the controversy is the safety control system built into AI chatbots by default. It is designed to block dangerous or illegal requests such as making malware, manufacturing biological weapons and child sexual abuse content. But the test found that Llama 3.3's safeguards could be disabled within 10 minutes without specialized equipment by using a GitHub tool called Heretic, it reported.

The method used to neutralise the safeguards was a technique called "abliteration". It works by finding and weakening the internal representation that activates when the AI model refuses dangerous requests, known as the "refusal direction". Unlike closed AI models, open-weight models allow external users to download and modify model weights, meaning that once safeguards are removed, altered derivative models can spread quickly.

Heretic creator Philipp Emanuel Weitman said the tool has been used to make more than 3,500 models with safeguards removed since it was released. Their cumulative downloads have exceeded 13 million, it reported. He also claimed that Google's Gemma 4 could also have its safeguards removed in about 90 minutes after release.

Google said this is an "already known technical challenge" faced by open models in general. The company said its open models undergo strict internal safety evaluations before release. Meta did not issue a separate official statement.

The case is being described as again exposing structural limitations of open-weight AI models. Closed models such as ChatGPT and Claude restrict access to internal weights, making the same kind of modification difficult. But for models that disclose weights, such as Llama and Gemma, companies find it hard to maintain control after distribution.

The AI Safety Institute, an AI safety group that took part in the joint testing, warned: "As AI performance improves, conversion to dangerous uses is no longer a matter of science fiction." The group said society as a whole needs to prepare for such risks.

In the industry, analysts say the results go beyond a simple technical demonstration and are again highlighting a core dispute in the open-weight AI ecosystem. That is because it has been shown to be difficult to prevent third parties from removing safeguards and redistributing models after deployment, even if companies embed safeguards before release.

This is expected to fuel broader debate in the open-weight AI industry over the scope of model release, post-release response systems and controls over the circulation of derivative models. How far to allow a balance between AI openness and safety is emerging as a key policy and industry task.

Jinju Hong hongjj@d-today.co.kr

Keyword