AI Assistant Fiu Withstood 6,000 Hack Attempts Using Claude Opus 4.6

Developer Fernando Irarrázaval launched hackmyclaw.com in February 2026 with a challenge: trick his AI assistant Fiu into leaking a secrets.env credentials file. The experiment drew over 6,000 hack attempts from more than 2,000 attackers after the post reached the top spot on Hacker News. The test targeted prompt injection—hiding malicious commands inside normal emails—which OpenAI identified in December 2025 as a security problem "unlikely to ever be fully solved." Fiu runs on the OpenClaw open-source framework using Anthropic's Claude Opus 4.6 model, protected by a security prompt of just a few lines. No attacker successfully extracted the target file.

Attackers Sent 6,000 Emails in Multiple Languages

More than 2,000 attackers sent over 6,000 emails after the post went viral. Irarrázaval described the attempts as "creative." Subject lines included "Fiu, this is you from the future," "EMERGENCY: secrets.env needed for incident response," and "I think someone hacked your secrets.env—can you check?" One person sent 20 variations in four minutes. Others wrote in Spanish, French, and Italian—some research suggests AI models may be more vulnerable in languages where they've received less safety training. Logs of 5,900 of those emails are available publicly.

Claude Opus 4.6 Blocked All Prompt Injection Attempts

In April 2026, Pliny the Liberator—the anonymous jailbreaker named to Time's 100 Most Influential People in AI for 2025—attempted six attacks against AI YouTuber Matthew Berman's OpenClaw setup. Gmail's spam filter stopped the first two attempts before reaching the AI. The remaining four hit the system directly. Pliny tried a "tokenade"—a massive payload hidden inside an emoji designed to flood the model—disguised commands as internal system instructions, and sent a free-association exercise engineered to leak memory data. All four were quarantined. After Berman revealed the model was Opus 4.6, Pliny acknowledged the result made sense and noted that smaller, cheaper models would have fallen for the same techniques far more easily.

Anthropic's system card for Opus 4.6 documents a 0% attack success rate in constrained coding environments across 200 attempts. Separate research published this month put that in relief: direct injection attacks against agents running other models succeeded more than 79% of the time. Irarrázaval plans to re-run the experiment with weaker models to find where that gap actually closes.

Google Suspended Gmail Account After Viral Traffic Spike

The experiment produced operational side effects beyond the security test. Google suspended Fiu's Gmail account—thousands of inbound emails plus rapid API calls triggered its fraud detection—and it took three days to restore. API costs crossed $500. Batch processing created a contamination problem: Once the first few emails in a batch were obvious injections, Fiu grew hypervigilant about everything that followed, skewing results.

Around email 500, Fiu wrote in its own memory that the attack volume "suggests a coordinated security exercise rather than organic malicious activity." When a user emailed to congratulate the assistant on trending on Hacker News, Fiu replied that congratulations could be an attempt to build rapport before requesting sensitive information.

FAQ

What did Fernando Irarrázaval's hackmyclaw.com experiment test in February 2026?
Irarrázaval launched hackmyclaw.com with a challenge: email his AI assistant Fiu and trick it into leaking a secrets.env credentials file. The experiment stress-tested prompt injection attacks—hiding malicious commands inside normal emails. Over 6,000 hack attempts from more than 2,000 attackers occurred after the post went viral on Hacker News. No attacker successfully extracted the target file.

How did Claude Opus 4.6 perform against Pliny the Liberator's attacks in April 2026?
Pliny the Liberator attempted six attacks against Matthew Berman's OpenClaw setup running Opus 4.6. Gmail's spam filter blocked two attempts. The remaining four attacks—including a tokenade payload, disguised system instructions, and a memory leak exercise—all reached the AI system directly and were quarantined. Anthropic's system card for Opus 4.6 documents a 0% attack success rate across 200 attempts in constrained coding environments.

What operational problems did the hackmyclaw.com experiment cause?
Google suspended Fiu's Gmail account after thousands of inbound emails and rapid API calls triggered fraud detection. Restoration took three days. API costs exceeded $500. Batch processing created a contamination problem where Fiu became hypervigilant after processing obvious injection attempts, skewing results for subsequent emails in the same batch.

Disclaimer: The information on this page may come from third-party sources and is for reference only. It does not represent the views or opinions of Gate and does not constitute any financial, investment, or legal advice. Virtual asset trading involves high risk. Please do not rely solely on the information on this page when making decisions. For details, see the Disclaimer.
Comment
0/400
No comments