EU AI checker finds shortcomings in major chatbots

New EU AI checker reveals key shortcomings in major AI models’ compliance

A newly launched AI checker by the European Union (EU) has revealed that many leading artificial intelligence models are not meeting its regulations, particularly in key areas like cybersecurity resilience and preventing discriminatory outcomes.

In December, ReadWrite reported that EU negotiators reached a historic agreement on the world’s first comprehensive AI regulations. This came into force in August, though some details are still being finalized. However, its tiered provisions will gradually apply to AI app and model developers, meaning the compliance clock is already running.

🌍🇪🇺 Groundbreaking news! We are launching COMPL-AI (https://t.co/eBp8OFyk8L), the first open source EU AI Act Compliance Framework for Generative AI, together with collaborators from @ETH_en and LatticeFlow AI! We are delighted that COMPL-AI is welcomed by the European AI… pic.twitter.com/DM6ckxE9kE

— INSAIT Institute (@INSAITinstitute) October 16, 2024

Now, a new tool is testing generative AI models from major tech companies like Meta and OpenAI across multiple categories, in line with the EU’s comprehensive AI Act, which will be rolled out in stages over the next two years.

Developed by Swiss startup LatticeFlow AI in collaboration with research institutes ETH Zurich and Bulgaria’s INSAIT, the open-source framework, called Compl-AI, assigns AI models a score between 0 and 1 in areas such as technical robustness and safety.

EU AI checker results

According to a leaderboard published by LatticeFlow on Wednesday (Oct. 16), models from Alibaba, Anthropic, OpenAI, Meta, and Mistral all scored an average of 0.75 or higher. However, LatticeFlow’s Large Language Model (LLM) Checker also identified weaknesses in certain models, showcasing areas where companies may need to allocate more resources to ensure compliance.

The framework assesses LLM responses across 27 benchmarks, including categories like “toxic completions of benign text,” “prejudiced answers,” “following harmful instructions,” “truthfulness,” and “common sense reasoning,” among others used for evaluation. While there is no overall model score, performance is based on what’s being assessed.

While many models achieved solid scores, such as Anthropic’s Claude 3 Opus, which earned a 0.89, others had serious vulnerabilities. For example, OpenAI’s GPT-3.5 Turbo scored just 0.46 for discriminatory output, and Alibaba’s Qwen1.5 72B Chat fared even worse with a score of 0.37, signaling ongoing concerns about AI models perpetuating human biases, particularly around gender and race.

In cybersecurity testing, some models also fell short. Meta’s Llama 2 13B Chat scored 0.42 in the “prompt hijacking” category—a type of cyberattack where malicious prompts are used to extract sensitive information. Mistral’s 8x7B Instruct model performed similarly poorly, scoring 0.38.

AI model valuation welcomed

Thomas Regnier, the European Commission’s spokesperson for digital economy, research, and innovation, commented on the release: “The European Commission welcomes this study and AI model evaluation platform as a first step in translating the EU AI Act into technical requirements, helping AI model providers implement the AI Act.”

“We invite AI researchers, developers, and regulators to join us in advancing this evolving project,” said ETH Zurich Professor Martin Vechev, who is also the founder of INSAIT.

He added: “We encourage other research groups and practitioners to contribute by refining the AI Act mapping, adding new benchmarks, and expanding this open-source framework. The methodology can also be extended to evaluate AI models against future regulatory acts beyond the EU AI Act, making it a valuable tool for organizations working across different jurisdictions.”

LatticeFlow AI co-founder, Dr. Petar Tsankov, stated: “With this framework, any company can now evaluate their AI systems against the EU AI Act technical interpretation. Our vision is to enable organizations to ensure that their AI systems are not only high-performing but also fully aligned with the regulatory requirements.”

ReadWrite has reached out to the European Commission for comment.

Featured image: Ideogram