Verify LLMs: How to Spot Fake AI in Decentralized Networks
This is a Plain English Papers summary of a research paper called Verify LLMs: How to Spot Fake AI in Decentralized Networks. If you like these kinds of analysis, you should join AImodels.fyi or follow us on Twitter. The Challenge of Verifying LLMs in Decentralized Networks Decentralized AI networks like Gaia enable individuals to run customized large language models (LLMs) on personal computers and offer these services to the public. This decentralization brings numerous benefits: enhanced privacy, reduced costs, faster response times, and improved availability. More importantly, it fosters an ecosystem where AI services can be tailored with proprietary data and specialized knowledge. However, this freedom creates a significant challenge. Since these networks must remain permissionless to reduce censorship and maintain accessibility, nodes might claim to run one model while actually operating a different one. When popular model domains might host over 1,000 nodes simultaneously, the network requires a reliable mechanism to detect and penalize dishonest participants. This verification challenge is particularly difficult because traditional cryptographic methods often prove impractical at scale. The research presents a novel approach using statistical analysis of model outputs combined with cryptoeconomic incentives to ensure network integrity. Prior Approaches to LLM Verification: Why Cryptographic Methods Fall Short Researchers have previously explored deterministic verification through cryptographic algorithms, but these methods face significant limitations: Zero Knowledge Proofs (ZKP) theoretically allow verification of computation outcomes without knowing internal system details. However, ZKP for LLM verification faces several obstacles: Each ZKP circuit must be custom-generated for individual LLMs, requiring enormous engineering effort across thousands of models Even state-of-the-art ZKP algorithms need 13 minutes to generate proof for a single inference from a small 13B parameter model—100× slower than the inference itself Memory requirements are prohibitive—a toy LLM with one million parameters requires 25GB of RAM for proof generation Open-source LLMs remain vulnerable to proof forgery Trusted Execution Environments (TEE) embedded in CPUs and GPUs can generate signed attestations for software and data. However, TEE implementation faces its own challenges: Reduces raw CPU performance by up to 2×, unacceptable for already compute-bound LLM inference Very few GPUs or AI accelerators currently support TEEs Cannot verify that an LLM server is actually using the verified model for serving requests Distributing private keys to decentralized TEE devices requires specialized infrastructure Given these limitations, cryptoeconomic mechanisms offer a more promising approach. This method assumes most participants are honest and uses social consensus to identify dishonest actors. Through staking and slashing, the network incentivizes honest behavior and penalizes cheating, creating a virtuous cycle within the ecosystem. Statistical Detection Hypothesis: Using Answer Patterns to Identify Models The research hypothesizes that by analyzing question responses from Gaia nodes, statistical distributions can reveal outliers running different LLMs or knowledge bases than advertised. Specifically, when asking a question to all nodes in a domain, honest nodes' answers should form a tight cluster in high-dimensional embedding space. Outliers that fall far outside this cluster likely run different models or knowledge bases than required. The mathematical approach involves: Sending questions (q) from set Q to nodes (m) from set M Repeating each question n times per node to create answer distributions Converting each answer to a z-dimensional vector (embedding) representing its semantic meaning Calculating mean points and distances between answer clusters Measuring consistency within a node's answers via standard deviation This framework allows for quantitative comparison between nodes, revealing statistical patterns that can distinguish between different models and knowledge bases. Experimental Design: Testing Model and Knowledge Base Detection The researchers conducted two key experiments to validate their hypothesis: First Experiment: Distinguishing Between LLM Models Three Gaia nodes were set up with different open-source LLMs: Llama 3.1 8b by Meta AI Gemma 2 9b by Google Gemma 2 27b by Google Each model was queried with 20 factual questions covering science, history, and geography, with each question repeated 25 times per model. This generated 500 responses per model and 1,500 responses total. Second Experiment: Distinguishing Between Knowledge Bases Two Gaia nodes were configured with identical LLMs (Gemma-2-9b) but different vector databases: One containing knowledge about Paris One containing knowledge about London Each kno

This is a Plain English Papers summary of a research paper called Verify LLMs: How to Spot Fake AI in Decentralized Networks. If you like these kinds of analysis, you should join AImodels.fyi or follow us on Twitter.
The Challenge of Verifying LLMs in Decentralized Networks
Decentralized AI networks like Gaia enable individuals to run customized large language models (LLMs) on personal computers and offer these services to the public. This decentralization brings numerous benefits: enhanced privacy, reduced costs, faster response times, and improved availability. More importantly, it fosters an ecosystem where AI services can be tailored with proprietary data and specialized knowledge.
However, this freedom creates a significant challenge. Since these networks must remain permissionless to reduce censorship and maintain accessibility, nodes might claim to run one model while actually operating a different one. When popular model domains might host over 1,000 nodes simultaneously, the network requires a reliable mechanism to detect and penalize dishonest participants.
This verification challenge is particularly difficult because traditional cryptographic methods often prove impractical at scale. The research presents a novel approach using statistical analysis of model outputs combined with cryptoeconomic incentives to ensure network integrity.
Prior Approaches to LLM Verification: Why Cryptographic Methods Fall Short
Researchers have previously explored deterministic verification through cryptographic algorithms, but these methods face significant limitations:
Zero Knowledge Proofs (ZKP) theoretically allow verification of computation outcomes without knowing internal system details. However, ZKP for LLM verification faces several obstacles:
- Each ZKP circuit must be custom-generated for individual LLMs, requiring enormous engineering effort across thousands of models
- Even state-of-the-art ZKP algorithms need 13 minutes to generate proof for a single inference from a small 13B parameter model—100× slower than the inference itself
- Memory requirements are prohibitive—a toy LLM with one million parameters requires 25GB of RAM for proof generation
- Open-source LLMs remain vulnerable to proof forgery
Trusted Execution Environments (TEE) embedded in CPUs and GPUs can generate signed attestations for software and data. However, TEE implementation faces its own challenges:
- Reduces raw CPU performance by up to 2×, unacceptable for already compute-bound LLM inference
- Very few GPUs or AI accelerators currently support TEEs
- Cannot verify that an LLM server is actually using the verified model for serving requests
- Distributing private keys to decentralized TEE devices requires specialized infrastructure
Given these limitations, cryptoeconomic mechanisms offer a more promising approach. This method assumes most participants are honest and uses social consensus to identify dishonest actors. Through staking and slashing, the network incentivizes honest behavior and penalizes cheating, creating a virtuous cycle within the ecosystem.
Statistical Detection Hypothesis: Using Answer Patterns to Identify Models
The research hypothesizes that by analyzing question responses from Gaia nodes, statistical distributions can reveal outliers running different LLMs or knowledge bases than advertised.
Specifically, when asking a question to all nodes in a domain, honest nodes' answers should form a tight cluster in high-dimensional embedding space. Outliers that fall far outside this cluster likely run different models or knowledge bases than required.
The mathematical approach involves:
- Sending questions (q) from set Q to nodes (m) from set M
- Repeating each question n times per node to create answer distributions
- Converting each answer to a z-dimensional vector (embedding) representing its semantic meaning
- Calculating mean points and distances between answer clusters
- Measuring consistency within a node's answers via standard deviation
This framework allows for quantitative comparison between nodes, revealing statistical patterns that can distinguish between different models and knowledge bases.
Experimental Design: Testing Model and Knowledge Base Detection
The researchers conducted two key experiments to validate their hypothesis:
First Experiment: Distinguishing Between LLM Models
Three Gaia nodes were set up with different open-source LLMs:
- Llama 3.1 8b by Meta AI
- Gemma 2 9b by Google
- Gemma 2 27b by Google
Each model was queried with 20 factual questions covering science, history, and geography, with each question repeated 25 times per model. This generated 500 responses per model and 1,500 responses total.
Second Experiment: Distinguishing Between Knowledge Bases
Two Gaia nodes were configured with identical LLMs (Gemma-2-9b) but different vector databases:
- One containing knowledge about Paris
- One containing knowledge about London
Each knowledge base was queried with 20 factual questions evenly covering Paris and London topics, with each question repeated 25 times. This generated 500 responses per knowledge base and 1,000 responses total.
For all responses, a standard system prompt ("You are a helpful assistant") was used, and embeddings were generated using the gte-Qwen2-1.5B-instruct model for analysis.
How Different LLMs Have Distinct Response Patterns
The first experiment revealed clear statistical differences between LLM outputs:
Figure 1: Internal consistency of different LLMs measured by RMS scatter.
The consistency metrics showed Gemma-2-27b demonstrating the highest consistency in responses with an RMS scatter of 0.0043, while Llama-3.1-8b showed the highest variation with an RMS scatter of 0.0062.
When comparing answers between models, substantial differences emerged:
Figure 2: Average distance vs RMS scatter for each question and model pair.
The distances between model pairs varied by question, with the highest distance (0.5291) observed between Llama-3.1-8b and Gemma-2-27b for the question "Who wrote 'Romeo and Juliet'?" and the lowest non-zero distance (0.0669) between Gemma models for "What is the atomic number of oxygen?"
Most importantly, the data showed that distances between model pairs were dramatically larger than variations within any single model:
Model pair | $D_{\text {ave }}$ | $D_{\text {ave }} / \sigma_{\text {max }}$ |
---|---|---|
Gemma9b - Gemma27b | 0.1558 | $32.5 \times$ |
Llama8b - Gemma9b | 0.3129 | $65.2 \times$ |
Llama8b - Gemma27b | 0.3141 | $65.4 \times$ |
Table 1: Average distance over max RMS scatter for each pair of LLMs
This 32-65× separation between inter-model distances and intra-model variation demonstrates that different LLMs produce reliably distinguishable outputs, making them identifiable through statistical analysis of their responses.
Knowledge Bases Leave Distinct Fingerprints in LLM Outputs
The second experiment showed that even with identical LLMs, different knowledge bases create statistically distinguishable response patterns:
Figure 3: Internal consistency of different knowledge bases measured by RMS scatter.
The consistency metrics indicated that both knowledge bases generated responses with similar internal consistency. However, when comparing responses between knowledge bases, clear differences emerged:
Figure 4: Average distance vs RMS scatter for each question and knowledge base pair.
The highest distance observed was 0.188 between Paris and London knowledge bases for the question "How many bridges did Philip Augustus build in Paris in the late 12th century?" The lowest non-zero distance was 0.037 for the question "What percentage of Paris's salaried employees work in hotels and restaurants according to the document?"
Critically, the distances between knowledge base pairs (average 0.0862) were 5-26× larger than variations within any knowledge base (0.0072). This significant separation indicates that nodes with different knowledge bases produce reliably distinguishable outputs, allowing for effective verification through statistical analysis.
Important Considerations: Factors Affecting Model Identification
Several important factors influence the effectiveness of statistical verification:
Family resemblance: Models from the same family (like the two Gemma models) show more similarity (distance of 0.1558) than models from different families (distances around 0.31). While still 32× greater than internal variations, this suggests distinguishing between models within the same family presents different challenges than cross-family comparisons.
Knowledge base similarities: Different knowledge bases produce more similar answers (distances around 0.1) than different LLM models, possibly because both test knowledge bases covered European capitals. Despite semantic similarities, the statistical differences remain sufficient for reliable detection.
Question effectiveness: Different questions demonstrated varying levels of differentiation. For example, "Who wrote 'Romeo and Juliet'?" showed the greatest distance between models (0.5291), while "How many bridges did Philip Augustus build in Paris in the late 12th century?" most effectively distinguished between knowledge bases (0.188). This variation highlights the importance of careful question selection in practical verification systems.
Further research is needed to determine how hardware variations, load conditions, and model updates might affect verification reliability.
Implementing Verification in a Decentralized Network: The AVS Design
Based on these findings, the researchers propose an Active Verification System (AVS) for the Gaia network:
Operator Sets Structure:
- Set 0: AVS validators responsible for polling nodes and detecting outliers (approved by Gaia DAO)
- Sets 1-n: Mapped to Gaia domains, with all nodes in each domain forming a single operator set
Verification Process:
- Each verification epoch lasts 12 hours
- Validators poll nodes with random questions from domain-specific question sets
- All responses, timeouts, and error messages are recorded
- Outlier detection is performed on responses
- Results are time-encrypted, signed, and posted on EigenDA
Node Flagging System:
- outlier: Node produces statistical outliers compared to other nodes
- slow: Node responds significantly slower than domain average
- timeout: Node timed out on one or more requests
- error 500: Node returned internal server errors
- error 404: Node returned resource unavailability errors
- error other: Node returned other HTTP error codes
Rewards and Penalties:
- Nodes maintaining good status across epochs receive regular AVS rewards
- Flagged nodes may be suspended from rewards and domain participation
- Malicious actors may have their stakes slashed
The AVS can also automate node onboarding by verifying that candidate nodes meet domain requirements for LLM, knowledge base, and response speed.
Conclusions: Making Decentralized AI Networks Trustworthy Through Statistical Verification
The research demonstrates that statistical analysis of LLM outputs can reliably identify the underlying model and knowledge base. This approach enables decentralized AI networks to use EigenLayer AVS to verify LLM outputs intersubjectively and detect outliers as potential bad actors.
By combining statistical verification with cryptoeconomic incentives and penalties, this method offers a practical solution for maintaining quality and trust in decentralized AI networks without requiring expensive cryptographic proofs or specialized hardware. The approach is particularly valuable because it scales effectively to large networks, making widespread decentralized AI inference more viable.
This verification framework represents a significant step toward building trustworthy decentralized AI systems that can deliver the benefits of local inference—privacy, cost-effectiveness, speed, and availability—while ensuring that users receive the specific model capabilities they expect.