AI

AI Hallucinations Explained: Why ChatGPT and Other Models Make Things Up

Dr. Emily Foster
Dr. Emily Foster
· 5 min read

This was not a bug, this is how large language models work. In March 2024, GitHub Copilot confidently suggested a Python function to a developer at Vercel. The code looked perfect: clean syntax, correct error handling, a well-known library. But the library didn’t exist.

\n\n

These are not edge cases, but fundamental to the way these systems generate text. A Stanford study from 2023 found that 27% of ChatGPT responses to medical questions contained factual errors, and 33% for legal questions. AI hallucinations cost companies real money.

\n\n

The Mathematical Reality Behind Hallucinations

\n\n

When GPT-4 tells you that Paris is the capital of France, it is not accessing a database of world capitals, but calculating that “Paris” has the highest probability of following “the capital of France is” based on billions of text examples. Large language models do not retrieve facts, they predict the most probable token based on the patterns of the training data.

\n\n

The model could not distinguish between facts it had learned and patterns it had extrapolated. A 2023 paper in MIT Technology Review analyzed 1,000 ChatGPT responses about historical events. The model achieved 94% accuracy on events before 2021, the cutoff for its training, but only 67% accuracy on events from 2022 and 2023 that users had mentioned in their prompts. This distinction matters enormously.

\n\n

The model has no internal mechanism for flagging uncertainty. At temperature 0 (deterministic), GPT-3.5 produces identical outputs for identical prompts. At temperature 0.7 (default), responses vary significantly. At temperature 1.5, the model produces creative fictions in the same confident tone as facts.

\n\n

Why Retrieval-Augmented Generation Isn’t a Complete Fix

\n\n

The theory was to ground the model in real data, to reduce hallucinations. The industry’s response has been RAG—feeding models external data sources at query time. Supabase, which crossed $100 million in annual recurring revenue in 2024, built its vector database specifically for RAG applications.

\n\n

Kelsey Hightower noted in a 2024 conference talk that RAG systems require extensive prompt engineering to prevent the model from ignoring the retrieved context entirely. The data suggests limited success: a 2024 study comparing Claude 2 with and without RAG found that hallucination rates dropped from 31 to 19 percent, better, but not solved. The model still fabricated details about retrieved documents, merged information from multiple sources incorrectly, and confidently answered when the search returned no relevant results.

\n\n

The Security Implications Nobody Discusses

\n\n

An AI trained on code repositories that include such malicious commits could confidently suggest vulnerable code patterns to developers. The XZ Utilities backdoor discovered in March 2024 revealed how a multi-year social engineering campaign compromised critical Linux infrastructure. AI hallucinations are not just accuracy problems, they are attack vectors.

\n\n

– Commentary by a security researcher on the risks of AI-generated code, 2024 – We train our models on the entire Internet, including exploits, malware documentation, and social engineering examples, and then we’re surprised when they produce code with security vulnerabilities.

\n\n

The randomness became an advantage. These were not sophisticated attacks, just hallucinations that happened to slip through the traditional filters because they did not match known patterns. Cloudflare, which in 2024 processed 57 million HTTP requests per second, reported that in Q3 2024, 8% of attempted attacks used AI-generated phishing content.

\n\n

The problem is not the technology, but the deployment of it in contexts that require factual accuracy, without appropriate safeguards. For creative applications – writing fiction, brainstorming product names, generating artistic concepts – hallucinations are features. Here’s the contrarian view: maybe we should stop trying to eliminate hallucinations altogether.

\n\n

What Actually Reduces Hallucination Rates

\n\n

The techniques that work aren’t exciting, but they’re measurable:

\n\n

    \n

  • Chain-of-thought prompting: Forcing models to show reasoning steps reduces hallucinations by 23% according to Google Research (2023). When GPT-4 must explain its logic, it catches more of its own errors.
  • \n

  • Multiple sampling with consistency checking: Generate five responses, flag inconsistencies. Anthropic’s Claude uses this internally, reducing hallucination rates from 29% to 14% in their benchmarks.
  • \n

  • Confidence scoring on outputs: OpenAI’s API now returns logprobs (log probabilities) for each token. Responses with consistently high probabilities hallucinate less than those with varied probabilities.
  • \n

  • Domain-specific fine-tuning: A model fine-tuned on medical literature for 10,000 examples achieves 89% accuracy versus 73% for the base model on medical questions (Stanford Medicine, 2024).
  • \n

\n\n

According to Gartner, the global spending on cybersecurity in 2024 was $215 billion, and we are deploying AI systems with known reliability issues in security-critical applications. The gap is striking. None of these eliminates the problem, but they reduce it to manageable levels for specific use cases.

\n\n

The Economics of Acceptable Error Rates

\n\n

We are building an entire industry around compensating for hallucinations rather than preventing them. The DevSecOps market will grow from $3.94 billion in 2023 to $23.28 billion in 2028, a CAGR of 42.6 percent. Much of this growth will be devoted to tools that catch AI-generated errors before they reach production.

\n\n

Microsoft’s GitHub Copilot serves millions of developers despite generating incorrect code 40% of the time, because most errors are caught during testing. This may be rational. Training GPT-5 to reduce hallucinations from 20% to 15% could cost $100 million in computation. Building review systems to catch that 20% might cost $5 million.

\n\n

We are still figuring out which applications fall into which category, often by expensive trial and error. The question is not whether the AI hallucinates, but whether the hallucination rate is low enough for the specific application. For writing marketing texts, a factual error rate of thirty percent may be acceptable if human editors review everything, but for generating SQL queries against production databases, even a one percent error is catastrophic.

\n\n

Sources and References

\n\n

n:1. Stanford School of Medicine, “Accuracy of Large Language Models in Clinical Decision Support” (2023), analysis of ChatGPT performance in medical queries. n2. MIT Technology Review, “The Hallucination Problem in Large Language Models” (2023), comprehensive study of factual accuracy in different temporal domains. n3. Google Research, “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” (2023), quantitative analysis of prompt engineering techniques. n4. Gartner Research, “Forecast: Information Security and Risk Management, Worldwide” (2024), a study of global cyber-security spending and market analysis.

Dr. Emily Foster

Dr. Emily Foster

Dr. Emily Foster holds a PhD in Public Health from Johns Hopkins University and has published extensively on wellness, medical breakthroughs, and preventive healthcare. She combines rigorous scientific methodology with accessible writing.

View all posts