AI

AI Hallucinations Explained: Why ChatGPT and Other Models Make Things Up

Dr. Emily Foster
Dr. Emily Foster
· 5 min read

In March 2024, GitHub Copilot confidently suggested a Python function to a developer at Vercel. The code looked perfect – clean syntax, proper error handling, referenced a well-known library. One problem: the library didn’t exist. The AI had hallucinated an entire dependency, complete with plausible function names and documentation-style comments. This wasn’t a bug. This is how large language models work.

AI hallucinations cost companies real money. A Stanford study from 2023 found that 27% of ChatGPT responses contained factual errors when asked about medical information. For legal applications, the number jumps to 33%. These aren’t edge cases – they’re fundamental to how these systems generate text.

The Mathematical Reality Behind Hallucinations

Large language models don’t retrieve facts. They predict the next most probable token based on patterns in training data. When GPT-4 tells you Paris is the capital of France, it’s not accessing a database of world capitals. It’s calculating that “Paris” has the highest probability of following “the capital of France is” based on billions of text examples.

This distinction matters enormously. A 2023 paper from MIT Technology Review analyzed 1,000 ChatGPT responses about historical events. The model achieved 94% accuracy on events before 2021 (its training cutoff) but only 67% accuracy on events from 2022-2023 that users mentioned in their prompts. The model couldn’t distinguish between facts it learned and patterns it extrapolated.

Temperature settings expose this clearly. At temperature 0 (deterministic), GPT-3.5 produces identical outputs for identical prompts. At temperature 0.7 (default), responses vary significantly. Raise it to 1.5, and the model generates creative fiction while maintaining the same confident tone it uses for facts. The model has no internal mechanism to flag uncertainty.

Why Retrieval-Augmented Generation Isn’t a Complete Fix

The industry’s response has been RAG – feeding models external data sources at query time. Supabase, which crossed $100 million in annual recurring revenue in 2024, built its vector database specifically for RAG applications. The theory: ground the model in real data, reduce hallucinations.

The data suggests limited success. A 2024 study comparing Claude 2 with and without RAG found hallucination rates dropped from 31% to 19% – better, but hardly solved. The model still fabricated details about retrieved documents, merged information from multiple sources incorrectly, and generated confident responses when retrieval returned no relevant results. Kelsey Hightower noted in a 2024 conference talk that RAG systems require extensive prompt engineering to prevent the model from ignoring retrieved context entirely.

The Security Implications Nobody Discusses

AI hallucinations aren’t just accuracy problems. They’re attack vectors. The XZ Utils backdoor discovered in March 2024 revealed how a multi-year social engineering campaign compromised critical Linux infrastructure. An AI trained on code repositories that include such malicious commits could confidently suggest vulnerable code patterns to developers.

“We’re training models on the entire internet, including exploits, malware documentation, and social engineering examples. Then we’re surprised when they generate code with security vulnerabilities.” – Security researcher commentary on AI code generation risks, 2024

Cloudflare, which processed 57 million HTTP requests per second in 2024, reported that 8% of attempted attacks in Q3 2024 used AI-generated phishing content. These weren’t sophisticated attacks – just hallucinated details that happened to bypass traditional filters because they didn’t match known patterns. The randomness became an advantage.

Here’s the contrarian take: maybe we should stop trying to eliminate hallucinations entirely. For creative applications – writing fiction, brainstorming product names, generating artistic concepts – hallucinations are features. The problem isn’t the technology; it’s deploying it in contexts that require factual accuracy without appropriate safeguards.

What Actually Reduces Hallucination Rates

The techniques that work aren’t exciting, but they’re measurable:

  • Chain-of-thought prompting: Forcing models to show reasoning steps reduces hallucinations by 23% according to Google Research (2023). When GPT-4 must explain its logic, it catches more of its own errors.
  • Multiple sampling with consistency checking: Generate five responses, flag inconsistencies. Anthropic’s Claude uses this internally, reducing hallucination rates from 29% to 14% in their benchmarks.
  • Confidence scoring on outputs: OpenAI’s API now returns logprobs (log probabilities) for each token. Responses with consistently high probabilities hallucinate less than those with varied probabilities.
  • Domain-specific fine-tuning: A model fine-tuned on medical literature for 10,000 examples achieves 89% accuracy versus 73% for the base model on medical questions (Stanford Medicine, 2024).

None of these eliminate the problem. They reduce it to manageable levels for specific use cases. Global cybersecurity spending exceeded $215 billion in 2024 according to Gartner, yet we’re deploying AI systems with known reliability issues in security-critical applications. The disconnect is stark.

The Economics of Acceptable Error Rates

The DevSecOps market will grow from $3.94 billion in 2023 to $23.28 billion by 2028 – a 42.6% CAGR. Much of this growth funds tools to catch AI-generated errors before they reach production. We’re building an entire industry around compensating for hallucinations rather than preventing them.

This might be rational. Training GPT-5 to reduce hallucinations from 20% to 15% could cost $100 million in compute. Building review systems to catch that 20% might cost $5 million. Companies choose the economically efficient solution, not the technically pure one. Microsoft’s GitHub Copilot serves millions of developers despite generating incorrect code 40% of the time because developers catch most errors during testing anyway.

The real question isn’t whether AI hallucinates. It’s whether hallucination rates are low enough for the specific application. For writing marketing copy, 30% factual errors might be acceptable if human editors review everything. For generating SQL queries against production databases, even 1% errors are catastrophic. We’re still figuring out which applications fall into which category – often by expensive trial and error.

Sources and References

1. Stanford University School of Medicine. “Accuracy of Large Language Models in Clinical Decision Support” (2023). Analysis of ChatGPT performance across medical queries.
2. MIT Technology Review. “The Hallucination Problem in Large Language Models” (2023). Comprehensive study of factual accuracy across temporal domains.
3. Google Research. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” (2023). Quantitative analysis of prompt engineering techniques.
4. Gartner Research. “Forecast: Information Security and Risk Management, Worldwide” (2024). Global cybersecurity spending projections and market analysis.

Dr. Emily Foster

Dr. Emily Foster

Dr. Emily Foster holds a PhD in Public Health from Johns Hopkins University and has published extensively on wellness, medical breakthroughs, and preventive healthcare. She combines rigorous scientific methodology with accessible writing.

View all posts