May 13, 2024

GPT-4 Getting Worse

Stephen M. Walker II · Co-Founder / CEO

Top tip

Automated Evaluations are now available, enabling comprehensive assessment of GPT-4's performance for your specific use cases.

Is GPT-4 Getting Worse?

The question of whether GPT-4 is getting worse is complex and multifaceted.

But, yes, it is.

MODELRELEASEBENCHMARKSRETRIEVALCOMPSCORE
GPT-403143.503.943.59
GPT-4 32k03143.503.943.58
GPT-406133.503.903.58
GPT-4 32k06133.503.903.56
GPT-4 Turbo2024-04-093.503.983.48
GPT-4o2024-05-133.423.963.44
GPT-3.5 Turbo03013.023.893.42
GPT-3.5 Turbo06133.013.883.33

GPT-4's speed, vision capabilities, and ability to follow complex ideas are improving. However, when examining raw capabilities across all known benchmarks, the original GPT-4 release maintains its position.

While some users have reported perceived declines in performance, particularly in specific tasks or contexts, it's important to consider several factors.

Model updates, changes in training data, and evolving user expectations can all influence perceptions of performance.

Additionally, OpenAI continuously works on improving and fine-tuning their models, as evidenced by the release of GPT-4 Omni (GPT-4o) on May 13, 2024.

This latest model showcases advanced multimodal capabilities, processing and generating text, audio, image, and video inputs and outputs, which may address some of the concerns raised by users.

Therefore, while there may be instances where GPT-4's performance appears to fluctuate, ongoing advancements and updates aim to enhance its overall capabilities and user experience.

Reported Issues

Many users have reported experiencing issues with GPT-4 that weren't present in earlier versions:

  • Decreased reasoning capabilities and logical errors
  • Difficulty maintaining context and following instructions
  • Reduced accuracy in specialized tasks like coding
  • Less nuanced understanding of prompts

Quantitative Study

A study by researchers from Stanford and UC Berkeley has attempted to quantify these changes:

  • Code generation: The percentage of directly executable code dropped from 52.0% in March to 10.0% in June for GPT-4
  • Prime number identification: GPT-4's accuracy decreased from 97.6% in March to 2.4% in June

These changes are likely due to two reasons: performance improvements via model pruning, quantization and other techniques, and additional RLHF to handle edge cases, safety issues, and minimize extra token output.

Possible Explanations

Several factors might contribute to these perceived and measured changes:

  1. Increased user base straining the system
  2. Modifications to improve speed at the cost of accuracy
  3. Changes in content moderation and safety measures
  4. Ongoing model updates and fine-tuning

Differing Opinions

It's important to note that experiences vary, and not all users report a decline in quality. Some find GPT-4 to be faster and more human-like in its responses, even if less accurate in certain areas.

Lack of Official Communication

OpenAI has not provided detailed explanations for these changes, leading to speculation and frustration among users. This lack of transparency has made it difficult for users to understand the reasons behind the perceived changes in performance.

Implications

The reported decline in quality has several implications:

  • Users may need to verify GPT-4's outputs more carefully
  • Businesses relying on GPT-4 APIs may need to reassess their strategies
  • There's an increased interest in open-source alternatives

While GPT-4 remains a powerful tool, users should be aware of its limitations and potential inconsistencies. As with any AI technology, it's crucial to approach its outputs critically and verify important information from authoritative sources.

More terms

Continue exploring the glossary.

Learn how teams define, measure, and improve LLM systems.

Glossary term

What is decision tree learning?

Decision tree learning is a supervised learning approach used in statistics, data mining, and machine learning. It is a non-parametric method used for classification and regression tasks. The goal is to create a model that predicts the value of a target variable based on several input features.
Read term

Glossary term

What is a state in AI?

In artificial intelligence (AI), a state represents the current condition or environment of the system, akin to a "snapshot" that the AI uses to inform its decision-making process. The complexity and dynamic nature of the world can pose challenges, as numerous factors influencing the state can change rapidly. To manage this, AI systems may employ state machines, which focus solely on the current state without considering its historical context, thereby simplifying the decision-making process.
Read term

It's time to build

Collaborate with your team on reliable Generative AI features.
Want expert guidance? Book a 1:1 onboarding session from your dashboard.

Talk to sales