Logo
FrontierNews.ai

Meta's Watermelon AI Claims to Match GPT-5.5, But There's a Catch

Meta's Chief AI Officer Alexandr Wang announced that the company's next frontier model, codenamed Watermelon, has matched OpenAI's GPT-5.5 on closely watched AI benchmarks, even while still in training. If verified, this would suggest Meta's multi-hundred-billion-dollar investment in AI infrastructure is beginning to produce competitive results. However, the claim comes with significant caveats that experts say make it impossible to evaluate without more transparency.

Why Are Unnamed Benchmarks a Red Flag?

The core issue with Wang's announcement is not the claim itself, but how it was made. The evaluation was conducted internally by Meta, used benchmarks that were never named, and included no independent verification. In the world of frontier AI development, this combination represents the least reliable form of evidence available. The same AI model can produce vastly different scores depending on which evaluation method, test set version, and configuration is used.

According to the source material, scores can swing by 10 to 20 points depending on these factors alone. This means that until Meta publishes a full benchmark table and submits Watermelon to independent evaluation, Wang's claim functions as a directional indicator about Meta's strategy, not a verified statement about where Watermelon actually ranks relative to GPT-5.5.

What Makes This Announcement Significant?

Watermelon represents a dramatic scaling up of Meta's AI efforts. Wang described the model as using roughly 10 times more computing power than Meta's previous model, Muse Spark, which launched in April 2026. The training is running on Meta's Prometheus cluster, a 1-gigawatt computing facility under construction in New Albany, Ohio, with approximately 500,000 graphics processing units (GPUs) and drawing over a gigawatt of power. This makes it one of the largest AI training installations ever attempted by a single company.

Meta has guided investors to expect between $125 billion and $145 billion in AI capital expenditure this year, which provides context for what "more compute" means at the company's scale. The previous model, Muse Spark, scored 52 on the Artificial Analysis Intelligence Index, placing it in the top five globally but below OpenAI, Anthropic, and Google at the frontier tier.

How Does Compute Scaling Actually Work in AI?

The engineering logic behind Watermelon's training run is governed by neural scaling laws, which are empirical relationships between training compute and model performance. These laws, first formalized by OpenAI researchers in 2020 and refined by DeepMind in 2022, establish that model performance improves predictably as compute, data, and parameters increase. However, the improvement follows a logarithmic pattern, not a linear one.

This means a tenfold increase in computing power does not produce a tenfold better model. Under scaling law mathematics, a 10-fold compute increase typically yields a 30 to 40 percent reduction in training loss, depending on the model's starting point and data quality. Additionally, for optimal results, the number of training tokens must scale proportionally with the number of parameters, which is why Meta has reportedly incorporated proprietary data from Facebook, Instagram, WhatsApp, and Threads into Watermelon's training corpus.

What Are the Key Factors Behind Meta's AI Strategy?

  • Proprietary Training Data: Meta has incorporated data from its own platforms, representing billions of social interactions and conversational exchanges that competitors cannot replicate from public web data alone.
  • Benchmark Saturation Concerns: Research from 2025 and 2026 shows that MMLU, one of the most commonly cited AI knowledge benchmarks, is functionally saturated at the frontier, with over 45 percent overlap between popular training corpora and test questions.
  • Evaluation Transparency Gap: Wang did not identify which benchmarks Watermelon was evaluated against, making it impossible to determine whether claimed parity reflects genuine capability or an artifact of training data overlap.
  • Competitive Timing: The comparison Wang chose, matching GPT-5.5, is an intentional framing decision, as OpenAI has already previewed GPT-5.6, a more advanced model currently restricted to approximately 20 government-approved partner organizations.

How to Evaluate AI Model Claims Like Watermelon's

  • Check for Named Benchmarks: Legitimate AI model claims should specify exactly which benchmarks were used, such as MMLU, HumanEval, or other publicly recognized tests.
  • Verify Independent Evaluation: Look for scores produced by third-party evaluation organizations like Artificial Analysis or the Scale AI SEAL leaderboard, which use public methodologies and reproducible results.
  • Examine Evaluation Methodology: Credible claims include details about the evaluation harness, configuration, and test set version used, allowing other researchers to reproduce the results.
  • Wait for Public Release: Before accepting claims about a model's capabilities, wait to see whether the company submits the model to named independent evaluation organizations with public harnesses.

In the 2026 frontier AI evaluation landscape, benchmark governance type is the single best predictor of how a published score can mislead. Scores from independent, third-party organizations carry the most weight, while scores from the company whose model is being evaluated, using internal test setups with no public methodology, carry the least.

"Wang's claim belongs to the least reliable category. Meta has not confirmed which benchmarks Watermelon was evaluated on, nor made the evaluation harness or configuration public," according to the analysis of the announcement.

Technical Analysis, TechTimes

The company declined to comment when approached by multiple outlets following the initial report, and OpenAI did not respond to requests for comment. This does not mean Wang's claim is false, but rather that it cannot be independently evaluated until Meta publishes the full benchmark table and submits Watermelon to named independent evaluation organizations.

For now, the real benchmark to watch is not the one Wang cited in the town hall announcement. It is whether, when Watermelon ships, Meta submits the model to a named independent evaluation organization with a public harness, and whether the numbers hold up under that scrutiny.