Benchmark Results

Martian Benchmark Testing

In our internal run of the open-source Martian Code Review Benchmark, Review My Code achieves an F1 score of 63.8% on curated hard PRs from large production repositories.

These results are from our internal evaluation using the benchmark's open-source methodology and data. Official leaderboard submission is pending.

Benchmark

Code Review F1 Score

RMCode Max
#1
63.8%
Cubic
61.8%
Qodo Extended
57.9%
RMCode Pro
55.9%
Augment
53.5%
RMCode Free
52.3%
Qodo
48.4%
Cursor Bugbot
45.5%
GitHub Copilot
37%
CodeRabbit
35.2%

Martian Code Review Benchmarkcurated hard PRs from large production repositories, with human-reviewed gold findings

RMCode scores from internal evaluation. Other scores from the official April 9, 2026 offline leaderboard. Official RMCode submission pending.

What is the Martian Benchmark?

The Martian Code Review Benchmark is an independent, open-source evaluation for AI code review tools.

It uses curated hard pull requests from large, mature open-source projects, with human-reviewed code review findings as the gold set. An independent LLM judge scores each tool on precision (are the flagged issues real?), recall (did it catch the known issues?), and F1 (the harmonic mean of both).

This is not our benchmark — it's the same evaluation used by Cubic, Qodo, Cursor Bugbot, CodeRabbit, and other tools. We use the same methodology, data, and scoring.

Results by Tier

RMCode offers three quality levels. Each tier uses a progressively more thorough review process, and the benchmark results reflect this:

TierF1CreditsCompetitive Position
Max ($49/mo)63.8%20 crTop result in our internal run; official submission pending
Pro ($19/mo)55.9%5 crAbove nearly every tool on the leaderboard
Free ($0/mo)52.3%1 crAbove most tools on the leaderboard — at no cost

Full Comparison

RMCode internal results shown beside the official April 9, 2026 Martian offline leaderboard, ranked by F1 score:

#ToolF1
RMCode Max63.8%
1Cubic Dev61.8%
2Qodo Extended57.9%
RMCode Pro55.9%
3Augment53.5%
RMCode Free52.3%
4Qodo48.4%
5Propel46.9%
7Cursor Bugbot45.5%
8Devin44.2%
9Greptile44.0%
13Claude Code37.6%
14GitHub Copilot37.0%
15CodeRabbit35.2%
16Gemini33.9%

Leaderboard scores from the official April 9, 2026 Martian offline benchmark. RMCode scores from internal evaluation using the same data, methodology, and judge model. Official RMCode submission pending.

Benchmark results are one input when evaluating a code review tool. Real-world performance can vary by language, repository structure, pull request size, and the types of issues present.

Methodology

Our internal evaluation follows the Martian benchmark methodology exactly: same curated hard PRs, same golden findings, same judge prompt, same scoring. Key details:

  • Context-aware analysis — understands code beyond just the diff, including related files and dependencies
  • Progressive quality levels — each tier uses a more thorough review process, trading speed and cost for accuracy
  • No benchmark-specific tuning — all bug patterns are generic, not tailored to specific test cases
  • 475+ experiments — refined over 5 optimization milestones to reach the current results

What This Means For You

Higher F1 means your reviews catch more real bugs with fewer false alarms. You spend less time dismissing noise and more time shipping.

Every RMCode tier scores above most tools on the leaderboard — even the free tier. Choose the quality level that fits your needs:

  • Free — 30 credits/month, enough for up to 30 Standard reviews, no credit card required
  • Pro ($19/mo) — 200 credits/month, outperforms nearly every tool on the leaderboard
  • Max ($49/mo) — 600 credits/month and our strongest internal benchmark result for high-risk PRs
Try it free — Install GitHub App

30 credits/month free. No credit card required.