Benchmark Results
Martian Benchmark Testing
In our internal run of the open-source Martian Code Review Benchmark, Review My Code achieves an F1 score of 63.8% on curated hard PRs from large production repositories.
These results are from our internal evaluation using the benchmark's open-source methodology and data. Official leaderboard submission is pending.
Benchmark
Code Review F1 Score
Martian Code Review Benchmark — curated hard PRs from large production repositories, with human-reviewed gold findings
RMCode scores from internal evaluation. Other scores from the official April 9, 2026 offline leaderboard. Official RMCode submission pending.
What is the Martian Benchmark?
The Martian Code Review Benchmark is an independent, open-source evaluation for AI code review tools.
It uses curated hard pull requests from large, mature open-source projects, with human-reviewed code review findings as the gold set. An independent LLM judge scores each tool on precision (are the flagged issues real?), recall (did it catch the known issues?), and F1 (the harmonic mean of both).
This is not our benchmark — it's the same evaluation used by Cubic, Qodo, Cursor Bugbot, CodeRabbit, and other tools. We use the same methodology, data, and scoring.
Results by Tier
RMCode offers three quality levels. Each tier uses a progressively more thorough review process, and the benchmark results reflect this:
| Tier | F1 | Credits | Competitive Position |
|---|---|---|---|
| Max ($49/mo) | 63.8% | 20 cr | Top result in our internal run; official submission pending |
| Pro ($19/mo) | 55.9% | 5 cr | Above nearly every tool on the leaderboard |
| Free ($0/mo) | 52.3% | 1 cr | Above most tools on the leaderboard — at no cost |
Full Comparison
RMCode internal results shown beside the official April 9, 2026 Martian offline leaderboard, ranked by F1 score:
| # | Tool | F1 |
|---|---|---|
| — | RMCode Max | 63.8% |
| 1 | Cubic Dev | 61.8% |
| 2 | Qodo Extended | 57.9% |
| — | RMCode Pro | 55.9% |
| 3 | Augment | 53.5% |
| — | RMCode Free | 52.3% |
| 4 | Qodo | 48.4% |
| 5 | Propel | 46.9% |
| 7 | Cursor Bugbot | 45.5% |
| 8 | Devin | 44.2% |
| 9 | Greptile | 44.0% |
| 13 | Claude Code | 37.6% |
| 14 | GitHub Copilot | 37.0% |
| 15 | CodeRabbit | 35.2% |
| 16 | Gemini | 33.9% |
Leaderboard scores from the official April 9, 2026 Martian offline benchmark. RMCode scores from internal evaluation using the same data, methodology, and judge model. Official RMCode submission pending.
Benchmark results are one input when evaluating a code review tool. Real-world performance can vary by language, repository structure, pull request size, and the types of issues present.
Methodology
Our internal evaluation follows the Martian benchmark methodology exactly: same curated hard PRs, same golden findings, same judge prompt, same scoring. Key details:
- Context-aware analysis — understands code beyond just the diff, including related files and dependencies
- Progressive quality levels — each tier uses a more thorough review process, trading speed and cost for accuracy
- No benchmark-specific tuning — all bug patterns are generic, not tailored to specific test cases
- 475+ experiments — refined over 5 optimization milestones to reach the current results
What This Means For You
Higher F1 means your reviews catch more real bugs with fewer false alarms. You spend less time dismissing noise and more time shipping.
Every RMCode tier scores above most tools on the leaderboard — even the free tier. Choose the quality level that fits your needs:
- Free — 30 credits/month, enough for up to 30 Standard reviews, no credit card required
- Pro ($19/mo) — 200 credits/month, outperforms nearly every tool on the leaderboard
- Max ($49/mo) — 600 credits/month and our strongest internal benchmark result for high-risk PRs
30 credits/month free. No credit card required.