Benchmark Results

Martian Benchmark Testing

Last updated July 6, 2026 · re-measured periodically

RMCode Max scored F1 = 74.7%, Pro scored F1 = 58.9%, and Free scored F1 = 53.0% on the Martian-50 code-review benchmark — measured end-to-end on the deployed production service and judged with the canonical Martian methodology.

These numbers come from a typical run: each finding was posted by the RMCode GitHub App on a public forked PR (see the 50 PRs below). Because language-model output is not deterministic, results can vary between runs. We re-measure periodically — after model refreshes and product updates — so the figures here may change over time. Official Martian leaderboard submission is pending.

Benchmark

Code Review F1 Score

RMCode Max

74.7%

Cubic

61.8%

RMCode Pro

58.9%

Qodo Extended

57.9%

Augment

53.5%

RMCode Free

53%

Qodo

48.4%

Cursor Bugbot

45.5%

GitHub Copilot

37%

CodeRabbit

35.2%

Martian Code Review Benchmark — curated hard PRs from large production repositories, with human-reviewed gold findings

RMCode results are self-evaluated; official submission is not yet submitted.

What is the Martian Benchmark?

The Martian Code Review Benchmark is an independent, open-source evaluation for AI code review tools.

It uses curated hard pull requests from large, mature open-source projects, with human-reviewed code review findings as the gold set. An independent LLM judge scores each tool on precision (are the flagged issues real?), recall (did it catch the known issues?), and F1 (the harmonic mean of both).

This is not our benchmark — it's the same evaluation used by Cubic, Qodo, Cursor Bugbot, CodeRabbit, and other tools. We use the same methodology, data, and scoring.

Results by Tier

RMCode offers three quality levels. All three have been measured end-to-end on the deployed service; Max runs our deepest, most thorough review pipeline:

Tier	F1	Credits	Competitive Position
Max	74.7%	20 cr	Our highest-scoring tier — the deepest Max review pipeline, measured end-to-end on the deployed production service
Pro	58.9%	5 cr	Multiple scout passes with a precision filter — measured end-to-end on the deployed service
Free	53.0%	1 cr	Above most tools on the leaderboard — at no cost

Full Comparison

The deployed RMCode results shown beside the official April 9, 2026 Martian offline leaderboard, ranked by F1 score:

#	Tool	F1
—	RMCode Max	74.7%
1	Cubic Dev	61.8%
—	RMCode Pro	58.9%
2	Qodo Extended	57.9%
3	Augment	53.5%
—	RMCode Free	53.0%
4	Qodo	48.4%
5	Propel	46.9%
7	Cursor Bugbot	45.5%
8	Devin	44.2%
9	Greptile	44.0%
13	Claude Code	37.6%
14	GitHub Copilot	37.0%
15	CodeRabbit	35.2%
16	Gemini	33.9%

Leaderboard scores from the official April 9, 2026 Martian offline benchmark. The RMCode Free, Pro, and Max scores were measured end-to-end on the deployed service using the same data, methodology, and judge model. Official RMCode submission pending.

Benchmark results are one input when evaluating a code review tool. Real-world performance can vary by language, repository structure, pull request size, and the types of issues present.

Why isn't RMCode on the official leaderboard yet?

We're still launching, and we're deliberately not tuning to a single benchmark snapshot. Our focus is long-term, real-world effectiveness — better review results with fewer tokens through better system design, and the freedom to run on the most cost-effective model, including open-source options and your own keys (BYOK on Enterprise).

Our principle is simple: results per token, not tokens burned — the most effective model for the job, the fewest tokens to get there, better results at lower cost, and no vendor lock-in. We are building a broader benchmark to measure that sustained effectiveness over time, and we'll submit to Martian once that groundwork is in place. Until then, every number here is measured on the live product and reproducible below.

The 50 PRs — Live Proof

Every score below was produced by the deployed production RMCode service reviewing a real pull request. Each PR was reconstructed as a public repository, opened as a PR, and reviewed automatically by the rmcode-ai GitHub App — the same app anyone can install. Open any “view review” link to see the bot's inline comments on the actual diff.

Scoring is the canonical Martian pairwise judge (Claude Opus 4.5). Per PR: TP = real issues caught, FP = false alarms, FN = known issues missed.

#	Repo · PR	RMCode Review	Findings	TP / FP / FN
1	keycloak#37429	view review →	5	3 / 2 / 1
2	keycloak#37634	view review →	3	3 / 0 / 1
3	keycloak#38446	view review →	3	2 / 0 / 0
4	keycloak#36882	view review →	2	0 / 2 / 1
5	keycloak#36880	view review →	3	2 / 1 / 1
6	keycloak#37038	view review →	2	2 / 0 / 0
7	keycloak#33832	view review →	2	2 / 0 / 0
8	keycloak#40940	view review →	2	2 / 0 / 0
9	keycloak-greptile#1	view review →	2	2 / 0 / 0
10	sentry#93824	view review →	5	5 / 0 / 0
11	sentry-greptile#5	view review →	1	1 / 0 / 2
12	sentry-greptile#1	view review →	4	3 / 0 / 1
13	grafana#97529	view review →	3	1 / 1 / 1
14	sentry#80168	view review →	1	1 / 0 / 1
15	sentry#80528	view review →	1	1 / 0 / 1
16	sentry#77754	view review →	5	3 / 0 / 1
17	sentry#95633	view review →	3	3 / 0 / 0
18	sentry-greptile#2	view review →	3	3 / 1 / 0
19	sentry-greptile#3	view review →	2	2 / 0 / 1
20	grafana#103633	view review →	2	0 / 2 / 2
21	sentry#67876	view review →	3	3 / 0 / 0
22	keycloak#32918	view review →	3	2 / 1 / 0
23	grafana#94942	view review →	1	1 / 0 / 1
24	grafana#90939	view review →	0	0 / 0 / 2
25	grafana#80329	view review →	2	1 / 1 / 0
26	grafana#90045	view review →	3	3 / 0 / 0
27	grafana#106778	view review →	3	1 / 2 / 1
28	grafana#107534	view review →	1	0 / 1 / 1
29	grafana#79265	view review →	4	4 / 0 / 1
30	discourse-graphite#9	view review →	0	0 / 0 / 2
31	grafana#76186	view review →	2	1 / 1 / 1
32	discourse-graphite#10	view review →	3	4 / 0 / 0
33	discourse-graphite#7	view review →	4	2 / 2 / 1
34	discourse-graphite#8	view review →	4	2 / 1 / 1
35	discourse-graphite#3	view review →	5	2 / 1 / 0
36	discourse-graphite#5	view review →	2	2 / 0 / 0
37	discourse-graphite#6	view review →	2	1 / 1 / 0
38	discourse-graphite#4	view review →	7	4 / 3 / 2
39	discourse-graphite#1	view review →	2	2 / 0 / 1
40	discourse-graphite#2	view review →	3	2 / 1 / 0
41	cal.com#22532	view review →	3	2 / 0 / 0
42	cal.com#8330	view review →	1	1 / 0 / 1
43	cal.com#14943	view review →	1	1 / 0 / 1
44	cal.com#22345	view review →	2	1 / 1 / 1
45	cal.com#11059	view review →	5	5 / 1 / 0
46	cal.com#7232	view review →	2	2 / 0 / 0
47	cal.com#14740	view review →	3	1 / 2 / 4
48	cal.com#10600	view review →	3	2 / 1 / 2
49	cal.com#10967	view review →	4	4 / 0 / 1
50	cal.com#8087	view review →	3	2 / 0 / 0
Aggregate — 50 PRs			135	99 / 29 / 38

Aggregate: 99 TP, 29 FP, 38 FN → precision 77.3%, recall 72.3%, F1 = 74.7%.

Measured on the deployed production service (Max tier) with the canonical Martian methodology and judge. Repositories are hosted under the public rmcode-benchmark organization. Official Martian leaderboard submission is pending.

Methodology

Our internal evaluation follows the Martian benchmark methodology exactly: same curated hard PRs, same golden findings, same judge prompt, same scoring. Key details:

Context-aware analysis — understands code beyond just the diff, including related files and dependencies
Progressive quality levels — each tier uses a more thorough review process, trading speed and cost for accuracy
No benchmark-specific tuning — all bug patterns are generic, not tailored to specific test cases

What This Means For You

Higher F1 means your reviews catch more real bugs with fewer false alarms. You spend less time dismissing noise and more time shipping.

RMCode Free scores above most tools on the leaderboard — at no cost. Choose the quality level that fits your needs:

Free — 30 credits/month, enough for up to 30 Standard reviews, no credit card required
Pro — 200 credits/month, F1 = 58.9% on the deployed Martian-50 run, runs multiple scout passes with a precision filter
Max — 600 credits/month, F1 = 74.7% on the deployed Martian-50 run, our deepest, most thorough review pipeline

Try it free — Install GitHub App

30 credits/month free. No credit card required.