Beta ends Mar 1525d 21h 40m 35s left. Lock in $28/mo for life (60% off). Claim your spot
← Learn

How Bayesian Confidence Scoring Works: The Math Behind Trader Grades

Why a trader who made 500% on 3 markets might not be as good as you think — and how Bayesian shrinkage separates skill from noise.

The 500% Trader Who Lost Everything

You find a trader on Polymarket with a 500% return across 3 markets. The numbers look incredible. You start mirroring their positions. Over the next month, they enter 10 more markets and lose on 7 of them. Your account drops 30%.

What happened? You mistook a lucky streak for skill. With only 3 resolved markets, a 500% return tells you nothing. Those three wins could have been coin flips that happened to land right. Meanwhile, the trader with a steady 45% return across 150 markets — the one you scrolled past — has been quietly compounding gains all year.

This is not a hypothetical. Leaderboards that rank by raw returns are dominated by small-sample outliers who regress to the mean within weeks. If you are using raw stats to decide who to follow, you are making decisions based on noise. 0xInsider uses Bayesian confidence scoring to fix this — a mathematical framework that separates genuine skill from lucky streaks before you put money on the line.

The Small Sample Problem

Flip a coin three times and get heads every time. Would you bet your savings the next flip is heads? Of course not. But this is exactly what happens when people judge traders by a handful of markets. Three wins in a row feel meaningful. Statistically, they are nearly worthless.

Here is the dangerous part: at 3 markets, luck and skill look identical. At 30 markets, patterns start to emerge. At 300 markets, the picture is clear. Small samples do not just lack information — they actively mislead. They create vivid narratives ("this trader never loses") that dissolve the moment more data arrives.

Baseball figured this out decades ago. A batter who goes 3-for-3 in April does not lead the league in batting average all season. Analysts wait for a meaningful sample before drawing conclusions. The same discipline applies to prediction markets — and the same math. Our confidence formula uses a square root function because that is how statistical reliability actually scales with sample size.

How the System Learns to Trust You

Think of it like hiring. A candidate walks in claiming a 95% success rate on 4 projects. Another claims 72% across 150 projects. Who do you trust more? Most people pick the second candidate, because a long track record is harder to fake. This is the core idea behind Bayesian shrinkage.

The system starts with one assumption: every trader is average until the data says otherwise. When a new trader posts amazing numbers from a few markets, the system responds with "interesting, show me more." It blends raw performance with a neutral baseline of 50 out of 100. With only a few markets of evidence, the blend leans toward 50. As markets pile up — 25, 50, 100 — the blend shifts toward the raw numbers.

This is not a penalty. A new trader with strong performance always scores above 50 — just not as far above as the raw stats suggest. Think of it as earned trust. The system says: "Promising, but I need more evidence before I stake my reputation on it." As the evidence arrives, the score converges to reality. No one gets punished. Judgment is simply reserved until the data warrants it.

The Confidence Curve

The system assigns a confidence value between 0 and 1 based on how many markets a trader has resolved. With just a handful of markets, confidence is low — the system is mostly skeptical. As the market count grows, confidence rises steadily. Once a trader has resolved enough markets, confidence reaches 1.0 and raw performance speaks for itself.

The curve is designed so that confidence builds gradually — it does not hand out trust too quickly to newcomers with thin track records, but it does give real credit for early performance. The shape mirrors how statistical reliability actually scales with sample size — the same principles used in clinical trials, election polling, and quantitative finance.

The practical result: a trader with strong numbers and a thin history will score above average, but not as high as their raw stats suggest. As they trade more markets, their score converges toward their true performance. The system rewards evidence, not first impressions.

Confidence Curve

Drag the slider to see how confidence grows with the number of markets traded.

Markets: 25Confidence: 0.50
050100150200

Shrinkage in Action

The core idea is simple: every trader's final score is a blend of their raw performance and a neutral baseline. When confidence is low (few markets), the score stays close to the baseline — no data, no strong opinion. When confidence is high (many markets), the raw performance shines through untouched. Everything in between is a proportional blend that shifts as evidence accumulates.

Consider a trader with genuinely elite raw numbers but only a handful of markets. The system acknowledges the strong start but holds back — "Great start, but I have seen hot streaks before." As the market count grows, the score rises steadily. Once they have built a full track record, their raw performance comes through completely. Same skill level, but the grade only reflects it once the data is strong enough.

This works in reverse too. A trader with poor numbers but few markets gets pulled toward the baseline rather than punished immediately — a few bad results could just be bad luck. With a full track record, poor performance comes through clearly. The system protects you on both sides: it will not overrate lucky beginners or prematurely dismiss unlucky ones.

Shrinkage in Action

Final score after Bayesian shrinkage → 61.1

Raw score85
Markets10

The Six Scoring Components

The raw score comes from six metrics used across quantitative finance to separate skill from luck. Profit Magnitude measures total profit on a log scale — bigger profits rank higher. A $100K trader scores higher than a $1K trader, but the log scale prevents outlier windfalls from dominating. It carries significant weight because sustained, large-scale profitability is the clearest signal of genuine edge.

Sharpe Ratio is the gold standard of risk-adjusted return — it measures profit earned per unit of risk. Capital Efficiency evaluates return on volume traded (PnL ÷ Volume), rewarding traders who extract more profit per dollar deployed. Profit Factor is the simplest metric: total money won divided by total money lost. A profit factor of 2.0 means the trader earns $2 for every $1 they lose.

Max Drawdown measures the worst peak-to-trough drop — the scariest moment in a trader's equity curve. Traders who avoid deep holes demonstrate the discipline that separates professionals from gamblers. Consistency tracks what percentage of active trading days are profitable. Each metric is weighted according to its predictive value, and when a metric is undefined, it gets excluded and the remaining weights are renormalized. No one receives an inflated score from a data edge case.

Scoring Components

Six metrics combined into a single raw score. Each is weighted by its predictive value. Undefined metrics are excluded and remaining weights renormalized.

Profit MagnitudeTotal profit on a log scale
Sharpe RatioRisk-adjusted return
Capital EfficiencyReturn on volume traded
Profit FactorGross wins ÷ gross losses
Max DrawdownWorst peak-to-trough decline
ConsistencyPercentage of profitable days

From Score to Grade

Final scores map to letter grades you can read at a glance. Grade S (85+) is elite — fewer than 5% of tracked traders reach it. When you see an S-grade trader, their risk-adjusted returns have been tested against sample size, volatility, and consistency. These are the traders worth studying. Grade A (70+) marks the top quartile — consistent, skilled, and backed by real evidence.

Grade B (55+) is where promising traders live. Many are genuinely skilled but have not built enough track record for the system to fully trust their numbers. If you spot a B-grade trader with strong raw stats and a growing market count, pay attention — their grade is likely climbing. You may be seeing skill before the system has fully confirmed it. Grade C (40+) is the most common grade for newer traders, because it overlaps with the Bayesian prior of 50. It means either mediocre performance or simply not enough data to decide.

Grade D (25–39) seems like it should be easy to get, but it actually requires evidence. The system needs enough data to confirm performance is genuinely below average before pulling the score below 40. New traders almost never receive a D — shrinkage protects them by pulling scores toward 50. Grade F (below 25) is the lowest tier and the hardest to reach — it takes sustained, confirmed poor performance across enough markets for the system to push the score that low. Both D and F grades are confident statements that a track record is poor, not snap judgments from a few unlucky markets.

Grade Bands

Final scores map to letter grades. S is elite. F means confirmed poor performance.

02540557085100

Worked Example: Meet Trader Alex

Think about a trader you follow on 0xInsider. Trader Alex has 25 resolved markets with strong metrics: a solid Sharpe ratio, healthy profit factor, a controlled max drawdown, and above-average consistency. Each metric normalizes to a 0-to-1 scale. Combined, Alex's raw score lands solidly in A-grade territory on pure merit.

But Alex has only 25 markets. The system's confidence is moderate — strong enough to push the score above average, but not enough for full conviction. After shrinkage, Alex lands at a B grade. The system says: "Alex looks good, but 25 markets is not enough for full conviction."

Fast-forward. Alex trades 75 more markets at the same quality. With a full track record, confidence reaches maximum. The final score now reflects Alex's true performance: Grade A. Nothing changed about Alex's skill — only the evidence caught up. This is exactly the trajectory to watch for on the leaderboard: traders whose grade is climbing not because they got lucky recently, but because growing data is confirming what the early numbers suggested.

What This Means for Your Next Decision

Next time you open the 0xInsider leaderboard, you will read it differently. An S-grade trader is not just someone with big returns — they are someone whose returns have been stress-tested against sample size, risk, and consistency. A B-grade trader with 15 markets and strong raw stats is not mediocre — they are early in the process of proving themselves. And a high P&L number paired with a C grade is the system telling you: not enough evidence yet.

Without Bayesian shrinkage, the top of the leaderboard would be a revolving door of small-sample outliers — impressive for a moment, gone the next. With it, the rankings reward the one thing that matters in prediction markets: repeatable, risk-adjusted performance backed by real evidence.

The grades, confidence scores, and component breakdowns are all visible on every trader profile. Now that you understand the math behind them, you know exactly what those numbers mean — and why they are the most reliable signal available for finding genuine skill in prediction markets.

Live Feed

Every whale trade. Every insider flag. The second it happens.

Real-time/Insider Radar/40+ Metrics