How 538's pollster ratings work

Our new methodology considers each pollster's accuracy and transparency.

January 25, 2024, 10:01 AM

538's pollster ratings quantify the empirical track record and methodological transparency of each polling firm and are an important ingredient in our polling averages and election forecasts. In January 2024, after doing a lot of thinking about how best to communicate "trust" in a pollster, we debuted a new way of calculating pollster ratings to give readers a comprehensive portrait of each pollster's quality. Our new pollster ratings are based on two fundamental criteria:

  1. Accuracy, as measured by the average error and average bias of a pollster's polls. We quantify error by calculating how close a pollster's surveys land to actual election results, adjusting for how difficult each contest is to poll. Bias is error that accounts for whether a pollster systematically overestimates Republicans or Democrats.
  2. Methodological transparency, which we measure directly by tracking how much information each pollster releases about the way it conducts its polls. In essence, we now measure trust in a pollster not just based on the accuracy of its work but also based on how much of its work it shows us.

Our pollster-ratings dashboard shows how each pollster scores on both of these fronts. It also lists a star rating for each pollster based on how well it does on both dimensions. Below are all the steps we take to calculate all of these numbers.

1. Gather polls

Our first step is to meticulously gather a comprehensive database of polls to analyze. Our ratings are based on all national and state-level polls that meet our methodological and ethical standards and:

  • Were conducted in 1998 or later.
  • Surveyed presidential primaries (or caucuses);* general elections for president, U.S. Senate, U.S. House (including generic congressional ballot polls) or governor; "jungle primaries" for Senate, House or governor; or runoff elections for Senate, House or governor. This includes special elections.
  • Have a median field date within 31 days of the date of the election, except for presidential primaries, for which we use within 14 days instead.

There are also a few special cases we need to account for:

  • If a poll includes results among multiple populations, we use the narrowest one (i.e., likely voters over registered voters and registered voters over all adults).
  • When a poll asks about a given election multiple times with different response options (for instance, a head-to-head race and a three-way race with an independent candidate), we pick the version with the largest number of response options.
  • In contests that use ranked-choice voting to decide the winner, polls often publish results at each stage of reallocation. We include each set of results as separate polls.
  • If a pollster releases daily or frequent "tracking polls" based on a rolling sample of interviews, we filter its data so that we are only using surveys that don't overlap.

We then assign each poll to a pollster to be rated. For standalone polling firms, this is easy, but other times, it can be harder than you might think. In general, we assign polls to the organization that was most instrumental in collecting and adjusting their data. Here's how we handle some edge cases:

  • First, polls sponsored by a media organization that did not have any role in the survey’s data-generating process are assigned to the company that actually generated the data. For example, polls conducted by SurveyMonkey for The New York Times are assigned to SurveyMonkey. Polls conducted jointly, on the other hand, are a different matter: For example, polls conducted in-house by The New York Times and its polling partners at Siena College are called The New York Times/Siena College polls.
  • Second, when pollsters team up with each other, we treat the partnership as a separate "pollster" than if the firms did the polls on their own. For instance, polls conducted jointly by Beacon Research (a Democratic firm) and Shaw & Company Research (a Republican pollster) are assigned to the pollster "Beacon Research/Shaw & Company Research" and do not factor into the ratings of either Beacon or Shaw & Company on their own.
  • Finally, when the same people are behind multiple firms that use the same polling methodology, those firms get rolled together into one pollster for rating purposes.

2. Calculate Excess Error and Excess Bias

Once we have assembled our dataset of polls, we can begin evaluating how accurate they were. The first step is to calculate a Raw Error and Raw Bias metric for each poll. Raw Bias is the difference between the margin between the top two candidates in the poll and the margin between those candidates in the actual election result. Directionality matters here: A positive Raw Bias means the poll overestimated support for Democrats, and a negative Raw Bias means the poll overestimated Republicans. Raw Error is the absolute value of Raw Bias (in other words, it's the same thing, except directionality doesn't matter). For example, if a poll showed the Democratic candidate leading by 2 percentage points but she actually won by 5 points, that poll's Raw Bias is -3 points (overestimating Republicans), and its Raw Error is 3 points.

We calculate these values only for polls that identified the correct first- and second-place candidates. Further, we calculate Raw Bias only for polls with both Democratic and Republican candidates (we cannot detect the partisan bias of a survey if there are only two Democrats in the race).

Of course, we should expect that some surveys will have higher errors and biases than others. A poll with a sample size of 500 people, for example, has a larger margin of error than a poll of 5,000 people and should usually be less accurate. So the next thing we have to do is to calculate Excess Error and Bias.

For this, we need a benchmark — to know what the raw readings are in excess of. This is easy for bias; we just use 0 as our benchmark, since we expect polls to have no partisan bias on average, so a poll's Excess Bias is the same as its Raw Bias.**

But for Excess Error, we run a multilevel regression model on every nonpartisan poll in our dataset to calculate how much error we would expect it to have based on the implied standard deviation from its sample size, the square root of the number of days between the median date of the poll and the election, plus variables for the cycle the poll was conducted and the type of election it sampled (e.g., presidential primary, presidential general election, gubernatorial general election, Senate general election, House general election or House generic ballot). Each poll's Excess Error, then, is simply its Raw Error minus that expected error. Our regression weights polls by the square roots of their sample size (which we capped at 5,000 to avoid giving any one poll too much weight), the number of polls in each race and the number of pollsters surveying each race.

Finally, we calculate an Excess Error and Excess Bias score for each pollster by taking a weighted average of the Excess Error and Bias of each of its polls, with older polls given less weight. The precise amount of decay changes every time we add more polls to the database, but it's currently about 14 percent a year. For example, the Transparency Score of a 1-year-old poll would be weighted 86 percent as much as that of a fresh poll; the Transparency Score of a 2-year-old poll would be weighted 74 percent (86 percent times 86 percent) as much; etc.

You can find each pollster's Excess Error and Excess Bias scores in our pollster-ratings database. However, while these scores give us a decent first approximation of pollster accuracy, they can be misleading for pollsters that survey races that are particularly hard to poll. That's why it's important to …

3. Adjust for race difficulty

In the 2010 Senate race in Hawaii, a poll from Public Policy Polling found Democrat Daniel Inouye leading Republican Cam Cavasso by 36 points. In the actual election, Inouye won by 53 points. At first glance, that looks like a pretty huge polling error — and, strictly speaking, it is. A miss of 17 points is in the top 2.5 percent of all polling errors in our database.

However, the other polls of the race were even worse. One, from Rasmussen Reports, found Inouye leading by just 13 points (an error of 40 points). Another, from MRG Research, had him up 32 (a 21-point whiff). So in comparison with other polls of the race, the PPP poll looks pretty good. The other surveys have an average error of 31 points — 14 points higher than the PPP poll's.

That Hawaii Senate race is a good example of why Raw, or even Excess, Error and Bias can be misleading on their own. So the next step in our process is to adjust each poll's Excess Error and Bias based on how difficult of a race it polled. First, for every poll in our database, we calculate the weighted average Excess Error and Bias for all other polls in the race, with polls with larger sample sizes given higher weight. Then, we subtract those values from the Excess Error and Bias of the poll. That gives us the Relative Excess Error and Bias of every poll.

Finally, we calculate statistics called Adjusted Error and Adjusted Bias, which are a weighted combination of each poll's Excess Error/Bias and Relative Excess Error/Bias, where the Relative Excess statistics get more weight when more pollsters release more surveys in a given race. We make this adjustment to reflect the fact that we are more confident in a pollster's relative performance in a race when the benchmark we're comparing them against (all the other pollsters in a race) is based on a larger sample of data.

Let's explore how this works for the PPP poll. Although the poll was off by 17.2 points, given its sample size and time before the election, we expected it to have an error of 5.6 points. So the poll's Excess Error was 11.6 points. The Excess Error of other polls in the race was 25.7 points, making the Relative Excess Error of the PPP poll -14.2 points. But since there weren't that many polls and pollsters in this race (three), we put less weight on the Relative Excess Error of the poll. Our final calculation of this poll's Adjusted Error, then, turns out to be 1.2 points. That's about halfway between its Excess Error (+11.6) and Relative Excess Error (-14.2).

Once we repeat this process for every poll, we can create summary Adjusted Error and Adjusted Bias scores for every pollster. We compute those using the same time-weighted average formula as for their (unadjusted) Excess Error and Excess Bias.

One word on our bias calculation. While we use a directional bias at the poll level, we convert this number back to an absolute reading at the pollster level. (Don't worry, though — we publish the signed version in our database so readers can look at it.) We do this because, when it comes to putting more or less weight on a poll in polling average, we don't really care if its pollster tends to systematically overestimate Republicans or Democrats — we care about whether it is systematically biased at all.

All this adjusting gives us a reasonable estimate of the empirical quality of a pollster today. But Adjusted Error and Bias have both mathematical and theoretical weaknesses. The mathematical difficulty with Adjusted Error and Bias is that they do not account for luck. A pollster that releases one really, really accurate survey could have an Adjusted Error of -10, but we wouldn't necessarily expect its future polls to be that good. And in theory, what's ultimately important to us at 538 is not how a pollster will perform now, but in the near future (i.e., the next election day). So our final modeling step is to turn Adjusted Error and Adjusted Bias into predictions.

4. Calculate Predictive Error and Predictive Bias

The most straightforward way to turn a pollster's Adjusted Error and Adjusted Bias into a prediction is to combine them with some proxy for the firm's future quality. For this, we use pollsters' membership in a transparency-oriented polling organization and ties to partisan groups. Our research has found that pollsters who share their data with the Roper Center for Public Opinion Research or are part of the American Association for Public Opinion Research's Transparency Initiative tend to have lower error than pollsters that don't. We've also found that pollsters that conduct at least 10 percent of their (publicly released) polls for partisan clients tend to have higher error and bias than pollsters who work with partisan clients less than 10 percent of the time. (This percentage is calculated with a weighted mean, where older polls are weighted down according to the aforementioned exponential decay.)

Using four separate regression models that look at the relationship between these variables and pollster accuracy, we can estimate what each firm's Adjusted Error and Adjusted Bias should be based only on its affiliation with Roper or the AAPOR Transparency Initiative and whether at least 10 percent of its polls are conducted with partisan sponsors. Our prior for each pollster's error is the average of the two predictions from the model. We do the same for bias.

Now that we have both a prior and actual reading for each pollster's Adjusted Error and Bias, we can calculate its Predictive Error and Predictive Bias using the following formula:

Predictive Error = Adjusted Error * (n / (n + n_shrinkage)) + group_error_prior * (n_shrinkage / (n + n_shrinkage))

And Predictive Bias is derived similarly:

Predictive Bias = Adjusted Bias * (n / (n + n_shrinkage)) + group_bias_prior * (n_shrinkage / (n + n_shrinkage))

Here, n is the time-weighted number of polls the pollster has released, and n_shrinkage is an integer that represents the effective number of polls' worth of weight we should put on the prior. (Details about how we picked this number are in the conclusion section below.)

We also add a penalty to Predictive Error and Predictive Bias for pollsters that we suspect are putting their thumb on the scale. Specifically, we are concerned with pollster "herding" — i.e., whether a firm is changing its methods to make its numbers more similar to results from other polls. We have found that some pollsters hew suspiciously close to polling averages, but especially when other high-quality pollsters have released surveys in a given race.

We measure herding with a two-step process. The first step is to calculate how far away each poll landed from a benchmark polling average in each race — a statistic for each poll that we call the Absolute Deviation from Average (or ADA). For each poll in our dataset, we calculate an exponential moving average of all nonpartisan polls with a median field date at least two days before the target poll's median field date. We calculate this average only if the contest has at least five preexisting nonpartisan polls from at least three pollsters. This average includes only the latest poll from each pollster. Polls receive a daily decay of about 7 percent, close to the setting for our polling averages. From here, we can compute a pollster's Average Absolute Deviation from Average (AADA for short) by averaging together the ADAs for all its polls, weighted by the number of polls for the average and the sample size of the polls.

Generally speaking, we are suspicious of pollsters that have AADAs that are very low. For example, if a pollster conducted 100 polls of 500 people each, we would expect it to have an AADA of about 1.8 percentage points.*** But most pollsters do not conduct hundreds of polls, and since some types of contests and individual races are harder to poll, the precise benchmark AADA for a given pollster can vary by a lot.

So the second step is to quantify the pollster's influence on its polls' distance from the average. We made a model to predict this based on a few important factors. One is, of course, the poll's sample size; polls with lower sample sizes have larger margins of error and, therefore, should have larger deviations from the average. Another factor is the uncertainty in the underlying average; an average that wiggles around more would also produce larger deviations. Then there are a host of other contextual variables we adjust for: the time before the election, how competitive the election was, the number of polls used in calculating the average, the specific race the poll measured, the type of contest (House, Senate, or presidential general or primary election) it polled and the cycle in which the poll occurred.

The herding model also incorporates two variables at the pollster level. The first is a variable for the pollster responsible for the survey. That ought to capture any impacts the pollster may have on its AADA after controlling for other factors. The second variable tests whether that pollster behaved differently in contests in which a live-caller phone pollster that's a member of AAPOR's Transparency Initiative or shares its data with the Roper Center had already released a poll**** versus contests that did not have any such polling. Each pollster's herding penalty is the model's estimate for how much it behaved differently in these two types of races.

The final step in calculating Predictive Error and Predictive Bias is to add the herding penalty on top of each. For nonpartisan pollsters that associate with AAPOR or the Roper Center and have conducted less than 50 percent of their polls using interactive voice response (IVR; a robot non-human interviewer), or any pollsters that have a Transparency Score of at least 8 out of 10 and also have less than 50 percent IVR polls (see next section), the herding penalty is reduced according to their Transparency Score. For all other pollsters, we add the full penalty.

5. Calculate pollster transparency

So far, we've been dealing strictly with the accuracy of pollsters' polls. But you'll recall that our pollster ratings are also based on how transparent pollsters are. So the next step is to calculate a Transparency Score for each pollster.

To do this, we first ask ourselves 10 yes-or-no questions about every poll in our pollster-ratings database going back to 2016.***** We developed these questions in partnership with Mark Blumenthal, a pollster, past 538 contributor and co-founder of the now-defunct poll-aggregation website Pollster.com, and by consulting the AAPOR transparency guidelines.

  1. Did the pollster publish the exact trial-heat question wording used in this poll?
  2. Did the pollster publish the exact question wording and response options for every question mentioned in the poll release?
  3. Did the pollster release both weighted and unweighted sample sizes for any demographic groups, or acknowledge the existence of a design effect in their data?
  4. Did the pollster publish crosstabs for every subgroup mentioned in the poll release?
  5. Did the pollster disclose the sponsor of the poll (if there was a sponsor)?
  6. Did the poll specify how the sample was selected (e.g., via a probability-based or non-probability method)? If the sample was probability, was the sampling frame disclosed? If non-probability, did the pollster disclose what marketplace or online panels were used to recruit responses or its model for respondent selection?
  7. Did the pollster list at least three of the variables the poll is weighted on?
  8. Did the pollster disclose the source of its weighting targets (e.g., "the 2022 American Community Survey")?
  9. Did the poll report a margin of error or sample size for a "critical mass" of subgroups? We do not mandate this be a complete count, but if it looks like groups are intentionally missing (e.g., they are referenced in the press release but are missing in the crosstab documents), we withhold the point.
  10. Did the poll methodology or release include a general statement acknowledging a source of non-sampling error, such as question wording bias, coverage error, etc., in addition to the normal margin of sampling error inherent to surveying?

We award each poll a 0, 0.5 or 1 on each question; a perfect score is 10, while the worst is 0 (although, in practice, almost every poll gets at least a few points). We then calculate a Transparency Score for each pollster by taking a weighted average of the Transparency Scores of all its polls, with the Transparency Scores of older polls getting less weight.

Finally, we adjust the Transparency Score so that pollsters that are members of AAPOR's Transparency Initiative or share data with the Roper Center get a slight boost. We make this adjustment because some pollsters that share their data with Roper — which we consider a signal of maximum transparency — do not disclose some information we track (such as sample sizes or design effects), but this information could be derived by someone analyzing their shared data.

The final Transparency Score for a given pollster is a weighted average of all of the above, with 70 percent of the weight going to its directly measured transparency and 30 percent of the weight on its implied transparency — a value of 10 if it is a member of the AAPOR/Roper group, and 0 if it is not, with one exception: If a pollster's directly measured transparency is 8 or higher and it has released more than 15 polls, we treat it as an honorary member of the AAPOR/Roper group.

6. Combine error, bias and transparency into a single pollster rating

All this data work is useless without a good way to package it for readers. So we simplify the math and transparency calculations to make results more intuitive.

First, we combine the error and bias calculations into one aggregate statistic, which we've given the backronym POLLSCORE (Predictive Optimization of Latent skill Level in Surveys, Considering Overall Record, Empirically). To calculate POLLSCORE, we average Predictive Error and Predictive Bias together. This has the effect of rewarding pollsters that are both accurate and precise.

This leaves us with two numbers for each pollster: POLLSCORE and Transparency Score. Ranking pollsters on either metric is straightforward enough (just sort them!) — but how to combine them into a single rating? We use something called Pareto Optimality. In layman's terms, the "best" pollster in America would theoretically be one with the lowest POLLSCORE value and a Transparency Score of 10 out of 10. But in practice, the pollster with the lowest POLLSCORE and the pollster with the best Transparency Score are different pollsters. So our algorithm determines the "best" pollster to be the one that is closest to ideal on both metrics, even if it is not the best on any one metric.

This ranking procedure yields a single, combined score that summarizes each pollster's performance along both dimensions. For interpretability, we convert this value to a star rating, with 3.0 stars being the highest-quality pollsters and 0.5 stars representing the worst. Pollsters receive their rank ordering based on this star rating. One note here: While we calculate POLLSCORE and our Transparency Score for all pollsters, we rank only organizations or partnerships that appear to be actively publishing new surveys. A pollster is inactive either if we know the firm/partnership has shut down or if it has not published a new poll in at least 10 years.

7. Account for luck

There is one final thing for us to do, and that's account for luck. Despite all the bells and whistles of our models, there is simply no straightforward quantitative way to punish or reward pollsters that end up with good (or bad) ratings because one of their surveys was much more (or less) accurate than their surveys usually are. All of our models control for a poll's sampling error, for instance, but there are other, unmeasurable factors that can push a poll closer to or away from an election result (pollsters call this "nonsampling" error). And while Predictive Bias and Predictive Error account for some pollsters having small sample sizes (and thus a more uncertain rating), the underlying Adjusted Bias/Error calculations are based on averaging, which is subject to being pulled around by skewed data. (We tried using a median instead, but it gave us slightly worse results.)

Thus, what we really want from our ratings are not single point predictions of each pollster's POLLSCORE, Transparency Score and final rank, but distributions of them. We need a way to calculate how much each pollster's scores change if you ignore certain good (or bad) polls it has released. The method we turn to for this is called "bootstrapping." To bootstrap our model essentially means re-running all the steps we've described so far 1,000 times. Each time, we grade pollsters based on a random sample of their polls in our database. As is standard in bootstrapping, we sample the polls with replacement, meaning individual polls can be included multiple times in the same simulation. We do this to keep the number of polls we have for each pollster constant across simulations. In the end this procedure yields 1,000 different plausible pollster scores for each organization.

Finally, we calculate the median of these simulated POLLSCORE and Transparency Scores. Pollsters are re-ranked according to the algorithm described in Step 6. Compared to our point predictions, the bootstrapped results change rather little for most pollsters, but they punish those who score well because they got lucky once or twice and reward pollsters that more reliably have lower bias and error than other firms. As a final hedge against modeling error here, we average the bootstrapped results and point predictions together. These numbers are the final ratings you see on our dashboard.

Conclusion

We'll end on something of a meta note about how we developed this methodology. Whenever we had a big methodological decision to make,****** we tested various options and used the one that yielded the most accurate results. This was our procedure for doing so:

  • We produced a POLLSCORE for each pollster in each methodological scenario we were entertaining (for example, one scenario in which we used polls conducted within a month of the election; one scenario in which we used polls conducted within three weeks of the election; etc.).
  • Then, for each race that (1) was not a presidential primary or caucus and (2) had at least 15 total polls, we calculated two final, election-day polling averages for each scenario. One weighted polls by sample size and recency (using an exponential decay of about 7 percent per day) alone; the other also weighted polls by their POLLSCORE. The lower the POLLSCORE, the higher the weight.
  • We then averaged the biases of each type of average across every contest in our database, putting more weight on races where more polls were published.
  • The "best" method is the one where the weighted average bias of the average with POLLSCORE is the lowest relative to the average without POLLSCORE.

For context, our averages without POLLSCORE have a weighted average bias of 4.19 percentage points (for either party) across all general-election contests where at least 15 polls were released. But with POLLSCORE, the weighted average bias falls to 4.07 points. That may not sound like a lot, but given all the noise in polling and the various idiosyncrasies of polling certain contests, it's actually a sizable decrease.

The result of all this fine-tuning is a set of model settings and methodological decisions that we are confident does the best job of producing polling averages that have the lowest possible level of statistical bias toward either political party in any given race. You can find our pollster ratings on our interactive pollster-ratings dashboard, and the underlying data is available on our GitHub page.

Footnotes

*Except polls of the New Hampshire primary conducted before the Iowa caucus; polls conducted before New Hampshire of states that hold primaries after New Hampshire; and polls whose leader or runner-up dropped out before that primary was held. We also exclude surveys if a candidate polled at 15 percent or more in them but dropped out before that primary was held, or if multiple candidates who dropped out before the primary polled at a combined 25 percent or more. This is because early-state results can significantly change the trajectory of primaries, as can candidate withdrawals.

**We experimented with our calculations for Excess Bias. First, we adjusted Raw Bias for a prediction from a multilevel model, too, but this ended up making our eventual election forecasts worse. We found that comparing Excess Bias to a benchmark of 0, all the time, gave us the best results.

***We derived this number by using the R programming language to simulate a random set of 100 fake polls that would be unbiased on average and each have a standard deviation of 2.2 points. We then took the absolute value of these fake polls and averaged them together.

****Not all phone polls are high-quality, and some members of the AAPOR Transparency Initiative that don't use phone polls are also great firms. But we found that this schema was the best one at detecting herding historically.

*****Due to issues retrieving archived poll releases and news stories, we were unable to comprehensively review transparency for polls released before 2016. We infer Transparency Scores for polls with missing data based on the pollster that conducted the poll, its methodology and partisanship, and the cycle and type of race it surveyed. We accomplish this using something called multivariate imputation with chained equations. To calculate poll-level Transparency Scores, 538 relied on codes from three coders with nearly unanimous agreement on each item. We did not calculate inter-coder reliability statistics.

******Those decisions were: (1) whether to incorporate Predictive Bias into our POLLSCORE values instead of just using Predictive Error; (2) the amount of yearly decay to apply to each poll when running regressions or aggregating statistics up to the pollster level; (3) the number of polls and number of pollsters per race to use in the weighting calculation for Adjusted Error and Bias; (4) the eligibility window for polls to be factored into our ratings (60 days before the election, 31 days before, etc.); and (5) the default number of polls we use as the shrinkage for our prior when calculating Predictive Error and Bias.