How our polling averages work
An explanation of the methodology for 538's current polling averages.
This is an old version of our polling average methodology. For the most up-to-date version, please click here.
**
Almost since its founding, 538 has published comprehensive averages of polls for a wide variety of questions related to U.S. politics. In June 2023, we debuted a new set of models for these averages that aims to improve the accuracy of the underlying models and how the results are visually conveyed to our readers.
The most important differences from our old polling-average model are:
Here are all the steps we take to calculate our averages:
Which polls we include
538's philosophy is to collect as many polls as possible for every topic or race we're actively tracking — so long as they are publicly available and meet our basic criteria for inclusion. After determining that a poll meets our standards, we have to answer a few more questions about it before sending it off to the various computer programs that power our models.
How we weight and adjust polls
After all this data is in our database, we compute three weights for each survey that control how much influence it has in our average, based on the following factors:
For horse-race averages, we also decrease the sample-size weight on polls of adults or registered voters. Polls of adults receive a weight equal to half of whatever is implied by their sample size, and polls of registered voters receive 90 percent of the weight implied by their sample size. This is because surveys of adults and registered voters aren't as useful as surveys of likely voters when it comes to horse-race polling.
Second, we calculate a weight for how much influence we give to an outlier poll using a kernel weighting function. Specifically, we calculate the difference between each poll's result and an iteratively calculated benchmark polling average on the day the pollster finished its fieldwork, and input those values into the kernel weighting function. (We run the aggregation functions described in the next section three separate times: once to establish a baseline, once to adjust that baseline for outliers and once more to detect systematic population and house effects.) The kernel weighting function transforms those values into new numbers that represent the weight we should give each survey in our average. Polls that land further from the average have lower weights. The precise way these values are transformed varies depending on which kernel we pick for the function, but they range from 0 to 1. The kernel weighting function also considers how much the values vary for different types of averages (approval ratings, horse-race averages, etc.), which we calculate directly from the data and then scale by a positive number that gets fed into our model. This gives our model the ability to make the outlier adjustment more or less aggressive depending on the type of data it is aggregating.
We then assign an outlier downweight for each poll based on its distance from the polling average and the confidence we have that it is an outlier. For polls that we are at least 95 percent sure are outliers (using the k-nearest-neighbors algorithm above), we set their weight to the value we computed with the distance-based kernel weighting function. We make sure the downweight is never less than 0.05 for any given outlier poll — equivalent to removing at most 95 percent of its weight in our model. All other polls get a weight of one, which is equivalent to not applying an outlier downweight at all.
Once we have these weights, we calculate a cumulative weight for each poll by multiplying the three component weights together.
The next step is to test and adjust for any factors that could be systematically shifting groups of polls in one direction. We consider three main adjustments here:
Then, we use tools from Bayesian statistics to make sure the adjustments for a given pollster are not reacting to noise in the data. That's because what looks like a house effect in an individual poll could just be abnormal amounts of random sampling (or other) error. For most polls, this shrinks our model's initial estimate of a pollster's house effect back toward 0. However, for partisan polls in horse-race averages, we assume they overestimate support for their party by about 2.4 percentage points. We arrived at this number by testing many different values between 0 and 5 points and picking the one that made our average the most accurate. (More on how we determine accuracy below.)
House effects can look large at the beginning of a series when we have few polls but tend to diminish over time as firms release more surveys. Our model adjusts for this: Specifically, the house-effects regression model gives us both an estimated mean and a standard error for each pollster's house effect, which we use to update a normal distribution with a mean of 0 and a standard deviation that our model determines using optimization (usually it's around 3 points). For national averages, we estimate house effects using only national polls. For state-level averages, we estimate house effects using the polls in that state.
We apply these adjustments as needed. For example, if we want to incorporate a likely-voter poll into our presidential approval average (which prefers polls of adults), we apply the likely-voter-to-adults adjustment. If a poll surveys multiple populations but not the one we want, we apply the adjustment to the version of the poll that's closest to what we want. For instance, if a poll surveyed both likely and registered voters, we'd adjust the registered-voter results for use in our presidential approval average. Additionally, in the event that we are unable to calculate a likely-voter-to-adults adjustment because there are no polls in our data set that survey them both, we adjust between likely voters and adults by adding together the likely-voter-to-registered-voter adjustment and the registered-voter-to-adult adjustment.
How we average polls together
Once we have collected our polls and adjusted them, we can finally calculate a polling average. Our final polling average is actually an average of two different methods for calculating a trend over time.
The first is an exponentially weighted moving average, or EWMA (a popular tool in financial analysis). The EWMA calculates an average for any given day by calculating a weight for each poll based on how old it is, multiplying the poll result by that weight and finally adding the values together. We select the value for a parameter called decay, which determines the rate at which older data points are phased out of the average according to an exponential function. The value of decay is capped at -0.2 to prevent the model from becoming overly aggressive in responding to new data (that's what the next method is for). It can vary depending on how quickly the average is changing: When the model detects more movement in public opinion in the more recent past, it anticipates slightly more movement on its next update. We also select a value for a parameter called hard_stop, which excludes any surveys conducted after a certain number of days into the past. We make sure this value is between 30 and 60 days, although if there are fewer than 10 polls within the last 30-60 days, the model will consider polls that are further into the past than this window on a day-by-day basis until there are 10 polls.
The second is a trend through points, calculated using a methodology similar to that of the now-defunct Huffington Post Pollster website and the forecasting methodology used by The Economist. We fit this trend using our custom implementation of a kernel-weighted local polynomial regression, which is similar to a lowess regression — a common tool for calculating trends through points on a scatterplot. The trendline and weight on any given poll in this regression depend on two parameters that we also have to set: the bandwidth of the kernel and the degree of the polynomial. (We allow our model to pick between a polynomial degree of either 1 or 2. When there are fewer than five polls in our average, we use a polynomial degree of 0 — which is roughly equivalent to just taking a rolling average of the polls. When there are fewer than 10 polls, we cap the degree at 1.) To guard against overfitting, we take a weighted average of the resulting trendline and a separate trendline calculated with a degree of 0 and a bandwidth equal to the bandwidth of the other trendline times 0.67, giving the better treadline (according to the Akaike information criterion) more weight.
Once the EWMA and polynomial trendlines are calculated, we calculate a mixing parameter to serve as a baseline for how much weight to give each trendline in our final average. This weight depends on the number of polls conducted over the last month. We put more weight on the polynomial regression when there is more data available to estimate it. That has the benefit of giving us less noisy averages when there are fewer polls, because the local polynomial regression detects movement quicker than the EWMA, which is useful when we have news events that move public opinion and coincide with a big dump of new polls.
Next, we average the weights calculated with the mixing parameter with the weights that would have produced the most accurate blended average over the past five days. We determine "most accurate" using a variant of maximum likelihood estimation; each day, we compute two values called EWMA_likelihood and polynominal_likelihood that represent how accurate the EWMA and polynomial trendlines were at predicting the polls over the last five days. The weight we assign to the polynomial is equal to polynominal_likelihood / (polynominal_likelihood + EWMA_likelihood), while the weight on the EWMA is equal to EWMA_likelihood / (polynominal_likelihood + EWMA_likelihood).
Finally, we use a technique called optimization to test the calibration of our model by calculating thousands of different averages for each politician and race in our historical database using different values for each of our seven hyperparameters (parameters that govern the behavior of a model): decay, hard_stop, bandwidth, degree, the mixing parameter, the standard deviation of the house-effects distribution and the scalar for outlier detection. For each type of polling average, our model picks the set of parameters that generate the optimal values for three measures of accuracy:
In 2023, we started calculating these hyperparameters values separately for each type of polling average (that is, presidential approval ratings, favorability ratings, Supreme Court approval ratings, congressional approval ratings, national horse-race presidential primary polls and state-level presidential primary polls). That means that we are always specifying the type of aggregation model that minimizes these three measures of error for that type of polling average. This results in averages that are more reactive to changes in the horse race, which tend to happen as a result of real campaign events, and less reactive to changes in favorability rating polls, which are often due to noise in the data.
And that's basically it! 538's polling averages can really be thought of as two different models: one that measures any biases resulting from the polls' underlying data-generating process, and another to aggregate polls after adjusting for those biases.
There is one last feature of note. As with any model we run, polling averages contain uncertainty. There is error in the individual polls, error in our adjustments and error in selecting the hyperparameters that produce the optimal trendlines. Starting in 2023, all our polling averages convey this uncertainty by calculating and displaying the 95th-percentile difference between the polling average on every day and the polls published those days. (Previously, this was only the case for our presidential-approval averages.) This "error band" represents the uncertainty in that average. (Importantly, this measures our uncertainty when it comes to predicting future polls, but it does not measure our uncertainty at predicting future election results. That step comes later, in our forecasting models.)
Finally, while we have tried to squish all of the obvious bugs in our programs, we are always on the lookout for anything we might've missed. If you spot something that you think is a bug, drop us a line.
Version History
1.1 | Improved outlier detection, more stable averages, better optimization. | Oct. 9, 2023
1.0 | New favorability, approval and horse-race averages debuted. | June 28, 2023
References
For horse-race averages, we also decrease the sample-size weight on polls of adults or registered voters. Polls of adults receive a weight equal to half of whatever is implied by their sample size, and polls of registered voters receive 90 percent of the weight implied by their sample size. This is because surveys of adults and registered voters aren't as useful as surveys of likely voters when it comes to horse-race polling.
Second, we calculate a weight for how much influence we give to an outlier poll using a kernel weighting function. Specifically, we calculate the difference between each poll's result and an iteratively calculated benchmark polling average on the day the pollster finished its fieldwork, and input those values into the kernel weighting function. (We run the aggregation functions described in the next section three separate times: once to establish a baseline, once to adjust that baseline for outliers and once more to detect systematic population and house effects.) The kernel weighting function transforms those values into new numbers that represent the weight we should give each survey in our average. Polls that land further from the average have lower weights. The precise way these values are transformed varies depending on which kernel we pick for the function, but they range from 0 to 1. The kernel weighting function also considers how much the values vary for different types of averages (approval ratings, horse-race averages, etc.), which we calculate directly from the data and then scale by a positive number that gets fed into our model. This gives our model the ability to make the outlier adjustment more or less aggressive depending on the type of data it is aggregating.
We then assign an outlier downweight for each poll based on its distance from the polling average and the confidence we have that it is an outlier. For polls that we are at least 95 percent sure are outliers (using the k-nearest-neighbors algorithm above), we set their weight to the value we computed with the distance-based kernel weighting function. We make sure the downweight is never less than 0.05 for any given outlier poll — equivalent to removing at most 95 percent of its weight in our model. All other polls get a weight of one, which is equivalent to not applying an outlier downweight at all.
Once we have these weights, we calculate a cumulative weight for each poll by multiplying the three component weights together.
The next step is to test and adjust for any factors that could be systematically shifting groups of polls in one direction. We consider three main adjustments here:
Then, we use tools from Bayesian statistics to make sure the adjustments for a given pollster are not reacting to noise in the data. That's because what looks like a house effect in an individual poll could just be abnormal amounts of random sampling (or other) error. For most polls, this shrinks our model's initial estimate of a pollster's house effect back toward 0. However, for partisan polls in horse-race averages, we assume they overestimate support for their party by about 2.4 percentage points. We arrived at this number by testing many different values between 0 and 5 points and picking the one that made our average the most accurate. (More on how we determine accuracy below.)
House effects can look large at the beginning of a series when we have few polls but tend to diminish over time as firms release more surveys. Our model adjusts for this: Specifically, the house-effects regression model gives us both an estimated mean and a standard error for each pollster's house effect, which we use to update a normal distribution with a mean of 0 and a standard deviation that our model determines using optimization (usually it's around 3 points). For national averages, we estimate house effects using only national polls. For state-level averages, we estimate house effects using the polls in that state.
We apply these adjustments as needed. For example, if we want to incorporate a likely-voter poll into our presidential approval average (which prefers polls of adults), we apply the likely-voter-to-adults adjustment. If a poll surveys multiple populations but not the one we want, we apply the adjustment to the version of the poll that's closest to what we want. For instance, if a poll surveyed both likely and registered voters, we'd adjust the registered-voter results for use in our presidential approval average. Additionally, in the event that we are unable to calculate a likely-voter-to-adults adjustment because there are no polls in our data set that survey them both, we adjust between likely voters and adults by adding together the likely-voter-to-registered-voter adjustment and the registered-voter-to-adult adjustment.
How we average polls together
Once we have collected our polls and adjusted them, we can finally calculate a polling average. Our final polling average is actually an average of two different methods for calculating a trend over time.
The first is an exponentially weighted moving average, or EWMA (a popular tool in financial analysis). The EWMA calculates an average for any given day by calculating a weight for each poll based on how old it is, multiplying the poll result by that weight and finally adding the values together. We select the value for a parameter called decay, which determines the rate at which older data points are phased out of the average according to an exponential function. The value of decay is capped at -0.2 to prevent the model from becoming overly aggressive in responding to new data (that's what the next method is for). It can vary depending on how quickly the average is changing: When the model detects more movement in public opinion in the more recent past, it anticipates slightly more movement on its next update. We also select a value for a parameter called hard_stop, which excludes any surveys conducted after a certain number of days into the past. We make sure this value is between 30 and 60 days, although if there are fewer than 10 polls within the last 30-60 days, the model will consider polls that are further into the past than this window on a day-by-day basis until there are 10 polls.
The second is a trend through points, calculated using a methodology similar to that of the now-defunct Huffington Post Pollster website and the forecasting methodology used by The Economist. We fit this trend using our custom implementation of a kernel-weighted local polynomial regression, which is similar to a lowess regression — a common tool for calculating trends through points on a scatterplot. The trendline and weight on any given poll in this regression depend on two parameters that we also have to set: the bandwidth of the kernel and the degree of the polynomial. (We allow our model to pick between a polynomial degree of either 1 or 2. When there are fewer than five polls in our average, we use a polynomial degree of 0 — which is roughly equivalent to just taking a rolling average of the polls. When there are fewer than 10 polls, we cap the degree at 1.) To guard against overfitting, we take a weighted average of the resulting trendline and a separate trendline calculated with a degree of 0 and a bandwidth equal to the bandwidth of the other trendline times 0.67, giving the better treadline (according to the Akaike information criterion) more weight.
Once the EWMA and polynomial trendlines are calculated, we calculate a mixing parameter to serve as a baseline for how much weight to give each trendline in our final average. This weight depends on the number of polls conducted over the last month. We put more weight on the polynomial regression when there is more data available to estimate it. That has the benefit of giving us less noisy averages when there are fewer polls, because the local polynomial regression detects movement quicker than the EWMA, which is useful when we have news events that move public opinion and coincide with a big dump of new polls.
Next, we average the weights calculated with the mixing parameter with the weights that would have produced the most accurate blended average over the past five days. We determine "most accurate" using a variant of maximum likelihood estimation; each day, we compute two values called EWMA_likelihood and polynominal_likelihood that represent how accurate the EWMA and polynomial trendlines were at predicting the polls over the last five days. The weight we assign to the polynomial is equal to polynominal_likelihood / (polynominal_likelihood + EWMA_likelihood), while the weight on the EWMA is equal to EWMA_likelihood / (polynominal_likelihood + EWMA_likelihood).
Finally, we use a technique called optimization to test the calibration of our model by calculating thousands of different averages for each politician and race in our historical database using different values for each of our seven hyperparameters (parameters that govern the behavior of a model): decay, hard_stop, bandwidth, degree, the mixing parameter, the standard deviation of the house-effects distribution and the scalar for outlier detection. For each type of polling average, our model picks the set of parameters that generate the optimal values for three measures of accuracy:
In 2023, we started calculating these hyperparameters values separately for each type of polling average (that is, presidential approval ratings, favorability ratings, Supreme Court approval ratings, congressional approval ratings, national horse-race presidential primary polls and state-level presidential primary polls). That means that we are always specifying the type of aggregation model that minimizes these three measures of error for that type of polling average. This results in averages that are more reactive to changes in the horse race, which tend to happen as a result of real campaign events, and less reactive to changes in favorability rating polls, which are often due to noise in the data.
And that's basically it! 538's polling averages can really be thought of as two different models: one that measures any biases resulting from the polls' underlying data-generating process, and another to aggregate polls after adjusting for those biases.
There is one last feature of note. As with any model we run, polling averages contain uncertainty. There is error in the individual polls, error in our adjustments and error in selecting the hyperparameters that produce the optimal trendlines. Starting in 2023, all our polling averages convey this uncertainty by calculating and displaying the 95th-percentile difference between the polling average on every day and the polls published those days. (Previously, this was only the case for our presidential-approval averages.) This "error band" represents the uncertainty in that average. (Importantly, this measures our uncertainty when it comes to predicting future polls, but it does not measure our uncertainty at predicting future election results. That step comes later, in our forecasting models.)
Finally, while we have tried to squish all of the obvious bugs in our programs, we are always on the lookout for anything we might've missed. If you spot something that you think is a bug, drop us a line.
Version History
1.1 | Improved outlier detection, more stable averages, better optimization. | Oct. 9, 2023
1.0 | New favorability, approval and horse-race averages debuted. | June 28, 2023
References
The second is a trend through points, calculated using a methodology similar to that of the now-defunct Huffington Post Pollster website and the forecasting methodology used by The Economist. We fit this trend using our custom implementation of a kernel-weighted local polynomial regression, which is similar to a lowess regression — a common tool for calculating trends through points on a scatterplot. The trendline and weight on any given poll in this regression depend on two parameters that we also have to set: the bandwidth of the kernel and the degree of the polynomial. (We allow our model to pick between a polynomial degree of either 1 or 2. When there are fewer than five polls in our average, we use a polynomial degree of 0 — which is roughly equivalent to just taking a rolling average of the polls. When there are fewer than 10 polls, we cap the degree at 1.) To guard against overfitting, we take a weighted average of the resulting trendline and a separate trendline calculated with a degree of 0 and a bandwidth equal to the bandwidth of the other trendline times 0.67, giving the better treadline (according to the Akaike information criterion) more weight.
Once the EWMA and polynomial trendlines are calculated, we calculate a mixing parameter to serve as a baseline for how much weight to give each trendline in our final average. This weight depends on the number of polls conducted over the last month. We put more weight on the polynomial regression when there is more data available to estimate it. That has the benefit of giving us less noisy averages when there are fewer polls, because the local polynomial regression detects movement quicker than the EWMA, which is useful when we have news events that move public opinion and coincide with a big dump of new polls.
Next, we average the weights calculated with the mixing parameter with the weights that would have produced the most accurate blended average over the past five days. We determine "most accurate" using a variant of maximum likelihood estimation; each day, we compute two values called EWMA_likelihood and polynominal_likelihood that represent how accurate the EWMA and polynomial trendlines were at predicting the polls over the last five days. The weight we assign to the polynomial is equal to polynominal_likelihood / (polynominal_likelihood + EWMA_likelihood), while the weight on the EWMA is equal to EWMA_likelihood / (polynominal_likelihood + EWMA_likelihood).
Finally, we use a technique called optimization to test the calibration of our model by calculating thousands of different averages for each politician and race in our historical database using different values for each of our seven hyperparameters (parameters that govern the behavior of a model): decay, hard_stop, bandwidth, degree, the mixing parameter, the standard deviation of the house-effects distribution and the scalar for outlier detection. For each type of polling average, our model picks the set of parameters that generate the optimal values for three measures of accuracy:
In 2023, we started calculating these hyperparameters values separately for each type of polling average (that is, presidential approval ratings, favorability ratings, Supreme Court approval ratings, congressional approval ratings, national horse-race presidential primary polls and state-level presidential primary polls). That means that we are always specifying the type of aggregation model that minimizes these three measures of error for that type of polling average. This results in averages that are more reactive to changes in the horse race, which tend to happen as a result of real campaign events, and less reactive to changes in favorability rating polls, which are often due to noise in the data.
And that's basically it! 538's polling averages can really be thought of as two different models: one that measures any biases resulting from the polls' underlying data-generating process, and another to aggregate polls after adjusting for those biases.
There is one last feature of note. As with any model we run, polling averages contain uncertainty. There is error in the individual polls, error in our adjustments and error in selecting the hyperparameters that produce the optimal trendlines. Starting in 2023, all our polling averages convey this uncertainty by calculating and displaying the 95th-percentile difference between the polling average on every day and the polls published those days. (Previously, this was only the case for our presidential-approval averages.) This "error band" represents the uncertainty in that average. (Importantly, this measures our uncertainty when it comes to predicting future polls, but it does not measure our uncertainty at predicting future election results. That step comes later, in our forecasting models.)
Finally, while we have tried to squish all of the obvious bugs in our programs, we are always on the lookout for anything we might've missed. If you spot something that you think is a bug, drop us a line.
Version History
1.1 | Improved outlier detection, more stable averages, better optimization. | Oct. 9, 2023
1.0 | New favorability, approval and horse-race averages debuted. | June 28, 2023