Toxic Data?

May 21, 2009 10:12am

5/29 update: I'm adding now my reply to Prof. Lichter at the end of this chain.

5/27 update: Below my entry I'm posting a reply from Prof. S. Robert Lichter of STATS, with my appreciation for his response. I'll let him have the floor for a couple of days, then post a few brief closing comments.

I don’t write regularly about the stacks of opinion surveys we vet each week and reject as unsuitable for reporting. But the latest drips with just a little too much irony.

It’s “a new survey of scientists,” being released at the National Press Club in Washington today, that “calls into question” reporting of the risks associated with chemicals in household products. It comes from an outfit called the Statistical Assessment Service at George Mason University, which says it exists “to improve the quality of scientific and statistical information in public discourse and to act as a resource for journalists and policy makers on scientific issues and controversies.”

Noble stuff. But since STATS, as it calls itself, also is in the business of producing data, we can see if it walks the walk.

This “survey of scientists,” it turns out, is not a survey in the common meaning of the word at all, since no sampling whatsoever was applied. STATS simply invited all 3,695 members of the Society of Toxicology to fill out an online questionnaire, and compiled results from the 937 who did.

I’ve got more to say about the methodology, but there are yet bigger problems. A variety of questions in the survey produced very large “no opinion” responses. But, in reporting some of its results, STATS removed those undecideds from the base, thus vastly inflating the attitudes it purports to measure.

For instance, STATS reports that “79 percent say the Environmental Working Group, Natural Resources Defense Council, and Center for Science in the Public Interest overstate the risks” of toxic chemicals. What it doesn’t say is that when no-opinion answers are included (specifically, “not sure” on the questionnaire), these drop sharply – to 40 percent for the EWG (49 percent not sure), 48 percent for the NRDC (39 percent not sure) and 43 percent for the CSPI (46 percent not sure).

In another example, per STATS, “56 percent say WebMD accurately portrays chemical risks” and “45 percent say Wikipedia accurately portrays chemical risks” – again leaving out “not sure” responses, in these cases of 50 percent and 54 percent. Including those, the figures for “accurately portrays” drop to 28 percent and 21 percent, respectively.

When we asked, STATS said it left out the no-opinion answers “to facilitate comparisons” and control what it calls “the recognition factor” – apparently meaning it thinks those respondents don’t have sufficient knowledge about the issue to form an opinion, and therefore should be left aside. I’d suggest two flaws in this thinking: One, these answers could just as well mean that the respondents are fully informed but truly undecided; if STATS wanted to ask their level of awareness, it could have. Two, the allocation approach it uses assumes that people with no opinion, if they had one, would divide on exactly the same lines as those who do have an opinion. That is a leap.

Any way you slice it, when an attitude is reported as representing the view of a "majority," the implication is that this means more than half of all respondents – not just more than half of those who had an opinion. Any other approach should, at minimum, be clearly explained and justified.

Far less egregious, but not fully descriptive, is the way in which STATS describes some of its results without noting the intensity of sentiment. Its news release reports that “54 percent say U.S. regulators are not doing a good job explaining chemical risks,” but doesn’t mention how many feel that way strongly – 17 percent. Or, while “90 percent say media coverage of risk lacks balance and diversity,” fewer, 34 percent, say it doesn’t provide balance at all.

Many of its questions, moreover, use the long-discredited (but still all-too-common) agree/disagree format, an approach that’s clearly been demonstrated to create substantial acquiescence bias. This is not a good way to measure attitudes.

It’s interesting, too, how some of the agree/disagree statements are posed positively – “U.S. government regulators do a balanced job of explaining chemical risk to the general public,” while others are posed negatively, such as, “The news media do not do a balanced job of explaining chemical risk to the general public.” Hmmm.

But back to the methodology for a moment: There was no back-checking of a subset of results for verification, a best-practice in self-administered surveys. No weighting was done to adjust for differences between participants and the overall membership (not huge in what we see, but the participants were older, a bit more apt to be in academia and a bit less apt to be in private industry than the overall membership).

We don’t know whether self-selection in this incomplete census may have produced bias in the results – whether people motivated to participate were different attitudinally from others. We also don’t know the coverage – that is, out of all the nation’s toxicologists, how many are members of the Society of Toxicology (the society was unable to tell us); or whether, again, society members are different attitudinally from non-members.

STATS’ promotional efforts this morning certainly are compelling: “Are chemicals killing us? Find out what the experts really think,” its website says. “A groundbreaking study… shows how experts view the risks of common chemicals – and that the media are overstating risk.”

This is from an organization that often critiques the quality of statistical studies and the way news organizations report them, and offers helpful advice for journalists. “PR plays on laziness – your laziness,” one column on its site says. “Thinking is such hard work. That's the secret of PR. Odds are, journalists will reprint the press release on the new study or poll results rather than thinking about what could go wrong.”

Having attempted a little thinking here, I can promise – it isn’t that hard.

I’ve called STATS, and invite their reply.

With thanks to ABC Senior Polling Analyst Pat Moynihan for the legwork.


5/27 Prof. Lichter replies:

By S. Robert Lichter

We thank Mr. Langer for offering us the opportunity to respond to his critique. Let us begin at the beginning: His first criticism is that this was not really a survey at all, “in the common meaning of the word,” because there was no sampling involved.  This is a puzzling distinction, since it means that interviewing an entire population does not qualify as a survey, while interviewing some part of it does. For example, the NIH’s list of controlled vocabulary descriptors, which is used to index journal articles, identifies a survey as a “systematic gathering of data for a particular purpose from various sources, including questionnaires, interviews, observation, existing records, and electronic devices.”

In fact, we are rapidly approaching the 23rd administration of a full population survey,  the U.S. Census, which the Council of American Survey Research Organizations’ (CASRO) website calls, “the first known survey done in the United States.” In 2010 the census forms sent to all households will be supplemented by a lengthier questionnaire sent to a sample of households. Surely this doesn’t  mean the sample data come from a survey while the population data do not.

Also surprising is the claim that the instrument was compromised by using “the long-discredited” agree-disagree format. There is a literature arguing that this format produces acquiescence bias, although to my knowledge this has not been demonstrated to operate in online surveys of elite populations.  However, this remains a very common format in academic as well as commercial survey research, and the notion that it is “long-discredited” is still a minority opinion rather than the expression of a collegial consensus.

Ironically, this expression of concern over acquiescence bias is immediately followed by the complaint that some agree-disagree statements in the questionnaire are posed positively and others negatively. Of course, this procedure is used for the express purpose of mitigating acquiescence bias. But Mr. Langer’s main complaint is apparently about which items were phrased positively and which negatively.  The only examples he gives are "US government regulators do a balanced job of explaining chemical risk to the general public," and "the news media do not go balanced job of explaining chemical risk to the general public." 

These examples are followed by a one-word comment  –  “Hmmm.” This seems to be a shorthand way of implying that the researchers were trying to nudge respondents toward these positions. But Mr. Langer does not mention another item on government regulation, which was negatively phrased in the questionnaire: “The U.S. system of chemical management and regulation is inferior to the European system.” Only 23 percent of respondents agreed with this negatively-worded statement, while 54 percent disagreed with the positively-worded statement that regulators explain risk in a balanced manner. It would seem difficult to characterize these responses as examples of acquiescence bias.

It is equally difficult to find evidence that the item wording led respondents into agreeing that the media are not balanced. Large majorities questioned the media’s credibility across a variety of response formats and question wordings, making this the most robust finding in the study. For example, one item asked “how well the media as a whole … seek out diverse scientific views to balance stories on potential chemical risk.” Ninety percent responded “not very well” or “not at all well,” while eight percent chose “well” or “very well.”  That rather closely replicates the finding cited by Mr. Langer that 87 percent agree and 11 percent disagree that "the news media do not do a balanced job of explaining chemical risk to the general public."

Mr. Langer also complains that “some of” the results are described “without noting the intensity of sentiment" in the press release.  Readers will recognize the two examples he gives from the preceding paragraphs. First, the release notes that "54% say US regulators are not doing a good job explaining chemical risks," without stating that only 17% indicate “strong agreement.” Second, the release states that "90% say media coverage of risk lacks balance and diversity," while only 34% “say it doesn't provide balance at all.” (As noted above, the actual response category for this item is that the media performs "not at all well.")

We have always considered it standard procedure to provide a brief overview of survey results in a press release and to follow up with more a detailed statement of findings on request. In this case, in response to his assistant’s queries, we promptly provided Mr. Langer with frequency distributions for all questionnaire items, in addition to answering all questions put to us about the survey methodology. Based on this material, he was able to differentiate the proportions who agree “somewhat” and. “strongly” in time to post his blog entry even before we held our press conference.

But the issue is not really a matter of brevity so much as consistency. It might be misleading to mention intensity of sentiment when it cuts in one direction but not in another. But the release does not do this — results for every variable are presented in terms of there being two sides to an issue (i.e., collapsing categories such as “somewhat” and “strongly.”) Alternatively, it might be considered misleading to describe a group as holding a set of beliefs without noting that the members consistently express only weak agreement with them. But this is not the case here; if anything, the opposite is true.

Thus, out of seven items that ask respondents how well the media performs “when reporting issues related to toxicology,” majorities chose the “not at all well” option on six. The sole exception is the item on media balance and diversity cited by Mr. Langer. For example, 61% said the media performs “not at all well” in distinguishing between correlation and causation, and 67% rated the media “not at all well” in distinguishing studies that are statistically rigorous from those that are not.. Since an average of 96% percent rated the media’s performance as either “not very well” or “not at all well” across all seven items, it seems fair to conclude that negative beliefs about the media’s performance are held widely as well as strongly among SOT members.

Another criticism of the presentation of findings by Mr. Langer concerns the exclusion of no-opinion responses when describing how respondents rate the quality of information from various sources.  In his lengthiest critique of the survey, his main argument is that the numbers change sharply when no-opinion responses are included, because the percentage willing to rate specific organizations varies widely.

In fact this is precisely the point, as the report clearly explained at the outset of the discussion: “There were considerable variations in the number of respondents who were familiar enough with the various organizations to rate their accuracy. To insure that the comparisons are commensurable, the percentages exclude “don’t know” [erratum - this should read “not sure”] responses. We added a column indicating the percentage who rated each organization.” In addition, we directed readers to another table which presents the same information, but with “not sure” responses included in the percentages.

Thus, readers were not only clearly informed of this procedure, they were given additional information that allowed them to compare the two ways of presenting the data and decide for themselves which they preferred. The question then becomes whether there was a good reason for us to focus on the subset of those willing to rate each organization in reporting the findings. In the case of environmental organizations, Mr. Langer notes that the percentages choosing “overstate” rose sharply when no-opinion responses were excluded. He does not note that differences among the ratings of many organizations also rose sharply.

For example, 70% rated PETA as overstating risk, compared to only 40% of those rating the Environmental Working Group (EWG). So it would appear that SOT members regard EWG as far more credible than PETA. But this conclusion is an artifact of differences in the percent of respondents rating each group. With no-opinion responses excluded, the findings for PETA and EWG were almost identical – 80% rated PETA as overstating risk, compared to 79% who rated EWG as overstating risk. A difference of 30 percentage points dropped to a single percentage point.

Across all environmental groups, when no-opinion responses were included in the calculation, a spread as large as forty percentage points opened up between those rated most  and least likely to overstate chemical risk — 83% for Greenpeace compared to only 43% for the Center for Science in the Public Interest (CSPI).  That spread dropped to 17 percentage points with no-opinion responses excluded.

Similar effects show up in comparing other organizations that were not cited. For example, the prestigious National Science Foundation’s information on risk was rated as accurate by 59% of all respondents, only slightly better than the more controversial Food and Drug Administration’s 51% accuracy rating. After excluding no-opinion responses, however, the NSF’s accuracy rating rose to 85%, compared to 55% for the FDA. This is what we meant by the need “to insure that the comparisons are commensurable.”

Finally, in discussing media organizations, Mr. Langer notes that the proportions rating WebMD and Wikipedia as accurate drop when no-opinion responses are included. But the distorting effects of reporting the data this way are much greater in comparisons with mainstream media outlets. For example, with no-opinion responses included, the national newspapers (e.g. New York Times, Washington Post and Wall Street Journal) were rated over three times as likely to overstate chemical risk as was Wikipedia (71% v. 23% of SOT members). Similarly, national health magazines (e.g. Prevention and Modern Health) emerged as by far the least likely outlets to overstate risk (48%). However, this was mainly due to the low recognition rate for health magazines. When no-opinion responses were deleted from the comparison, the proportion rating them as overstating risks rose to 86%, a poorer showing than those of the national newspapers cited above, national newsmagazines, and public broadcasting.

Of course, as Mr. Langer suggests, it is possible that many of those not venturing an opinion are “fully informed but truly undecided.” That is, many toxicologists might be as familiar with the Environmental Working Group (rated by 51% of respondents) and Center for Science in the Public Interest (rated by 54%) as they are with Greenpeace (rated by 87%) and PETA (rater by 81%). They might just be less certain of their opinions about CSPI and EWG.

On the other hand, a Google search brings up 245,000 listings for CSPI and 342,000 for EWG, compared to 11,600,000 for Greenpeace and 26,700,000 for PETA. One need not invoke Occam’s razor to conclude that differential recognition rates are sufficient to account for more respondents expressing opinions about the latter pair of organizations, which are nearly 50 times as visible as the former pair in Web searches.

Alternatively, no one can be certain whether respondents who expressed no opinion would divide along the same lines as those with an opinion. But it is suggestive that the differences in accuracy ratings across similar organizations (e.g., the differences among various environmental or among different media organizations) narrowed sharply when no-opinion responses were excluded. For example, the base of those rating environmental groups varied from 51% for EWG to 87% for Greenpeace. As noted above, however, the spread in the proportion rating each group as overstating risks dropped sharply when no-opinion responses were excluded. If those not venturing an opinion differed dramatically from those who did, we would expect the differences in accuracy ratings to increase when comparing groups whose ratings are based on greatly differing proportions of respondents.

At the conclusion of his critique Mr. Langer returns to sampling issues. He correctly states that there was no weighting of results and “no back-checking of a subset of results for verification, a best-practice in self-administered surveys.” Given the high educational attainments and knowledge base of this group, we preferred to put our limited resources into increasing the response rate rather than back-checking for respondent reliability by re-administering survey items to a subsample.

Again, given that this is an elite survey, and after determining that the available demographic comparisons closely tracked SOT membership data, we chose to report non-weighted responses to let readers know the actual opinions that we measured. Weighting can easily be done at any point, although given the representativeness of the respondents on all variables available for weighting, it would produce little substantive change in the findings.

Finally, Mr. Langer notes that the SOT could provide no information on how many of the nation’s toxicologists were members, and he raises the possibility that members may have differed from non-members in their attitudes. But the question of how many toxicologists there are may not just be unknown but unknowable. As an area of expertise, toxicology does not represent a single academic field such as chemistry, biology, or physics. Toxicologists come from a variety of academic disciplines, possess a wide range of certifications, and belong to various other scientific professional associations. As a result, individuals engaged in the same professional activities may or may not consider themselves (or be considered by others) as toxicologists.

For the purposes of the survey, however, SOT is an appropriate group to represent toxicological expertise. It is the only professional scientific association devoted solely to toxicology, and membership requires proof of several years of relevant professional experience. The required time period is increased in the absence of published post-doctoral peer-reviewed journal articles. STATS’ goal was to obtain a representative portrait of toxicological expertise on chemical risk; the SOT provides the best representation of a community of experts in this field.

The choice of SOT to represent toxicological expertise illustrates the basis of many of our methodological decisions: It may not be a theoretically perfect choice, but in the real world it provides a reasonable approximation as a source of relatively reliable and valid information. And having responded at length to Mr. Langer’s critique, it is important to note that we appreciate the methodological concerns he raised and the ways in which he suggests that our research methodology might be improved or our findings better presented.

Any survey is subject to numerous valid questions ranging from item construction and formatting to the appropriate presentation and representativeness of findings. We are under no illusion that we have attained perfection in all these areas. But after this review of our methodology, we strongly believe that the basic findings are valid and our inferences from them reasonable, and that this survey fulfills its function of advancing our understanding of this contentious but crucial area of inquiry.

5/29: OK, here are my closing comments in reply to Prof. Lichter, again with my thanks for his participation in this discussion.

- The professor calls it “standard procedure to provide a brief overview of survey results in a press release and to follow up with a more detailed statement of findings on request.” This is discordant with the statement on STATS’ own website saying that journalists rarely go beyond the news release. Given that awareness (sad but true), surely the release should include disclosure of essential elements, including, in this case, the decision (unjustified in my view) to percentage out undecideds. (Disclosure of the absence of strong sentiment also belongs in the news release, even if briefly.)

On repercentaging, Lichter also says readers were "clearly informed of this procedure." In fact it was not disclosed in the news release, and is explained solely in a footnote on p.5 of the full report; compare that to the very prominent play the repercentaged data received in the handout. The footnote, moreover, says the differences in "not sure" responses were caused by unfamiliarity. As per my original post, that is pure conjecture.

- The fact that agree/disagrees are a “common format” does not make them an acceptable one. A wealth of literature rejects the approach; the absence of an alternative proposition encourages satisficing. Consider the opening line of Prof. Willem E. Saris and Prof. Jon Krosnick’s 2001 paper, Comparing the Quality of Agree/Disagree Questions and Balanced Forced Choice Questions via a Split Ballot MTMM Experiment: (n.b.: subsequently updated as "Comparing Questions with Agree/Disagree Response Options to Questions with Construct-Specific Response Options,“ see it here): "A huge body of research conducted during more than five decades has documented the role that acquiescence response bias plays in distorting answers to agree/disagree questions.”

They go on to identify “remarkably sizable differences in data quality” in balanced forced-choice vs. agree/disagree formats, and they conclude with no fewer than 48 reference books, chapters and papers. (For a quicker review, try the “Acquiescence Bias” entry in the Encyclopedia of Survey Research Methods, Sage 2008).

Some users of agree/disagree lists switch between positively and negatively phrased statements. This, however, does not address acquiescence bias; it simply shifts it around. In this specific case, for an organization with an interest in media critiques to have employed a negative proposition in its agree/disagree measurement of media performance is, at the very best, an unfortunate choice.

- “Survey” in common usage indicates a sample, not a census, but semantics aren’t the point. A census interviews an entire population (or darn close), not 25 percent of a population. A census needs to be worked like one. (Surveys need to be worked too; see for example the standard practices for optimizing mail and internet participation rates developed by Prof. Don Dillman at Washington State University.)

- If the true population of toxicologists is “unknowable,” this study should not claim to be representative of their views (e.g., “Majorities of toxicologists…”)

- I take embargoes very seriously. Prof. Lichter says I posted my piece "even before we held our press conference." My comments went up at 10:12 a.m., 42 minutes after the 9:30 a.m. embargo on this study.

- “Limited resources” do not excuse departure from sound methodology, nor need back-checking a random subset break the bank.

Survey research is, or should be, serious business. It needs to be done properly, disclosed appropriately and reported precisely. On that I’ve got to think STATS, devoted to the worthy task of educating journalists about the niceties of data analysis, would agree. Whether it's hit the mark in this report, after hearing my views and Prof. Lichter's, is your call.

You are using an outdated version of Internet Explorer. Please click here to upgrade your browser in order to comment.
blog comments powered by Disqus