For example, HealthMap can differentiate between a news story about an outbreak and a story about government vaccination or public health. By identifying the language reporters use in known outbreak stories, it can find similar new ones based on the same verbiage. For instance, key words like "mysterious" tend to pop up in outbreak stories, but not, say, in coverage of vaccine programs. Another common feature of outbreak stories is a small number in the headline, usually to denote a number of people infected or killed.
"We're looking at data visualization rather than just producing a stream of e-mails," Freifeld said, referencing how some fast-breaking public health data has been distributed.
The main HealthMap interface is a map of the globe that allows any user to see various outbreaks around the world, but the power of the site is really revealed when drilling down on an individual outbreak. The researchers' algorithms sort stories into one of six categories, so that the breaking news is presented on a dashboard with background context and related outbreaks, as shown at right.
In a study published this March in the Journal of the American Medical Informatics Association, the researchers found that their automated classification system was accurate 84 percent of the time. Algorithm improvements have pushed accuracy close to 90 percent now, according to the researchers.
While the site only has about 20,000 unique users, many of them are from the public health and research fields that are working directly on preventing and controlling disease.
Hannah Gould, an epidemiologist with the Centers for Disease Control, used the site to quickly trace reports of a recent E. coli outbreak at a major supermarket chain.
"It's a timely synthesis from many different sources into one place," Gould said. "It allows for a quick visualization of health events that you wouldn't otherwise have in other formats."
Right now, the researchers are focused on adding more sources, particularly in other languages, as well as improving their methodologies.
Freifeld and Brownstein are looking into using more social media sources, but they've encountered a problem that most internet users are already familiar with: There's too much noise.
"We have certainly explored looking at more free and noisier sources like blogs and things like Twitter," Freifeld said. "But they pose the problem of capturing a good quality signature from all that stuff."