Culturomics: Mining 'Genome' of Words for Cultural Trends

Harvard and Google discover cultural trends with database of digitized books.

Dec. 17, 2010— -- Ever wonder how many new words enter the English language each year? Or what fame is like for celebrities today versus during the 1800s? Or which words are most deeply ingrained in our collective subconscious?

The billions of words in all the books ever published can provide an interesting window into how the winds of culture change direction. But plowing through centuries of primers, paperbacks and other publications is hard work.

To ease the burden, a team of Harvard researchers worked with Google to devise a way to quantitatively analyze human culture.

In a paper published Thursday in the journal Science, the researchers unveiled their insights on "culturomics," along with some fascinating findings about what we've stored in our collective memory.

"People have been wondering about human culture and human society and about what makes the fabric of our society way before the sciences were there," said Jean-Baptiste Michel, a key researcher on the study and a postdoctoral fellow in Harvard's department of psychology. "But interestingly enough, I don't think there was ever a quantitative way to get at those questions. We felt that it was very interesting that the tool that would enable one to ask questions about human culture quantitatively and in a reproducible manner [would] deliver interesting insights on culture."

Dataset of Words Is Based on 5.2 Million Books

Realizing that Google was already conquering the data acquisition piece of a quantitative tool by digitizing books, Michel and his Harvard colleague Erez Lieberman Aiden approached the search giant for help.

Soon enough, the academics and the engineers joined forces to create and analyze a dataset of words that would resemble a "fossil record" of human culture.

The dataset, which can be downloaded and searched with a Google tool launched Thursday, is the largest data release in the history of the humanities, the researchers said. If the words were arranged in a straight line, they would extend to the moon and back 10 times.

The database includes more than 500 billion words, based on the full text of about 5.2 million books (which is about 4 percent of all books ever published).

Aiden said the earliest books date back to the early 1500s, but the vast majority of the data are from the last 200 years.

Culturomics Website Includes Tool for Public to Search Database

The online Google tool was released to coincide with the publication of the paper, so that the public can participate in exploring humanity's culture genome along with the researchers.

"We hope they'll learn some history," he said. "We hope that it will enable us to create this space in which you can quantify things that are relevant to the humanities and that might lead to an interesting dialogue in and of itself."

Anyone can visit the team's Culturomic's website to explore the Google tool and learn more about the research, but for a few of the study's most interesting tidbits, check out the list below.

The English language welcomes about 8,500 new words annually, leading to a 70 percent growth of the lexicon between 1950 and 2000. But you'll never find many of these words in the dictionary.

Humanity is losing its grasp on the past. The researchers tracked the frequency with which each year between 1875 to 1975 appeared, finding that references to the past decrease much more rapidly now than in the 19th century. References to 1880 didn't fall by half until 1912 -- a lag of 32 years -- but references to 1973 reached half their peak a decade later, in 1983.

Cultural adoption of technology is speeding up. Inventions from the end of the 19th century spread more than twice as fast as those from the early 1800s.

Fame vanishes faster than ever. Modern-day celebrities are younger and more famous (in books, at least) than celebrities of yesteryear, but their heyday is shorter-lived. Celebrities born in 1950 initially achieved fame at an average age of 29, compared with 43 for celebrities born in 1800.

The most-famous actors become famous earlier (at age 30) than the most-famous writers (at age 40) or politicians (age 50). But apparently, it's not so bad being a late-bloomer. Well-known politicians end up with more fame than top actors.

Censorship and propaganda can be easily spotted with culturomics. Jewish artist Marc Chagall was mentioned just once in the entire German dataset from 1936 to 1944, even as references to him grew about fivefold in English-language books. Similar patterns emerged for Russian mentions of Leon Trotsky; of Chinese mentions of Tiananmen Square; and of U.S. mentions of the Hollywood 10, a group of entertainers blacklisted in 1947.

Galileo, Darwin and Einstein may be among history's most brilliant scientists, but Freud is more deeply ingrained in our collective subconscious.