Culturomics: Mining 'Genome' of Words for Cultural Trends

VIDEO: Googles Top Searches in 2010

Ever wonder how many new words enter the English language each year? Or what fame is like for celebrities today versus during the 1800s? Or which words are most deeply ingrained in our collective subconscious?

The billions of words in all the books ever published can provide an interesting window into how the winds of culture change direction. But plowing through centuries of primers, paperbacks and other publications is hard work.

To ease the burden, a team of Harvard researchers worked with Google to devise a way to quantitatively analyze human culture.

In a paper published Thursday in the journal Science, the researchers unveiled their insights on "culturomics," along with some fascinating findings about what we've stored in our collective memory.

"People have been wondering about human culture and human society and about what makes the fabric of our society way before the sciences were there," said Jean-Baptiste Michel, a key researcher on the study and a postdoctoral fellow in Harvard's department of psychology. "But interestingly enough, I don't think there was ever a quantitative way to get at those questions. We felt that it was very interesting that the tool that would enable one to ask questions about human culture quantitatively and in a reproducible manner [would] deliver interesting insights on culture."

Dataset of Words Is Based on 5.2 Million Books

Realizing that Google was already conquering the data acquisition piece of a quantitative tool by digitizing books, Michel and his Harvard colleague Erez Lieberman Aiden approached the search giant for help.

Soon enough, the academics and the engineers joined forces to create and analyze a dataset of words that would resemble a "fossil record" of human culture.

The dataset, which can be downloaded and searched with a Google tool launched Thursday, is the largest data release in the history of the humanities, the researchers said. If the words were arranged in a straight line, they would extend to the moon and back 10 times.

The database includes more than 500 billion words, based on the full text of about 5.2 million books (which is about 4 percent of all books ever published).

Aiden said the earliest books date back to the early 1500s, but the vast majority of the data are from the last 200 years.

Culturomics Website Includes Tool for Public to Search Database

The online Google tool was released to coincide with the publication of the paper, so that the public can participate in exploring humanity's culture genome along with the researchers.

"We hope they'll learn some history," he said. "We hope that it will enable us to create this space in which you can quantify things that are relevant to the humanities and that might lead to an interesting dialogue in and of itself."

Anyone can visit the team's Culturomic's website to explore the Google tool and learn more about the research, but for a few of the study's most interesting tidbits, check out the list below.

The English language welcomes about 8,500 new words annually, leading to a 70 percent growth of the lexicon between 1950 and 2000. But you'll never find many of these words in the dictionary.

  • 1
  • |
  • 2
Join the Discussion
blog comments powered by Disqus
You Might Also Like...