Slav Petrov and Yuri Lin of Google's NLP group worked with a universal tagset of twelve parts of speech that could work across different languages, and then applied those tags to parse the entire corpus. Doing part-of-speech tagging on hundreds of billions of words in eight different languages is an impressive achievement in the field of natural language processing, and it's hard to imagine such a Herculean task being undertaken anywhere other than Google. This kind of grammatical annotation greatly enhances the utility of the corpus for language researchers. While these improvements in quanitity and quality are welcome, the most exciting change for the linguistically inclined is that all the words in the Ngram Corpus have now been tagged according to their parts of speech, and these tags can also be searched for in the interface. For instance, searching for modern-day brand names - like Microsoft or, well, Google - previously revealed weird, spurious bumps of usage around the turn of the 20th century, but those bumps have now been smoothed over thanks to more reliable dating of books. The Google team, led by engineering manager Jon Orwant, has also fixed a great deal of the faulty metadata that marred the original release. The English portion alone contains about half a trillion words, and seven other languages are represented: Spanish, French, German, Russian, Italian, Chinese, and Hebrew. That represents about six percent of all books ever published, according to Google's estimate. For starters, the text corpus, already mind-bogglingly big, has become much bigger: The new edition extracts data from more than eight million out of the 20 million books that Google has scanned. If nothing else, playing with Ngrams became a time suck of epic proportions.Īs of today, the Ngram Viewer just got a whole lot better. Here at The Atlantic, Alexis Madrigal collected a raft of great examples submitted by readers, some of whom pitted "vampire" against "zombie," "liberty" against "freedom," and "apocalypse" against "utopia." A Tumblr feed brought together dozens more telling graphs. The appeal of the Ngram Viewer was immediately obvious to scholars in the digital humanities, linguistics, and lexicography, but it wasn't just specialists who got pleasure out of generating graphs showing how key words and phrases have waxed and waned over the past few centuries. They called the interface the Ngram Viewer, and it was launched in conjunction with a blockbuster paper in the journal Science that baptized this Big Data approach to historical analysis with the label "culturomics." Back in December 2010, Google unveiled an online tool for analyzing the history of language and culture as reflected in the gargantuan corpus of historical texts that have been scanned and digitized as part of the Google Books project.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |