SGI
(NASDAQ:SGI), the trusted leader in technical computing has partnered
with Kalev H. Leetaru of the University of Illinois to create the
first-ever historical mapping and exploration of the full text contents
of the English-language edition of Wikipedia, in time and space. The
results include visualizations of modern history captured in under a day
utilizing in-memory data-mining techniques. Loading the entire English
language edition of Wikipedia into SGI®
UV™ 2000, Mr. Leetaru was able to show how Wikipedia’s view of the
world unfolded over the past two centuries. Location, year and the
positive or negative sentiment have been tied to those references.
While several previous projects have mapped Wikipedia entries with
manually assigned location metadata by an editor, these previous
attempts only accounted for a tiny fraction of Wikipedia’s location
information. This project unlocked the contents of the articles
themselves, identifying every location and date in all four million
pages and the connections among them to create a massive network.
“Seeing”
Wikipedia in a brand new way
“This analysis
allows the world to take a step back from the individual articles and
text to gain a forest view of the tremendous knowledge captured in
Wikipedia, not just a page by page tree view. We can watch how one of
the largest collections of human knowledge has evolved and see what we
could never see before, such as global sentiment at a certain time and
place, or where there might be blind spots in the knowledge coverage, ”
said Franz Aman, chief marketing officer and head of strategy, SGI. “We
love to use Google Earth because we can zoom out and get the big picture
view. With SGI UV 2, we can apply the same concept to Big Data to get
the big picture on our Big Data.”
From this analysis, Wikipedia is seen to have four periods of growth in
its historical coverage: 1001-1500 (Middle Ages), 1501-1729 (Early
Modern Period), 1730-2003 (Age of Enlightenment), 2004-2011 (Wikipedia
Era) and its continued growth appears to be focused on enhancing its
coverage of historical events, rather than increased documenting of the
present. The average tone of Wikipedia’s coverage of each year closely
matches major global events, with the most negative period in the last
1,000 years being the American Civil War, followed by World War II. The
analysis also shows that the “copyright gap” that blanks out most of the
twentieth century in digitized print collections is not a problem with
Wikipedia where there is steady exponential growth in its coverage from
1924 to today.
Enabling researchers to data-mine Big Data at the speed of Big Data
“The one-way nature of connections in Wikipedia, the lack of links, and
the uneven distribution of Infoboxes, all point to the limitations of
metadata-based data mining of collections like Wikipedia,” said Mr.
Leetaru. “With SGI UV 2, the large shared memory available allowed me to
ask questions of the entire dataset in near-real time. With a huge
amount of cache-coherent shared memory at my fingertips, I could simply
write a few lines of code and run it across the entire dataset, asking
whatever questions came to mind. This isn’t possible with a scale-out
computing approach. It’s very similar to using a word processor instead
of using a typewriter – I can conduct my research in a completely
different way, focusing on the outcomes, not the algorithms.”
The analytical approach
Loaded into SGI®
UV™ 2000, the Big Brain computer, this massive dataset underwent
full text geocoding and complete date-coding, using algorithms that
identified every mention of every location and every date across the
text of every entry on Wikipedia. More than 80 million locations and 42
million dates between 1000 AD and 2012 were extracted, averaging 19
locations and 11 dates per article (every 44 words and every 75 words,
respectively). The connections between every date and every location
were captured into a massive network representing Wikipedia’s view of
history. With this instrumentation, Mr. Leetaru was able to perform
near-real time analysis over the entire dataset on the SGI UV 2 to
create visual maps throughout space and time to see not only how history
unfolded but also the overall tone of the world throughout the last
thousand years, and interactively testing a wide array of theories and
research questions, all in less than a day’s work.
The New SGI UV: The Big Brain computer
SGI
UV 2 product family enables users to find answers to the world’s
most difficult problems on a system as easy to administer as a
workstation. Built with Intel® Xeon® processor E5 family, running
standard Linux, and supporting a wide range of storage options, SGI UV 2
offers a complete, industry-standard solution for no-limit computing.
With as little as 16 cores and 32 gigabytes of memory, SGI UV 2 can
start small and seamlessly expand. This next generation platform doubles
the number of cores (up to 4096 cores) and quadruples the amount of
coherent main memory (up to 64 terabytes) from the previous generation,
available for in-memory computing in a single-image system. SGI UV 2 can
scale to eight petabytes of shared memory and at a peak I/O rate of four
terabytes per second (14 PB/hour) it could ingest the entire contents of
the U.S. Library of Congress print collection in less than three seconds.
SGI UV 2000 is available immediately. SGI UV 20 can be ordered today and
will start shipping in August 2012. Pricing starts at $30,000 USD.
About SGI
SGI, the trusted leader in technical computing, is focused on helping
customers solve their most demanding business and technology challenges.
Visit sgi.com
for more information.
Connect with SGI on Twitter
(@sgi_corp), Facebook
(facebook.com/sgiglobal), YouTube
(youtube.com/sgicorp), and LinkedIn.
For photos and videos go to: http://www.sgi.com/go/wikipedia
© 2012 Silicon Graphics International Corporation. SGI and the SGI logo
are trademarks or registered trademarks of Silicon Graphics
International Corp. or its subsidiaries in the United States and/or
other countries. Intel and Xeon are registered trademarks of Intel
Corporation. All other trade names and marks are the property of their
respective owners.
Images provided courtesy of Kalev Leetaru
Photos/Multimedia Gallery Available: http://www.businesswire.com/cgi-bin/mmg.cgi?eid=50313303&lang=en
