Big data has already proved its
importance at companies such as
JPMorgan Chase and eBay (see
“Hadoop Is Ready for Corporate IT,”
page 6). TheNew York Times has
used big data tools for text analysis
and Web mining, while Disney uses
them to correlate and understand
customer behavior across its stores,
theme parks and Web properties.
Perhaps the biggest challenge
facing those pursuing big data is
getting a platform that can store
and access all current and future
information and make it available
online for analysis cost-effectively.
All companies have big data, whether they realize it or not.
That requires a highly scalable
platform made up of many moving
parts, including storage technologies, query languages, analytics
tools, content analysis tools and
transport infrastructures.
Lots of proprietary and open-source options are available for
these various components, often
from startups but also from established cloud services providers
such as Amazon.com and Google.
However, big data does not necessarily have to be a “roll your own”
type of deployment. Large vendors
such as IBM and EMC offer tools for
big data projects, though their costs
can be high and hard to justify.
Hadoop: The Core of Most
Big Data Efforts
In the open-source realm, the big
name is Hadoop, a project administered by the Apache Software
Foundation that consists of Google-derived technologies for building
a platform to consolidate, combine
and understand data.
Technically, Hadoop consists
of two key services: reliable data
storage using the Hadoop Distributed File System (HDFS), and
high-performance parallel data
processing using a technique called
MapReduce. The goal of these
services is to provide a foundation
on which the fast, reliable analysis
of both structured and complex
data becomes a reality. In many
cases, enterprises deploy Hadoop
alongside their legacy IT systems,
allowing them to combine old and