is a fairly
but it does let
you really use thousands of
processors at once running
over all of your data in a very
all sorts of analyses
that just weren’t
practical before. You
can start to look at
patterns over years,
over seasons, across
have enough data to
fill in patterns and
and decide, “How
should we price
things?” and “What
should we be selling
now?” and “How
should we adver-
tise?” It is not only
about having data for
but also richer data
about any given
Continued from page 8
what are hive and
pig? Hive gives you
[a way] to query
data that is stored
in Hadoop. A lot of
people are used to
using SQL and so, for
it’s a very useful
tool. Pig is a different language. It is not SQL. It is an
imperative data flow language. It is an alternate way
to do higher-level programming of Hadoop clusters.
There is also HBase, if you want to have real-time
[analysis] as opposed to batch. There is a whole
ecosystem of projects that have grown up around
Hadoop and that are continuing to grow. Hadoop is
the kernel of a distributed operating system, and all
the other components around the kernel are now
arriving on the stage.
why do you think there’s so much interest in hadoop
right now? It is a relatively new technology. People
are discovering just how useful it is. I think it is still
in a period of growth where people are finding more
and more uses for it. To some degree, software has
lagged hardware for some years, and now we are
starting to catch up. We’ve got software that lets companies really exploit the hardware they can afford.
what is it about relational database technologies
that makes them unsuitable for some of the tasks
that hadoop is used for? Some of it is technological
challenges. If you want to write a SQL query that
has a “join over tables” that are petabytes [in size]
— nobody knows how to do that. The standard way
you do things in a database tops out at a certain level.
[Relational databases] weren’t designed to support
distributed parallelism, to the degree that people now
find affordable. You can buy a Hadoop-based solution
for a 10th of the price [of conventional relational
database technology]. So there is the affordability.
Hadoop is a fairly crude tool, but it does let you really
use thousands of processors at once running over all
of your data in a very direct way.
what are enterprises using hadoop for? Well, we
see a lot of different things, industry by industry. In
the financial industry, people are looking at fraud
detection, credit card companies are looking to see
which transactions are fraudulent, banks are looking
at credit worthiness — deciding if they should give
someone a loan or not. Retailers are looking at
long-term trends, analyzing promotions, analyzing
inventory. The intelligence community uses this a lot
for analyzing intelligence.
Are those users replacing relational databases, or
just supplementing them? They are augmenting and
not replacing. There are a lot of things I don’t think
Hadoop is ever going to replace, things like doing
payroll, the real nuts-and-bolts things that people
have been using relational databases for forever. It’s
not really a sweet spot for Hadoop.
microsoft, oracle, iBm and other big vendors have
all begun doing things with hadoop these days. what
do you think about that trend? It’s a validation that
this is real, that this is a real need that people have. I
think this is good news.
what advice would you give to enterprises considering hadoop? I think they should identify a
problem that, for whatever reason, they are not able
to address currently, and do a sort of pilot. Build a
cluster, evaluate that particular application and then
see how much more affordable, how much better it
is [on Hadoop]. I think you can do bakeoffs, at least
for some initial applications. There is a real synergy
when you get more data into a Hadoop cluster.
Hadoop lets you get all of your data in one place so
you can do an analysis of it together and combine it.
where do you see hadoop five years from now? It is
going to start to be a real established part of IT infrastructure. Right now, these things from Oracle and
Microsoft are experiments. I think they are trying
to tinker with it. I think in five years those won’t be
experiments. [Hadoop] will be the incumbent.
My hope is to build something that is loosely
coupled enough that it can evolve and change and
we can replace component by component [so] there
doesn’t need to be a revolution again anytime soon.