and Mazda, the task requires innovative approaches for
handling billions of objects and peta-scale storage mediums,
tagging data for quick retrieval and rooting out errors.
1Library of Congress The Library of Congress processes 2. 5 petabytes of data each year, which amounts to around 40TB each week. And Thomas Youkel, group chief of enterprise systems engineering at the library, estimates that
the data load will quadruple in the next few years, thanks to
the library’s dual mandates to serve up data for historians and
to preserve information in all its forms.
The library stores information on 15,000 to 18,000 spinning
disks attached to 600 servers in two data centers. More than
90% of the data, or over 3PB, is stored on a fiber-attached SAN,
and the rest is stored on network-attached storage drives.
The Library of Congress has an “interesting model” in that
part of the information stored is metadata — or data about the
data that’s stored — while the other is the actual content, says
Greg Schulz, an analyst at consulting firm StorageIO. Plenty
of organizations use metadata, but what makes the library
unique is the sheer size of its data store and the fact that it tags
absolutely everything in its collection, including vintage audio
recordings, videos, photos and other media, Schulz explains.
The actual content — which is seldom accessed — is ideally
kept o;ine and on tape, Schulz says, with perhaps a thumbnail
or low-resolution copy on disk.
Today, the library holds around 500 million objects per database, but Youkel expects that number to grow to as many as
5 billion. To prepare, Youkel’s team has started rethinking the
library’s namespace system. “We’re looking at new file systems
that can handle that many objects,” he says.
Gene Ruth, a storage analyst at Gartner, says that scaling up
and out correctly is critical. When a data store grows beyond
10PB, the time and expense of backing up and otherwise
handling that much data go quickly skyward. One approach,
he says, is to have infrastructure in a primary location that
handles most of the data and another facility for secondary,
long-term archival storage.
2Amazon.com E-commerce giant Amazon.com is quickly becom- ing one of the largest holders of data in the world, with around 450 billion objects stored in its cloud for its customers’ and its own storage needs. Alyssa
Henry, vice president of storage services at Amazon Web Services, says that translates into about 1,500 objects for every person
in the U.S. and one object for every star in the Milky Way galaxy.
Some of the objects in the database are fairly massive — up
to 5TB each — and could be databases in their own right.
Henry expects single-object size to get as high as 500TB by
2016. The secret to dealing with massive data, she says, is to
split the objects into chunks, a process called parallelization.
In its S3 storage service, Amazon uses its own custom code to
split files into 1,000MB pieces. This is a common practice, but
what makes Amazon’s approach unique is how the file-splitting
process occurs in real time. “This always-available storage ar-
chitecture is a contrast with some storage systems which move
data between what are known as ‘archived’ and ‘live’ states,
creating a potential delay for data retrieval,” Henry explains.
3Mazda Mazda Motor Corp., with 900 dealers and 800 employees in the U.S., manages around 90TB of data. Barry Blakeley, infrastructure architect at Mazda’s North American operations, says business
units and dealers are generating ever-increasing amounts of
data analytics files, marketing materials, business intelligence
databases, Microsoft SharePoint data and more. “We have
virtualized everything, including storage,” says Blakeley. The
company uses tools from Compellent, now part of Dell, for
storage virtualization and Dell PowerVault NX3100 as its SAN,
along with VMware systems to host the virtual servers.
The key, says Blakeley, is to migrate “stale” data quickly onto
tape. He says 80% of Mazda’s stored data becomes stale within
months, which means the blocks of data are not accessed at all.
To accommodate these usage patterns, the virtual storage is
set up in a tiered structure. Fast solid-state disks connected by
Fibre Channel switches make up the first tier, which handles
20% of the company’s data needs. The rest of the data is archived to slower disks running at 15,000 rpm on Fibre Channel
in a second tier and to 7,200-rpm disks connected by serial-attached SCSI in a third tier.
Blakeley says Mazda is putting less and less data on tape —
about 17TB today — as it continues to virtualize storage.
Overall, the company is moving to a “business continuance
model” as opposed to a pure disaster recovery model, he explains. Instead of having backup and o;site storage that would
be available to retrieve and restore data in a disaster recovery
scenario, “we will instead replicate both live and backed-up
data to a colocation facility.” In this scenario, Tier 1 applications
will be brought online almost immediately in the event of a
primary site failure. Other tiers will be restored from backup
data that has been replicated to the colocation facility.
Adapting the Techniques
These organizations are a proving ground for handling a tremendous amount of data. StorageIO’s Schulz says other
companies can mimic some of their processes, including
running checksums against files, monitoring disk failures
by using an alert system for IT sta;, incorporating metadata
and using replication to make sure data is always available.
However, the critical decision about massive data is to choose
the technology that matches the needs of the organization, not
the system that is cheapest or just happens to be popular at the
moment, he says.
In the end, the biggest lesson may be that while big data poses
many challenges, there are also many avenues to success. ◆
Brandon is a former IT manager at a Fortune 100 company who now
writes about technology. Follow him on Twitter (@jmbrandonbb).