Case Study APA FORMAT( paragraphs are single-spaced )
– NO PLAGIARISM
– NEED PLAGIARISM REPORT
– INTEXT CITATION
– QUESTION IS ATTACHED and SOURCE TOO
– paragraphs are single-spaced Sep 25 – 27, 2013
Oct 28 – 30, 2013
New York, NY
Nov 11 – 13, 2013
©2013 O’Reilly Media, Inc. O’Reilly logo is a registered trademark of O’Reilly Media, Inc. 13110
Change the world with data.
We’ll show you how.
O’Reilly Media, Inc.
Big Data Now: 2012 Edition
Big Data Now: 2012 Edition
by O’Reilly Media, Inc.
Copyright © 2012 O’Reilly Media. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA
O’Reilly books may be purchased for educational, business, or sales promotional use.
Online editions are also available for most titles (http://my.safaribooksonline.com). For
more information, contact our corporate/institutional sales department: (800)
998-9938 or email@example.com.
Cover Designer: Karen Montgomery Interior Designer: David Futato
October 2012: First Edition
Revision History for the First Edition:
2012-10-24 First release
See http://oreilly.com/catalog/errata.csp?isbn=9781449356712 for release details.
Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered
trademarks of O’Reilly Media, Inc.
Many of the designations used by manufacturers and sellers to distinguish their prod‐
ucts are claimed as trademarks. Where those designations appear in this book, and
O’Reilly Media, Inc. was aware of a trademark claim, the designations have been printed
in caps or initial caps.
While every precaution has been taken in the preparation of this book, the publisher
and authors assume no responsibility for errors or omissions, or for damages resulting
from the use of the information contained herein.
Table of Contents
1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2. Getting Up to Speed with Big Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
What Is Big Data? 3
What Does Big Data Look Like? 4
In Practice 8
What Is Apache Hadoop? 10
The Core of Hadoop: MapReduce 11
Hadoop’s Lower Levels: HDFS and MapReduce 11
Improving Programmability: Pig and Hive 12
Improving Data Access: HBase, Sqoop, and Flume 12
Coordination and Workflow: Zookeeper and Oozie 14
Management and Deployment: Ambari and Whirr 14
Machine Learning: Mahout 14
Using Hadoop 15
Why Big Data Is Big: The Digital Nervous System 15
From Exoskeleton to Nervous System 15
Charting the Transition 16
Coming, Ready or Not 17
3. Big Data Tools, Techniques, and Strategies. . . . . . . . . . . . . . . . . . . . . 19
Designing Great Data Products 19
Objective-based Data Products 20
The Model Assembly Line: A Case Study of Optimal
Decisions Group 21
Drivetrain Approach to Recommender Systems 25
Optimizing Lifetime Customer Value 28
Best Practices from Physical Data Products 31
The Future for Data Products 35
What It Takes to Build Great Machine Learning Products 35
Progress in Machine Learning 36
Interesting Problems Are Never Off the Shelf 37
Defining the Problem 39
4. The Application of Big Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Stories over Spreadsheets 41
A Thought on Dashboards 43
Full Interview 43
Mining the Astronomical Literature 43
Interview with Robert Simpson: Behind the Project and
What Lies Ahead 48
Science between the Cracks 51
The Dark Side of Data 51
The Digital Publishing Landscape 52
Privacy by Design 53
5. What to Watch for in Big Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Big Data Is Our Generation’s Civil Rights Issue, and We
Don’t Know It 55
Three Kinds of Big Data 60
Enterprise BI 2.0 60
Civil Engineering 62
Customer Relationship Optimization 63
Headlong into the Trough 64
Automated Science, Deep Data, and the Paradox of
(Semi)Automated Science 65
Deep Data 67
The Paradox of Information 69
The Chicken and Egg of Big Data Solutions 71
Walking the Tightrope of Visualization Criticism 73
The Visualization Ecosystem 74
The Irrationality of Needs: Fast Food to Fine Dining 76
Grown-up Criticism 78
Final Thoughts 80
6. Big Data and Health Care. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Solving the Wanamaker Problem for Health Care 83
Making Health Care More Effective 85
More Data, More Sources 89
iv | Table of Contents
Paying for Results 90
Enabling Data 91
Building the Health Care System We Want 94
Recommended Reading 95
Dr. Farzad Mostashari on Building the Health Information
Infrastructure for the Modern ePatient 96
John Wilbanks Discusses the Risks and Rewards of a Health
Data Commons 100
Esther Dyson on Health Data, “Preemptive Healthcare,” and
the Next Big Thing 106
A Marriage of Data and Caregivers Gives Dr. Atul Gawande
Hope for Health Care 112
Five Elements of Reform that Health Providers Would
Rather Not Hear About 119
Table of Contents | v
In the first edition of Big Data Now, the O’Reilly team tracked the birth
and early development of data tools and data science. Now, with this
second edition, we’re seeing what happens when big data grows up:
how it’s being applied, where it’s playing a role, and the conse‐
quences — good and bad alike — of data’s ascendance.
We’ve organized the 2012 edition of Big Data Now into five areas:
Getting Up to Speed With Big Data — Essential information on the
structures and definitions of big data.
Big Data Tools, Techniques, and Strategies — Expert guidance for
turning big data theories into big data products.
The Application of Big Data — Examples of big data in action, in‐
cluding a look at the downside of data.
What to Watch for in Big Data — Thoughts on how big data will
evolve and the role it will play across industries and domains.
Big Data and Health Care — A special section exploring the possi‐
bilities that arise when data and health care come together.
In addition to Big Data Now, you can stay on top of the latest data
developments with our ongoing analysis on O’Reilly Radar and
through our Strata coverage and events series.
Getting Up to Speed with Big Data
What Is Big Data?
By Edd Dumbill
Big data is data that exceeds the processing capacity of conventional
database systems. The data is too big, moves too fast, or doesn’t fit the
strictures of your database architectures. To gain value from this data,
you must choose an alternative way to process it.
The hot IT buzzword of 2012, big data has become viable as cost-
effective approaches have emerged to tame the volume, velocity, and
variability of massive data. Within this data lie valuable patterns and
information, previously hidden because of the amount of work re‐
quired to extract them. To leading corporations, such as Walmart or
Google, this power has been in reach for some time, but at fantastic
cost. Today’s commodity hardware, cloud architectures and open
source software bring big data processing into the reach of the less
well-resourced. Big data processing is eminently feasible for even the
small garage startups, who can cheaply rent server time in the cloud.
The value of big data to an organization falls into two categories: an‐
alytical use and enabling new products. Big data analytics can reveal
insights hidden previously by data too costly to process, such as peer
influence among customers, revealed by analyzing shoppers’ transac‐
tions and social and geographical data. Being able to process every
item of data in reasonable time removes the troublesome need for
sampling and promotes an investigative approach to data, in contrast
to the somewhat static nature of running predetermined reports.
The past decade’s successful web startups are prime examples of big
data used as an enabler of new products and services. For example, by
combining a large number of signals from a user’s actions and those
of their friends, Facebook has been able to craft a highly personalized
user experience and create a new kind of advertising business. It’s no
coincidence that the lion’s share of ideas and tools underpinning big
data have emerged from Google, Yahoo, Amazon, and Facebook.
The emergence of big data into the enterprise brings with it a necessary
counterpart: agility. Successfully exploiting the value in big data re‐
quires experimentation and exploration. Whether creating new prod‐
ucts or looking for ways to gain competitive advantage, the job calls
for curiosity and an entrepreneurial outlook.
What Does Big Data Look Like?
As a catch-all term, “big data” can be pretty nebulous, in the same way
that the term “cloud” covers diverse technologies. Input data to big
data systems could be chatter from social networks, web server logs,
traffic flow sensors, satellite imagery, broadcast audio streams, bank‐
ing transactions, MP3s of rock music, the content of web pages, scans
of government documents, GPS trails, telemetry from automobiles,
financial market data, the list goes on. Are these all really the same
To clarify matters, the three Vs of volume, velocity, and variety are
commonly used to characterize different aspects of big data. They’re
a helpful lens through which to view and understand the nature of the
data and the software platforms available to exploit them. Most prob‐
ably you will contend with each of the Vs to one degree or another.
The benefit gained from the ability to process large amounts of infor‐
mation is the main attraction of big data analytics. Having more data
beats out having better models: simple bits of math can be unreason‐
ably effective given large amounts of data. If you could run that forecast
taking into account 300 factors rather than 6, could you predict de‐
mand better? This volume presents the most immediate challenge to
conventional IT structures. It calls for scalable storage, and a distribut‐
ed approach to querying. Many companies already have large amounts
of archived data, perhaps in the form of logs, but not the capacity to
4 | Chapter 2: Getting Up to Speed with Big Data
Assuming that the volumes of data are larger than those conventional
relational database infrastructures can cope with, processing options
break down broadly into a choice between massively parallel process‐
ing architectures — data warehouses or databases such as Green‐
plum — and Apache Hadoop-based solutions. This choice is often in‐
formed by the degree to which one of the other “Vs” — variety —
comes into play. Typically, data warehousing approaches involve pre‐
determined schemas, suiting a regular and slowly evolving dataset.
Apache Hadoop, on the other hand, places no conditions on the struc‐
ture of the data it can process.
At its core, Hadoop is a platform for distributing computing problems
across a number of servers. First developed and released as open source
by Yahoo, it implements the MapReduce approach pioneered by Goo‐
gle in compiling its search indexes. Hadoop’s MapReduce involves
distributing a dataset among multiple servers and operating on the
data: the “map” stage. The partial results are then recombined: the
To store data, Hadoop utilizes its own distributed filesystem, HDFS,
which makes data available to multiple computing nodes. A typical
Hadoop usage pattern involves three stages:
• loading data into HDFS,
• MapReduce operations, and
• retrieving results from HDFS.
This process is by nature a batch operation, suited for analytical or
non-interactive computing tasks. Because of this, Hadoop is not itself
a database or data warehouse solution, but can act as an analytical
adjunct to one.
One of the most well-known Hadoop users is Facebook, whose model
follows this pattern. A MySQL database stores the core data. This is
then reflected into Hadoop, where computations occur, such as cre‐
ating recommendations for you based on your friends’ interests. Face‐
book then transfers the results back into MySQL, for use in pages
served to users.
The importance of data’s velocity — the increasing rate at which data
flows into an organization — has followed a similar pattern to that of
What Is Big Data? | 5
volume. Problems previously restricted to segments of industry are
now presenting themselves in a much broader setting. Specialized
companies such as financial traders have long turned systems that cope
with fast moving data to their advantage. Now it’s our turn.
Why is that so? The Internet and mobile era means that the way we
deliver and consume products and services is increasingly instrumen‐
ted, generating a data flow back to the provider. Online retailers are
able to compile large histories of customers’ every click and interaction:
not just the final sales. Those who are able to quickly utilize that in‐
formation, by recommending additional purchases, for instance, gain
competitive advantage. The smartphone era increases again the rate
of data inflow, as consumers carry with them a streaming source of
geolocated imagery and audio data.
It’s not just the velocity of the incoming data that’s the issue: it’s possible
to stream fast-moving data into bulk storage for later batch processing,
for example. The importance lies in the speed of the feedback loop,
taking data from input through to decision. A commercial from
IBM makes the point that you wouldn’t cross the road if all you had
was a five-minute old snapshot of traffic location. There are times
when you simply won’t be able to wait for a report to run or a Hadoop
job to complete.
Industry terminology for such fast-moving data tends to be either
“streaming data” or “complex event processing.” This latter term was
more established in product categories before streaming processing
data gained more widespread relevance, and seems likely to diminish
in favor of streaming.
There are two main reasons to consider streaming processing. The first
is when the input data are too fast to store in their entirety: in order to
keep storage requirements practical, some level of analysis must occur
as the data streams in. At the extreme end of the scale, the Large Ha‐
dron Collider at CERN generates so much data that scientists must
discard the overwhelming majority of it — hoping hard they’ve not
thrown away anything useful. The second reason to consider stream‐
ing is where the application mandates immediate response to the data.
Thanks to the rise of mobile applications and online gaming this is an
increasingly common situation.
6 | Chapter 2: Getting Up to Speed with Big Data
Product categories for handling streaming data divide into established
proprietary products such as IBM’s InfoSphere Streams and the less-
polished and still emergent open source frameworks originating in the
web industry: Twitter’s Storm and Yahoo S4.
As mentioned above, it’s not just about input data. The velocity of a
system’s outputs can matter too. The tighter the feedback loop, the
greater the competitive advantage. The results might go directly into
a product, such as Facebook’s recommendations, or into dashboards
used to drive decision-making. It’s this need for speed, particularly on
the Web, that has driven the development of key-value stores and col‐
umnar databases, optimized for the fast retrieval of precomputed in‐
formation. These databases form part of an umbrella category known
as NoSQL, used when relational models aren’t the right fit.
Rarely does data present itself in a form perfectly ordered and ready
for processing. A common theme in big data systems is that the source
data is diverse, and doesn’t fall into neat relational structures. It could
be text from social networks, image data, a raw feed directly from a
sensor source. None of these things come ready for integration into an
Even on the Web, where computer-to-computer communication
ought to bring some guarantees, the reality of data is messy. Different
browsers send different data, users withhold information, they may be
using differing software versions or vendors to communicate with you.
And you can bet that if part of the process involves a human, there will
be error and inconsistency.
A common use of big data processing is to take unstructured data and
extract ordered meaning, for consumption either by humans or as a
structured input to an application. One such example is entity reso‐
lution, the process of determining exactly what a name refers to. Is this
city London, England, or London, Texas? By the time your business
logic gets to it, you don’t want to be guessing.
The process of moving from source data to processed application data
involves the loss of information. When you tidy up, you end up throw‐
ing stuff away. This underlines a principle of big data: when you can,
keep everything. There may well be useful signals in the bits you throw
away. If you lose the source data, there’s no going back.
What Is Big Data? | 7
Despite the popularity and well understood nature of relational data‐
bases, it is not the case that they should always be the destination for
data, even when tidied up. Certain data types suit certain classes of
database better. For instance, documents encoded as XML are most
versatile when stored in a dedicated XML store such as MarkLogic.
Social network relations are graphs by nature, and graph databases
such as Neo4J make operations on them simpler and more efficient.
Even where there’s not a radical data type mismatch, a disadvantage
of the relational database is the static nature of its schemas. In an agile,
exploratory environment, the results of computations will evolve with
the detection and extraction of more signals. Semi-structured NoSQL
databases meet this need for flexibility: they provide enough structure
to organize data, but do not require the exact schema of the data before
We have explored the nature of big data and surveyed the landscape
of big data from a high level. As usual, when it comes to deployment
there are dimensions to consider over and above tool selection.
Cloud or in-house?
The majority of big data solutions are now provided in three forms:
software-only, as an appliance or cloud-based. Decisions between
which route to take will depend, among other things, on issues of data
locality, privacy and regulation, human resources and project require‐
ments. Many organizations opt for a hybrid solution: using on-
demand cloud resources to supplement in-house deployments.
Big data is big
It is a fundamental fact that data that is too big to process conven‐
tionally is also too big to transport anywhere. IT is undergoing an
inversion of priorities: it’s the program that needs to move, not the
data. If you want to analyze data from the U.S. Census, it’s a lot easier
to run your code on Amazon’s web services platform, which hosts such
data locally, and won’t cost you time or money to transfer it.
Even if the data isn’t too big to move, locality can still be an issue,
especially with rapidly updating data. Financial trading systems crowd
into data centers to get the fastest connection to source data, because
that millisecond difference in processing time equates to competitive
8 | Chapter 2: Getting Up to Speed with Big Data
Big data is messy
It’s not all about infrastructure. Big data practitioners consistently re‐
port that 80% of the effort involved in dealing with data is cleaning it
up in the first place, as Pete Warden observes in his Big Data Glossa‐
ry: “I probably spend more time turning messy source data into some‐
thing usable than I do on the rest of the data analysis process com‐
Because of the high cost of data acquisition and cleaning, it’s worth
considering what you actually need to source yourself. Data market‐
places are a means of obtaining common data, and you are often able
to contribute improvements back. Quality can of course be variable,
but will increasingly be a benchmark on which data marketplaces
The phenomenon of big data is closely tied to the emergence of data
science, a discipline that combines math, programming, and scientific
instinct. Benefiting from big data means investing in teams with this
skillset, and surrounding them with an organizational willingness to
understand and use data for advantage.
In his report, “Building Data Science Teams,” D.J. Patil characterizes
data scientists as having the following qualities:
• Technical expertise: the best data scientists typically have deep
expertise in some scientific discipline.
• Curiosity: a desire to go beneath the surface and discover and
distill a problem down into a very clear set of hypotheses that can
• Storytelling: the ability to use data to tell a story and to be able to
communicate it effectively.
• Cleverness: the ability to look at a problem in different, creative
The far-reaching nature of big data analytics projects can have un‐
comfortable aspects: data must be broken out of silos in order to be
mined, and the organization must learn how to communicate and in‐
terpet the results of analysis.
What Is Big Data? | 9
Those skills of storytelling and cleverness are the gateway factors that
ultimately dictate whether the benefits of analytical labors are absor‐
bed by an organization. The art and practice of visualizing data is be‐
coming ever more important in bridging the human-computer gap to
mediate analytical insight in a meaningful way.
Know where you want to go
Finally, remember that big data is no panacea. You can find patterns
and clues in your data, but then what? Christer Johnson, IBM’s leader
for advanced analytics in North America, gives this advice to busi‐
nesses starting out with big data: first, decide what problem you want
If you pick a real business problem, such as how you can change your
advertising strategy to increase spend per customer, it will guide your
implementation. While big data work benefits from an enterprising
spirit, it also benefits strongly from a concrete goal.
What Is Apache Hadoop?
By Edd Dumbill
Apache Hadoop has been the driving force behind the growth of the
big data industry. You’ll hear it mentioned often, along with associated
technologies such as Hive and Pig. But what does it do, and why do
you need all its strangely named friends, such as Oozie, Zookeeper,
Hadoop brings the ability to cheaply process large amounts of data,
regardless of its structure. By large, we mean from 10-100 gigabytes
and above. How is this different from what went before?
Existing enterprise data warehouses and relational databases excel at
processing structured data and can store massive amounts of data,
though at a cost: This requirement for structure restricts the kinds of
data that can be processed, and it imposes an inertia that makes data
warehouses unsuited for agile exploration of massive heterogenous
data. The amount of effort required to warehouse data often means
that valuable data sources in organizations are never mined. This is
where Hadoop can make a big difference.
This article examines the components of the Hadoop ecosystem and
explains the functions of each.
10 | Chapter 2: Getting Up to Speed with Big Data
The Core of Hadoop: MapReduce
Created at Google in response to the problem of creating web search
indexes, the MapReduce framework is the powerhouse behind most
of today’s big data processing. In addition to Hadoop, you’ll find Map‐
Reduce inside MPP and NoSQL databases, such as Vertica or Mon‐
The important innovation of MapReduce is the ability to take a query
over a dataset, divide it, and run it in parallel over multiple nodes.
Distributing the computation solves the issue of data too large to fit
onto a single machine. Combine this technique with commodity Linux
servers and you have a cost-effective alternative to massive computing
At its core, Hadoop is an open …