Friday, June 3, 2011

What Makes Big Data Storage Different?

At EMC World, I was fortunate enough to facilitate our first-ever Big Data Storage Summit.  Imagine a room with 20 or so people, each facing their own unique flavor of stupendous storage requirements.
Our working premise going in was that big data storage requirements were fundamentally different than the more familiar enterprise requirements.  Not only the technology, but also aspects of the operational environment, funding model and other contextual aspects.
We were directionally correct, but we fortunately ended up getting surprised in several regards.  That's what these sessions are really all about -- opening yourself up to being repeatedly surprised.

Not All Big Data Is Analytics Toss around the phrase "big data", and many people will immediately gravitate to the uber-data-warehouse-on-steroids mental picture.  That's fascinating enough in its own right, but there's another side to big data that is more about dealing with big files vs. big databases.

The analytics side was well explored during EMC World's first-ever Data Scientist Summit.  And the non-analytics side was the topic of the Big Data Storage Summit.  Think medical research, energy, video, repositories, satellite imagery, service providers -- anytime you're the proud owner of petabyte-class file systems coupled with alarming growth rates.

How Big Is Big?
Most people tend to focus on the absolute size of these environments.  While the total capacity numbers are certainly impressive, what's more interesting are the explosive growth rates, and that's where we started to focus.

When asked "how fast are you growing?" the responses ranged from "dozens of terabytes per month" to "dozens of terabytes per week".  A few were in the "terabytes per day" growth club.  Digging a little further, it wasn't hard to make the case that -- in some environments -- the growth rate itself was accelerating, leading to exponential growth on top of existing massive repositories.

Indeed, an interesting subset of the room created a picture of infinite storage demand; one where capacities were more dictated by limitations of resources and technology vs. simply keeping up with demand.  As the storage operational environments improved, they immediately tended to balloon to the next order-of-magnitude.
Yikes!


The Haves and The Have-Nots
We mixed the room up with two types of storage users: those that were meeting the challenge using purpose-built scale-out NAS (e.g. Isilon) vs. those that were attempting to use more traditional NAS platforms (e.g. EMC Celerra and VNX, NetApp, BlueArc, et. al.).  We wanted to understand if there was a meaningful and significant advantage between using purpose-built storage products vs. more traditional NAS offerings.

The differences couldn't have been more pronounced.  


Although it's considered exceedingly bad form to turn these research events into a blatant product pitch, at several points the Isilon customers were openly sharing how much better their worlds had become once they moved off of more traditional NAS products.

Gone was the endless treadmill of endlessly rebalancing storage and workloads across multiple filers.  Gone was lengthy and repetitive installation, configuration and integration exercises.  Gone was detecting and responding to individual performance spikes.


This wasn't glossy marketing-speak; these were real live IT administrators who now couldn't imagine any other way to get things done.  The people using a purpose-built scale-out approach (e.g. Isilon) had other challenges they were facing, but they were of a different class entirely than those using traditional NAS filers.

Surprising to me was the discussion around downtime -- I had sort of assumed that downtime or performance degradation wasn't particularly a huge issue in these environments.  I was very wrong.
As part of the endless rebalancing that the more traditional NAS users faced, they had to often had to take frequent and lengthy downtime to shuffle hundreds of terabytes around.  Cranky and irritated users appeared to be the norm here, not to mention cranky and irritated IT administrators.

One customer shared how a relatively normal filer disk failure and subsequent lengthy rebuild put a smoking performance hole in the middle of a dozen-filer farm -- because the user data sets spanned
multiple filers!  As a result, every user was significantly impacted; and of course the issue rose to very high levels indeed.

Yikes again.


Big Data Storage = Internal Storage Service Provider?
About an hour or so into the session, it became clear to me that we would end up focusing more on the people who were already using purpose-built scale-out NAS.  The folks who weren't were mostly so consumed in the day-to-day firefighting that it was more difficult for them to articulate requirements beyond their current situation.

I then started to probe on the folks who were using purpose-built products.  We wanted to know more about their operational model (how they're organized to do what they do), and the associated funding models.

Before long, it was clear to me that their operational models had edged over to look very much like an internal storage service provider: here are my service offerings, here is how I make them very easy to consume, here is how I give you visibility into what you're using, and how well it's performing.

And -- behind that -- the processes, roles, skills and organizational alignment that are the hallmarks of IT-as-a-service vs. traditional enterprise IT silos.


Not everyone in this subgroup was 100% there, but it started looking awfully familiar to me.  And, as a result, their concerns started sounding familiar as well.

For example, they all were pretty good at provisioning storage services on demand.  That being said, there was recognition that they were really providing infrastructure resources, hence the need to associate server, network, image, etc. with the fundamental provisioning activity.  I'd describe it as a desired Vblock-ish model, but with entirely different compute-to-capacity ratios.

There was also a desire to give their power users more visibility into the resources they were using, and how well they were performing.  Most of that information flows to the storage administrator today, vs.a federated view where "subdomain administrators" can get their specific context.

Notions of chargeback and metering came up frequently as well.  Some of these larger environments were well-funded and thus weren't overly concerned with showing resource usage in a precise and granular fashion.  Others were coming from government-funded research or educational settings; the need to justify each and every dollar spent was a pressing need for them.

Features and Functionality
We did some fishing to see if some of the more popular features found traditional NAS platforms had an equally desirably role in purpose-built scale-out environments.  And there were more than a few surprises here as well.

For example, when it came up to space reduction technologies (e.g. single-instancing, compressing and data deduplication), there wasn't the overwhelming demand from the purpose-built NAS crowd that you might have expected.  I think they weren't exactly clear if it would be worth the trouble in their environments, especially considering their data types and usage models usually aren't great candidates for these technologies.

Replication and data movement technologies were an area of growing interest.  Perhaps less so in a data protection sense, and more in a get-the-right-information-in-the-right-place-at-the-right-time information logistics sort of way.

Producers and consumers of these large information stores were increasingly separated by distance; and associated latency was no one's friend.


When we did finally wander into data protection topics (backup, continuous replication, etc.) there was a strange and rather awkward silence in the room.  No one came out and openly admitted it, but I was left with the suspicion that much of this big data isn't getting adequately protected for one reason or another.
When I asked "would anyone be interested in considering some newer approaches to this topic?", there was very strong interest.  Stay tuned here ...

Feature, Feature, Feature -- Hey, Wait A Minute!
As we went through a laundry list of other specific storage features (e.g. encryption, auto-tiering, hypervisor integration, etc.) the purpose-built crowd said something very important: we're willing to consider all these new features, but not at the expense of the utter simplicity and predictability we have in our existing environments.


Complexity -- in any form -- was the bane of their existence.  Better to have a less-functional solution that scaled and retained its core simplicity aspects vs. a more feature-rich environment that was even a tiny bit less elegant to use.  That came across loud and clear.


For me, this was one of the essential defining elements of what makes big data storage fundamentally different: simplicity and predictability above all else.  Take any seemingly minor inefficiency or iota of complexity, multiple it by a very large number, and you inherently have a major issue.

There's More, Of Course ...
We ended up with pages and pages of incredibly detailed notes from the session.  We learned a lot from this group.  And, in some cases, they learned a lot from each other :)

When I run one of these sessions, I sometimes feel a bit guilty that we're taking a lot without giving something back in return.  Time is valuable, and having these people come all the way out to EMC World so we can ask hit-and-miss questions about their world -- well, that's a huge ask from a vendor to a customer.

That being said, when I asked them if they would want to repeat this sort of session in the future, just about everyone raised their hands.  


I think that's because -- when it comes to big data storage -- it's a time for intense dialogue between both sides of the vendor/customer community.  Beware of vendors bearing "total solutions" :)


Instead, I think there's a clear opportunity for vendors to partner with these fascinating big data storage users, and build unique capabilities that help them do what they do even better than today.

A huge thank you to all of you who participated!


By: Chuck Hollis