The Squabble Over Single File Systems

Many of you are endlessly entertained by the back-and-forth bickering between us storage vendors over things like benchmarks.   

Sometimes the disagreement is over how the test was conducted, or the use of "lab queen" configurations that would never be found in a customer environment.  
And, occasionally, there's very strong disagreement around comparing two very unlike things using a common standard.   That's what this post is about.  Why should you care?

In a world of exploding data growth, massive scale and limited resources, how you do things may end up being more important than what you do.   I believe many IT architects will want to take note of this particular debate, because you'll be seeing ever-more variations of this same theme in the near future.

How This Came About

Nothing brings out the competitive nature of IT vendors more than benchmarks.

While I'm a general skeptic of many benchmarks, the SPEC tests (specifically the SPECsfs2008 NFS and CIFS workloads) are notable in that the SPEC organization has an ongoing process to ensure the workloads match those of the member organizations.

Put differently, you can't seriously claim the SPEC isn't "real world".

The testing methodology is difficult to game, although if you scrutinize some of the submissions you can see obvious signs of creativity here and there.   For example, you'll sometimes see some vendors export only a small amount of the total capacity configured in an effort to goose the numbers. Or occasionally turn down the flush rate from write cache to persistent storage.

You know, stuff real users wouldn't do.

And, unlike most other benchmarks, most of us bigger vendors routinely make submissions.

There is no cost element defined for the equipment used in submitting SPEC tests, however. Another form of vendor creativity can result from assigning inflated prices to the other guy's gear, and then showing various per-unit comparisons in an effort to put their own results in a favorable light.

However, this sort of comparison is neither sanctioned nor condoned by the SPEC organizations. The SPECSFS test is sheer performance, plain and simple.

The Core Of The Current Debate

Simply put, there are two approaches to getting really good numbers from the SPEC tests.

One approach is to architect a single, scalable file system that goes really fast and scales linearly. Not many of these submissions, as you'll find.

A more common method is to aggregate multiple, independent file systems (using a global name space) to appear -- at least in some aspects -- as a single entity, although it clearly doesn't behave as one, as we'll see in a moment.

My point of view (as well as EMC's) is simple: since these two approaches are radically different in terms of user experience and administrative effort, they shouldn't be directly compared. Apples and oranges. At a very minimum, their inherent differences should be well understood by all.

I'll make my arguments here; you can draw your own conclusions.

Let's Start With A Traditional Single File System

Imagine, say, a single 16TB file system, sitting on a filer.

People start to use it, and -- eventually -- it either fills up, gets slow, or both. Before long, its time for more performance and/or more capacity. That usually means another controller or NAS head, in addition to more capacity.

You then acquire a separate device (array, NAS head, etc.) and add it to the configuration.
But you've got a new problem -- you now have to allocate the new capacity and/or performance amongst the people who need it. How many users and their data go to the first NAS device, and how many to the second?  You sit down, and do a static rationalization of what might go where in an ideal world. You copy a bunch of data around, and set up new mappings. Hopefully, you can do all of this without disrupting users.  But you're working with imprecise information; and of course there's absolutely no guarantee that all your users will continue to be nicely behaved in the future. For example, one set of users might grow faster in terms of capacity or performance than expected.

In a dynamic environment that's growing fast, that means you'll find yourself sitting down to perform this "analyze, recommend, migrate" loop more often. Fast forward: more independent filers get added over time. More capacity needs to be shoveled around from place to place, and it's now taking days instead of hours.   Users now start to notice that they can't use their data predictably. Storage admins find themselves pulling late nights and weekends to keep up with the growth. Over-provisioning performance and capacity quickly becomes a defense mechanism against having to move things around so often. Overall utilization of resources goes way down as a result. 

What might have made sense at 10TB becomes painful at 100TB and downright unworkable at 1000TB.  To give users a simplified logical view, the filers will often aggregate their name spaces (a global name space) so the combination of multiple, independent file systems. But this ends up being nothing more than a layer of shrink-wrap film over a pallet of multiple containers.   You can call it one container, but it's patently obvious to all it's just an aggregation of much smaller containers.  

Administrators still have to continually juggle what's in each file system container -- both from a performance and capacity perspective. And power users will often get involved in where their data physically resides -- simply because these capacity, performance and availability issues start to impact them as well.  Not a pretty sight. But it doesn't have to happen that way ...  Let's Start Again With A Scalable Single File System  .

Now let's go through this same scenario, but using a scalable single file system approach vs. valiantly attempting to aggregate multiple, independent file systems.  Our first 16TB file system goes in like before. But when the second one is needed for either capacity or performance reasons, the story changes considerably. 

The additional unit is quickly configured, and the scalable file system software does any required balancing and/or data migration: transparently and in the background.
The administrator can stick around and watch this magic happen if they like; but once you'e seen it it's about as exciting as watching a washing machine go through its cycles.  A third unit gets added, and a fourth, and so on up to potentially very large numbers indeed.

Each time, the new resources are automatically integrated -- and all available performance and capacity is auto balanced with each new resource added. Data protection (locating portions on multiple nodes) is also adapted as well to the new resources.  Users see a single giant file system that's essentially "flat' in terms of performance and capacity. Administrators get to see one giant pool of self-administering, self-balancing and self-protecting resources.   No downtime, no drama.   And no need to over-provision as a defensive mechanism.  The level of effort -- and usability -- remains largely constant whether we're talking 10TB, 100TB, 1000TB or more. Capacity and performance scale; hassle doesn't.  You'll have to admit -- there is a meaningful difference between the two approaches.
This glaring and obvious difference has been validated in customer forums that I've been at. On one side of the room, large environments who use a single scalable file system approach. On the other, those using a more traditional approach of aggregating many, many smaller file systems.  Their worlds are very different indeed :)  In All Fairness  Competitors who only offer the traditional approach of aggregating smaller file systems using a global name space will claim that there are multiple ways of solving customer problems, and that every customer is different.  While it's hard to disagree with that sort of platitude, it's hard to imagine a scenario where the aggregated separate file systems approach would have any sort of decided advantage. I mean, how many use cases are there where user demands precisely orient around the capacity and performance of a traditional file system?  And, in all fairness, EMC's higher-end VNX products (such as the VG8) have long used this aggregated independent file system approach.   But, as many of you know, EMC's Isilon is different -- it creates a single, scalable file system over many nodes.  For those of us who are now familiar with both, the differences couldn't be more stark -- especially at scale.  The Magic Of Scale-Out  Compared to our competitors, I think EMC is quite fortunate to now have multiple scale-out technologies in our portfolio.  

In addition to Isilon for scale-out file systems (NAS and CIFS), Greenplum (now augmented with Hadoop!) uses the same architectural style to achieve blazing performance coupled with cost-efficiency and administrative ease.  If you're into distributed object storage (e.g. cloud storage), Atmos uses a scale-out design to achieve the same results. And, if you're familiar with enterprise block storage at scale, well, that's a VMAX.  And, of course, VMware's products create scale-out clusters using cool technologies such as VMotion.

As just about any server admin will tell you, a shared pool of server resources that auto-balance is vastly preferable to isolated ones that don't :)  Many years back, we recognized that riding Intel's curve and building products that scaled out as well as up was going to be the architecture of the future: storage, database, servers and so on.  We've invested literally many billions of dollars in this one concept, and will continue to invest many more.   By this standard, many of our traditional competitors have some very serious work ahead of them.  All Is Fair In Benchmarks, Or Is It?  

Perusing the various SPECsfs2008 NFS and CIFS submissions, you have to look carefully to determine whether the competing product simply aggregates multiple, independent file systems to achieve their results -- or creates a single, scalable file system to get the job done.  You won't see it in the inventory of the parts list. Nor can you spot it from configuration diagrams.   Nor will the submitting vendors likely come forward at the outset and clearly state "hey, we achieved this result by aggregating 24 smaller file systems".   Your only clue is the subtle entry "file system type" which is intended to be only descriptive in nature.

Many will say "global name space". A few may say "single scalable file system".  

Trust me, there is a difference. 

By: Chuck Hollis

Wednesday, December 21, 2011

The Squabble Over Single File Systems