A Big Effort To Support Big Data

Industry opinions around the topic of big data analytics range from wild-eyed enthusiasm to hardened cynicism.
My personal take?
The cynics should move on and find something better to grumble about; there's absolutely stunning potential being driven by a perfect storm of exploding data sources, nose-diving infrastructure costs and new toolsets to make data dance and sing in ways we haven't seriously considered before.
As of late, one of these toolsets -- Hadoop -- has enjoyed more than its fair share of attention.
One of its primary strengths is supporting efficient batch processing of enormous unstructured data sets.
Yes, you read that right -- batch processing is sexy again :)
The core technologies are thankfully open sourced, with the Apache Hadoop project being the primary core of the effort. And yesterday, EMC and Greenplum announced a massive donation to the cause.

In A Nutshell
Imagine gigantic data sets coming from everywhere: web servers, social feeds, metering, etc. There's a first level ingest/filter/correlate step to extracting value that's roughly analogous to a mining process -- raw ore in; useful minerals out. Maybe that's why they call it data mining?
Big, scale-out commodity infrastructure is demanded -- the bigger/cheaper/faster, the better. You need a set of tools on top of the plumbing to manage the data sets, schedule jobs and workflow, etc. That's where Hadoop comes in.
Hadoop's roots are a story unto itself; the present state of play is a core open source project (Apache) and several derivative variants that are being commercialized by various vendors, including EMC's Greenplum division via the Greenplum HD offering.
When EMC announced their intentions to offer an enterprise version of Hadoop, there was the predictable concern about EMC's ability to give back to the open source community that not only created it, but was the source of major technological evolution going forward.
Well, I think we found an important and useful way to give back.
The Greenplum Analytics Workbench
One of the great aspects of the open source model is you get the best-of-the-best intellectual contributions from key stakeholders who are actually using the technology. Some of the best code on the planet arises from open source models.
One of the downfalls of the open source model is that there's not a lot of money around to fund expensive stuff, like massive computing infrastructure.
When it comes to open source big data efforts, that's a special problem: unless the code is tested at reasonable scale, it's a work unfinished -- and less-than-useful to people who want to use it in large-scale production environments.
So EMC and Greenplum are leading an effort -- along with a great list of other vendor participants -- to create a 1000-node, 24 petabyte lab on behalf of the Apache project. They couldn't afford a scale-out test environment, so we're building one for them. And donating the equipment, facility costs and supporting labor. That's not an inconsequential investment.
It should be up and running this January. 1000 physical nodes can easily become 10,000 or more logical nodes (thanks to VMware!), which allows some serious scaling of compute, network and data. The team can find the problems that only happen at scale *before* it gets into the distribution.
That -- in effect -- greatly accelerates the maturation of the Hadoop code in a significant and meaningful way that can't readily be achieved by other means. There's just no substitute for a big lab full of equipment :)
If you bother to read the quotes from the press release, you can almost feel the enthusiasm from the team. My inner geek can relate.
My personal hope is that we can do more of this: there's an entire cadre of data scientists and data engineers that need to learn the skills to wrangle data sets at massive scale. I can imagine us teaming up with educational institutions at some point to do exactly that.
And On To The Product News
Greenplum is essentially a software company. Part of their compelling "secret sauce" is a modern database that is the essence of shared-nothing scale-out architecture.
Want to go faster? Just rack up more commodity hardware, and you're off to the races. Nothing could be simpler -- or more efficient. At the end of the day, scale-out wins when it comes to big data.
Since being acquired by EMC, their software stack has moved beyond the initial GP database to include their enterprise-grade Hadoop distribution (Greenplum HD) which acts as a front-end for data loading and first-level grinding, and Greenplum Chorus which provides the "front end" portal for driving workflows and collaboration in the environment.
Of particular interest is the Greenplum DCA -- data computing appliance. Yes, it's nothing more than an optimized set of commercial technologies (servers, storage, interconnect, etc.) but it's pre-configured, pre-tested and supported as a whole using EMC's enterprise support model.
I know, many of you reading this would love the idea of having the opportunity to design, assemble and support your own creation, but a lot of folks that just isn't an attractive option. They want to use the technology, and not invest in hand-crafting it.
The important announcement here was around a more-unified DCA. In addition to the original modules that support the Greenplum database (available in both high-capacity and high-performance configurations), there are now Greenplum HD modules to support those workloads, and an interesting new Data Integration Accelerator module that supports a variety of third-party analytics tools from the community and our ISV partners.
Customers can add various modules as their needs changes, and as the underlying technologies go through their predictable tick-tock of performance increases, price decreases and expanding capacities.
In essence, the Greenplum DCA has now become a single infrastructure that can support the big data analytics process: from raw information ingestion to advanced analytics built on industry-standard hardware and supported by a single vendor.
And I'm guessing it's going to be rather popular :)
Big Data Analytics And Core Business Processes
At the recent EMEA Analysts Summit, Jeetu Patel ran a fascinating session on how the insights gained from big data analytics were causing many enterprises to re-think how they built their core business processes, and how Documentum's new xCP environment was playing a key role.
The classic example is loan scoring. Traditionally, that might have involved such things as credit history, income, employment status and so on.
But when that is complemented by analytics that include house pricing predictions for the local market, the local employment picture, macroeconomic forecasting, etc. etc. the loan scoring accuracy enters an entirely new realm.
Score loans better and you can price them better. Price them better, and you make a lot more money.
It doesn't take a rocket scientist to grasp the impact.
He gave another example of how using external social feeds greatly changed a core process everyone uses: hiring and recruitment. And -- without too much effort -- you can come up with hundreds of core business processes across industry after industry that fundamentally change in the face of advanced analytical insight.
Jeetu flatly stated that most core business processes would be re-engineered along these lines over the next five years. I have to agree -- it's inevitable given this perspective. And I'm rather glad that an important division of EMC (IIG) is creating the enabling technology (xCP) to exploit the business value gleaned from big data analytics.
Stepping Back A Bit
For many of us, we see big data as the next important frontier for creating new value from information. Yes, there will be plenty of cool technologies (at massive scale!), but the real challenge will be creating end-to-end environments that help organizations move from raw, unfiltered data feeds to critical insights and the ability to react as part of their core operations.
Exciting times indeed.
And I feel privileged to work for a company that's investing in this brave new world.

By: Chuck Hollis

Tuesday, September 27, 2011

A Big Effort To Support Big Data