Monday, May 16, 2011

Big Science Means Big Data

I don't know about you, but modern astronomy fascinates me: just about everything I learned about the subject when younger has since been re-written.  Like most of the sciences, things are moving fast indeed.

Increasingly, pushing the boundaries of human understanding now requires big data.  The bigger the science, the bigger the data it seems.

As a case in point, I'd like to discuss the proposed Square Kilometre Array (SKA), and how -- from a storage perspective -- it pushes the boundaries for all of us.

To Being With
Dishes_overview_web_large
The bigger the telescope, the more signal it can capture and correlate.  The more signal, the farther away it can probe, and -- correspondingly -- the farther back in time we can look.


Not only that, but there's increasing evidence that many of the biological precursors for life are routinely manufactured in interstellar space.  I don't know about you, but that gets my curiousity going :)

When we move from shorter wavelength optical to longer-wavelength radio, telescopes can get very big indeed.  Big radio telescopes need big land, and -- ideally - a location with a minimum of background interference.

As I understand it, the contest is now down to either South Africa or ANZ as to who's going to host this monster.  Either way, there's going to be a *lot* of data.
Fun With Numbers
Layout2
Consider this quote from a recent article:

"... each of the 3000 dishes will be collecting data continuously and when combined, the SKA will produce nine million signals at once, enough to fill five thousand 160-gigabyte mp3 players every minute."
The storage geek in me finds this fascinating.
Let's see ... five thousand times 160 GB is about 800 TB per minute.  At ten hours per day, that's about 480 petabytes per day.  Now, assume it runs for a few years to get a good survey baseline for researchers.  We can easily be at 500,000 petabytes without too much effort.
On behalf of EMC and the entire storage industry, sign me up!

Seriously, though, given the current state (and economics!) of storage technologies, no one's going to be standing up that amount of capacity anytime soon.  But it does serve to illustrate the insatiable demand for ever more capacity -- especially in many areas of advanced scientific research.

The Bad News
Unfortunately -- like most initiatives -- data captured will be limited solely by the funding model and current technological limits.  There will be X amount of money for storage capabilities.  No matter how efficient or how compressed -- most of the data will inevitably be thrown away.
Indeed, building high-speed computers to decide what to keep and what to discard (in realtime) itself is a major undertaking.
.. To deal with the problem Gaensler and colleagues are working on new intelligent computer algorithms to process the torrent of data.  "We need computers that can do the job of humans, but make decisions on a timescale of microseconds. It would decide if something is interesting or should be thrown away," he says.  "Undoubtedly we'll occasionally be throwing out important data."
That's unfortunate -- having to deploy enormous amounts of computing resource to simply figure out what's not worth saving, and frequently getting it wrong.

Using The Data
Most of the article focuses on the primary challenges associated with simply capturing the signal streams.  Step back a minute, and consider the related challenge of making all these data sets freely available to researchers around the world.  Even more storage -- and considerable bandwidth as well.   In some respects, this aspect becomes even more important than primary data capture.
Much of the interesting work in modern astronomy involves comparing time series over very long periods of time -- years quickly become decades or longer.  So it's safe to imagine this data being around a very long time indeed.
Big data becomes even bigger data.

A Quick Plug
Data_scientist_summit
At EMC, we've become intensely interested in these researchers -- who are they, what are they doing, and what do they need from us?

As a result, EMC is hosting what we believe to be an industry first: a summit for data scientists, hosted at EMC World next week.
Take a quick look at the agenda -- this isn't about technology; it's about what's now possible that simply couldn't be considered before.
And that's cool.

Cutting Edge -- Or Merely A Preview Of Things To Come?
It's tempting to look at initiatives such as this and immediately classify them as exotic outliers -- certainly not anything any of us would ever encounter from an IT perspective.
Really?  

The energy industry is now contemplating what to do with all the metering data that's starting to be available from smart grids and intelligent appliances.  Some law enforcement agencies are starting to get drowned with all the video that's now available.  Health researchers have realized that more data means better outcomes for patients. Investment firms are now starting to monitor Twitter streams to gauge consumer sentiment.

And all of that is even before we settle the matter of smartphones tracking our every location.
Look around a bit, and you'll see signs of big data showing up just about everywhere.
Are you ready?

By: Chuck Hollis
VP -- Global Marketing CTO
EMC Corporation