Indiana University (IU) is looking to widen lanes and raise the speed limit on the information superhighway — it recently introduced the Data Capacitor, a file system designed to store and manipulate large data sets.
The Data Capacitor has a single-client-transfer rate of 977 megabytes per second across the TeraGrid network, which is an open scientific discovery infrastructure that combines large computing resources at nine sites (such as supercomputer centers and universities) partnered with the National Science Foundation to create a geographically diverse computational resource.
Work on the Data Capacitor began with a grant of $1.72 million from the National Science Foundation in late 2005. Steven Simms, Data Capacitor project leader, said the idea behind the system was to create a facility that would do three things: provide large storage, provide fast storage and provide researchers with a way to find large data sets after transfer.
“The premise is that digital instruments these days, including machines that are producing simulation data, produce that data at an alarming rate,” Simms said. “I like to call it the ‘data fire hose.’ If you’re going to capture the data, that means you’ve got to be able to ingest that data quickly, and if your simulations are running for a long time, you’ve got to have hundreds of terabytes of space to accommodate multiple streams of this kind of data from different departments. So, we set down this path: We started mounting the file system on multiple locations, tying local resources together.”
This is where the TeraGrid comes into play.
“The idea is that by having this across a wide area, it frees you up from the overhead of data transfer, so you don’t necessarily have to at every step of your work flow control the speed of someone else’s schedule,” Simms said. “You can’t necessarily control how fast your network is going to go, so you have this vast data reservoir or ‘data parking lot’ where you push your data temporarily so that you can inject it someplace else. Your services don’t have to push the data from resource to resource. It saves you time — the data’s already there.”
The system’s speed and efficiency have the potential to change how scientists collaborate across great distances, exponentially enhancing that process.
Simms described a hypothetical scenario in which a scientific instrument is in one location, and a researcher far from the instrument wants to use it to harvest data.
“[Using the Data Capacitor through the TeraGrid network] it would be possible for them to send small messages to that instrument, and then the instrument could blow data to a shared file space,” Simms said. “The data would then leave the instrument and be able to be written quickly to this central file system. Then, the researcher at a remote point could mount that file system and pull that data, analyze that data or compute against that data seamlessly.”
Simms said this system potentially can affect a broad range of scientific disciplines, citing archeological image preservation and astronomy as examples — he relayed a news story about an astronomer who was working on a particular project and complained that pushing a terabyte worth of data was going to take him 30 days.
“I thought, ‘That’s a crime,’” Simms said. “We can do it across the TeraGrid network in 20 minutes.”