Cluster Databases and an Alternative

Posted on
Share on Google+Share on LinkedInShare on FacebookShare on RedditTweet about this on TwitterEmail this to someone

We all know that a growing number of Web sites attract an ever-increasing volume of total Web users. As the social networks have expanded their invitations to all demographic groups — meaning Facebook has become more than just a Saturday-night venue for college students checking out their crushes — and Google’s brand name has earned colloquial status as a verb, databases for popular Web sites and applications noticeably groan under the stress of success.

Aster Data Systems saw a gap in the market — which provides a dearth of databases that allow companies to quickly analyze data in the tens or hundreds of terabytes near its source while keeping costs low.

“This was increasingly apparent as a need in large-scale Internet companies such as Google, Yahoo or advertising networks. [They] had built their own systems to manage data volumes for analysis of this size on clusters of commodity hardware,” said Mayank Bawa, CEO and co-founder of Aster.

Aster’s nCluster customers, including MySpace — which processes, analyzes and stores upwards of 200 terabytes of active data — receive combined processing, networking and data recovery in Internet-scale analytic databases. nCluster software’s architecture separates data workloads in three layers: one that grows or shrinks depending on the number of data users, one dedicated to data analysis and one for data loading.

MySpace uses nCluster to study Web traffic. It has a 100-node cluster that can juggle the functional demands on 360 terabytes of data. As soon as new nodes are plugged into this or any other network, Aster incorporates them into the cluster, sans user intervention, while the rest of the system remains functioning and open for business.

“This speeds up performance and efficiency considerably. There is no competition between processors and memory,” Bawa said. “Processes run independently and with as much processing power as needed. Loading, querying, backup, export and more can occur concurrently.”

“In other systems, as you scale, the utilization of other system resources drops,” he said. “If the pipe between the processors and the data is choked, you can’t get your data fast enough to the processors, even if you have multiple processors.”

nCluster runs on algorithms that control data placement, partitioning, balancing, replication and querying across everyday hardware. “Effectively, we remove the network bottleneck, which enables us to exploit the full power of a cluster,” Bawa said.

Java infrastructure software company Terracotta has taken a different approach to ease the strain felt by relational hard drives. Its response to fight the physics of inherently limited hardware is a middleman cluster that manages XXX data.

“Our founders came from, one of the first large retail sites,” explained Amit Pandey, CEO of Terracotta. “Almost 10 years ago, they started to realize that this was going to be a very large business.”

By 2005, the site’s traffic levels rivaled all other commerce sites and was selling more than $1 billion in consumer goods.

As a result, the database took a hit with the constant queries and writing of data back and forth during transactions. So the team — the same team that later formed Terracotta — developed a software layer for transient data such as credit card information. Many of these pieces of transient data reached the database at some point, but many had no place there and did not need to be communicated to the database during a purchase.

The Terracotta software batches transient data in memory until the necessary parts are parsed and committed to the relational database.

Easing the functionality of this cluster database alternative is its foundation in the object format — thus its compatibility with consumer-facing, Java-based applications.

Run on a server infrastructure, Terracotta makes multiple Java virtual machines into one large memory pool that could be used as a database, Pandey said.

“Many of our customers come to us instead of going to a cluster database,” he said.

Cluster databases, because they require hardware each time they expand, cost more as a site scales upward. Terracotta sits in front of all databases.

And while Terracotta software could be used as a database, its parent company does not want its clients to ditch their databases, only to feel the financial relief of avoiding additional CPU investments.

“Many customers ask for search and query functions so they can get rid of their databases, but that is not something we are interested in doing in the short term,” Pandey said.

Kelly Shermach is a freelance writer based in Chicago, Ill., who frequently writes about technology and data security. She can be reached at editor (at) certmag (dot) com.

Share on Google+Share on LinkedInShare on FacebookShare on RedditTweet about this on TwitterEmail this to someone


Posted in Archive|