Spark: Open Source Superstar Rewrites Future of Big Data


RAM SRIHARSHA WORKS in the engine room powering one of Silicon Valley’s most influential companies. He’s an engineer at Yahoo.

Even after naming ex-Google star Marissa Mayer chief exec, Yahoo often is derided as a thing of the past, a fallen giant struggling to keep pace with the likes of Google, Facebook, and Twitter. Behind the scenes, though, thanks to people like Sriharsha, Yahoo is in many respects a step ahead of its much flashier competition — and has been for years.

Yahoo’s Sunnyvale, California headquarters is ground zero for Hadoop, an open source software creation that underpins a Who’s Who of the internet, including Facebook and Twitter. After reinventing not only the web but the world of business software, the sweeping software platform — a means of crunching vast amounts of data across thousands of computer servers — is one of the great open source success stories of the past decade, and its influence is only expanding. But Yahoo, its founding father, is moving on.

Teaming with a particularly ambitious group of computer scientists from the University of California at Berkeley, Sriharsha is installing a new data crunching platform inside the massive data centers that drive Yahoo’s still enormous online empire. This software platform is called Spark, and according to those who built it and use it, it’s about 100 times faster than the mighty Hadoop — and could very well replace Hadoop as the stuff that fuels the modern web.