MapR's Google deal marks second Big Data Cloud win
By Chris Bongard
4th July 2012

Just two weeks after inking a deal with Amazon Web Services, MapR gets an exclusive to run Hadoop services on the Google Compute Engine.

MapR's latest deal is tied to Google's big June 28 announcement of the Google Compute Engine, new infrastructure-as-a-service (IaaS) that sets up the search giant as a public-cloud rival to Amazon Web Services (AWS). MapR is one of at least six partners debuting services on the Google infrastructure, which is currently in limited beta release. MapR and Google are currently signing up customers to join a private preview of the Hadoop services that will run on Google Compute Engine.

News of the Google partnership came just two weeks after MapR and Amazon announced that services based on its M3 and M5 Hadoop software distributions would be available on AWS. Where Amazon's own Elastic MapReduce service runs on Apache Hadoop, the MapR-based services add high-availability features not yet supported on standard open source software.

A key appeal of the AWS and Google services will likely be the ability to process and analyze data that already resides in the cloud. The MapR-based services on AWS, for example, are integrated with Amazon's Simple Storage Service (S3) and DynamoDB NoSQL database. Google AdWords and Google (Web) Analytics are both potentially rich, high-volume sources of search and click-stream data that Google Compute Engine customers could presumably tap without costly and time-consuming data-integration and data-movement steps.

"The big challenges in media are figuring out who to target, when to target, appropriate price points, and appropriate keyword bids, so you could easily see related digital media and advertising analyses performed on Google's cloud," MapR VP of marketing Jack Norris told InformationWeek.

By tapping compute capacity on demand, customers could potentially save money if they experience peaks and valleys in capacity utilization. In a test of Google Compute Engine performance, Norris said MapR recently tested its beta Hadoop service by setting up a 1,256-node cluster and running an industry-standard benchmark terasort job. The cloud-based system completed the job in one minute and 20 seconds, according to Norris, whereas the world record is one minute and two seconds.

"The record was set on a system that had twice as many cores, four times the number of disks, 200 more servers than the system we put together on the Compute Engine, and the cost of the infrastructure was in the neighborhood of $5 million," Norris said. "For the test that we ran on the Google Compute Engine, the cost would be about $16."

Comparable tests of MapR-based Hadoop clusters have not been performed on Amazon's infrastructure, Norris said. In the case of AWS, companies use the S3 services for everything from Web logs and click-through data to genomics data, and they use Amazon Elastic MapReduce and MapR-based Hadoop for analytics.

"The cloud is also an excellent target for business continuity, so instead of having a complete second data center, you can use run Hadoop clusters in the cloud, with mirroring synchronized between your on-premises and cloud-based targets," Norris said.

Some analysts say clould-based services will be prohibitively expensive for long-term storage at high scale, making them most attractive for pilot tests, brief projects, and cases where the data already exists in the cloud (as in the case of Google AdWords, Google Analytics, AWS S3, and DynamoDB). Norris took exception to that analysis.

"I think we're going to see generations of cloud services, and [costs at scale] are not going to be as much of a factor in the future," Norris said.

MapR distinguishes itself from Hadoop software distribution and support competitors Cloudera and Hortonworks by providing high-performance options not supported on standard Apache open source Hadoop software. MapR's M5 distribution, for example, replaces the Hadoop Distributed File System (HDFS) with a derivative of the Unix-based Network File System. M5 includes snapshotting, mirroring, and other high-availability features that aren't currently supported on the current (1.0) Hadoop code line.

MapR describes the AWS and Google services based on its distributions as an endorsement of its architecture, but there are plenty of options to run Cloudera and Hortonworks in the cloud. Hortonworks is the developer of the software used to run Hadoop on Microsoft's Azure public cloud. And multiple providers run Hadoop services on AWS and other public clouds using Cloudera's CDH Hadoop software distribution.

Responding to requests for comment on MapR's recent deals, Cloudera VP of product, Charles Zedlewski, said is a statement, "Cloudera has led the industry in support for Apache Hadoop on public clouds, supporting Rackspace, AWS, and Softlayer dating back to 2009. Every month, tens of thousands of CDH instances are created on top of various public cloud providers."

Zedlewski also noted that Cloudera developed Apache Whirr, software now used by Cloudera and its competitors to run Hadoop distributions on public clouds.

The entire Hadoop movement was actually inspired by Google, which was a pioneer in the use of MapReduce processing and published the white paper that guided the creators of Hadoop. Google still uses MapReduce processing extensively internally, but its software is not distributed and its approach to MapReduce is not made available as a service on the Google Compute Engine.

Read full article at source - InformationWeek


Currently there are no comments. Be the first to post one!

Post Comment


Data Visualisation vs Traditional Reporting

Every quarter we produce an e-magazine focussing on issues and topics within the data and analytics world. In our previous editions we have looked at; data analytics and climate change , the impact of blockchain and the death of the data... Read More

The best data conferences in 2019

2019 is set to be another big year in data, technology and analytics. As more ‘smart’ technology is being released the amount of data being collected is only going to rise. At KDR we find one of the best ways... Read More

Top blogs of 2018

Every year we take a look back at which blogs of ours you read the most, and this year is no different. While we write around a mix of topics including: data analytics , AI , IoT and more , it seems our... Read More

Information Matters – Data Analytics and Climate Change

We are proud to announce the sixth edition of Information Matters ! As recruiters in the Information Management and Data Analytics industry we consider it vital to in the know about issues and events facing our industry and your business. Our... Read More

Where should we send our newsletter?