Big Data: Part 2 - Technology
By Mark Dexter
10th September 2011

Apache Hadoop is seen by many experts as a driving force behind big data analytics. Hadoop is an open-source distributed file system that supports parallel processing of large-scale unstructured data spread across multiple connected systems. It incorporates various open-source software elements, including the Chukwa data collection and monitoring system, the HBase database, the Hive data warehouse, the Pig query tool and ZooKeeper configuration and synchronisation software.

Hadoop also relies on the MapReduce programming model, first developed by Google in 2004 to analyse web indexes, which can also support distributed computing on large structured or unstructured data sets sitting on clusters or grids of connected computers. Other developers have also used the Enterprise Control Language (ECL) on high-performance cluster computing (HPCC) to build their own distributed file systems or data mining grids.

Because it can be used to process and analyse data that resides outside, as well as inside, the corporate firewall, some multi-national organisations use Hadoop to collect data from thousands of sensors in warehouses and factories located in different parts of the world. The software is batch-oriented and can be complicated to configure, and results are slow to gather. While any IT department can deploy it themselves, there are companies offering support and management packages around the Hadoop platform, most notably Cloudera.

IBM has also looked to simplify Hadoop-based analytics, using the technology as the base for its InfoSphere BigInsights and Streams analytics applications, which process text, audio, video, social media, stock prices and data captured by sensors.

Business Intelligence and visualisation software

While extracting and analysing big data is complex in itself, presenting the findings in a meaningful way can be just as hard. For years, business intelligence and analytics software has been used to output results into Microsoft Excel or specialist reporting tools, such as Crystal Reports, but fresh approaches to visualising that data can help business departments interpret predictive analytics more easily.

Data visualisation tools pull information from other BI applications or directly from underlying data sets, before presenting them in graphical format as opposed to numbers and text only. A good example is Tableau Software, but similar tools, both proprietary and open-source, include Tibco Spotfire, IBM OpenDX, Tom Sawyer Software, Mondrian and Avizo (the latter specialising in manipulating and understanding scientific and industrial data).

MPP appliances

Hardware appliances specifically designed to support massively parallel processing (MPP) of large datasets have recently surfaced following a round of software acquisitions by hardware vendors.

Storage giant EMC offers a data computing appliance built on the Greenplum database 4.0, delivering data loading performance of up to 10TB an hour, aimed primarily at large telecommunications companies and big retailers. The Oracle Exadata, Netezza TwinFin and Teradata 2580 are other MPP appliances built on multiple servers and CPUs, offering data storage capacities ranging from 20TB to 128TB, with the load ranging from 2-5TB per hour (Terabytes per hour – TB/hr – is a unit measuring data transmission rates or throughput).

Dell has hooked up with Aster Data’s nCluster MPP data warehouse platform, optimising the software to run on Dell Power Edge C-Series servers for large-scale datawarehousing and advanced analytics, for example. However, it is unclear whether that partnership will continue as Aster Data is now owned by Teradata in full.

The latest version of HP Vertica uses a mix of cloud computing infrastructure-as-a-service (IaaS), virtual and physical resources to run analytics on SQL databases. Though the software has yet to make it onto a specialised hardware appliance, HP promises this is imminent and is including a software development kit for Vertica 5.0 so that customers can adapt or add APIs to existing analytics applications to pull data out of the MPP platform.

IBM has also developed a pay-as-you-go data storage system, dubbed Scale Out Network-attached Storage (SONAS), capable of hosting up to 14.4PB of information that uses its own clustered file system.

Read at source: Computing.co.uk

Comments

Currently there are no comments. Be the first to post one!

Post Comment

*
*
*

How to pass a technical test

I recently asked the question “ are technical tests useful or just lazy? ” because any businesses that are hiring software and tech specialists are seeing a mass influx of candidates and are using these tests to cut down through some of... Read More

How NOT to have a successful phone interview

For many businesses a telephone interview is an essential phase of the recruitment process. It can allow them to gauge the interest and skills of a candidate without each person having to take time out for a face-to-face interview , which... Read More

How big data is changing the way we live

There’s no question that big data is everywhere, it has been reported that two and a half quintillion bytes of data are created every day, and this amount is surely only going to rise in the future. Anything we, or... Read More

Why do the top sales people use CRM Systems?

For many sales businesses their CRM system can be the life and blood of the data they hold, but for many sales people using the CRM system can seem like just another tool the business makes them use and is... Read More

Where should we send our newsletter?

Close