Big Data: Part 2 - Technology
By Mark Dexter
10th September 2011

Apache Hadoop is seen by many experts as a driving force behind big data analytics. Hadoop is an open-source distributed file system that supports parallel processing of large-scale unstructured data spread across multiple connected systems. It incorporates various open-source software elements, including the Chukwa data collection and monitoring system, the HBase database, the Hive data warehouse, the Pig query tool and ZooKeeper configuration and synchronisation software.

Hadoop also relies on the MapReduce programming model, first developed by Google in 2004 to analyse web indexes, which can also support distributed computing on large structured or unstructured data sets sitting on clusters or grids of connected computers. Other developers have also used the Enterprise Control Language (ECL) on high-performance cluster computing (HPCC) to build their own distributed file systems or data mining grids.

Because it can be used to process and analyse data that resides outside, as well as inside, the corporate firewall, some multi-national organisations use Hadoop to collect data from thousands of sensors in warehouses and factories located in different parts of the world. The software is batch-oriented and can be complicated to configure, and results are slow to gather. While any IT department can deploy it themselves, there are companies offering support and management packages around the Hadoop platform, most notably Cloudera.

IBM has also looked to simplify Hadoop-based analytics, using the technology as the base for its InfoSphere BigInsights and Streams analytics applications, which process text, audio, video, social media, stock prices and data captured by sensors.

Business Intelligence and visualisation software

While extracting and analysing big data is complex in itself, presenting the findings in a meaningful way can be just as hard. For years, business intelligence and analytics software has been used to output results into Microsoft Excel or specialist reporting tools, such as Crystal Reports, but fresh approaches to visualising that data can help business departments interpret predictive analytics more easily.

Data visualisation tools pull information from other BI applications or directly from underlying data sets, before presenting them in graphical format as opposed to numbers and text only. A good example is Tableau Software, but similar tools, both proprietary and open-source, include Tibco Spotfire, IBM OpenDX, Tom Sawyer Software, Mondrian and Avizo (the latter specialising in manipulating and understanding scientific and industrial data).

MPP appliances

Hardware appliances specifically designed to support massively parallel processing (MPP) of large datasets have recently surfaced following a round of software acquisitions by hardware vendors.

Storage giant EMC offers a data computing appliance built on the Greenplum database 4.0, delivering data loading performance of up to 10TB an hour, aimed primarily at large telecommunications companies and big retailers. The Oracle Exadata, Netezza TwinFin and Teradata 2580 are other MPP appliances built on multiple servers and CPUs, offering data storage capacities ranging from 20TB to 128TB, with the load ranging from 2-5TB per hour (Terabytes per hour – TB/hr – is a unit measuring data transmission rates or throughput).

Dell has hooked up with Aster Data’s nCluster MPP data warehouse platform, optimising the software to run on Dell Power Edge C-Series servers for large-scale datawarehousing and advanced analytics, for example. However, it is unclear whether that partnership will continue as Aster Data is now owned by Teradata in full.

The latest version of HP Vertica uses a mix of cloud computing infrastructure-as-a-service (IaaS), virtual and physical resources to run analytics on SQL databases. Though the software has yet to make it onto a specialised hardware appliance, HP promises this is imminent and is including a software development kit for Vertica 5.0 so that customers can adapt or add APIs to existing analytics applications to pull data out of the MPP platform.

IBM has also developed a pay-as-you-go data storage system, dubbed Scale Out Network-attached Storage (SONAS), capable of hosting up to 14.4PB of information that uses its own clustered file system.

Read at source:


Currently there are no comments. Be the first to post one!

Post Comment


15 years on: How times have changed…

When I incorporated KDR 15 years ago this month I had a simple mission. I wanted to run the market leading agency in its field placing contract staff with experience of working with the emerging data warehouse tool Kalido. I... Read More

The life of a Business Administration apprentice!

I started my apprenticeship with KDR in the September of 2018, when I was searching for a job I never imagined that I would find a company quite like KDR, from the moment I first stepped through the door everyone... Read More

How can data analytics impact climate change?

Every quarter we produce an e-magazine on issues and topics within the data and analytics world. In our previous issues we have looked at the impact of blockchain , the future of the data warehouse and GDPR . In our next issue... Read More

Will veganism save the planet?

Before I begin… No, I’m not a vegan. In fact, I’m no way near being a vegan! But I have always had an interest in the environment and how much impact we are having on it and running a data... Read More

Where should we send our newsletter?