Big Data: Part 2 - Technology
By Mark Dexter
10th September 2011

Apache Hadoop is seen by many experts as a driving force behind big data analytics. Hadoop is an open-source distributed file system that supports parallel processing of large-scale unstructured data spread across multiple connected systems. It incorporates various open-source software elements, including the Chukwa data collection and monitoring system, the HBase database, the Hive data warehouse, the Pig query tool and ZooKeeper configuration and synchronisation software.

Hadoop also relies on the MapReduce programming model, first developed by Google in 2004 to analyse web indexes, which can also support distributed computing on large structured or unstructured data sets sitting on clusters or grids of connected computers. Other developers have also used the Enterprise Control Language (ECL) on high-performance cluster computing (HPCC) to build their own distributed file systems or data mining grids.

Because it can be used to process and analyse data that resides outside, as well as inside, the corporate firewall, some multi-national organisations use Hadoop to collect data from thousands of sensors in warehouses and factories located in different parts of the world. The software is batch-oriented and can be complicated to configure, and results are slow to gather. While any IT department can deploy it themselves, there are companies offering support and management packages around the Hadoop platform, most notably Cloudera.

IBM has also looked to simplify Hadoop-based analytics, using the technology as the base for its InfoSphere BigInsights and Streams analytics applications, which process text, audio, video, social media, stock prices and data captured by sensors.

Business Intelligence and visualisation software

While extracting and analysing big data is complex in itself, presenting the findings in a meaningful way can be just as hard. For years, business intelligence and analytics software has been used to output results into Microsoft Excel or specialist reporting tools, such as Crystal Reports, but fresh approaches to visualising that data can help business departments interpret predictive analytics more easily.

Data visualisation tools pull information from other BI applications or directly from underlying data sets, before presenting them in graphical format as opposed to numbers and text only. A good example is Tableau Software, but similar tools, both proprietary and open-source, include Tibco Spotfire, IBM OpenDX, Tom Sawyer Software, Mondrian and Avizo (the latter specialising in manipulating and understanding scientific and industrial data).

MPP appliances

Hardware appliances specifically designed to support massively parallel processing (MPP) of large datasets have recently surfaced following a round of software acquisitions by hardware vendors.

Storage giant EMC offers a data computing appliance built on the Greenplum database 4.0, delivering data loading performance of up to 10TB an hour, aimed primarily at large telecommunications companies and big retailers. The Oracle Exadata, Netezza TwinFin and Teradata 2580 are other MPP appliances built on multiple servers and CPUs, offering data storage capacities ranging from 20TB to 128TB, with the load ranging from 2-5TB per hour (Terabytes per hour – TB/hr – is a unit measuring data transmission rates or throughput).

Dell has hooked up with Aster Data’s nCluster MPP data warehouse platform, optimising the software to run on Dell Power Edge C-Series servers for large-scale datawarehousing and advanced analytics, for example. However, it is unclear whether that partnership will continue as Aster Data is now owned by Teradata in full.

The latest version of HP Vertica uses a mix of cloud computing infrastructure-as-a-service (IaaS), virtual and physical resources to run analytics on SQL databases. Though the software has yet to make it onto a specialised hardware appliance, HP promises this is imminent and is including a software development kit for Vertica 5.0 so that customers can adapt or add APIs to existing analytics applications to pull data out of the MPP platform.

IBM has also developed a pay-as-you-go data storage system, dubbed Scale Out Network-attached Storage (SONAS), capable of hosting up to 14.4PB of information that uses its own clustered file system.

Read at source:


Currently there are no comments. Be the first to post one!

Post Comment


AI is changing the customer experience

There is no question about it, artificial intelligence is changing how customers interact with brands and develop through the buyer journey. For many customers interacting with AI has become the norm, whether they realise it or not. As artificial intelligence... Read More

How to spot a bad contract

As specialist recruiters in the information management and data analytics industry we come across plenty of contractors, and contracts. We always make sure our contractors get a fair deal and their contact is correct, however others are not so lucky.... Read More

Growing old robotically

The use of Artificial Intelligence and robotics is increasing throughout the world, it is being used across multiple sectors in multiple industries. The healthcare industry is starting to see a boom in how AI and robots can be used, and... Read More

5 common CV mistakes you should avoid

Your CV is important, it is what stands between you and your future employer. If you don’t have a good CV, you might not get an interview therefore it is essential you get it right. What are the rules for... Read More

Where should we send our newsletter?