Data Science Maturity

I had the opportunity to participate in two pretty good conferences in 2011. The first was the data science-oriented Strata by O’Reilly in February, the other the more traditional BI/Netezza-focused Enzee Universe in June.

One striking difference between the two was the median age of participants: it seems baby-faced “Stratans” might well have been the progeny of graybeard “Enzeens.” I believe that age divide between the nascent DS and the now-mature BI is most telling.

In the earlier posts of this series, I’ve proposed that both data science and BI share underpinnings of business, technology and statistical science. According to data scientist Drew Conway: “First, one must have hacking skills …(which) in this context mean proficiency working with large, unstructured chunks of electronic data … Second, one needs a basic understanding of mathematics and statistics … Finally, and perhaps most importantly, a data scientist must have some substantive expertise in the data being analyzed.”

Data science and BI differ in the foci of their investigations. DS is consumed with supporting the development of data products. As Monica Rogati of LinkedIn notes, “On one side, I’ve been working on building products … The other side is finding interesting stories in the data.” BI, on the other hand, is all about measuring and managing business performance. At their best, though, both disciplines have an evidenced-based “science of business” foundation that makes me reject the contention by some that data science has a higher calling and is more scientifically sophisticated than BI.

DS and BI relate differently to the critical data that feeds them. Over time, BI has become obsessed with absolute answers using complete, precise, high-quality information, while data science often bludgeons solutions, settling for approximate responses from incomplete but massive data sets.

The maturity divide between data science and BI carries with it a number of cultural differences. At present, BI is probably more methodical and bureaucratic than DS, though impatient-with-IT DS’ers argue that’s a good thing. I suspect with maturity comes a governance “advantage” for BI as well. DS seems unencumbered with these “shackles,” but will probably start to look more like BI organizationally in time. Indeed, I believe that BI’s methodical, governed approach will positively impact DS, just as DS’s get-it-done intolerance of sloth and bureaucracy will rattle BI for the better.

With maturity comes an age divide that shows in platform software choices. Young DS’ers arrive at commerce from academia armed with the open source tools they learned in school: Perl/Python/Ruby for data integration, MySQL and Postgres for database management, R for analytics and graphics and, increasingly, Cloud computing and the Hadoop ecosystem for big data handling.

BI’ers, in contrast, are more likely to have settled in over the years on proprietary offerings from big technology vendors for their BI tasks – e.g. Informatica or DataStage for integration, Oracle or IBM-Netezza for database management, BusinessObjects or Cognos for query and reporting, and SAS or SPSS for analytics.

With maturity also comes a work group size difference that promotes a wider division of labor in BI than in DS. In large BI shops now there are business analysts, data analysts, DBAs, infrastructure specialists, developers, user experience experts, analytics experts, statisticians, et al. While the more sophisticated DS shops are rapidly growing and diversifying, many are still relatively small with jack of all trade contributors.

My take is that over time DS and BI will start to look more alike as areas of intersection between the disciplines grow. Indeed, OpenBI’s seeing that now with several of our current big data customers, where the database group and the Hadoop guys are already starting to align. We suspect the BI/ETL-OLAP teams and the stats geeks to start meshing forces soon as well. For these customers, the organizational distinctions between DS and BI may soon vanish. New data products and business performance evaluation will both be driven from a common analytics infrastructure. After all, is marketing attribution a data product or BI performance evaluation?

Current BI and DS vendors will play a key role in expediting this confluence as new versions of their platforms combine BI and big data capabilities. Anyone who’s coded MapReduce in Java understands the productivity and maintenance benefits of using a higher-order language to program big data jobs in Hadoop. Hive and Pig are already being used in that capacity. Now, BI ETL software such as Pentaho Data Integration (PDI) is making MapReduce programming even more accessible to developers. Pentaho Business Analytics with Hadoop will promote “Integration of big data tasks into the overall IT/ETL/BI solutions.”

Commercial R purveyor Revolution Analytics has jumped into the big analytics fray head first with enhanced support for large data and distributed computing, as well as integrations to both Netezza and Hadoop. They recently announced a partnership with Apache Hadoop provider Cloudera to introduce “RevoConnectR for Apache Hadoop, a collection of open-source packages that allows R programmers to access Hadoop HDFS and HBASE data stores in Apache Hadoop directly from R and write MapReduce jobs with R.”

And let’s not overlook cagey Oracle, who just announced Oracle R Enterprise, its own integration of RDBMS and R statistics. The software will allow “analysts and statisticians to run existing R applications and use the R client directly against data stored in Oracle Database 11g – vastly increasing scalability, performance and security.”

Look for 2012 to be the year that BI and DS start to get on the same page. That’s great news for both business intelligence and data science.

Read at source: Information Management

Mark Dexter

January 17th, 2012 View my profile

You might also like: