Data Science Skepticism

I don’t think you’d get much argument from the data science community that the emerging field involves components of business, technology and statistical science. “Veteran” DS’ers will also note both inquisitive and skeptical dispositions as keys to success in the discipline.

LinkedIn’s Monica Rogati observes that data scientists are at the intersection of Columbus and Columbo – “starry eyed explorers and skeptical detectives.” Amazon’s John Rauser opines “A healthy dose of skepticism comprises the fourth dimension of the data scientist. If you have a healthy skepticism, you will look as hard for evidence that refutes your thesis as you will for evidence that confirms it.”

In a terrific article “Top Holiday Gifts For Data Scientists,” Cloudera co-founder and chief scientist Jeff Hammerbacher recommends a multitude of books, websites and software tools for the budding data scientist. Among his choices are the texts “Statistics as Principled Argument” and “Bias and Causation,” both of which encourage healthy skepticism in interpreting relationships from observational or non-experimental investigations. The latter details a “taxonomy of bias and its potential sources. It is a must read and constant reference for those designing survey studies and a reminder of cautions for those who must contend with study results and conclusions.”

Why the obsession with bias? Because data scientists generally work with messy observational data from which it can be difficult to prove that factor A caused outcome B. Does a high correlation between A and B indicate that A caused B? Or maybe that both A and B are caused by a confounding factor C? Or perhaps that A and B are spuriously related? In the absence of random sampling or random assignment to experimental groups, these questions can be nearly impossible to answer with certainty – hence the skepticism of good data scientists.

I often put on my cynic’s hat when I review the results and interpretations of surveys in BI/analytics. And so it was with the recently published “Data Science Revealed: A Data-Driven Glimpse into the Burgeoning New Field,” a survey of business intelligence and data science professionals conducted by EMC. It’s not that I think the DSR study was poorly done; rather, I believe there are significant weaknesses in an online survey methodology that might bias the findings. Are the survey findings valid?

The first question a skeptical data scientist would ask is how representative the DSR data is of the population of BI and DS professionals it purports to describe. I’d love, for example, to know the demographics of the DSR sample. If it includes many more data scientists than business intelligence practitioners even though BI professionals currently dwarf data scientists in the work world, does that introduce bias?

Is the sample of 497 respondents large enough to detect small percentages? Does the fact that respondents choose whether or not to participate in any way bias the results? Might it be the case, for example, that those who consider themselves data scientists were more likely to complete the survey than those who identified as BI professionals? And could it be that the sexier title of data scientist is the now the self-reported professional designation of choice for many BI professionals – regardless of the work they do? The data scientist must ask questions like these.

I agree with DSR’s declaration that data science is a young field, much as business intelligence was 20 years ago. About BI, the study notes: “As the field grew rapidly in the 90s, it also coalesced around a smaller number of tools, more consistent expectations for talent, better training, and more rigorous organizational standards. As our data demonstrates, data scientists are currently going through that transition.”

This disparity in maturity levels may explain some of DSR’s findings. As an illustration, the observation “that data science professionals were over 2.5 times more likely to have a master’s degree, and over 9 times more likely to have a doctoral degree as business intelligence professionals.“ is probably an artifact of the relative maturity and size of BI in contrast to DS. Think back 20 years when BI was in its infancy. There were then a high percentage of advanced degrees among the small population of BI professionals as well. Recall the seminal work of Bill Inmon, Ralph Kimball and Claudia Imhoff – Ph.D.s all.

I don’t buy DSR’s assertion that “the data science toolkit is more varied and more technically sophisticated than the BI toolkit. While most BI professionals do their analysis and data processing in Excel, data science professionals are using SQL, advanced statistical packages, and NoSQL databases.” Huh? Excel as the primary BI data processing tool? SQL for DS but not BI? Not.

And don’t tell Tableau founder Pat Hanrahan that while “advanced visualization tools like Tableau are just starting to emerge in the data science world, they are almost unseen in the business intelligence world.” On the contrary, Tableau and kin Spotfire, Omniscope and QlikView are now inundating self-service BI, as Tableau’s startup screen greeting “Fast analytics and rapid-fire business intelligence” attests.

That BI is more mature than DS probably suggests that BI professionals are, on the average, older than their DS counterparts, many of whom started their data science careers just out of school. That could explain why “Open Source tools, like the R statistics package, Python, and Perl, are each used by one in five data science professionals, but around one in twenty BI professionals.” R, Python and Perl are languages many DS’ers learned in graduate school and brought with them to the work world. And while I’m a big fan of all three, I find it curious that the Data Management tool section doesn’t include ETL stalwarts Informatica, DataStage, and Pentaho PDI. I don’t think I’d choose to use Perl for a big data integration initiative in 2012.

While some of the findings of Data Science Revealed contrasting DS and BI give me heartburn, I’m pretty much in agreement with the survey’s organizational implications. The admonition that DS professionals must be built rather than bought is spot on. Companies should “find practitioners with the intellectual curiosity and technical depth to solve big data problems, with academic concentrations in the hard sciences, statistics, and mathematics … Rather than hiring for experience with a certain toolkit, companies should invest in on-the-job training with their chosen set of emerging technologies.” This aligns with OpenBI’s strategy of hiring scientifically-inclined graduates for BI consulting, most of who are not CS majors.

As DS matures, look for additional division of labor in the discipline, with sub-specialties evolving in the science of business, big data integration, statistical learning, visualization, user experience, et al. “Once companies have brought in the right talent, they need to create an environment conducive to effective data science. That means building high-performing, cross-functional teams that include a variety of roles, including programmers, statisticians, and graphic designers, and aligning them to directly support interested business decision makers.”

Source: Steve Miller – Information Management

Chris Bongard

January 4th, 2012 View my profile

You might also like: