So says Dhiraj Rajaram, founder and CEO of Mu Sigma, a Chicago-based startup providing analytics (or “decision sciences” as the company calls it) as a service to a large pool of Fortune 500 customers. He’s probably right, and that’s a problem.
Organizations, even large ones, might be masters of the fields in which they do business, but they’re not masters of applied mathematics, which is at the core of the growing data science trend. When it comes time to undertake a big data strategy that requires turning advanced algorithms on potentially massive data sets, many fast realize they don’t have, or have nearly enough of, the necessary skills internally. Attempts to hire these skills might prove largely fruitless as the small population of employees with the predicate acumen in both business and calculus are quickly snatched up by an equally small number of companies.
Analyzing traditional business data held in a data warehouse is one thing, but doing big data and, more specifically, data science is quite another. McKinsey & Co. predicts that by 2018, the United States will have a shortage of “1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions,” and a shortage of almost 200,000 people with the deep analytical skils necessary for data science.
But enough about big business data. In an era of webscale computing and large clusters running big-data workloads, we’ll also need more people who can apply mathematics to data with the goal of automating and troubleshooting distributed systems. Sure, predictive analytics can be great for determining how consumers are likely to react to changes to their favorite products, but they also can be very helpful in helping ensure that complex systems such as Google’s run smoothly.Case in point: Software bugs at Google
On Wednesday, two members of the Google engineering team wrote a blog post explaining their surprisingly simple new algorithm for detecting particularly troublesome code. The problem, as authors Chris Lewis and Rong Ou present it, is that with such a large, growing and increasingly complex code base — and thousands of developers working on it — it becomes nearly impossible for code reviewers to identify “hot spots.”
The authors define hot spots as code that “creates issues again and again, as developers try to wrestle with the problem,” as opposed to just a piece of necessarily difficult code to carry out a complex function. If it’s the former, reviewers need to be alerted to its status up front so they know to give it their utmost attention, or perhaps even hand it off to some with more experience.
Based on some research on how to best predict if there are bugs within particular code, the team decided on a simple method for flagging files: “files are flagged if they have attracted a large number of bug-fixing commits, no more and no less.” How the algorithm goes about filtering commits down to only the valid bug fixes is a little more complex, of course. After discussing the results of early experiments with developers, the authors’ team also decided to work in a time variable so that newer bug-fixing commits score higher than old ones that might have already been dealt with.
Their algorithm looks like this:
Here’s what it looks like plotted:
It’s relatively simple, but it’s not as if one can always just choose the simplest-possible algorithm and run with it. As author Chris Lewis noted in the Hacker News thread on his and Ou’s post, this algorithm came to be only after much experimentation with far more-complex algorithms to solve the same problem. He had “spent a lot of time trying to implement FixCache, a pretty complicated algorithm that looks at things such as spacial locality of bugs (bugs appear close to each other), temporal locality of bugs (bugs are introduced at the same time) and files that change together (bugs will cross cut these)” before coming across the research that led to the ultimate strategy.
In another comment, Lewis suggested the future might involve some much more-difficult concepts, such as machine learning. “[W]e don’t have good tools (yet) to have a computer properly check the semantic meaning of our code,” he wrote. “Bug prediction sits as a sort of baby step. It’s the computer making a best-effort guess of where issues will be.“We’re just getting started
Companies such as Google and Facebook are doing alright in solving some of their problems, but what they’re doing now is just the tip of the iceberg. As Lewis indicated, there’s real value in evolving their current efforts further, to the point where machine learning and other techniques will let computers do everything from review code to, perhaps, predict problems with overall system health. And as the next generation of web companies start scaling up, they’ll start running into their own unique systems issues that they’ll have to solve.
Data science as it relates to business decisions is an obviously valuable area, and all the talk about big data probably ensures a fair investment in learning those skills. For organizations without internal skills, they can always outsource data science to companies such as Mu Sigma and Opera Solutions that exist to provide just such services. New, higher-level software products from startups such as Odiago, Platfora and others promise to alleviate some business-oriented analytic pain, as well.
But applying data science to data about software code or webscale system activity doesn’t always have a direct connection to income, which means it doesn’t get talked about as much. Those skills, however, are arguably as important to our growing Internet economy as big business data is to companies of all types. Hopefully, the message gets out and teenagers start to realize that if they want high-paying jobs with the coolest companies around, they’d better get a lot more interested in math.
If big data does indeed write the book about the future of business, Mu Sigma’s Rajaram says the climax will be “that mathematicians take the prom queen home.”
Feature image courtesy of Flickr user jenny8lee; The Simpsons image courtesy of FOX.
Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.
- Infrastructure Q3: OpenStack and flash step into the spotlight
- Infrastructure Q2: Big data and PaaS gain more momentum
- Smartphones help us to understand the cloud