Wednesday 13 August 2014

Data Science Venn Diagram (I)

During the facilitation for this week, I was going through the Venn Diagram on Data Science by Drew Conway. The Venn diagram was actually shared on his blog and he even have his own explanation of it.

As spoken, for me the Venn diagram would be different from his. The following would be how I label my Circles.

1) Data & IT management
2) Mathematical Models
3) Domain Expertise (Business)

Data & IT Management

Many a times, the data scientist when building its models will have two major constraints at least. The first constraint is by the IT infrastructure, both hardware and software and the other is by the business environment the company is in, whether there is a lot of regulations or business policies. Thus it is important for the data scientist to have an understanding of computer science, what are the potential tools, both hardware and software, to implement the model and also capturing the 'right' data.

As data error investigation and rectification falls on the shoulder of data scientist, the data scientist would need to know how to achieve good quality data. Not only that, the data scientist would also need to make suggestions on getting good quality data and also recommend the potentially useful data that the company can use and in order to make such a recommendation, the data scientist would need to be familiar on IT and relevant technology.

Of course the programming part would lie in this circle as well, since we are talking about Data & IT management in general here. Many a times, the data scientist would have to prepare the data, to conform to a structure that is needed for a particular mathematical model. As such, the data scientist would need to know how to code so as to get the data in the right structure for mathematical modelling purpose.

Mathematical Models

A data scientist would definitely need to know the most commonly used statistical/operation research models. For instance, understand how to coefficient is calculated, how cluster analysis is done and so on. Through such knowledge, the data scientist would know the assumption of each model and thus check and see if any assumptions is broken before proceeding on to modelling. And if the assumptions are broken, what can be done to rectify it. Also through understanding the mathematics behind it, the data scientist would also understand the flaws of the model valuation metrics (such as R-Square & Misclassification error). Thus one cannot run away from the maths that is running behind each models. It is strongly encouraged that those who are embarking on the Data Science journey, to strengthen their basic linear algebra, calculus and statistics and go through the mathematics behind these models so that one can understand the pitfalls and advantages of using each models.

Domain Expertise (Business)

Mentioned in my previous facilitation, domain expertise can be split into two parts. One is the general business knowledge such as accounting rules, strategy management models and so on and the other is more industry/company specific kind of business knowledge. The general business knowledge could be picked up at any MBA classes but those that are industry/company-specific would require data scientist to be in the field to learn from it.

Business domain knowledge can be another constraint to the mathematical  models build. Besides the IT systems that will constraint the models build, regulations and business policies can add another layer of constraint on mathematical models. For instance, one cannot increase the fleet of lorries indiscriminately so as to fulfil the required sales revenue in a logistic firm. Or another example is that race cannot be statistical model because that can mean there is a form of racial discrimination.

In conclusion, data science is found in the intersection of these three circles and to be a good data scientist, he/she has to have knowledge in these three areas. It could be overwhelming at first but do start from somewhere. Go with your interest as one learns much faster when they are passionate about it but at the end, try to gain knowledge from these three areas altogether but with different emphasis depending on your job role.

Good luck and have fun while learning more about Data Science!

Andrew Conway's Data Science Venn Diagram

Slides I shared at the second facilitation workshop

No comments:

Post a Comment