Wednesday, 27 August 2014

Data Science Venn Diagram - From Good to Great

During one of the facilitation session for IDA Data Science MOOC Initiative, I came across Drew Conway's Venn Diagram on the meaning of Data Science. I have a different version that is adapted from him and have written it in my blog post here.

After much thought and looking at the recent post from Andrew Ng with regards to setting up his Data Science team in Baidu Research, I asked myself the question how can a good data scientist move to being a great data scientist. One thing that came to my mind is that for anyone working in Data Science, they cannot run away from being part of a team. They have to learn how to be a good leader and also what is required to be a good follower as well. To be both a good leader and follower, one needs to have the soft skills or what is commonly known as people skill. Everyone plays a part in making the team effective and efficient, the leader cannot be effective without the cooperation from the team members and team members cannot be effective without the leader to give the direction, stay focus, managing the timelines and motivating the team members.

Besides that, to be a great data scientist, in my opinion, communication skills is very important. Being able to communicate the relevant insights and in a manner that is digestible by management requires much thought to be put in. The great data scientist would need to learn how people learn, what kind of communication medium is effective in bringing across the messages/insights and how to even convince/influence management to take the necessary actions required.

So to be a good data scientist, I would recommend the person to be trained and have knowledge in the following:

1) Data & IT management
2) Mathematical Modelling
3) Domain Expertise (Business)

And to move on from good to great data scientist, the fourth skill that is needed would be

4) People skills - Team Management, Communication Skills & Empathy

To conclude, the data scientist should not only have the 'hard' skills, knowledge and skills that can be picked up through books and other mediums but also need to have the 'soft' skills that can only be picked up through experience. The best thing is both of them can be picked up in parallel. So go forth and pick it up!

Wednesday, 20 August 2014

Stages in the Analytics/Data Science Process

This blog post is the about the Data Science Process which I have gone through for the 3rd and last facilitation workshop done at IDA for the IDA Data Science MOOC initiative.

Setting Business Objectives/Questions

The value of Data Science comes from the Business Questions first. Thus is it important that companies choose the business questions that at the current circumstances provides the highest value, and value need not be monetary alone but could be advantages over competitors. From setting the right business question to answer, the data scientist would then need to know how to convert the question into one of modelling, know what kind of mathematical model(s) to be used.

Collecting & Preparing Data

With an understanding of both the business question and the modelling question at hand, the data scientist can proceed to collect the amount of data needed. Questions like how much many months of data to collect and what are the variables to collect would be asked and answered.

After the collection of data, the next step is to start preparing the data for modelling. This is because each type of modelling would require different structure of data.

Exploratory Data Analysis (EDA)

This is to find out and get familiar with the data, understand what are the patterns in the data and at this stage we usually do missing data analysis, correlations, distribution analysis, scatterplots, frequency analysis and so on. Through the EDA, we also lookout for data errors. For instance, if we know that the value of Gender is "M" and "F" but we see the value of "f" and "m" as well, there might be some errors in the way the gender data is captured and if that is the case, it should be flagged out.

The EDA results would form the basis for the modelling part later. Through the EDA we can find out what are the potential pitfalls later when we start working on the mathematical models.

Building Mathematical Models

At this point,we would try to build the model that can be implemented and has the highest possible predictive power (if we are building a statistical model). Usually a few models are built with various number of independent variables. We would also check back the results from the model with that of the EDA to make sure the models are making sense.

Select the Mathematical Model for implementaion.

With several models built we would start looking at which models can be implemented and also what are the pros and cons of each model. Predictive power would definitely be one factor for consideration but other factors include the costs of implementation and maintenance during deployment.

Deployment of Models

At this stage, there is the preparation of the different test to implement the models. Test such as System Integration Test (SIT) and User Acceptance Test(UAT) would need to be done to ensure the smooth deployment of models. At this stage, setting up the test cases are important so that the many different types of scenarios are tested and working fine.

Continuous Model Validation

As most most models would suffer from model decay due to environment changes, there is a constant need to ensure that the model is functioning well (i.e. maintain an acceptable level of predictive power). Thus models are validated with enough frequency level and when it consistently falls below the accepted level of predictive power, processes will have to be activated to either re-calibrate the model or re-build the model with a more current data.

Conclusion

Throughout the whole process of answering data science questions, the quality of data would be a constant question that needs to be answered so as to ensure that the model built and/or implemented is of value. Thus to have a good start in data science, data management is of utmost important and a suitable data management strategy would need to be devised to ensure the data is of the highest possible quality.

The slides I have shared are here.

Wednesday, 13 August 2014

Data Science Venn Diagram (I)

During the facilitation for this week, I was going through the Venn Diagram on Data Science by Drew Conway. The Venn diagram was actually shared on his blog and he even have his own explanation of it.

As spoken, for me the Venn diagram would be different from his. The following would be how I label my Circles.

1) Data & IT management
2) Mathematical Models
3) Domain Expertise (Business)

Data & IT Management

Many a times, the data scientist when building its models will have two major constraints at least. The first constraint is by the IT infrastructure, both hardware and software and the other is by the business environment the company is in, whether there is a lot of regulations or business policies. Thus it is important for the data scientist to have an understanding of computer science, what are the potential tools, both hardware and software, to implement the model and also capturing the 'right' data.

As data error investigation and rectification falls on the shoulder of data scientist, the data scientist would need to know how to achieve good quality data. Not only that, the data scientist would also need to make suggestions on getting good quality data and also recommend the potentially useful data that the company can use and in order to make such a recommendation, the data scientist would need to be familiar on IT and relevant technology.

Of course the programming part would lie in this circle as well, since we are talking about Data & IT management in general here. Many a times, the data scientist would have to prepare the data, to conform to a structure that is needed for a particular mathematical model. As such, the data scientist would need to know how to code so as to get the data in the right structure for mathematical modelling purpose.

Mathematical Models

A data scientist would definitely need to know the most commonly used statistical/operation research models. For instance, understand how to coefficient is calculated, how cluster analysis is done and so on. Through such knowledge, the data scientist would know the assumption of each model and thus check and see if any assumptions is broken before proceeding on to modelling. And if the assumptions are broken, what can be done to rectify it. Also through understanding the mathematics behind it, the data scientist would also understand the flaws of the model valuation metrics (such as R-Square & Misclassification error). Thus one cannot run away from the maths that is running behind each models. It is strongly encouraged that those who are embarking on the Data Science journey, to strengthen their basic linear algebra, calculus and statistics and go through the mathematics behind these models so that one can understand the pitfalls and advantages of using each models.

Domain Expertise (Business)

Mentioned in my previous facilitation, domain expertise can be split into two parts. One is the general business knowledge such as accounting rules, strategy management models and so on and the other is more industry/company specific kind of business knowledge. The general business knowledge could be picked up at any MBA classes but those that are industry/company-specific would require data scientist to be in the field to learn from it.

Business domain knowledge can be another constraint to the mathematical  models build. Besides the IT systems that will constraint the models build, regulations and business policies can add another layer of constraint on mathematical models. For instance, one cannot increase the fleet of lorries indiscriminately so as to fulfil the required sales revenue in a logistic firm. Or another example is that race cannot be statistical model because that can mean there is a form of racial discrimination.

In conclusion, data science is found in the intersection of these three circles and to be a good data scientist, he/she has to have knowledge in these three areas. It could be overwhelming at first but do start from somewhere. Go with your interest as one learns much faster when they are passionate about it but at the end, try to gain knowledge from these three areas altogether but with different emphasis depending on your job role.

Good luck and have fun while learning more about Data Science!

Andrew Conway's Data Science Venn Diagram

Slides I shared at the second facilitation workshop

Thursday, 7 August 2014

Resources on Data Visualization (I)




Yesterday during the facilitation of the inaugural workshop for IDA's Data Science MOOC, I mentioned about Data Visualization and how important it is with regards to the work of the data scientist. So in this blog post, I will put in more of my thoughts on it and also some of the very popular resources on Data Visualization.


What is my definition of data visualization? My definition of it is using visual aids such as graphs and maps to make important actionable insights easily comprehended by target audience. Given this day and age, target audience have very short attention span and to be able to squeeze so much information within such a short span gets more challenging as we go by. Thus it is important that data scientist also understand the importance of visualization and also the know the pros and cons of each visual aids.

Many a times, data visualization are sold as visual analytics but both of them should be totally different in my opinion. Data visualization is much more closer to descriptive statistics (where things have happened) whereas if you look at the definition of just 'analytics' alone, besides descriptive it should also contain predictive component as well. So visual analytics should be using visual methods to make certain predictions (at least this is my opinion for it to be call "Visual Analytics"). While doing research for this blog post, I came across the definition of Visual Analytics in Wikipedia but alas, it would take a few hours for me to digest its definition. Hopefully someone can go through it and come up with a simpler definition.
  
And also there are some cognitive blind spots when we use visual aids to present information as compared to predictive statistical models. If I plot the outbreak of a disease on a map over time, and it seems to move from east to the centre, our brain would start to link the points up, extrapolate , coming to a conclusion that the disease will continue to move through the centre of the map towards the west side. Really? Do you have data to support that the disease outbreak may not 'suddenly' appear at the west side of the map and move towards the centre? This is not true in statistical model because through the statistical models, we are very sure the independent variables would predict the outcome with a higher probability based on training data as compared to the simple 'extrapolation' on visual aids.

Now I am not saying that Data Visualization is not important. It is definitely important. After generating so much insights from data, the very last step is to share the insights so buy-in can be gained and actions can be taken to generate value from Analytics done. If at this stage, there is no planning of the visual aids to show the actionable insights, it would be like a striker that has the ball with him in front of a goalkeeper-less goal post and fail to score. ALL efforts wasted.

So what are the resource available to learn more about  visualization? Well, there are two Masters on Visualization, namely Edward Tufte and Stephen Few.

Their websites are as follows:

Their books:
Stephen Few

Edward Tufte

Have fun learning about Data Visualization in your Data Science Journey!

Monday, 7 July 2014

Similarity Measures on Categorical Data Part 1

Last Saturday, I have started my research work on Cluster Analysis. In the book, I was reading, it was talking about a common problem in Cluster Analysis which is handling categorical data. The method that I usually adopt would be to create a string of dummy variables to represent each level of the categorical variables. For example, if the categorical variable have 4 values, 4 dummy variables would be created for the dummy variables.

Apparently, using this method, there is a possibility of introducing more 'similarity' into the model rather. '1-1' pairing is similar to a '0-0' pairing. Thus there are other measures to calculate the 'similarity' of categorical data. Measures such as Goodall, Gambaryan and so on that takes into account the frequency and also occurrence of attributes.

Let's see if I can digest the papers that I have found.

http://www-users.cs.umn.edu/~sboriah/PDFs/BoriahBCK2008.pdf

Saturday, 7 June 2014

Where to start learning Analytics/Data Science?

Welcome to my first post that is to document down what I have learnt about Analytics/Data Science. I am glad that all along throughout my career, I am able to work on Analytics. The reason why I loved Analytics/Data Science is because data can tell a lot of stories about businesses or to be more specific, the organization that is providing the data. To use and implement Analytics requires all round-skills, just to name a few, communication, psychology, cognitive science, mathematics (statistics), IT and business. It develops me all round.

In this blog post, I want to provide a very useful resource that I come across for those who are starting out on Analytics. Below is a link that you can access to download a book that introduce the field of Analytics as "Data Science".

http://jsresearch.net/index.html

This is a very interesting book as it hits the nails on what Analytics truly is. It is not totally about statistics or operations research in itself. It is not about business either. It is also not about IT management too, it encompass too much field in itself thus up till now, no one has written a book on Analytics successfully (in my opinion) until I come across this book.

The book is pretty well structured so I encourage anyone who wants to start learning about Analytics to read this book. The other good point about this book is that it introduce the software R, an open source software that most data scientists especially academics used.

I strongly encouraged those who wants to learn Analytics to read it well and thorough.