Wednesday 20 August 2014

Stages in the Analytics/Data Science Process

This blog post is the about the Data Science Process which I have gone through for the 3rd and last facilitation workshop done at IDA for the IDA Data Science MOOC initiative.

Setting Business Objectives/Questions

The value of Data Science comes from the Business Questions first. Thus is it important that companies choose the business questions that at the current circumstances provides the highest value, and value need not be monetary alone but could be advantages over competitors. From setting the right business question to answer, the data scientist would then need to know how to convert the question into one of modelling, know what kind of mathematical model(s) to be used.

Collecting & Preparing Data

With an understanding of both the business question and the modelling question at hand, the data scientist can proceed to collect the amount of data needed. Questions like how much many months of data to collect and what are the variables to collect would be asked and answered.

After the collection of data, the next step is to start preparing the data for modelling. This is because each type of modelling would require different structure of data.

Exploratory Data Analysis (EDA)

This is to find out and get familiar with the data, understand what are the patterns in the data and at this stage we usually do missing data analysis, correlations, distribution analysis, scatterplots, frequency analysis and so on. Through the EDA, we also lookout for data errors. For instance, if we know that the value of Gender is "M" and "F" but we see the value of "f" and "m" as well, there might be some errors in the way the gender data is captured and if that is the case, it should be flagged out.

The EDA results would form the basis for the modelling part later. Through the EDA we can find out what are the potential pitfalls later when we start working on the mathematical models.

Building Mathematical Models

At this point,we would try to build the model that can be implemented and has the highest possible predictive power (if we are building a statistical model). Usually a few models are built with various number of independent variables. We would also check back the results from the model with that of the EDA to make sure the models are making sense.

Select the Mathematical Model for implementaion.

With several models built we would start looking at which models can be implemented and also what are the pros and cons of each model. Predictive power would definitely be one factor for consideration but other factors include the costs of implementation and maintenance during deployment.

Deployment of Models

At this stage, there is the preparation of the different test to implement the models. Test such as System Integration Test (SIT) and User Acceptance Test(UAT) would need to be done to ensure the smooth deployment of models. At this stage, setting up the test cases are important so that the many different types of scenarios are tested and working fine.

Continuous Model Validation

As most most models would suffer from model decay due to environment changes, there is a constant need to ensure that the model is functioning well (i.e. maintain an acceptable level of predictive power). Thus models are validated with enough frequency level and when it consistently falls below the accepted level of predictive power, processes will have to be activated to either re-calibrate the model or re-build the model with a more current data.

Conclusion

Throughout the whole process of answering data science questions, the quality of data would be a constant question that needs to be answered so as to ensure that the model built and/or implemented is of value. Thus to have a good start in data science, data management is of utmost important and a suitable data management strategy would need to be devised to ensure the data is of the highest possible quality.

The slides I have shared are here.

No comments:

Post a Comment