Analyze
- Stage 3: Analyze - To Explore Data
Once your data is ready to be used, and right before you jump into AI and Machine Learning, you will have to examine the data. Now that you have your data nicely prepared, the next stage is to analyze the data. The prepared data then would be passed onto the Analysis stage, which involves selection of analytical techniques to use, building a model of the data, and analyzing results. This stage can take a couple of iterations on its own or might require data scientists to go back to stages one and two to get more data or package data in a different way.
- Data Analysis and Analysis Techniques
Data analysis involves building a model from your data, which is called input data. The input data is used by the analysis technique to build a model. What your model generates is the output data. There are different types of problems, and so there are different types of analysis techniques. The main categories of analysis techniques are classification, regression, clustering, association analysis, and graph analysis.
- Classification. In classification, the goal is to predict the category of the input data. An example of this is predicting the weather as being sunny, rainy, windy, or cloudy in this case. Another example is to classify a tumor as either benign or malignant. In this case, the classification is referred to as binary classification, since there are only two categories. But you can have many categories as well, as the weather prediction problem shown here having four categories. Another example is to identify handwritten digits as being in one of the ten categories from zero to nine.
- Regression. When your model has to predict a numeric value instead of a category, then the task becomes a regression problem. An example of regression is to predict the price of a stock. The stock price is a numeric value, not a category. So this is a regression task instead of a classification task. Other examples of regression are estimating the weekly sales of a new product and predicting the score on a test.
- Clustering. In clustering, the goal is to organize similar items into groups. An example is grouping a company's customer base into distinct segments for more effective targeted marketing like seniors, adults and teenagers. Another such example is identifying areas of similar topography, like mountains, deserts, plains for land use application. Yet another example is determining different groups of weather patterns, like rainy, cold or snowy.
- Association Analysis. The goal in association analysis is to come up with a set of rules to capture associations within items or events. The rules are used to determine when items or events occur together. A common application of association analysis is known as market basket analysis, which is used to understand customer purchasing behavior. For example, association analysis can reveal that banking customers who have certificate of deposit accounts (or CDs), also tend to be interested in other investment vehicles, such as money market accounts. This information can be used for cross-selling. If you advertise money market accounts to your customers with CDs, they're likely to open such an account.
- Graph Analysis. When your data can be transformed into a graph representation with nodes and links, then you want to use graph analytics to analyze your data. This kind of data comes about when you have a lot of entities and connections between those entities, like social networks. Some examples where graph analytics can be useful are exploring the spread of a disease or epidemic by analyzing hospitals' and doctors' records; identification of security threats by monitoring social media, email and text data; and optimization of mobile communications network traffic, and optimization of mobile telecommunications network traffic, to ensure call quality and reduce dropped calls.
- Constructing the Model
Modeling starts with selecting, one of the techniques we listed as the appropriate analysis technique, depending on the type of problem you have. Then you construct the model using the data you've prepared. To validate the model, you apply it to new data samples. This is to evaluate how well the model does on data that was used to construct it. The common practice is to divide the prepared data into a set of data for constructing the model and reserving some of the data for evaluating the model after it has been constructed. You can also use new data prepared the same way as with the data that was used to construct model.
- Evaluating the Model
Evaluating the model depends on the type of analysis techniques you used. Let's briefly look at how to evaluate each technique. For classification and regression, you will have the correct output for each sample in your input data. Comparing the correct output and the output predicted by the model, provides a way to evaluate the model. For clustering, the groups resulting from clustering should be examined to see if they make sense for your application. For example, do the customer segments reflect your customer base? Are they helpful for use in your targeted marketing campaigns? For association analysis and graph analysis, some investigation will be needed to see if the results are correct. For example, network traffic delays need to be investigated to see what your model predicts is actually happening. And whether the sources of the delays are where they are predicted to be in the real system.
- Transforming Business Questions into Data Science Questions
After you have evaluated your model to get a sense of its performance on your data, you will be able to determine the next steps. Some questions to consider are, should the analysis be performed with more data in order to get a better model performance? Would using different data types help? For example, in your clustering results, is it difficult to distinguish customers from distinct regions? Would adding zip code to your input data help to generate finer grained customer segments? Do the analysis results suggest a more detailed look at some aspect of the problem? For example, predicting sunny weather gives very good results, but rainy weather predictions are just so-so. This means that you should take a closer look at your examples for rainy weather. Perhaps you just need more samples of rainy weather, or perhaps there are some anomalies in those samples. Or maybe there are some missing data that needs to be included in order to completely capture rainy weather.
- Summary
The ideal situation would be that your model platforms very well with respect to the success criteria that were determined when you defined the problem at the beginning of the project. In that case, you're ready to move on to communicating and acting on the results that you obtained from your analysis. As a summary, data analysis involves selecting the appropriate technique for your problem, building the model, then evaluating the results. As there are different types of problems, there are also different types of analysis techniques.
[More to come ...]