Prepare
- Stage 2: Prepare - To Scrub Data
The 2nd stage is Prepare data. We divide the pre-data activity into two steps based on the nature of the activity. Namely, explore data and pre-process data.
The first step in data preparation involves literally looking at the data to understand its nature, what it means, its quality and format. It often takes a preliminary analysis of data, or samples of data, to understand it. This step is called "explore". Once we know more about the data through exploratory analysis, the next step is pre-processing of data for analysis.
Pre-processing includes cleaning data, sub-setting or filtering data, creating data, which programs can read and understand, such as modeling raw data into a more defined data model, or packaging it using a specific data format. If there are multiple data sets involved, this step also includes integration of multiple data sources, or streams.
- Exploring Data
The first step after getting your data is to explore it. Exploring data is a part of the two-step data preparation process. You want to do some preliminary investigation in order to gain a better understanding of the specific characteristics of your data. In this step, you'll be looking for things like correlations, general trends, and outliers. Without this step, you will not be able to use the data effectively.
Correlation graphs can be used to explore the dependencies between different variables in the data. Graphing the general trends of variables will show you if there is a consistent direction in which the values of these variables are moving towards, like sales prices going up or down. In statistics, an outlier is a data point that's distant from other data points. Plotting outliers will help you double check for errors in the data due to measurements. In some cases, outliers that are not errors might make you find a rare event.
Additionally, summary statistics provide numerical values to describe your data. Summary statistics are quantities that capture various characteristics of a set of values with a single number or a small set of numbers. Some basic summary statistics that you should compute for your data set are mean, median, range, and standard deviation.
Mean and median are measures of the location of a set of values. Mode is the value that occurs most frequently in your data set. And range and standard deviation are measures of spread in your data. Looking at these measures will give you an idea of the nature of your data. They can tell you if there's something wrong with your data.
In summary, what you get by exploring your data is a better understanding of the complexity of the data you have to work with. This, in turn, will guide the rest of your process.
- Pre-processing Data
- Scaling involves changing the range of values to be between a specified range. Such as from zero to one. This is done to avoid having certain features that large values from dominating the results. For example, in analyzing data with height and weight. The magnitude of weight values is much greater than of the height values. So scaling all values to be between zero and one will equalize contributions from both height and weight features.
- Various transformations can be performed on the data to reduce noise and variability. One such transformation is aggregation. Aggregate data generally results in data with less variability, which may help with your analysis. For example, daily sales figures may have many serious changes. Aggregating values to weekly or monthly sales figures will result in similar data. Other filtering techniques can also be used to remove variability in the data. Of course, this comes at the cost of less detailed data. So these factors must be weighed for the specific application.
- Feature selection can involve removing redundant or irrelevant features, combining features, and creating new features. During the exploring data step, you might have discovered that two features are correlated. In that case one of these features can be removed without negatively affecting the analysis results. For example, the purchase price of a product and the amount of sales tax paid, are likely to be correlated. Eliminating the sales tax amount, then will be beneficial. Removing redundant or irrelevant features will make the subsequent analysis much simpler. In other cases, you may want to combine features or create new ones. For example, adding the applicant's education level as a feature to a loan approval application would make sense. There are also algorithms to automatically determine the most relevant features, based on various mathematical properties.
- Dimensionality reduction is useful when the data set has a large number of dimensions. It involves finding a smaller subset of dimensions that captures most of the variation in the data. This reduces the dimensions of the data while eliminating irrelevant features and makes analysis simpler. A technique commonly used for dimensional reduction is called Principle Component Analysis or PCA.
- Raw data often has to be manipulated to be in the correct format for analysis. For example, from samples recording daily changes in stock prices, we may want the capture price changes for a particular market segments like real estate or health care. This would require determining which stocks belong to which market segment. Grouping them together, and perhaps computing the mean, range, standard deviation for each group.
[More to come ...]