Personal tools

Feature Engineering

 
Dartmouth College_012924A
[Dartmouth College]


- Overview

Data feature engineering, also known as data preprocessing, is the process of transforming raw data into features that can be used to develop machine learning (ML) models. 

Feature engineering involves: 

  • Extracting and transforming variables from raw data
  • Selecting, combining, and crafting attributes that capture the relationships between variables
  • Adding, deleting, combining, or mutating data to improve ML model training


Feature engineering helps:

  • Increase the model's accuracy on new, unseen data
  • Enhance the predictive power of machine learning models
  • Lead to better performance and greater accuracy


Feature engineering can involve raw data such as: Price lists, Product descriptions, Sales volumes. 

Some examples of features in a dataset include:

  • Numerical features: Numerical values such as height, weight, and so on
  • Categorical features: Multiple classes/ categories, such as gender, color, and so on


Principal component analysis (PCA) is a feature engineering technique for dimensionality reduction. It involves:

  • Standardizing data
  • Computing the covariance matrix
  • Performing eigenvalue decomposition

 

- Principal Component Analysis

Principal component analysis (PCA) is a statistical method that summarizes large data tables into a smaller set of "summary indices". These indices can be more easily visualized and analyzed. 

PCA is a dimensionality reduction method that transforms a large set of variables into a smaller one. The smaller set still contains most of the information in the large set. 

The axes that represent the variation are "Principal Components". PC1 represents the most variation in the data, and PC2 represents the second most variation. 

The outcome of PCA can be visualized on colorful scatterplots, ideally with only a minimal loss of information. 

Here are some steps for solving PCA problems: 

  • Standardize the dataset
  • Find the Eigenvalues and eigenvectors
  • Arrange Eigenvalues
  • Form Feature Vector
  • Transform Original Dataset
  • Reconstructing Data

 
PCA can be based on either the covariance matrix or the correlation matrix. The new variables (the PCs) depend on the dataset, rather than being pre-defined basis functions.
PCA is a popular and unsupervised algorithm that has been used across several applications like data analysis, data compression, de-noising, and reducing the dimension of data.

 

- Feature Extraction

Feature extraction refers to the process of converting raw data into processable numerical features while retaining the information in the original data set. It produces better results than directly applying machine learning to raw data.

There are two main methods of performing feature extraction: manual and automatic. 

  • Manual feature extraction: It involves applying domain knowledge and human intuition to select or design features suitable for the problem. For example, you can use image processing techniques to extract edges, corners, or regions of interest from images. Manual feature extraction can be efficient and customized, but it can also be labor-intensive and subjective. 
  • Automatic feature extraction: It involves using ML algorithms to learn features from data without human intervention. For example, you can use principal component analysis (PCA) to reduce the dimensionality of your data by finding the direction of maximum variation. Automatic feature extraction can be efficient and objective, but it can also be complex and opaque.

 

[More to come ...]

 

 

 
Document Actions