Dimensionality Reduction
- Overview
Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally close to its intrinsic dimension.
As the number of features or dimensions in a data set increases, the amount of data required to obtain statistically significant results increases exponentially. This can lead to problems such as overfitting of machine learning (ML) models, increased computation time, and reduced accuracy, known as the curse of dimensionality problem that occurs when dealing with high-dimensional data.
As the number of dimensions increases, the number of possible feature combinations grows exponentially, making obtaining a representative sample of the data computationally difficult and performing tasks such as clustering or classification expensive. In addition, some ML algorithms may be sensitive to the number of dimensions and require more data to achieve the same level of accuracy as lower-dimensional data.
To solve the curse of dimensionality, feature engineering techniques are used, including feature selection and feature extraction. Dimensionality reduction is a feature extraction technique that aims to reduce the number of input features while retaining as much of the original information as possible.
Please refer to the following for more information:
- Wikipedia: Dimension Reduction
- Dimensionality Reduction Techniques
Dimensionality reduction is common in fields that deal with large numbers of observations and/or large numbers of variables, such as signal processing, speech recognition, neuroinformatics, and bioinformatics.
Here are some dimensionality reduction techniques:
- Principal Component Analysis (PCA): A popular algorithm for dimensionality reduction
- Linear Discriminant Analysis (LDA): A predictive modeling algorithm for multi-class classification. LDA can also be used as a dimensionality reduction technique
- Factor analysis: An unsupervised machine learning algorithm that creates factors from observed variables to represent the common variance. Factor analysis is an extension of PCA, but its main focus is on finding latent variables
- Backward feature elimination: Starts with all the features and removes the least significant feature at each iteration
- Low variance filter: Calculates each column variance and removes those columns with a variance value below a given threshold. This method only applies to numerical columns
[More to come ...]