Resampling Methods
- Overview
Resampling is a statistical approach that uses empirical analysis to gather more information about a sample. The goal of resampling is to make an inferential decision. Resampling methods draw samples from the observed data to draw certain conclusions about the population of interest.
Resampling methods are data science modeling techniques that involve taking a data sample and drawing repeated samples from it. Resampling generates unique sampling distribution results, which could be valuable in analysis.
Resampling method are a static method used to generate new data points in a data set by randomly selecting data points from an existing data set. It helps in creating new synthetic datasets to train machine learning (ML) models and estimate the properties of the dataset when the dataset is unknown, difficult to estimate, or the sample size of the dataset is small.
Resampling methods are a natural extension of simulation. The analyst uses a computer to generate a large number of simulated samples, then analyzes and summarizes patterns in those samples. The key difference is that the analyst begins with the observed data instead of a theoretical probability distribution.
Two common methods of Resampling are:
- Cross Validation Resampling
- Bootstrapping Resampling
Please refer to the following for more information:
- Wikipedia: Resampling
- Cross Validation Resampling
Cross-validation resampling is a technique that can improve the performance of a machine learning model. It involves: Holding back a test data set, Splitting data into testing and training sets, and Using k-fold cross-validation.
Cross-validation resampling involves repeatedly drawing samples from a training set and refitting a model of interest on each sample. This helps in estimation of test error rates and assists in model selection.
Here's how k-fold cross-validation works:
- The whole data is divided into k sets of almost equal sizes.
- The first set is selected as the test set.
- The model is trained on the remaining k-1 sets.
- The test error rate is then calculated after fitting the model to the test data.
K-fold cross validation can help avoid overfitting or underfitting by providing a more reliable estimate of the model's performance on unseen data.
- Bootstrapping Resampling
Bootstrapping is a resampling method that involves repeatedly taking random samples from a known sample, with replacement. The process replicates the sample variation and allows the calculation of standard errors.
Bootstrapping is a statistical procedure that resamples a single data set to create many simulated samples. This process allows for the calculation of standard errors, confidence intervals, and hypothesis testing.
[More to come ...]