Data Science Life Cycle
- Overview
Data science is quickly becoming one of the hottest fields in the tech industry. With rapid advances in computing power now enabling the analysis of massive data sets, we can uncover patterns and insights about user behavior and world trends to an unprecedented degree.
The data science lifecycle revolves around using machine learning and different analytics strategies to generate insights and predictions from information to achieve business enterprise goals.
The complete method includes multiple steps such as data cleaning, preparation, modeling, and model evaluation. Therefore, it is very important to have a common structure to look at every trouble at hand.
- CRISP-DM Framework
The globally mentioned structure in fixing any analytical problem is referred to as a Cross Industry Standard Process for Data Mining (or CRISP-DM framework). The CRISP-DM methodology provides a structured approach to planning a data mining project. It is a robust and well-proven methodology.
This model is an idealised sequence of events. In practice many of the tasks can be performed in a different order and it will often be necessary to backtrack to previous tasks and repeat certain actions. The model does not try to capture all possible routes through the data mining process.
You can jump to more information about each phase of the process here:
- Business understanding
- Data understanding
- Data preparation
- Modeling
- Evaluation
- Deployment
Published in 1999 to standardize data mining processes across industries, CRISP-DM has since become the most common methodology for data mining, analytics, and data science projects.
- A Data Science Project's Life Cycle
The data science life cycle is the process of data from its creation to its destruction. It involves many stages, including: problem definition, data collection, preprocessing, exploratory analysis, model building, deployment.
Other stages of a data science project's life cycle include:
- Business problem understanding
- Data cleaning and processing
- Model communication
- Model evaluation and monitoring
The time required to complete a data science project is subjective and depends on the data set. It can take months or even years for a model to start showing results.
The data processing phase is usually the longest and most important phase of a data science project. This is because the quality of the input data determines the quality of the output.
Data preparation is the process of preparing raw data for further processing and analysis. It involves:
- Collect data from various sources
- Clean and label data
- Handle missing data
- Explore and visualize data
[More to come ...]