Once the data has been collected, it is in a raw form. Raw data is collected from various sources and is usually unsuitable for analysis by itself. For example:
There might be many duplicate entries.
The data may have typographical errors.
There may be missing data.
The data may be available in different formats.
With an uncleaned data set, no matter what type of algorithm you try, you will never get accurate results. That is why data scientists spend a considerable amount of time on data cleaning; further, it has to be labeled. This is call preprocessing. The steps and techniques for data preprocessing will vary for each data set. But the following steps could be used as a standard approach for any data set type:
Identifying relevant data and removing irrelevant data
Fix Irregular cardinality and structural errors
Outliers
Missing data treatment
Data Transformation
Post the data preprocessing, comes the final data preparation phase of Data Splitting.
You discovered a three step framework for data preparation and tactics in each step:
Step 1: Data Selection Consider what data is available, what data is missing and what data can be removed.
Step 2: Data Preprocessing Organize your selected data by formatting, cleaning and sampling from it.
Step 3: Data Transformation Transform preprocessed data ready for machine learning by engineering features using scaling, attribute decomposition and attribute aggregation.
Data preparation is a large subject that can involve a lot of iterations, exploration and analysis. Getting good at data preparation will make you a master at machine learning.
Login
Accessing this course requires a login. Please enter your credentials below!