Table of Contents
What is Data Preprocessing?
Data Preprocessing Is the Process Of Transforms Data Into Algorithm Knowing Data. RealWord [Raw ]Data Is In incomplete and inconsistent Not Always. Make Raw-data Useful Using Data Preprocessing.
Data Preprocessing Step By Step
Step 1 : Import the libraries
Step 2 : Import the data set
Step 3 : Data Cleaning
Step 4 : Data Transformation
Step 4 : Data Reduction
Step 5: Feature Scaling
Import the libraries in python
First, I Import pandas and NumPy libraries and give alias.
import pandas as pd import numpy as np
Import the data set
Import Data Using Pandas. You Can Import Data: CVS, Excel, SAS, delimited, SQL And URL.
# Import CSV File Data = pd.read_csv("Train.csv") # Import CSV File Using URL Data = pd.read_csv("https://quickinsights.org/wp-content/uploads/2020/03/train.csv") # import TXT File Data = pd.read_table("train1.txt") # Import Excel File Data = pd.read_excel("train.xls",sheetname="June", skiprows=2) #Sqlite 3 db import sqlite3 from pandas.io import sql conn = sqlite3.connect('forecasting.db') query = "SELECT * FROM forecasting" results = pd.read_sql(query, con=conn) print results.head()
Data Cleaning
Find Missing Data
First, check In Data Set Have Missing Value Or Not.
import numpy as np import pandas as pd import matplotlib.pyplot as plt Data = pd.read_csv("~/Downloads/Data Science/data set/train.csv") Data.isnull()

In Huge Data Set Use isnull().sum() Not Always.
Data.isnull().sum()
Some Time Null Value In Different Value As: ?, Blank Than Need To Convert Them NAN Format For Further Algorithm Use.
#Eliminate the NAN for col in Data.columns: Data.loc[Data[col] == '?', col] = np.nan
Visualizing Null Values
In seaborn library Use For statistical graphics visualisation
import seaborn as sns sns.heatmap(Data.isnull(), cbar=False)
Drop Null Value
Using dropna()
Data.dropna()
Using Column name
When See Graph, and No More Use Column Or Have A 75% Null Value No fill null value For Imputation then Better Option is Remove Column.
Data.drop(['Unnamed: 0'], axis=1, inplace=True)
Filling null values using mean
# Find mean result = Data.category_ID.mode() print(result) #Then Fill null value Data.category_ID = Data.loc[Data.category_ID == '?', col] = 'Category 26' Data["category_ID"].fillna("Category", inplace = True)
fill null values with the previous and next ones
Data.fillna(method ='bfill') # for next values as Data.fillna(method ='pad') # for previous values as
fill na value using replace()
Data['category_ID'].replace(to_replace = np.nan, value = 'Category 26')
Data Transformation
When Your data is mixtures attributes Then Need Transformation them, Not Always. Example : currency, kilograms and sales volume.
Normalization
Normalization is rescaling real numeric value into the range 0 and 1.
When you don’t know data distribution or know distribution is not Gaussian distribution(bell curve). Example k-nearest neighbors and artificial neural networks.
#import sklearn library for Normalization from sklearn import preprocessing #need all value in number, not convert non number normalized_Data = preprocessing.normalize(Data)
Standardization
Standardization is shifting the distribution of every attribute with a zero mean and one standard deviation.
When your data is Gaussian distribution (bell curve). This does not require compulsory, but the technique is more effective if your attribute is Gaussian distribution and varying scale data. Example linear regression, logistic regression
standardized_Data = preprocessing.scale(Data)
Data Discretization
When Your Data is Continuous and need to convert them discrete then use Discretization.
Data['type_contact']=pd.cut(Data['type_contact'],3,labels=['email','phone','fax'])
Data Reduction
In data reduction (Dimension Reduction) techniques example: Filter method using Pearson correlation, Wrapper method using pvalue, the Embedded method using Lasso regularization. Lasso regularization is an iterative method. Each iteration extract features to check which features contribute the most to the training. If the feature is irrelevant, lasso penalizes 0(zero) and removes it. PCA(Principal Component Analysis).
#Filter method corr = Data.corr() drop_cols = [] for col in Data.columns: if sum(corr[col].map(lambda x: abs(x) > 0.1)) <= 4: drop_cols.append(col) Data.drop(drop_cols, axis=1, inplace=True) print(drop_cols) display(Data)

Feature Scaling
Feature scaling is also known as Data Transformation. It is applied to independent variables for Given data in a particular range. This is also used for algorithm speeding up the calculation.
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler.fit(Data)
Use Feature Scaling
Use feature scaling when data is a big scale, irrelevant or misleading, and your algorithm is Distance based. Example: K-Means, K-Nearest-Neighbours, PCA.