Data Preprocessing

Imagine a treasure chest overflowing with unrefined gems. Extracting their true value requires careful preparation, a meticulous process known as data preprocessing. This comprehensive guide delves into the world of data preprocessing, equipping you to transform raw, unpolished data into a state that empowers machine learning algorithms to shine.

Why Preprocess Data?

Data, in its raw form, can be messy, inconsistent, and incomplete. It might contain errors, missing values, inconsistencies in format, or irrelevant information. Feeding such data directly into a machine learning model would be akin to expecting a chef to create a culinary masterpiece from unwashed ingredients and a dull knife. Here’s why data preprocessing is crucial:

Improved Model Performance: Clean and consistent data allows machine learning algorithms to focus on learning the underlying patterns and relationships within the data, ultimately leading to better model performance (accuracy, precision, recall, etc.).

Reduced Training Time: By eliminating irrelevant information and inconsistencies, data preprocessing can significantly reduce the time required to train a machine learning model.

Enhanced Algorithm Stability: Preprocessing techniques like normalization or scaling can improve the stability of machine learning algorithms, making them less susceptible to the influence of outliers or features with vastly different scales.

Facilitates Feature Engineering: Cleaned and prepared data serves as a solid foundation for feature engineering, where new features are created to capture specific relationships within the data, potentially boosting model performance further.

Data preprocessing is an essential step in the machine learning workflow. It lays the groundwork for successful model development, paving the way for accurate predictions and valuable insights.

READ Also  How Random Forest Algorithm Works and Why It Is So Effective?

What is Data Preprocessing?

Data Preprocessing Is the Process Of Transforms Data Into Algorithm Knowing Data. RealWord [Raw ]Data Is In incomplete and inconsistent Not Always. Make Raw-data Useful Using Data Preprocessing. Imagine a chef preparing ingredients before cooking – data preprocessing performs similar tasks on your data, ensuring it’s clean, consistent, and ready for analysis.

Data Preprocessing Step By Step

Step 1 : Import the libraries

Step 2 : Import the data set

Step 3 : Data Cleaning

Step 4 : Data Transformation

Step 4 : Data Reduction

Step 5: Feature Scaling

Import the libraries in python

First, I Import pandas and NumPy libraries and give alias.

import pandas as pd
import numpy as np

Import the data set

Import Data Using Pandas. You Can Import Data: CVS, Excel, SAS, delimited, SQL And URL.

# Import CSV File
Data = pd.read_csv("Train.csv")
# Import CSV File Using URL
Data = pd.read_csv("https://quickinsights.org/wp-content/uploads/2020/03/train.csv")
# import TXT File
Data = pd.read_table("train1.txt")
# Import Excel File
Data = pd.read_excel("train.xls",sheetname="June", skiprows=2)
#Sqlite 3 db
import sqlite3
from pandas.io import sql
conn = sqlite3.connect('forecasting.db')
query = "SELECT * FROM forecasting"
results = pd.read_sql(query, con=conn)
print results.head()

Data Cleaning

Find Missing Data

First, check In Data Set Have Missing Value Or Not.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Data = pd.read_csv("~/Downloads/Data Science/data set/train.csv")
Data.isnull() 

In Huge Data Set Use isnull().sum() Not Always.

Data.isnull().sum()
Data.isnull().sum()

Some Time Null Value In Different Value As: ?, Blank Than Need To Convert Them NAN Format For Further Algorithm Use.

#Eliminate the NAN
for col in Data.columns:
    Data.loc[Data[col] == '?', col] = np.nan

Visualizing Null Values

In seaborn library Use For statistical graphics visualisation

import seaborn as sns
sns.heatmap(Data.isnull(), cbar=False)
seaborn use for null value visualisation

Drop Null Value

Using dropna()

Data.dropna()

Using Column name

When See Graph, and No More Use Column Or Have A 75% Null Value No fill null value For Imputation then Better Option is Remove Column.

Data.drop(['Unnamed: 0'], axis=1, inplace=True)

Filling null values using mean

# Find mean
result = Data.category_ID.mode()
print(result)
#Then Fill null value 
Data.category_ID = Data.loc[Data.category_ID == '?', col] = 'Category 26'
Data["category_ID"].fillna("Category", inplace = True)  

fill null values with the previous and next ones

Data.fillna(method ='bfill') # for next values as
Data.fillna(method ='pad') # for previous values as 

fill na value using replace()

Data['category_ID'].replace(to_replace = np.nan, value = 'Category 26') 

Data Transformation

When Your data is mixtures attributes Then Need Transformation them, Not Always. Example : currency, kilograms and sales volume.

READ Also  Time Series Regression in Python: Predicting the Future with Data

Normalization

Normalization is rescaling real numeric value into the range 0 and 1.

When you don’t know data distribution or know distribution is not Gaussian distribution(bell curve). Example k-nearest neighbors and artificial neural networks.

#import sklearn library for Normalization
from sklearn import preprocessing
#need all value in number, not convert non number
normalized_Data = preprocessing.normalize(Data)

Standardization

Standardization is shifting the distribution of every attribute with a zero mean and one standard deviation.

When your data is Gaussian distribution (bell curve). This does not require compulsory, but the technique is more effective if your attribute is Gaussian distribution and varying scale data. Example linear regression, logistic regression

standardized_Data = preprocessing.scale(Data)

Data Discretization

When Your Data is Continuous and need to convert them discrete then use Discretization.

Data['type_contact']=pd.cut(Data['type_contact'],3,labels=['email','phone','fax'])

Data Reduction

In data reduction (Dimension Reduction) techniques example: Filter method using Pearson correlation, Wrapper method using pvalue, the Embedded method using Lasso regularization. Lasso regularization is an iterative method. Each iteration extract features to check which features contribute the most to the training. If the feature is irrelevant, lasso penalizes 0(zero) and removes it. PCA(Principal Component Analysis).

#Filter method
corr = Data.corr()
drop_cols = []
for col in Data.columns:
    if sum(corr[col].map(lambda x: abs(x) > 0.1)) <= 4:
        drop_cols.append(col)
Data.drop(drop_cols, axis=1, inplace=True)
print(drop_cols)
display(Data)

Feature Scaling

Feature scaling is also known as Data Transformation. It is applied to independent variables for Given data in a particular range. This is also used for algorithm speeding up the calculation.

from sklearn.preprocessing import StandardScaler 
scaler = StandardScaler() 
scaler.fit(Data) 

Use Feature Scaling

Use feature scaling when data is a big scale, irrelevant or misleading, and your algorithm is Distance based. Example: K-Means, K-Nearest-Neighbours, PCA.

READ Also  Exploring Ridge and Lasso Regression in Python: Taming Complex Data

Common Data Preprocessing Methods

Let’s delve into some commonly used data preprocessing techniques:

  • Handling Missing Values:
    • Imputation: Fill in missing values with statistical methods like mean/median imputation or more sophisticated techniques like K-Nearest Neighbors (KNN) imputation.
    • Deletion: Remove data points with missing values, especially if the number of missing values is high or if imputation is not feasible.
  • Data Cleaning and Correction:
    • Error Correction: Identify and fix typos, inconsistencies in formatting, or invalid entries within the data.
    • Outlier Detection and Treatment: Identify data points that deviate significantly from the majority. These outliers can be removed, winsorized (capped to a specific value), or transformed depending on the context.
  • Data Transformation:
    • Scaling: Techniques like standardization (z-score normalization) or min-max scaling can ensure features are on a similar scale, preventing features with larger ranges from dominating the model.
    • Encoding Categorical Variables: Convert categorical features into numerical representations suitable for machine learning algorithms. Techniques like one-hot encoding or label encoding can be used.
    • Handling Skewed Distributions: Transform features with skewed distributions (e.g., applying a log transformation) to create a more normal distribution, potentially improving model performance.

Beyond these core techniques, data preprocessing might involve:

  • Feature Selection: Choosing a subset of relevant features to improve model efficiency and potentially reduce overfitting.
  • Normalization: Rescaling features to a specific range (e.g., 0 to 1) for specific algorithms.

Python Libraries for Data Preprocessing

Several Python libraries can streamline your data preprocessing workflow:

  • pandas: Provides powerful tools for data manipulation, cleaning, and feature engineering.
  • NumPy: Offers efficient array-based operations for data cleaning and transformation.
  • Scikit-learn: Includes functionalities for data preprocessing tasks like scaling, encoding categorical variables, and handling missing values.

Utilizing these libraries can significantly reduce the time and effort required for data preprocessing, allowing you to focus on analysis and model building.

Conclusion

Data preprocessing, often referred to as the “unsung hero” of machine learning, plays a pivotal role in transforming raw data into a powerful tool for discovery and decision-making. By understanding the core principles, implementing common techniques, and adhering to best practices, you can unlock the true potential within your data. Remember, effective data preprocessing sets the stage for success in your machine learning journey – a journey that empowers you to extract valuable insights and drive innovation across various domains.

By Jay Patel

I done my data science study in 2018 at innodatatics. I have 5 Yers Experience in Data Science, Python and R.