2. Data preprocessing using scikit learn| California Housing Prices dataset

5 min readAug 18, 2021

Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. There are numbers of methodologies of data preprocessing but our main focus is toward :

(1)Data Encoding

(2)Normalization

(3)Standardization

(4) Imputing the Missing Values

(5) Discretization

Dataset Description

Here i have used ‘California Housing Prices dataset’. This dataset contains information about longitude, latitude of ocean proximity area, population, number of beds, number of rooms, house price etc…

This dataset contains numeric as well as categorical data. Dataset also has different scaled columns and contains missing values. So this is the perfect dataset for preprocessing.

Dataset: California Housing Prices dataset

Data Encoding

Data encoding is the transformation of categorical variables to binary or numerical counterparts. In this we assign unique values to all the categorical attribute. An example is to treat male or female for gender as 1 or 0. so there are two types so data encoding (1)label encoding (2)Onehot encoding

column 1 represents the house price & column 2 represents number of houses with the same house price.

(1)Label encoding

If we will have more than one category in the dataset that to convert those categories into numerical features we can use a Label encoder. Label Encoder will assign a unique number to each category.

As you can see ‘median_house_value’ column has 3842 categories that is nothing but house ranges. After Using Label Encoder we labeled the data. The 500001 housing range converted to 3841, 137500 housing range converted to 959, 162500 housing range converted to 1209 and so on…

classes_ attribute is helping us to identify numerical categories for particular label categories. ( 0 index: 14999 house range, 1 index: 17500 house range…)

(2)Onehot encoder

One hot encoder does the same things but in a different way. Label Encoder initializes the particular number but one hot encoder will assign a whole new column to particular categories. So if you have 3 categories in the column then one hot encoder will add 3 more columns to your dataset.

here ocean_proximity attribute divided into 5 categories

We can also confirm by checking any row number value in both frames(original and new transformed_data).

Now it totally depends on the dataset and its behavior. One Hot Encoder will increase the dimensional but it is useful most time because in the label encoder sometimes all the numerical categories will compare with each other by machine so it will make wrong assumptions. So that’s why OneHot is used more in the real world. But I advise you to do an experiment with both.

Normalization

Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. because in real-world data is not available on the same scale. Data columns will always have different scales. So to make all the columns in one scale we can use normalization methods.

MinMax scaler type normalization

MinMaxScaler : For each value in a feature, MinMaxScaler subtracts the minimum value in the feature and then divides by the range. The range is the difference between the original maximum and original minimum.

Standardization

Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation(i.e. standard deviation = 1).

Imputing Missing Values

Missing data are values that are not recorded in a dataset. They can be a single value missing in a single cell or missing of an entire observation (row). Missing data can occur both in a continuous variable (e.g. height of students) or a categorical variable (e.g. gender of a population).

We can handle missing values in two ways. : (1) Remove the data (whole row) which have missing values.(2) Add the values by using some strategies or using Imputer.

we can see total_bedrooms attribute having 207 null/blank value

Simple Imputer

Discretization

Data discretization is the process of converting continuous data into discrete buckets by grouping it. by doing this we can limit the number of possible states. basically we convert the numerical features into categorical columns.

There are 3 types of Discretization available in Sci-kit learn.(1) Quantile Discretization Transform (2) Uniform Discretization Transform (3) KMeans Discretization Transform