Linear Regression With Categorical and Continuous Variables Python

Welcome to this tutorial on Multiple Linear Regression. We will look into the concept of Multiple Linear Regression and its usage in Machine learning.

Before, we dive into the concept of multiple linear regression, let me introduce you to the concept of simple linear regression.

What is Simple Linear Regression?

Regression is a Machine Learning technique to predict values from a given data.

For example, consider a dataset on the employee details and their salary.

This dataset will contain attributes such as "Years of Experience" and "Salary". Here, we can use regression to predict the salary of a person who is probably working for 8 years in the industry.

By simple linear regression, we get the best fit line for the data and based on this line our values are predicted. The equation of this line looks as follows:

In the above equation, y is the dependent variable which is predicted using independent variable x1. Here, b0 and b1 are constants.

What is Multiple Linear Regression?

Multiple Linear Regression is an extension of Simple Linear regression where the model depends on more than 1 independent variable for the prediction results. Our equation for the multiple linear regressors looks as follows:

y = b0 + b1 *x1 + b2 * x2 + .... + bn * xn

Here, y is dependent variable and x1, x2,..,xn are our independent variables that are used for predicting the value of y. Values such as b0,b1,…bn act as constants.

Steps to Build a Multiple Linear Regression Model

There are 5 steps we need to perform before building the model. These steps are explained below:

Step 1: Identify variables

Before you start building your model it is important that you understand the dependent and independent variables as these are the prime attributes that affect your results.

Without understanding the dependent variables, the model you build would be a waste, hence make sure you spend enough time to identify the variables correctly.

Step 2: Check the Cavet/Assumptions

It is very important to note that there are 5 assumptions to make for multiple linear regression. These are as follows:

Linearity
Homoscedasticity
Multivariate normality
Independence of errors
Lack of Multicollinearity

Step 3: Creating dummy variables

Suppose, I want to check the relation between dependent and independent variables, dummy variables come into picture.

We create dummy variables where there are categorical variables. For this, we will create a column with 0s and 1s. For example, we have names of few states and our dataset has just 2 namely New York and California. We will represent New York as 1 and California as 0. This 0 and 1 are our dummy variables.

Step 4: Avoiding the dummy variable trap

After you create the dummy variables, it is necessary to ensure that you do not reach into the scenario of a dummy trap.

The phenomenon where one or more variables in linear regression predict another is often referred to as multicollinearity. As a result of this, there may be scenarios where our model may fail to differentiate the effects of the dummy variables D1 and D2. This situation is a dummy variable trap.

The solution to this problem could be by omitting one of the dummy variables. In the above example of New York and California, instead of having 2 columns namely New York and California, we could denote it just as 0 and 1 in a single column as shown below.

Step 5: Finally, building the model

We have many independent variables inputted to determine an output variable. But one policy we need to keep in mind, is garbage in- garbage out. This means that we must input only the necessary variables into the model and not all of them. Inputting all the variables may lead to error prone models.

Also, keep in mind, when you build a model it is necessary you present the model to the users. It is relatively difficult to explain too many variables.

There are 5 methods you can follow while building models. There are stepwise regression techniques:

All-in
Backward Elimination
Forward Selection
Bidirectional Elimination
Scope comparison

Discussing each of these models in detail, is beyond the scope of this article. However, we will look at an example in this article.

Implementing Multiple-Linear Regression in Python

Let's consider a dataset that shows profits made by 50 startups. We'll be working on the matplotlib library.

The link to the dataset is – https://github.com/content-anu/dataset-multiple-regression

Importing the dataset

import numpy as np import matplotlib.pyplot as plt import pandas as pd  dataset = pd.read_csv('50_Startups.csv') dataset.head()

Thus, in the above-shown sample of the dataset, we notice that there are 3 independent variables – R&D spend, Administration and marketing spend.

They contribute to the calculation of the dependent variable – Profit.

The role of a data scientist is to analyze the investment made in which of these fields will increase the profit for the company?

Data-preprocessing

Building the matrix of features and dependent vector.

Here, the matrix of features is the matrix of independent variables.

X = dataset.iloc[:,:-1].values y = dataset.iloc[:,4].values

Encoding the categorical variables

We have categorical variables in this model. 'State' is a categorical variable. We will be using Label Encoder.

from sklearn.preprocessing import LabelEncoder, OneHotEncoder labelEncoder_X = LabelEncoder()  X[:,3] = labelEncoder_X.fit_transform(X[ : , 3])  from sklearn.compose import ColumnTransformer ct = ColumnTransformer([('encoder', OneHotEncoder(), [3])], remainder='passthrough') X = np.array(ct.fit_transform(X), dtype=np.float)

We have performed Label Encoding first because One hot encoding can be performed only after converting into numerical data. We need numbers to create dummy variables.

Avoiding the dummy variable trap

In the below code, we removed the first column from X but put all rows. We ignore only index 0. This is to avoid the dummy variable trap.

Splitting the test and train set

Generally, we will consider 20% of the dataset to be test set and 80% to be the training set. By training set we mean, we train our model according to these parameters and perform test on the "test set" and check if the output of our testing matches the output given in the dataset earlier.

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Fitting the model

from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X_train, y_train)

The output of the above code snippet would be the small line below.

Output of the regression fitting

Predicting the test set results

We create a vector containing all the predictions of the test set profit. The predicted profits are then put into the vector called y_pred.(contains prediction for all observations in the test set).

'predict' method makes the predictions for test set. Hence, input is the test set. The parameter for predict must be an array or sparse matrix, hence input is X_test.

y_pred = regressor.predict(X_test) y_test

y-test set

The model-fit until now need not be the optimal model for the dataset. When we built the model, we used all the independent variables.

But what if among these independent variables there are some statistically significant (having a great impact) dependent variables?

What if we also have some variables that are not significant at all?

Hence we need an optimal team of independent variables so that each independent variable is powerful and statistically significant and definitely has an effect.

This effect can be positive (decrease in 1 unit of the independent variable, profit will increase) or negative (increase in 1 unit of the independent variable, profit will decrease).

We will perform backward elimination using stats model. But this topic will not be discussed in this article.

Complete Code for Multiple Linear Regression in Python

import numpy as np import matplotlib.pyplot as plt import pandas as pd   dataset = pd.read_csv('50_Startups.csv') dataset.head()  # data preprocessing X = dataset.iloc[:,:-1].values y = dataset.iloc[:,4].values  from sklearn.preprocessing import LabelEncoder, OneHotEncoder labelEncoder_X = LabelEncoder()  X[:,3] = labelEncoder_X.fit_transform(X[ : , 3])   from sklearn.compose import ColumnTransformer ct = ColumnTransformer([('encoder', OneHotEncoder(), [3])], remainder='passthrough') X = np.array(ct.fit_transform(X), dtype=np.float)  	 X = X[:, 1:]  from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)  # Fitting the model from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X_train, y_train)   # predicting the test set results y_pred = regressor.predict(X_test)   y_test  y_pred

The output will be the predictions as follows:

Conclusion

To quickly conclude, the advantages of using linear regression is that it works on any size of the dataset and gives information about the relevance of features. However, these models work on certain assumptions which can be seen as a disadvantage.

charlescrect1958.blogspot.com

Source: https://www.askpython.com/python/examples/multiple-linear-regression