Let's continue the ML...
As you remember, we have seen a lot of things about linear regression and attempted real-life projects to strengthen our gained knowledge. πͺ
Let's continue our journey and a little increase our temp with the "Logistic Regression" πππ
What in the world the Logistic Regression is?
By the way, I want you to imagine a kind of dataset in your mind and refer to it when talking about some challenging concepts to simplify our understanding
Until this point, you might have been predicting some numerical quantities via your models. But what about a categorical variable?
In statistics, a categorical variable is a variable that can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a particular group or nominal category on the basis of some qualitative property. More on Wikipedia.
The categorical data is divided into classes and the task of predicting it is known as classification which can be performed using logistic regression. In classification problems, the predicted variable is categorical. The simplest case of classification is when the predicted variable is binary, i.e., it has only two classes, e.g., yes/no, male/female, etc.
The logistic regression takes the linear combination of different variables plus the intercept term (like with the linear regression), but after that, it takes the result and passes it to a logistic function which is also known as sigmoid and defined as:
The logistic function has a fixed range from 0 to 1. Clearly, only the numbers between 0 and 1 are taken as the output.
In logistic regression, the output is interpreted as the probability of the observation belonging to the second class. In binary classification, when the result is greater than 0.5, the observation belongs to the second class, and when it is less than 0.5, it belongs to the first class.
Cost function
The following is the math-related part and I recommend you to try to understand it by spending more time than other parts. YOU CAN DO IT!
The cost function used instead of mean squared error is the cross-entropy function.
The y(with the index of i) denotes labels of our classes. It can be 1 or 0 for binary classification.The expression inside the square brackets is the loss for one observation. The error is summed for all observations. The function will be minimized using gradient descent, as can be also done for linear regression. However, we will not go further into the math of how gradient descent would optimize this function at this point.
Logistic Regression in Python
We would not create the bycicle from the beginnig but use theLogisticRegression
class available insklearn.linear_model
.
To evaluate the performance, we will be using the functionaccuracy_score
fromsklearn.metrics
, which tells us the percentage of accurate results.
We will be predicting whether a credit card client defaults or not by using the Credit Card Clients Default Dataset. The binary prediction variable is default.payment.next.month
1 import pandas as pd 2 from sklearn.linear_model import LogisticRegression 3 from sklearn.metrics import accuracy_score 4 5 df = pd.read_csv('credit_card_cleaned.csv') 6 7 # Make data 8 X = df.drop(columns = ['default.payment.next.month','MARRIAGE','GENDER']) 9 Y = df[['default.payment.next.month']] 10 11 # Fit model 12 lr = LogisticRegression() 13 lr.fit(X,Y) 14 15 # Print parameters 16 print(lr.coef_) 17 print(lr.intercept_) 18 19 # Get predictions and accuracy 20 preds = lr.predict(X) 21 acc = accuracy_score(y_true = Y,y_pred = preds) 22 23 print('accuracy = ',acc)
Output: [[-2.57824834e-05 -3.89563030e-06 -3.62189243e-05 -4.66762409e-04 8.12276762e-05 6.51340113e-05 5.46246420e-05 5.07101652e-05 4.57317929e-05 4.16325749e-05 -9.45615078e-06 5.15966173e-06 2.05599697e-06 2.85369917e-06 1.79368256e-06 2.03851811e-06 -3.21462059e-05 -2.15839860e-05 -8.56968053e-06 -8.44288217e-06 -6.25973049e-06 -1.84750686e-06]] [-1.73101261e-05] accuracy = 0.778792322047454
We load the data in line 5. Then we separate our training data and the predictions in lines 8 and 9. We initialize the class in line 12. We use the fit
function to fit the model and obtain the best parameters in line 13. Then we print the model parameters and the intercept parameter in lines 16 and 17. Afterward, we obtain predictions using the predict
function. We calculate the accuracy in line 21. accuracy_score
expects the actual values and the predicted values. Then we print the accuracy in the last line.
From the output, we can see that our model gives the correct prediction 77% of the time.
After consuming this much info successfully, you would also be provided with the explanation of evaluation of your logistic regression models. Until that time, spend your time on this and try to understand by giving your own examples to yourself.
Let's do it!
Credits to:
Educative, Kaggle and Wikipedia.
The contents, concepts, and especially codes are taken from the Educative.
Explanations of some difficult concepts are from the Wikipedia.
The dataset used in the example is taken from the Kaggle.