weekly-coordination
May 26, 2022
Taking a leap in ML ๐
Introduction
We all know about dependent and independent variables, right? Well, in the linear regression, we focus only on the linear relationships between them. But, what about nonlinear relationships? ๐
๐ฅ๐ฅ Let me introduce you the "Decision trees"
Into the decision trees
They're made to capture nonlinear relationships and they model data as a tree of hierarchical branches. Its structure is a flowchart-like in which:
- each internal node represents a test on an attribute (e.g. whether a coin flip comes up heads or tails)
- each branch represents the outcome of the test
- each leaf node represents a class label (decision taken after computing all attributes).
The paths from the root to the leaf represent classification rules. BTW, the decision trees can adapt to both regression and classification tasks.
Common terms
Root node
- It represents the entire population or sample, and this further gets divided into two or more homogeneous sets.
Splitting
Decision node
Leaf/Terminal node
Pruning
- When we remove sub-nodes of a decision node, this process is called pruning. It is the opposite process of splitting.
Branch/Sub-tree
Parent and Child node
- A node, which is divided into sub-nodes is called a parent node of sub-nodes, whereas sub-nodes are the children of the parent node.
Any examples?
In this example, we want to classify a person as unfit or fit based on:
A decision tree of this could be:
From the diagram, we can see that at every node there is a yes/no decision. We keep moving in the tree until we reach the leaf nodes, where the observation is classified into a class.
Let's Python
First of all, download the dataset: Audit Risk Dataset of different firms. We will be performing the binary classification task of predicting whether a company is fraudulent or not.
Let's look at the dataset's attributes
Audit Risk Dataset # Sector_score # LOCATION_ID # PARA_A # SCORE_A # PARA_B # SCORE_B # TOTAL # numbers # Marks # Money_Value # MONEY_Marks # District # Loss # LOSS_SCORE # History # History_score # Score # Risk : BInary Target Variable
Fortunately,sklearn
package has a lot of machine learning models implemented as classes and we will be importing theDecisionTreeClassifier
class fromsklearn.tree
. We will also be importing a function namedtrain_test_split
fromsklearn.model_selection
that divides our dataset into train and test sets.
1 import pandas as pd 2 from sklearn.tree import DecisionTreeClassifier 3 from sklearn.model_selection import train_test_split 4 from sklearn.metrics import accuracy_score,classification_report 5 6 df = pd.read_csv('audit_data.csv') 7 8 # MAKE DATA 9 X = df.drop(columns = ['Risk','LOCATION_ID']) 10 Y = df[['Risk']] 11 12 X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size = 0.2, random_state = 3) 13 14 # MAKE MODEL 15 d_tree = DecisionTreeClassifier() 16 d_tree.fit(X_train,Y_train) 17 18 # CALCULATE AND PRINT RESULTS 19 preds = d_tree.predict(X_test) 20 acc = accuracy_score(y_true = Y_test,y_pred = preds) 21 print(acc) 22 print(classification_report(y_true = Y_test,y_pred = preds))
After we read the data in line 6, we separate our target variable as Y
.
We drop the LOCATION_ID
column since it would not provide any useful information to the model.
To split the data into training and test sets, we use the functiontrain_test_split
. We provide our inputs,X
, and the labels,Y
, to the function in line 12. We also provide the test set size astest_size
. 0.2 implies that 20% data will be included in the testing set, while the rest 80% will form the training set. The function outputs 4 items that we can retrieve directly into 4 variables. These are:
- Inputs for the training data that we store in
X_train
- Inputs for the testing data that we store in
X_test
- Labels for the training data that we store in
Y_train
- Labels for the testing data that we store in
Y_test
In line 15, we make our Decision Tree model. We callDecisionTreeClassifier
without any arguments. Then in the next line, we call thefit
function of the model. We provide the training examples and labels to the function.
Now, we need to evaluate our model. Therefore, we use thepredict
function of the model in line 19 to store predictions inpreds
. We give the testing inputsX_test
topredict
as an argument.
We use theaccuracy_score
function to measure the accuracy of the predictions. We print the accuracy with the classification report, which we obtained by using theclassification_report
function, in the last two lines.
From the outputs, we can see that the model performs excellent on the testing data. The model has learned all patterns and relationships in the dataset and gives correct results 100% of the time.
We might not get 100% accuracy if we have had bigger and more complex datasets
This is the ending of the post dedicated to Decision Trees.
Thanks for your time and effort.
Keep learning and exploring ๐
Credits to:
The content of this article is inspired by and taken from Educative.
The dataset used in the example above is taken from Kaggle.