Taking a leap in ML 🚀

Designed on Canva.

Introduction

We all know about dependent and independent variables, right? Well, in the linear regression, we focus only on the linear relationships between them. But, what about nonlinear relationships? 😕

🥁🥁 Let me introduce you the "Decision trees"

Into the decision trees

They're made to capture nonlinear relationships and they model data as a tree of hierarchical branches. Its structure is a flowchart-like in which:

each internal node represents a test on an attribute (e.g. whether a coin flip comes up heads or tails)
each branch represents the outcome of the test
each leaf node represents a class label (decision taken after computing all attributes).

The paths from the root to the leaf represent classification rules. BTW, the decision trees can adapt to both regression and classification tasks.

Common terms

Root node

It represents the entire population or sample, and this further gets divided into two or more homogeneous sets.

Splitting

It is a process of dividing a node into two or more sub-nodes.

Decision node

When a sub-node splits into further sub-nodes, then it is called a decision node.

Leaf/Terminal node

Nodes that do not split are called Leaf or Terminal node.

Pruning

When we remove sub-nodes of a decision node, this process is called pruning. It is the opposite process of splitting.

Branch/Sub-tree

A subsection of the entire tree is called branch or sub-tree.

Parent and Child node

A node, which is divided into sub-nodes is called a parent node of sub-nodes, whereas sub-nodes are the children of the parent node.

Any examples?

In this example, we want to classify a person as unfit or fit based on:

the person’s age
whether he/she eats pizza
whether he/she exercises in the morning

A decision tree of this could be:

The example is given in the context of decision trees.

From the diagram, we can see that at every node there is a yes/no decision. We keep moving in the tree until we reach the leaf nodes, where the observation is classified into a class.

Let's Python

First of all, download the dataset: Audit Risk Dataset of different firms. We will be performing the binary classification task of predicting whether a company is fraudulent or not.

Let's look at the dataset's attributes

Audit Risk Dataset
# Sector_score
# LOCATION_ID
# PARA_A
# SCORE_A
# PARA_B
# SCORE_B
# TOTAL
# numbers
# Marks
# Money_Value
# MONEY_Marks
# District
# Loss
# LOSS_SCORE
# History
# History_score
# Score
# Risk : BInary Target Variable

Fortunately, sklearn package has a lot of machine learning models implemented as classes and we will be importing the DecisionTreeClassifier class from sklearn.tree. We will also be importing a function named train_test_split from sklearn.model_selection that divides our dataset into train and test sets.

 1  import pandas as pd
 2  from sklearn.tree import DecisionTreeClassifier
 3  from sklearn.model_selection import train_test_split
 4  from sklearn.metrics import accuracy_score,classification_report
 5
 6  df = pd.read_csv('audit_data.csv')
 7
 8  # MAKE DATA
 9  X = df.drop(columns = ['Risk','LOCATION_ID'])
10  Y = df[['Risk']]
11
12  X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size = 0.2, random_state = 3)
13
14  # MAKE MODEL
15  d_tree = DecisionTreeClassifier()
16  d_tree.fit(X_train,Y_train)
17
18  # CALCULATE AND PRINT RESULTS
19  preds = d_tree.predict(X_test)
20  acc = accuracy_score(y_true = Y_test,y_pred = preds)
21  print(acc)
22  print(classification_report(y_true = Y_test,y_pred = preds))

After we read the data in line 6, we separate our target variable as Y.

We drop the LOCATION_ID column since it would not provide any useful information to the model.

To split the data into training and test sets, we use the function train_test_split. We provide our inputs, X, and the labels, Y, to the function in line 12. We also provide the test set size as test_size. 0.2 implies that 20% data will be included in the testing set, while the rest 80% will form the training set. The function outputs 4 items that we can retrieve directly into 4 variables. These are:

Inputs for the training data that we store in X_train
Inputs for the testing data that we store in X_test
Labels for the training data that we store in Y_train
Labels for the testing data that we store in Y_test

In line 15, we make our Decision Tree model. We call DecisionTreeClassifier without any arguments. Then in the next line, we call the fit function of the model. We provide the training examples and labels to the function.

Now, we need to evaluate our model. Therefore, we use the predict function of the model in line 19 to store predictions in preds. We give the testing inputs X_test to predict as an argument.

We use the accuracy_score function to measure the accuracy of the predictions. We print the accuracy with the classification report, which we obtained by using the classification_report function, in the last two lines.

From the outputs, we can see that the model performs excellent on the testing data. The model has learned all patterns and relationships in the dataset and gives correct results 100% of the time.