@asadbek_universe

Learning curves (AUC ROC curve)

2023-01-04T08:57:57.495Z

Learning curves are plots used to show a model's performance as the training set size increases. Another way it can be used is to show the model's performance over a defined period of time. We typically used them to diagnose algorithms that learn incrementally from data. It works by evaluating a model on the training and validation datasets, then plotting the measured performance.

In Machine Learning, only developing an ML model is not sufficient as we also need to see whether it is performing well or not. It means that after building an ML model, we need to evaluate and validate how good or bad it is, and for such cases, we use different Evaluation Metrics. AUC-ROC curve is such an evaluation metric that is used to visualize the performance of a classification model.

ROC Curve

ROC or Receiver Operating Characteristic curve represents a probability graph to show the performance of a classification model at different threshold levels. The curve is plotted between two parameters, which are:

True Positive Rate or TPR
False Positive Rate or FPR

In the curve, TPR is plotted on Y-axis, whereas FPR is on the X-axis.

TPR:

TPR or True Positive rate is a synonym for Recall, which can be calculated as:

FPR or False Positive Rate can be calculated as:

Here, TP: True Positive

FP: False Positive

TN: True Negative

FN: False Negative

Now, to efficiently calculate the values at any threshold level, we need a method, which is AUC.

AUC: Area Under the ROC curve

AUC is known for Area Under the ROC curve. As its name suggests, AUC calculates the two-dimensional area under the entire ROC curve ranging from (0,0) to (1,1), as shown below image:

https://www.javatpoint.com/auc-roc-curve-in-machine-learning

In the ROC curve, AUC computes the performance of the binary classifier across different thresholds and provides an aggregate measure. The value of AUC ranges from 0 to 1, which means an excellent model will have AUC near 1, and hence it will show a good measure of Separability.

When to Use AUC-ROC

AUC is preferred due to the following cases:

AUC is used to measure how well the predictions are ranked instead of giving their absolute values. Hence, we can say AUC is Scale-Invariant.
It measures the quality of predictions of the model without considering the selected classification threshold. It means AUC is classification-threshold-invariant.

When not to use AUC-ROC

AUC is not preferable when we need to calibrate probability output.
Further, AUC is not a useful metric when there are wide disparities in the cost of false negatives vs false positives, and it is difficult to minimize one type of classification error.

https://www.youtube.com/@statquest

References

The main article - https://www.javatpoint.com/auc-roc-curve-in-machine-learning

The video explanation - https://www.youtube.com/watch?v=4jRBRDbJemM&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=12

Regularization Techniques

2023-01-04T08:30:36.655Z

Introduction

From the articles of geeksforgeeks.org and javatpoint.com we have a look at overall information about regularization and it's types. After that we have a visual explanations in videos of the YouTube channel StatQuest with Josh Starmer. At the end we explore the article published in Towards Data Science by Prashant Gupta.

Regularization in Machine Learning

Regularization is a technique used to reduce the errors by fitting the function appropriately on the given training set and avoid overfitting.

Sometimes the machine learning model performs well with the training data but does not perform well with the test data. It means the model is not able to predict the output when deals with unseen data by introducing noise in the output, and hence the model is called overfitted. This problem can be deal with the help of a regularization technique.

This technique can be used in such a way that it will allow to maintain all variables or features in the model by reducing the magnitude of the variables. Hence, it maintains accuracy as well as a generalization of the model.

It mainly regularizes or reduces the coefficient of features toward zero. In simple words, "In regularization technique, we reduce the magnitude of the features by keeping the same number of features."

The commonly used regularization techniques are :

L1 regularization
L2 regularization

A regression model that uses L2 regularization technique is called Ridge regression.

https://www.youtube.com/@statquest

A regression model which uses L1 Regularization technique is called LASSO(Least Absolute Shrinkage and Selection Operator) regression

https://www.youtube.com/@statquest

More visual explanations about Regularization you can find in the playlist https://youtube.com/playlist?list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF

Regularization in Machine Learning II

One of the major aspects of training your machine learning model is avoiding overfitting. The model will have a low accuracy if it is overfitting. This happens because your model is trying too hard to capture the noise in your training dataset. By noise we mean the data points that don’t really represent the true properties of your data, but random chance. Learning such data points, makes your model more flexible, at the risk of overfitting.

The concept of balancing bias and variance, is helpful in understanding the phenomenon of overfitting.

Balancing Bias and Variance to Control Errors in Machine Learning

One of the ways of avoiding overfitting is using cross validation, that helps in estimating the error over test set, and in deciding what parameters work best for your model.

This article will focus on a technique that helps in avoiding overfitting and also increasing model interpretability.

Regularization

This is a form of regression, that constrains/ regularizes or shrinks the coefficient estimates towards zero. In other words, this technique discourages learning a more complex or flexible model, so as to avoid the risk of overfitting.

A simple relation for linear regression looks like this. Here Y represents the learned relation and β represents the coefficient estimates for different variables or predictors(X).

Y ≈ β0 + β1X1 + β2X2 + …+ βpXp

The fitting procedure involves a loss function, known as residual sum of squares or RSS. The coefficients are chosen, such that they minimize this loss function.

Now, this will adjust the coefficients based on your training data. If there is noise in the training data, then the estimated coefficients won’t generalize well to the future data. This is where regularization comes in and shrinks or regularizes these learned estimates towards zero.

Ridge Regression

Above image shows ridge regression, where the RSS is modified by adding the shrinkage quantity. Now, the coefficients are estimated by minimizing this function. Here, λ is the tuning parameter that decides how much we want to penalize the flexibility of our model. The increase in flexibility of a model is represented by increase in its coefficients, and if we want to minimize the above function, then these coefficients need to be small. This is how the Ridge regression technique prevents coefficients from rising too high. Also, notice that we shrink the estimated association of each variable with the response, except the intercept β0, This intercept is a measure of the mean value of the response when xi1 = xi2 = …= xip = 0.

When λ = 0, the penalty term has no eﬀect, and the estimates produced by ridge regression will be equal to least squares. However, as λ→∞, the impact of the shrinkage penalty grows, and the ridge regression coeﬃcient estimates will approach zero. As can be seen, selecting a good value of λ is critical. Cross validation comes in handy for this purpose. The coefficient estimates produced by this method are also known as the L2 norm.

The coefficients that are produced by the standard least squares method are scale equivariant, i.e. if we multiply each input by c then the corresponding coefficients are scaled by a factor of 1/c. Therefore, regardless of how the predictor is scaled, the multiplication of predictor and coefficient(Xjβj) remains the same. However, this is not the case with ridge regression, and therefore, we need to standardize the predictors or bring the predictors to the same scale before performing ridge regression. The formula used to do this is given below.

Lasso

Lasso is another variation, in which the above function is minimized. Its clear that this variation differs from ridge regression only in penalizing the high coefficients. It uses |βj|(modulus)instead of squares of β, as its penalty. In statistics, this is known as the L1 norm.

Lets take a look at above methods with a different perspective. The ridge regression can be thought of as solving an equation, where summation of squares of coefficients is less than or equal to s. And the Lasso can be thought of as an equation where summation of modulus of coefficients is less than or equal to s. Here, s is a constant that exists for each value of shrinkage factor λ. These equations are also referred to as constraint functions.

Consider their are 2 parameters in a given problem. Then according to above formulation, the ridge regression is expressed by β1² + β2² ≤ s. This implies that ridge regression coefficients have the smallest RSS(loss function) for all points that lie within the circle given by β1² + β2² ≤ s.

Similarly, for lasso, the equation becomes,|β1|+|β2|≤ s. This implies that lasso coefficients have the smallest RSS(loss function) for all points that lie within the diamond given by |β1|+|β2|≤ s.

The image below describes these equations.

Credit : An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani

The above image shows the constraint functions(green areas), for lasso(left) and ridge regression(right), along with contours for RSS(red ellipse). Points on the ellipse share the value of RSS. For a very large value of s, the green regions will contain the center of the ellipse, making coefficient estimates of both regression techniques, equal to the least squares estimates. But, this is not the case in the above image. In this case, the lasso and ridge regression coefficient estimates are given by the ﬁrst point at which an ellipse contacts the constraint region. Since ridge regression has a circular constraint with no sharp points, this intersection will not generally occur on an axis, and so the ridge regression coeﬃcient estimates will be exclusively non-zero. However, the lasso constraint has corners at each of the axes, and so the ellipse will often intersect the constraint region at an axis. When this occurs, one of the coeﬃcients will equal zero. In higher dimensions(where parameters are much more than 2), many of the coeﬃcient estimates may equal zero simultaneously.

This sheds light on the obvious disadvantage of ridge regression, which is model interpretability. It will shrink the coefficients for least important predictors, very close to zero. But it will never make them exactly zero. In other words, the final model will include all predictors. However, in the case of the lasso, the L1 penalty has the eﬀect of forcing some of the coeﬃcient estimates to be exactly equal to zero when the tuning parameter λ is suﬃciently large. Therefore, the lasso method also performs variable selection and is said to yield sparse models.

What does Regularization achieve?

A standard least squares model tends to have some variance in it, i.e. this model won’t generalize well for a data set different than its training data. Regularization, significantly reduces the variance of the model, without substantial increase in its bias. So the tuning parameter λ, used in the regularization techniques described above, controls the impact on bias and variance. As the value of λ rises, it reduces the value of coefficients and thus reducing the variance. Till a point, this increase in λ is beneficial as it is only reducing the variance(hence avoiding overfitting), without loosing any important properties in the data. But after certain value, the model starts loosing important properties, giving rise to bias in the model and thus underfitting. Therefore, the value of λ should be carefully selected.

References

The Articles:

Video Explanations:

Logistic Regression

2023-01-04T07:51:26.606Z

Introduction

Let's have a quick overall information given in one of the articles of javatpoint.com, after which we have a video from StatQuest with Josh Starmer, then we will go through article with detailed overview published in Towards Data Science by Saishruthi Swaminathan.

Logistic Regression in Machine Learning

Logistic regression is one of the most popular Machine Learning algorithms, which comes under the Supervised Learning technique. It is used for predicting the categorical dependent variable using a given set of independent variables.
Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.
Logistic Regression is much similar to the Linear Regression except that how they are used. Linear Regression is used for solving Regression problems, whereas Logistic regression is used for solving the classification problems.
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function, which predicts two maximum values (0 or 1).
The curve from the logistic function indicates the likelihood of something such as whether the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
Logistic Regression is a significant machine learning algorithm because it has the ability to provide probabilities and classify new data using continuous and discrete datasets.
Logistic Regression can be used to classify the observations using different types of data and can easily determine the most effective variables used for the classification. The below image is showing the logistic function:

https://www.javatpoint.com/logistic-regression-in-machine-learning

https://www.youtube.com/@statquest

Logistic Regression — Detailed Overview

Logistic Regression was used in the biological sciences in early twentieth century. It was then used in many social science applications. Logistic Regression is used when the dependent variable(target) is categorical.

For example,

To predict whether an email is spam (1) or (0)
Whether the tumor is malignant (1) or not (0)

Consider a scenario where we need to classify whether an email is spam or not. If we use linear regression for this problem, there is a need for setting up a threshold based on which classification can be done. Say if the actual class is malignant, predicted continuous value 0.4 and the threshold value is 0.5, the data point will be classified as not malignant which can lead to serious consequence in real time.

From this example, it can be inferred that linear regression is not suitable for classification problem. Linear regression is unbounded, and this brings logistic regression into picture. Their value strictly ranges from 0 to 1.

Simple Logistic Regression

(Full Source code: https://github.com/SSaishruthi/LogisticRegression_Vectorized_Implementation/blob/master/Logistic_Regression.ipynb)

Model

Output = 0 or 1

Hypothesis => Z = WX + B

hΘ(x) = sigmoid (Z)

Sigmoid Function

Figure 2: Sigmoid Activation Function

If ‘Z’ goes to infinity, Y(predicted) will become 1 and if ‘Z’ goes to negative infinity, Y(predicted) will become 0.

Analysis of the hypothesis

The output from the hypothesis is the estimated probability. This is used to infer how confident can predicted value be actual value when given an input X. Consider the below example,

X = [x0 x1] = [1 IP-Address]

Based on the x1 value, let’s say we obtained the estimated probability to be 0.8. This tells that there is 80% chance that an email will be spam.

Mathematically this can be written as,

Figure 3: Mathematical Representation

This justifies the name ‘logistic regression’. Data is fit into linear regression model, which then be acted upon by a logistic function predicting the target categorical dependent variable.

Types of Logistic Regression

1. Binary Logistic Regression

The categorical response has only two 2 possible outcomes. Example: Spam or Not

2. Multinomial Logistic Regression

Three or more categories without ordering. Example: Predicting which food is preferred more (Veg, Non-Veg, Vegan)

3. Ordinal Logistic Regression

Three or more categories with ordering. Example: Movie rating from 1 to 5

Decision Boundary

To predict which class a data belongs, a threshold can be set. Based upon this threshold, the obtained estimated probability is classified into classes.

Say, if predicted_value ≥ 0.5, then classify email as spam else as not spam.

Decision boundary can be linear or non-linear. Polynomial order can be increased to get complex decision boundary.

Cost Function

Figure 4: Cost Function of Logistic Regression

Why cost function which has been used for linear can not be used for logistic?

Linear regression uses mean squared error as its cost function. If this is used for logistic regression, then it will be a non-convex function of parameters (theta). Gradient descent will converge into global minimum only if the function is convex.

Figure 5: Convex and non-convex cost function

Cost function explanation

Figure 6: Cost Function part 1

Figure 7: Cost Function part 2

Simplified cost function

Figure 8: Simplified Cost Function

Why this cost function?

Figure 9: Maximum Likelihood Explanation part-1

Figure 10: Maximum Likelihood Explanation part-2

This negative function is because when we train, we need to maximize the probability by minimizing loss function. Decreasing the cost will increase the maximum likelihood assuming that samples are drawn from an identically independent distribution.

Deriving the formula for Gradient Descent Algorithm

Figure 11: Gradient Descent Algorithm part 1

Figure 12: Gradient Descent part 2

Python Implementation

def weightInitialization(n_features):
    w = np.zeros((1,n_features))
    b = 0
    return w,b
def sigmoid_activation(result):
    final_result = 1/(1+np.exp(-result))
    return final_result
    
def model_optimize(w, b, X, Y):
    m = X.shape[0]
    
    #Prediction
    final_result = sigmoid_activation(np.dot(w,X.T)+b)
    Y_T = Y.T
    cost = (-1/m)*(np.sum((Y_T*np.log(final_result)) + ((1-Y_T)*(np.log(1-final_result)))))
    
    #Gradient calculation
    dw = (1/m)*(np.dot(X.T, (final_result-Y.T).T))
    db = (1/m)*(np.sum(final_result-Y.T))
    
    grads = {"dw": dw, "db": db}
    
    return grads, cost
def model_predict(w, b, X, Y, learning_rate, no_iterations):
    costs = []
    for i in range(no_iterations):
        grads, cost = model_optimize(w,b,X,Y)
        dw = grads["dw"]
        db = grads["db"]
        #weight update
        w = w - (learning_rate * (dw.T))
        b = b - (learning_rate * db)
       
        if (i % 100 == 0):
            costs.append(cost)
            #print("Cost after %i iteration is %f" %(i, cost))
       
        #final parameters
        coeff = {"w": w, "b": b}
        gradient = {"dw": dw, "db": db}
       
        return coeff, gradient, costs
       
def predict(final_pred, m):
    y_pred = np.zeros((1,m))
    for i in range(final_pred.shape[1]):
        if final_pred[0][i] > 0.5:
            y_pred[0][i] = 1
    return y_pred

Cost vs Number_of_Iterations

Figure 13: Cost Reduction

Train and test accuracy of the system is 100 %

This implementation is for binary logistic regression. For data with more than 2 classes, softmax regression has to be used.

References

The article about Logistic Regression - https://www.javatpoint.com/logistic-regression-in-machine-learning

The video about Logistic Regression - https://www.youtube.com/watch?v=yIYKR4sgzI8

The main post about Logistic Regression - https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc

Additional Info - https://www.ibm.com/topics/logistic-regression#:~:text=Resources-,What%20is%20logistic%20regression%3F,given%20dataset%20of%20independent%20variables.

The Source Code - https://github.com/SSaishruthi/LogisticRegression_Vectorized_Implementation/blob/master/Logistic_Regression.ipynb

Linear Regression

2023-01-04T06:59:11.802Z

Foreword

We begin to dive into the world of actual Machine Learning starting from this topic. There are few techniques of ML: Supervised Learning, Unsupervised Learning, Reinforcement Learning and Semi-supervised Learning. And Linear Regression is one of the bases of "Supervised Learning" techniques.

What is Linear Regression?

Linear regression is a model that assumes a relationship between the independent(x) variables and one dependent(y) variable. It performs the task to predict a dependent(y) variable value based on a given independent(x) variable. Simple linear regression is a type of linear regression where there is single independent(x) variable.

https://miro.medium.com

Let us understand the principle of the regression model visually with the help of a video published in channel StatQuest with Josh Starmer

https://www.youtube.com/watch?v=nk2CQITm_eo

Now let's have a look at the post published in Towards Data Science by Rohith Gandhi

y = a_0 + a_1 * x ## Linear Equation

The motive of the linear regression algorithm is to find the best values for a_0 and a_1. Before moving on to the algorithm, let’s have a look at two important concepts you must know to better understand linear regression.

Cost Function

The cost function helps us to figure out the best possible values for a_0 and a_1 which would provide the best fit line for the data points. Since we want the best values for a_0 and a_1, we convert this search problem into a minimization problem where we would like to minimize the error between the predicted value and the actual value.

Minimization and Cost Function

We choose the above function to minimize. The difference between the predicted values and ground truth measures the error difference. We square the error difference and sum over all data points and divide that value by the total number of data points. This provides the average squared error over all the data points. Therefore, this cost function is also known as the Mean Squared Error(MSE) function. Now, using this MSE function we are going to change the values of a_0 and a_1 such that the MSE value settles at the minima.

Gradient Descent

The next important concept needed to understand linear regression is gradient descent. Gradient descent is a method of updating a_0 and a_1 to reduce the cost function(MSE). The idea is that we start with some values for a_0 and a_1 and then we change these values iteratively to reduce the cost. Gradient descent helps us on how to change the values.

Gradient Descent

To draw an analogy, imagine a pit in the shape of U and you are standing at the topmost point in the pit and your objective is to reach the bottom of the pit. There is a catch, you can only take a discrete number of steps to reach the bottom. If you decide to take one step at a time you would eventually reach the bottom of the pit but this would take a longer time. If you choose to take longer steps each time, you would reach sooner but, there is a chance that you could overshoot the bottom of the pit and not exactly at the bottom. In the gradient descent algorithm, the number of steps you take is the learning rate. This decides on how fast the algorithm converges to the minima.

Convex vs Non-convex function

Sometimes the cost function can be a non-convex function where you could settle at a local minima but for linear regression, it is always a convex function.

You may be wondering how to use gradient descent to update a_0 and a_1. To update a_0 and a_1, we take gradients from the cost function. To find these gradients, we take partial derivatives with respect to a_0 and a_1. Now, to understand how the partial derivatives are found below you would require some calculus but if you don’t, it is alright. You can take it as it is.

The partial derivates are the gradients and they are used to update the values of a_0 and a_1. Alpha is the learning rate which is a hyperparameter that you must specify. A smaller learning rate could get you closer to the minima but takes more time to reach the minima, a larger learning rate converges sooner but there is a chance that you could overshoot the minima.

The article is very well written, I advice you to visit at the end https://towardsdatascience.com/introduction-to-machine-learning-algorithms-linear-regression-14c4e325882a

As I guess, video explanation would make a better picture. The ML video-tutorial from codebasics would be suitable for our topic.

The code from the video:

import numpy as np

def gradient_descent(x,y):
    m_curr = b_curr = 0
    iterations = 10000
    n = len(x)
    learning_rate = 0.08

    for i in range(iterations):
        y_predicted = m_curr * x + b_curr
        cost = (1/n) * sum([val**2 for val in (y-y_predicted)])
        md = -(2/n)*sum(x*(y-y_predicted))
        bd = -(2/n)*sum(y-y_predicted)
        m_curr = m_curr - learning_rate * md
        b_curr = b_curr - learning_rate * bd
        print ("m {}, b {}, cost {} iteration {}".format(m_curr,b_curr,cost, i))

x = np.array([1,2,3,4,5])
y = np.array([5,7,9,11,13])

gradient_descent(x,y)

References

https://towardsdatascience.com/introduction-to-machine-learning-algorithms-linear-regression-14c4e325882a

https://machinelearningmastery.com/linear-regression-for-machine-learning/

https://www.youtube.com/watch?v=vsWrXfO3wWw

https://www.youtube.com/watch?v=nk2CQITm_eo

Overfitting/Underfitting

2023-01-04T06:45:18.434Z

So, we started to explore Supervised Learning. Let's have a bit understanding of approximating a target function (f) that maps input variables (X) to an output variable (Y)

 Y = f(X)

This characterization describes the range of classification and prediction problems and the machine algorithms that can be used to address them. An important consideration in learning the target function from the training data is how well the model generalizes to new data.

https://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/

Generalization in Machine Learning

Gneralization defines the ability of an ML model to provide a suitable output by adapting the given set of unknown input. It means after providing training on the dataset, it can produce reliable and accurate output. Hence, the underfitting and overfitting are the two terms that need to be checked for the performance of the model and whether the model is generalizing well or not.

Before understanding the overfitting and underfitting, let's understand some basic term that will help to understand this topic well:

Signal: It refers to the true underlying pattern of the data that helps the machine learning model to learn from the data.
Noise: Noise is unnecessary and irrelevant data that reduces the performance of the model.
Bias: Assumptions made by a model to make a function easier to learn. It is actually the error rate of the training data. When the error rate has a high value, we call it High Bias and when the error rate has a low value, we call it low Bias.
Variance: The difference between the error rate of training data and testing data is called variance. If the difference is high then it’s called high variance and when the difference of errors is low then it’s called low variance. Usually, we want to make a low variance for generalized our model.

https://www.javatpoint.com/overfitting-and-underfitting-in-machine-learning#:~:text=Underfitting%20occurs%20when%20our%20machine,enough%20from%20the%20training%20data.

Overfitting

Overfitting occurs when our machine learning model tries to cover all the data points or more than the required data points present in the given dataset. Because of this, the model starts caching noise and inaccurate values present in the dataset, and all these factors reduce the efficiency and accuracy of the model.

The chances of occurrence of overfitting increase as much we provide training to our model. It means the more we train our model, the more chances of occurring the overfitted model.

https://www.javatpoint.com

* Overfitting is the main problem that occurs in supervised learning.

Reasons for Overfitting:

High variance and low bias
The model is too complex
The size of the training data

Techniques to reduce overfitting:

Increase training data
Reduce model complexity
Early stopping during the training phase (have an eye over the loss over the training period as soon as loss begins to increase stop training)
Other techniques (take a look if you are interested: "8 Simple Techniques to Prevent Overfitting" - https://towardsdatascience.com/8-simple-techniques-to-prevent-overfitting-4d443da2ef7d)

Underfitting

Underfitting occurs when our machine learning model is not able to capture the underlying trend of the data. To avoid the overfitting in the model, the fed of training data can be stopped at an early stage, due to which the model may not learn enough from the training data. As a result, it may fail to find the best fit of the dominant trend in the data.

In the case of underfitting, the model is not able to learn enough from the training data, and hence it reduces the accuracy and produces unreliable predictions.

https://www.javatpoint.com

Reasons for Underfitting:

High bias and low variance
The size of the training dataset used is not enough.
The model is too simple.
Training data is not cleaned and also contains noise in it.

Techniques to reduce underfitting:

Increase model complexity
Remove noise from the data.
Increase the size or number of parameters in the model.
Increasing the training time, until cost function is minimised.

Visual explanation

https://www.geeksforgeeks.org/

Good Fit in a Statistical Model

The "Goodness of fit" term is taken from the statistics, and the goal of the machine learning models to achieve the goodness of fit. In statistics modeling, it defines how closely the result or predicted values match the true values of the dataset.

The model with a good fit is between the underfitted and overfitted model, and ideally, it makes predictions with 0 errors, but in practice, it is difficult to achieve it.

As when we train our model for a time, the errors in the training data go down, and the same happens with test data. But if we train the model for a long duration, then the performance of the model may decrease due to the overfitting, as the model also learn the noise present in the dataset. The errors in the test dataset start increasing, so the point, just before the raising of errors, is the good point, and we can stop here for achieving a good model.

References

"Overfitting and Underfitting in Machine Learning" https://www.javatpoint.com/overfitting-and-underfitting-in-machine-learning#:~:text=Underfitting%20occurs%20when%20our%20machine,enough%20from%20the%20training%20data.

"ML | Underfitting and Overfitting" https://www.geeksforgeeks.org/underfitting-and-overfitting-in-machine-learning/

"Overfitting and Underfitting With Machine Learning Algorithms" https://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/

Categorical Data

2023-01-04T06:23:17.394Z

The article published in Towards Data Science by Andrew Engel explains clearly about Categorical Variables for Machine Learning Algorithms. I advice you to visit and go over eyes this article

While most machine learning algorithms only work with numeric values, many important real-world features are not numeric but rather categorical. As categorical features, they take on levels or values. These can be represented as various categories such as age, state, or customer type for example. Alternatively, these can be created by binning underlying numeric features, such as identifying individuals by age ranges (e.g., 0–10, 11–18, 19–30, 30–50, etc.). Finally, these can be numeric identifiers where the relationship between the values is not meaningful. ZIP codes are a common example of this. Two ZIP codes close numerically may be farther apart than another ZIP code that is distant numerically.

Since these categorical features cannot be directly used in most machine learning algorithms, the categorical features need to be transformed into numerical features. While numerous techniques exist to transform these features, the most common technique is one-hot encoding.

In one-hot encoding, a categorical variable is converted into a set of binary indicators (one per category in the entire dataset). So in a category that contains the levels clear, partly cloudy, rain, wind, snow, cloudy, fog, seven new variables will be created that contain either 1 or 0. Then, for each observation, the variable that matches the category will be set to 1 and all other variables set to 0.

Let us watch the video tutorial from codebasics channel about Categorical variables, aka Dummies and One Hot Encoding

The code form the tutorial is in github https://github.com/codebasics/py/blob/master/ML/5_one_hot_encoding/one_hot_encoding.ipynb

Also the article in www.kdnuggets.com by Shelvi Garg apparently shows how to Deal with Categorical Data for Machine Learning.

Here is the list of the 15 types of encoding the library supports:

One-hot Encoding
Label Encoding
Ordinal Encoding
Helmert Encoding
Binary Encoding
Frequency Encoding
Mean Encoding
Weight of Evidence Encoding
Probability Ratio Encoding
Hashing Encoding
Backward Difference Encoding
Leave One Out Encoding
James-Stein Encoding
M-estimator Encoding
Thermometer Encoder

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02

Visit the webpage for understanding categorical encoding methods and having a better idea about topic.