Overfitting/Underfitting

So, we started to explore Supervised Learning. Let's have a bit understanding of approximating a target function (f) that maps input variables (X) to an output variable (Y)

 Y = f(X)

This characterization describes the range of classification and prediction problems and the machine algorithms that can be used to address them. An important consideration in learning the target function from the training data is how well the model generalizes to new data.

https://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/

Generalization in Machine Learning

Gneralization defines the ability of an ML model to provide a suitable output by adapting the given set of unknown input. It means after providing training on the dataset, it can produce reliable and accurate output. Hence, the underfitting and overfitting are the two terms that need to be checked for the performance of the model and whether the model is generalizing well or not.

Before understanding the overfitting and underfitting, let's understand some basic term that will help to understand this topic well:

Signal: It refers to the true underlying pattern of the data that helps the machine learning model to learn from the data.
Noise: Noise is unnecessary and irrelevant data that reduces the performance of the model.
Bias: Assumptions made by a model to make a function easier to learn. It is actually the error rate of the training data. When the error rate has a high value, we call it High Bias and when the error rate has a low value, we call it low Bias.
Variance: The difference between the error rate of training data and testing data is called variance. If the difference is high then it’s called high variance and when the difference of errors is low then it’s called low variance. Usually, we want to make a low variance for generalized our model.

https://www.javatpoint.com/overfitting-and-underfitting-in-machine-learning#:~:text=Underfitting%20occurs%20when%20our%20machine,enough%20from%20the%20training%20data.

Overfitting

Overfitting occurs when our machine learning model tries to cover all the data points or more than the required data points present in the given dataset. Because of this, the model starts caching noise and inaccurate values present in the dataset, and all these factors reduce the efficiency and accuracy of the model.

The chances of occurrence of overfitting increase as much we provide training to our model. It means the more we train our model, the more chances of occurring the overfitted model.

https://www.javatpoint.com

* Overfitting is the main problem that occurs in supervised learning.

Reasons for Overfitting:

High variance and low bias
The model is too complex
The size of the training data

Techniques to reduce overfitting:

Increase training data
Reduce model complexity
Early stopping during the training phase (have an eye over the loss over the training period as soon as loss begins to increase stop training)
Other techniques (take a look if you are interested: "8 Simple Techniques to Prevent Overfitting" - https://towardsdatascience.com/8-simple-techniques-to-prevent-overfitting-4d443da2ef7d)

Underfitting

Underfitting occurs when our machine learning model is not able to capture the underlying trend of the data. To avoid the overfitting in the model, the fed of training data can be stopped at an early stage, due to which the model may not learn enough from the training data. As a result, it may fail to find the best fit of the dominant trend in the data.

In the case of underfitting, the model is not able to learn enough from the training data, and hence it reduces the accuracy and produces unreliable predictions.

https://www.javatpoint.com

Reasons for Underfitting:

High bias and low variance
The size of the training dataset used is not enough.
The model is too simple.
Training data is not cleaned and also contains noise in it.

Techniques to reduce underfitting:

Increase model complexity
Remove noise from the data.
Increase the size or number of parameters in the model.
Increasing the training time, until cost function is minimised.

Visual explanation

https://www.geeksforgeeks.org/

Good Fit in a Statistical Model

The "Goodness of fit" term is taken from the statistics, and the goal of the machine learning models to achieve the goodness of fit. In statistics modeling, it defines how closely the result or predicted values match the true values of the dataset.

The model with a good fit is between the underfitted and overfitted model, and ideally, it makes predictions with 0 errors, but in practice, it is difficult to achieve it.

As when we train our model for a time, the errors in the training data go down, and the same happens with test data. But if we train the model for a long duration, then the performance of the model may decrease due to the overfitting, as the model also learn the noise present in the dataset. The errors in the test dataset start increasing, so the point, just before the raising of errors, is the good point, and we can stop here for achieving a good model.

References

"Overfitting and Underfitting in Machine Learning" https://www.javatpoint.com/overfitting-and-underfitting-in-machine-learning#:~:text=Underfitting%20occurs%20when%20our%20machine,enough%20from%20the%20training%20data.

"ML | Underfitting and Overfitting" https://www.geeksforgeeks.org/underfitting-and-overfitting-in-machine-learning/

"Overfitting and Underfitting With Machine Learning Algorithms" https://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/