Math for Data Science Masterclass
May 31, 2023

2.3 Measurements of Dispersion - Variance and Standard Deviation

এভারেজ থেকে ডেটা গুলো কতটা ছড়ানো - সেই হিসাব

Variance value less means - the data is close together, and higher value means they are scattered in significant differences .

Some questions I had asked chatGPT:


The reason for squaring the deviations in the variance calculation instead of using the absolute value is primarily based on mathematical and statistical considerations. Squaring the deviations offers several advantages:

  1. Emphasizing Larger Deviations: Squaring the deviations amplifies the effect of larger deviations, giving them more weight in the calculation of variance. This is desirable as larger deviations often represent more significant deviations from the mean and contribute more to the overall spread of the data.
  2. Mathematical Properties: Squaring the deviations allows for convenient mathematical properties in the variance formula. For example, the squared deviations can be summed, which simplifies calculations and enables the use of various statistical techniques and formulas.
  3. Relation to Other Statistical Measures: Squaring the deviations in the variance calculation is linked to other important statistical measures, such as the standard deviation. The standard deviation, which is the square root of the variance, is widely used in statistical analysis due to its mathematical properties and interpretation.

It's important to note that the choice of using squared deviations in the variance calculation is a convention within statistics. Different measures of dispersion, such as the mean absolute deviation (MAD), use absolute differences instead. The choice of using squared deviations is based on mathematical convenience and the desire to capture the emphasis of larger deviations in assessing the spread of data.


Can you give me real life example with data to prove your above mention points? Use numbers please?

- Certainly! Let's consider the following dataset representing the heights of individuals (in centimeters):

Dataset: 165, 170, 175, 180, 185

We will calculate the variance using both squared deviations and absolute deviations to demonstrate their differences.

  1. Squared Deviations: First, we calculate the mean height: Mean = (165 + 170 + 175 + 180 + 185) / 5 = 175

Next, we calculate the squared deviation for each data point: (165 - 175)^2 = 100 (170 - 175)^2 = 25 (175 - 175)^2 = 0 (180 - 175)^2 = 25 (185 - 175)^2 = 100

Now, we find the variance by taking the average of the squared deviations: Variance = (100 + 25 + 0 + 25 + 100) / 5 = 50

  1. Absolute Deviations: Using absolute deviations, we calculate the deviation for each data point: |165 - 175| = 10 |170 - 175| = 5 |175 - 175| = 0 |180 - 175| = 5 |185 - 175| = 10

To find the mean absolute deviation (MAD), we average the absolute deviations: MAD = (10 + 5 + 0 + 5 + 10) / 5 = 6

By comparing the squared deviations (variance) and the absolute deviations (MAD), we can observe the differences:

  • Emphasizing Larger Deviations: Squaring the deviations in the variance calculation places more emphasis on larger deviations. In this example, heights farther from the mean (175) contribute more to the variance. The squared deviations of 100 and 100 for heights of 165 and 185, respectively, reflect this increased emphasis.
  • Mathematical Properties: The use of squared deviations enables the variance to be calculated by summing the squared deviations and dividing by the number of data points, simplifying calculations. It also allows for various statistical properties and techniques to be applied.
  • Relation to Other Statistical Measures: The variance is a crucial component in calculating the standard deviation, which is the square root of the variance. The standard deviation is widely used in statistics to measure the spread of data and is based on the squared deviations.

In summary, the use of squared deviations in the variance calculation provides several statistical advantages, including the emphasis of larger deviations, mathematical properties, and relation to other statistical measures.


Can you calculate the standard deviation from the above mentioned data and compare it with MAD and write the significance of standard deviation from the data set? The impact of Standard deviation over MAD.

Certainly! Let's calculate the standard deviation from the given dataset and compare it to the mean absolute deviation (MAD).

Dataset: 165, 170, 175, 180, 185

  1. Calculate the Mean: Mean = (165 + 170 + 175 + 180 + 185) / 5 = 175
  2. Calculate the Squared Deviation for each data point: (165 - 175)^2 = 100 (170 - 175)^2 = 25 (175 - 175)^2 = 0 (180 - 175)^2 = 25 (185 - 175)^2 = 100
  3. Calculate the Variance: Variance = (100 + 25 + 0 + 25 + 100) / 5 = 50
  4. Calculate the Standard Deviation: Standard Deviation = √Variance = √50 ≈ 7.071
  5. Calculate the Absolute Deviation for each data point: |165 - 175| = 10 |170 - 175| = 5 |175 - 175| = 0 |180 - 175| = 5 |185 - 175| = 10
  6. Calculate the Mean Absolute Deviation (MAD): MAD = (10 + 5 + 0 + 5 + 10) / 5 = 6

Now, let's compare the Standard Deviation and MAD:

  • Standard Deviation (7.071): The standard deviation measures the average amount by which each data point deviates from the mean. It takes into account the squared deviations, providing a measure of the spread of the data. In this example, the standard deviation of approximately 7.071 indicates that, on average, the data points deviate from the mean by about 7.071 centimeters.
  • Mean Absolute Deviation (MAD) (6): The MAD represents the average absolute deviation of each data point from the mean. It is less influenced by extreme values and provides a robust measure of dispersion. In this example, the MAD of 6 indicates that, on average, the data points deviate from the mean by about 6 centimeters.

Significance of Standard Deviation: The standard deviation is widely used in statistics because it provides a measure of dispersion that takes into account the spread of the data while considering the algebraic signs of deviations. It helps understand the variability and spread of a dataset and is often used in conjunction with the mean to summarize and compare datasets. The standard deviation allows for more accurate assessment of the magnitude of deviations from the mean compared to MAD, which is based on absolute deviations.

The standard deviation is particularly useful when dealing with datasets that follow a normal distribution or when comparing datasets that have similar units or scales. It is also a fundamental component in many statistical techniques and hypothesis testing. Its squared nature allows for mathematical properties and facilitates the calculation of confidence intervals and significance testing.

While the MAD is a robust measure of dispersion that is less sensitive to outliers, the standard deviation provides additional insights into the variability and spread of the data, accounting for the magnitude and direction of deviations from the mean.


You wrote "In summary, the use of squared deviations in the variance calculation provides several statistical advantages, including the emphasis of larger deviations, mathematical properties, and relation to other statistical measures." can you explain which mathematical properties you are talking about? And also which relation to other statistical measures?

Certainly! Here are the mathematical properties and relations to other statistical measures associated with the use of squared deviations in variance calculation:

  1. Simplified Calculation: Squaring the deviations in the variance formula allows for the convenient mathematical property of being able to sum the squared deviations. This simplifies the calculation process, as we can add up the squared deviations and divide by the number of data points to find the variance.
  2. Linearity of Variance: The variance has the property of linearity. This means that if we have two sets of data and we combine them, the variance of the combined dataset is equal to the sum of the variances of the individual datasets. This property facilitates the analysis of variance in many statistical techniques and hypothesis testing.
  3. Standard Deviation: The variance is the squared value of the standard deviation. The standard deviation is widely used as a measure of dispersion, and its calculation involves taking the square root of the variance. The standard deviation inherits the mathematical properties and relations of the variance.
  4. Covariance and Correlation: The variance is closely related to covariance and correlation. Covariance measures the linear relationship between two variables, while correlation measures the strength and direction of the linear relationship. Both covariance and correlation involve squared deviations and are related to the variance through mathematical formulas.
  5. Statistical Inference: The use of squared deviations in the variance calculation is fundamental to many statistical inference techniques. It plays a crucial role in hypothesis testing, confidence interval estimation, and model building. These techniques rely on the properties of variance to make inferences about population parameters and assess the reliability of statistical estimates.

By squaring the deviations, the variance incorporates mathematical properties such as linearity, facilitates the calculation of other statistical measures like standard deviation, covariance, and correlation, and supports various statistical inference techniques. These properties and relationships make the variance a widely used and important statistical measure.


https://chat.openai.com/share/1fbb8455-ce54-4810-86b3-b801846abac0 (off topic)



Quartiles and IQR

Median - and Quartile is similar.

কোয়াটেলের কনসেপ্টওয়ার সোজা। মিডিয়ান যেরকম একটা একটা রেঞ্জ অফ ডাটা থেকে মাঝখানের খুঁজে বের করে ঠিক সেরকম কোয়ারটাইল ও মিডিয়ানের মতোই কিছুটা শুধুমাত্র কোয়ারটাইল এর ক্ষেত্রে আমরা এটাকে তিনটা ভাগে ভাগ করতে পারি।

আমরা নিচে দেখতে পাচ্ছি যে কিছু ডাটা রয়েছে যেগুলোতে হয়েছে ১৮ টি গলফ স্কোর।

আমরা জানি এখানে মিডিয়ান হচ্ছে মাঝখানের 69 ও 69- দুইটা 69 এর গড়। তাই মিডিয়াম বের করতে হলে আমরা দুইটা সিক্সটি নাইন কে যোগ করে দুই দিয়ে ভাগ করে ৬৯ পাবো। এ জন্য আমাদের ভ্যালুটা হচ্ছে মাঝের দুইটা সিক্সটি নাইনের গড়।

হিসাব করার সময় আমরা তিনটা কর্টাইল নিয়ে হিসাব করি একটা হচ্ছে ফার্স্ট কোয়ার্টাল, একটা সেকেন্ড কোয়ার্টাল, থার্ড কোয়ার্টাল।

এখানে ৭৫ যে সে হচ্ছে সবচেয়ে ভালো. ৬৬ হচ্ছে সবচেয়ে খারাপ গলফেয়ার তাই আমি যদি মাঝখান বরাবর নিতে চাই সিক্সটি নাইন হচ্ছে গিয়ে আমাদের মিডিয়াম এখন যদি আমি এই ডাটাকে তিন ভাগে ভাগ করি তাহলে আমি প্রথম দিকে পাই ফার্স্ট কোয়ার্টার যেটাকে আমি লোয়ার কোয়ারটাইলেও বলতে পারি।

Q1 = প্রথম অংশের মিডিয়ান হচ্ছে ফার্স্ট কোয়ার্টাইল। একে লোয়ার কোয়ার্টাইল ও বলে। It is the 25th percent of the data. [ 25th % of the data will fall below lower quarties, and 75th % of the data will fall in the upper quartile]

Q2 = সেকেন্ড কোয়ার্টাইল = মিডিয়ান (50th percent of the data) [ ৫0th % অফ মিডিয়ান এর মানে হচ্ছে গিয়ে এর শুরুর 50 ভাগ পড়বে লোয়ার পার্টে এবং পরের ৫০ ভাগ করবে আপার part এ। এজন্যই এটা ফিফটিত পারসেন্ট অফ মিডিয়ান।

Q3 = Third Quartile, Upper Quartile, or 75th percent of the quartile.

5 number summary === min, Q1, median, Q3, Max (gives ideas of the center and the spread at the same time)

By examining the five-number summary, we can gain insights into the shape of the distribution. For example:

  • If the minimum and maximum are close together, the data may be tightly clustered, indicating a smaller variance.
  • If Q1 and Q3 are close to the median, the data may be symmetrically distributed.
  • If Q3 is much higher than Q1 and the median, it suggests a positively skewed distribution.
  • If Q1 is much higher than Q3 and the median, it suggests a negatively skewed distribution.
  • The range between Q1 and Q3 (also known as the interquartile range, or IQR) provides information about the spread of the middle 50% of the data.

By considering these measures together, the five-number summary allows us to assess the dispersion, skewness, and central tendency of a dataset, giving us a sense of the distribution and variance.

IQR = Q3- Q1

We can determine the outliars as well by using the Q1 and Q3 values.

If any value goes beyond: (Q1-1.5XIQR) to (Q3+1.5XIQR) they can be considered as outliers.


Questions to GPT:

why quartile is called quartile? what is the relationship of this with 4

Quartiles are called quartiles because they divide a dataset into four equal parts or quarters. Each quartile represents a specific portion of the dataset, splitting it into approximately 25% of the data.

There are three quartiles commonly used in statistics: the first quartile (Q1), the second quartile (Q2), and the third quartile (Q3). Q2 is also known as the median. These quartiles help analyze the distribution and dispersion of data.

The first quartile (Q1) represents the value below which 25% of the data falls. It divides the lowest 25% of the data from the rest. The second quartile (Q2) is the median, representing the value below which 50% of the data falls, splitting the dataset into two equal halves. The third quartile (Q3) represents the value below which 75% of the data falls, dividing the upper 25% of the data from the rest.

The quartiles are commonly used in statistical analysis to understand the spread of data, identify outliers, and analyze the distribution of a dataset. They provide a way to summarize the distribution of data into four equal parts, offering insights into the relative positions and variability of values within a dataset.


Interquartile Range formula of a data set in excel

To calculate the interquartile range (IQR) of a data set in Excel, you can use the following formula:

=QUARTILE.INC(data_range,3)-QUARTILE.INC(data_range,1)

Here's how the formula works:

  1. QUARTILE.INC(data_range, 1) represents the first quartile (25th percentile) of the data set.
  2. QUARTILE.INC(data_range, 3) represents the third quartile (75th percentile) of the data set.
  3. Subtracting the first quartile from the third quartile gives you the interquartile range.

Make sure to replace "data_range" in the formula with the actual range of data you want to calculate the interquartile range for.

For example, if your data is in cells A1 to A10, the formula would be:

=QUARTILE.INC(A1:A10,3)-QUARTILE.INC(A1:A10,1)

This formula will give you the interquartile range of the data set in Excel.


A business is trying to analyze its sales data from the past year. The data shows the amount of money made from each sale, sorted in ascending order. The business wants to find the three points that divide the dataset into four equal parts, in order to better understand the distribution of its sales. The data in USD is shown below: [10, 12, 15, 18, 20, 21, 22, 24, 25, 27, 28, 30, 32, 35, 38, 40, 42, 45, 50] What is the Interquartile Range for this data set?

To find the interquartile range (IQR) for the given data set, we need to calculate the first quartile (Q1) and the third quartile (Q3).

First, let's determine the position of the quartiles. Since we have 19 data points, we can use the following formulas:

Position of Q1 = (n + 1) / 4 = (19 + 1) / 4 = 5 Position of Q3 = 3 * (n + 1) / 4 = 3 * (19 + 1) / 4 = 15

Next, let's find the values corresponding to these positions in the sorted data set:

Q1 = The value at position 5 = 20 Q3 = The value at position 15 = 38

Finally, we can calculate the interquartile range (IQR) by subtracting Q1 from Q3:

IQR = Q3 - Q1 = 38 - 20 = 18

Therefore, the interquartile range for the given data set is 18 USD.