Kamrun Nahar

3.2 Scatter Plot

2023-05-31T13:04:24.869Z

There is trend line in the scatter plots. And the formula of a line in alzebra is:

y = mx + c

Here,

m is the slope,

m (higher value) - the increase is faster

m (lower value) - the increase is slower, the change is not rapid

m (negative) - the value is going downward

m value 7 means, every 1 item you add on the x axis, you should add 7 unit more to the y axis.

3.1 Visualization of Data

2023-05-31T10:12:26.297Z

Visualizing data is a key aspect of data science!

It’s important to be able to convey information to others, especially people who may not have the full technical knowledge or understanding to view the raw data or statistical analysis.

As we continue learning about data visualization, you should always keep in mind:

“What is the information I want to share or story I am trying to tell? How does this visualization help in conveying that information to others?”

Data visualization is especially crucial in organizations because often decisions are made based on a final visualization or interpretation of the data.

Remember that the purpose of data science at an enterprise level is to use it to make key decisions and improve products or services!

To understand which data visualization to use, let’s have a quick tour of the different data visualizations categories we’ll cover in this section:

Scatter Plots
Line Plots
Distribution Plots
Categorical Plots

Scatter Plots

A typical scatter plot will represent 2 dimensions (data features). For example, the height vs. weight of a group of people. A scatter plot can reveal relationships between two data features.

Imagine the following data set:

Is there a relationship between tip and bill?

x Axis এ টোটাল বিল

y- axis এ টিপ চিহ্নিত করি।

ডাটা বসিয়ে আমরা দেখতে পাই এমনঃ

এখান থেকে আমরা বলতে পারিঃ Tip tends to increase as Total Bill increases.

Now,

We can add a trend line (or regression line, or best fit line, or least squares line).

ট্রেন্ড লাইনটি স্ক্যাটার প্লটের ডেটা পয়েন্টগুলির উপরে আঁকা হয় এবং এর উদ্দেশ্য হলো ডেটার সাধারণ প্রবণতা বা প্রবণতার উপরে নজর দেয়া। এটি ভবিষ্যতের ডেটা পয়েন্টগুলি উপর নজর রাখতে বা আনুমানিক প্রতিষ্ঠান করতে ব্যবহৃত হয় যা দেখা যায় পর্যাপ্ত প্রবণতায়।

In most plotting libraries or software, you can set the alpha value to a decimal between 0 and 1, where 0 represents completely transparent and 1 represents fully opaque. By reducing the alpha value, you can make the stacked scatter points appear more transparent, allowing underlying points or patterns to be visible.

Here's an example using Python's Matplotlib library:

import matplotlib.pyplot as plt

# Generate example data
x = [1, 2, 3, 4, 5]
y = [10, 20, 30, 40, 50]
stacked_values = [5, 15, 25, 35, 45]

# Plotting scatter points with transparency
plt.scatter(x, y, c='blue', alpha=0.5)
plt.scatter(x, stacked_values, c='red', alpha=0.5)

plt.xlabel('X')
plt.ylabel('Y')
plt.title('Stacked Scatter Points with Transparency')

plt.show()

In this example, two scatter plots are created: one for the original data points (blue) and another for the stacked points (red). The alpha parameter is set to 0.5 for both scatter plots, making them semi-transparent.

Adjust the alpha value according to your desired level of transparency.

Line Plots

Sometimes we already know there should be a continuous relationship between points along an axis. This is where we can use a simple line to indicate a known relationship between points along a data feature.

For example, in our previous example, it would not make sense to draw a line between the points!

কারণ শনিবারের লাঞ্চে আসা লোকটার সাথে রবিবার রাতে ডিনারে আসার লোকটার সাথে কোন সম্পর্ক নাই এই জন্যই এই পার্টিকুলার case এর জন্য লাইন প্লট একদমই ইউজ করা উচিত না।

So when is a line plot appropriate?

When we already know for certain there is a continuous relationship between data points along a feature. For example, timestamps. We know for certain there are continuous times in between each time stamped point.

We know flights occurred between years!

A line indicates that continuous knowledge.

Line plots also make it easy to stack features:

Distribution Plots

Distribution plots allow us to visualize the dispersion of data across a feature or variable. One of the most common ways to do this is through a histogram.

What is the distribution of Total Bill amounts?

We could answer this through some statistical metrics:

Mean Total Bill is $18.79 with a standard deviation is $8.9 with a min value of $3.07 and max value of $50.81.

A histogram will count the number of occurrences within a range and then create a bar of the height of the count of the occurrences within the range.

Note how the x-axis feature is continuous, this must be the case for a histogram!

There are many more types of plots to display distribution, such as box-and-whisker plots and KDE (Kernel Density Estimation) plots, but by far the most common you will encounter is the histogram.

Distribution plots:

Suppose you have a dataset that contains customer ratings for three different product categories: electronics, clothing, and home appliances. The ratings range from 1 to 5, where 1 represents low satisfaction and 5 represents high satisfaction.
To analyze the distribution of ratings for each product category, you can create histograms for each category.
The histogram for the electronics category may show that a large proportion of customers gave ratings of 4 and 5, indicating higher satisfaction. It may also reveal a few customers who gave lower ratings, potentially indicating areas for improvement.
The histogram for the clothing category might show a more evenly distributed set of ratings, indicating moderate satisfaction across different customer preferences.
The histogram for the home appliances category could display a skewed distribution, indicating that most customers are highly satisfied, with fewer giving lower ratings.

2. Categorical plots:

To further analyze the relationship between customer satisfaction and other categorical variables, such as gender and age group, you can create categorical plots.
A bar plot can be used to show the average rating for each gender group. This plot may reveal that females tend to have slightly higher satisfaction ratings compared to males.
A count plot can be used to show the count of ratings for different age groups. It might indicate that younger customers (e.g., 18-25 age group) have a higher participation rate and a wider range of ratings compared to older customers.
A categorical scatter plot can be used to examine the relationship between customer satisfaction ratings and the price range of products. Different markers or colors can represent different price ranges. This plot may show that customers who purchased products in the higher price range tend to give higher satisfaction ratings.

Categorical Plots

Categorical plots simply display some metric per category. For example, a mean value per category or a count per category. There are many variations of these types of plots, but one of the most common is the simple bar plot.

Be careful not to confuse the bar plot for a histogram! While they may appear similar, carefully look at the x-axis feature. For a bar chart, the feature is categorical while for a histogram it is continuous!

2.3 Measurements of Dispersion - Variance and Standard Deviation

2023-05-31T06:33:17.580Z

এভারেজ থেকে ডেটা গুলো কতটা ছড়ানো - সেই হিসাব

Variance value less means - the data is close together, and higher value means they are scattered in significant differences .

Some questions I had asked chatGPT:

The reason for squaring the deviations in the variance calculation instead of using the absolute value is primarily based on mathematical and statistical considerations. Squaring the deviations offers several advantages:

Emphasizing Larger Deviations: Squaring the deviations amplifies the effect of larger deviations, giving them more weight in the calculation of variance. This is desirable as larger deviations often represent more significant deviations from the mean and contribute more to the overall spread of the data.
Mathematical Properties: Squaring the deviations allows for convenient mathematical properties in the variance formula. For example, the squared deviations can be summed, which simplifies calculations and enables the use of various statistical techniques and formulas.
Relation to Other Statistical Measures: Squaring the deviations in the variance calculation is linked to other important statistical measures, such as the standard deviation. The standard deviation, which is the square root of the variance, is widely used in statistical analysis due to its mathematical properties and interpretation.

It's important to note that the choice of using squared deviations in the variance calculation is a convention within statistics. Different measures of dispersion, such as the mean absolute deviation (MAD), use absolute differences instead. The choice of using squared deviations is based on mathematical convenience and the desire to capture the emphasis of larger deviations in assessing the spread of data.

Can you give me real life example with data to prove your above mention points? Use numbers please?

- Certainly! Let's consider the following dataset representing the heights of individuals (in centimeters):

Dataset: 165, 170, 175, 180, 185

We will calculate the variance using both squared deviations and absolute deviations to demonstrate their differences.

Squared Deviations: First, we calculate the mean height: Mean = (165 + 170 + 175 + 180 + 185) / 5 = 175

Next, we calculate the squared deviation for each data point: (165 - 175)^2 = 100 (170 - 175)^2 = 25 (175 - 175)^2 = 0 (180 - 175)^2 = 25 (185 - 175)^2 = 100

Now, we find the variance by taking the average of the squared deviations: Variance = (100 + 25 + 0 + 25 + 100) / 5 = 50

Absolute Deviations: Using absolute deviations, we calculate the deviation for each data point: |165 - 175| = 10 |170 - 175| = 5 |175 - 175| = 0 |180 - 175| = 5 |185 - 175| = 10

To find the mean absolute deviation (MAD), we average the absolute deviations: MAD = (10 + 5 + 0 + 5 + 10) / 5 = 6

By comparing the squared deviations (variance) and the absolute deviations (MAD), we can observe the differences:

Emphasizing Larger Deviations: Squaring the deviations in the variance calculation places more emphasis on larger deviations. In this example, heights farther from the mean (175) contribute more to the variance. The squared deviations of 100 and 100 for heights of 165 and 185, respectively, reflect this increased emphasis.
Mathematical Properties: The use of squared deviations enables the variance to be calculated by summing the squared deviations and dividing by the number of data points, simplifying calculations. It also allows for various statistical properties and techniques to be applied.
Relation to Other Statistical Measures: The variance is a crucial component in calculating the standard deviation, which is the square root of the variance. The standard deviation is widely used in statistics to measure the spread of data and is based on the squared deviations.

In summary, the use of squared deviations in the variance calculation provides several statistical advantages, including the emphasis of larger deviations, mathematical properties, and relation to other statistical measures.

Can you calculate the standard deviation from the above mentioned data and compare it with MAD and write the significance of standard deviation from the data set? The impact of Standard deviation over MAD.

Certainly! Let's calculate the standard deviation from the given dataset and compare it to the mean absolute deviation (MAD).

Dataset: 165, 170, 175, 180, 185

Calculate the Mean: Mean = (165 + 170 + 175 + 180 + 185) / 5 = 175
Calculate the Squared Deviation for each data point: (165 - 175)^2 = 100 (170 - 175)^2 = 25 (175 - 175)^2 = 0 (180 - 175)^2 = 25 (185 - 175)^2 = 100
Calculate the Variance: Variance = (100 + 25 + 0 + 25 + 100) / 5 = 50
Calculate the Standard Deviation: Standard Deviation = √Variance = √50 ≈ 7.071
Calculate the Absolute Deviation for each data point: |165 - 175| = 10 |170 - 175| = 5 |175 - 175| = 0 |180 - 175| = 5 |185 - 175| = 10
Calculate the Mean Absolute Deviation (MAD): MAD = (10 + 5 + 0 + 5 + 10) / 5 = 6

Now, let's compare the Standard Deviation and MAD:

Standard Deviation (7.071): The standard deviation measures the average amount by which each data point deviates from the mean. It takes into account the squared deviations, providing a measure of the spread of the data. In this example, the standard deviation of approximately 7.071 indicates that, on average, the data points deviate from the mean by about 7.071 centimeters.
Mean Absolute Deviation (MAD) (6): The MAD represents the average absolute deviation of each data point from the mean. It is less influenced by extreme values and provides a robust measure of dispersion. In this example, the MAD of 6 indicates that, on average, the data points deviate from the mean by about 6 centimeters.

Significance of Standard Deviation: The standard deviation is widely used in statistics because it provides a measure of dispersion that takes into account the spread of the data while considering the algebraic signs of deviations. It helps understand the variability and spread of a dataset and is often used in conjunction with the mean to summarize and compare datasets. The standard deviation allows for more accurate assessment of the magnitude of deviations from the mean compared to MAD, which is based on absolute deviations.

The standard deviation is particularly useful when dealing with datasets that follow a normal distribution or when comparing datasets that have similar units or scales. It is also a fundamental component in many statistical techniques and hypothesis testing. Its squared nature allows for mathematical properties and facilitates the calculation of confidence intervals and significance testing.

While the MAD is a robust measure of dispersion that is less sensitive to outliers, the standard deviation provides additional insights into the variability and spread of the data, accounting for the magnitude and direction of deviations from the mean.

You wrote "In summary, the use of squared deviations in the variance calculation provides several statistical advantages, including the emphasis of larger deviations, mathematical properties, and relation to other statistical measures." can you explain which mathematical properties you are talking about? And also which relation to other statistical measures?

Certainly! Here are the mathematical properties and relations to other statistical measures associated with the use of squared deviations in variance calculation:

Simplified Calculation: Squaring the deviations in the variance formula allows for the convenient mathematical property of being able to sum the squared deviations. This simplifies the calculation process, as we can add up the squared deviations and divide by the number of data points to find the variance.
Linearity of Variance: The variance has the property of linearity. This means that if we have two sets of data and we combine them, the variance of the combined dataset is equal to the sum of the variances of the individual datasets. This property facilitates the analysis of variance in many statistical techniques and hypothesis testing.
Standard Deviation: The variance is the squared value of the standard deviation. The standard deviation is widely used as a measure of dispersion, and its calculation involves taking the square root of the variance. The standard deviation inherits the mathematical properties and relations of the variance.
Covariance and Correlation: The variance is closely related to covariance and correlation. Covariance measures the linear relationship between two variables, while correlation measures the strength and direction of the linear relationship. Both covariance and correlation involve squared deviations and are related to the variance through mathematical formulas.
Statistical Inference: The use of squared deviations in the variance calculation is fundamental to many statistical inference techniques. It plays a crucial role in hypothesis testing, confidence interval estimation, and model building. These techniques rely on the properties of variance to make inferences about population parameters and assess the reliability of statistical estimates.

By squaring the deviations, the variance incorporates mathematical properties such as linearity, facilitates the calculation of other statistical measures like standard deviation, covariance, and correlation, and supports various statistical inference techniques. These properties and relationships make the variance a widely used and important statistical measure.

https://chat.openai.com/share/1fbb8455-ce54-4810-86b3-b801846abac0 (off topic)

Quartiles and IQR

Median - and Quartile is similar.

কোয়াটেলের কনসেপ্টওয়ার সোজা। মিডিয়ান যেরকম একটা একটা রেঞ্জ অফ ডাটা থেকে মাঝখানের খুঁজে বের করে ঠিক সেরকম কোয়ারটাইল ও মিডিয়ানের মতোই কিছুটা শুধুমাত্র কোয়ারটাইল এর ক্ষেত্রে আমরা এটাকে তিনটা ভাগে ভাগ করতে পারি।

আমরা নিচে দেখতে পাচ্ছি যে কিছু ডাটা রয়েছে যেগুলোতে হয়েছে ১৮ টি গলফ স্কোর।

আমরা জানি এখানে মিডিয়ান হচ্ছে মাঝখানের 69 ও 69- দুইটা 69 এর গড়। তাই মিডিয়াম বের করতে হলে আমরা দুইটা সিক্সটি নাইন কে যোগ করে দুই দিয়ে ভাগ করে ৬৯ পাবো। এ জন্য আমাদের ভ্যালুটা হচ্ছে মাঝের দুইটা সিক্সটি নাইনের গড়।

হিসাব করার সময় আমরা তিনটা কর্টাইল নিয়ে হিসাব করি একটা হচ্ছে ফার্স্ট কোয়ার্টাল, একটা সেকেন্ড কোয়ার্টাল, থার্ড কোয়ার্টাল।

এখানে ৭৫ যে সে হচ্ছে সবচেয়ে ভালো. ৬৬ হচ্ছে সবচেয়ে খারাপ গলফেয়ার তাই আমি যদি মাঝখান বরাবর নিতে চাই সিক্সটি নাইন হচ্ছে গিয়ে আমাদের মিডিয়াম এখন যদি আমি এই ডাটাকে তিন ভাগে ভাগ করি তাহলে আমি প্রথম দিকে পাই ফার্স্ট কোয়ার্টার যেটাকে আমি লোয়ার কোয়ারটাইলেও বলতে পারি।

Q1 = প্রথম অংশের মিডিয়ান হচ্ছে ফার্স্ট কোয়ার্টাইল। একে লোয়ার কোয়ার্টাইল ও বলে। It is the 25th percent of the data. [ 25th % of the data will fall below lower quarties, and 75th % of the data will fall in the upper quartile]

Q2 = সেকেন্ড কোয়ার্টাইল = মিডিয়ান (50th percent of the data) [ ৫0th % অফ মিডিয়ান এর মানে হচ্ছে গিয়ে এর শুরুর 50 ভাগ পড়বে লোয়ার পার্টে এবং পরের ৫০ ভাগ করবে আপার part এ। এজন্যই এটা ফিফটিত পারসেন্ট অফ মিডিয়ান।

Q3 = Third Quartile, Upper Quartile, or 75th percent of the quartile.

5 number summary === min, Q1, median, Q3, Max (gives ideas of the center and the spread at the same time)

By examining the five-number summary, we can gain insights into the shape of the distribution. For example:

If the minimum and maximum are close together, the data may be tightly clustered, indicating a smaller variance.
If Q1 and Q3 are close to the median, the data may be symmetrically distributed.
If Q3 is much higher than Q1 and the median, it suggests a positively skewed distribution.
If Q1 is much higher than Q3 and the median, it suggests a negatively skewed distribution.
The range between Q1 and Q3 (also known as the interquartile range, or IQR) provides information about the spread of the middle 50% of the data.

By considering these measures together, the five-number summary allows us to assess the dispersion, skewness, and central tendency of a dataset, giving us a sense of the distribution and variance.

IQR = Q3- Q1

We can determine the outliars as well by using the Q1 and Q3 values.

If any value goes beyond: (Q1-1.5XIQR) to (Q3+1.5XIQR) they can be considered as outliers.

Questions to GPT:

why quartile is called quartile? what is the relationship of this with 4

Quartiles are called quartiles because they divide a dataset into four equal parts or quarters. Each quartile represents a specific portion of the dataset, splitting it into approximately 25% of the data.

There are three quartiles commonly used in statistics: the first quartile (Q1), the second quartile (Q2), and the third quartile (Q3). Q2 is also known as the median. These quartiles help analyze the distribution and dispersion of data.

The first quartile (Q1) represents the value below which 25% of the data falls. It divides the lowest 25% of the data from the rest. The second quartile (Q2) is the median, representing the value below which 50% of the data falls, splitting the dataset into two equal halves. The third quartile (Q3) represents the value below which 75% of the data falls, dividing the upper 25% of the data from the rest.

The quartiles are commonly used in statistical analysis to understand the spread of data, identify outliers, and analyze the distribution of a dataset. They provide a way to summarize the distribution of data into four equal parts, offering insights into the relative positions and variability of values within a dataset.

Interquartile Range formula of a data set in excel

To calculate the interquartile range (IQR) of a data set in Excel, you can use the following formula:

=QUARTILE.INC(data_range,3)-QUARTILE.INC(data_range,1)

Here's how the formula works:

QUARTILE.INC(data_range, 1) represents the first quartile (25th percentile) of the data set.
QUARTILE.INC(data_range, 3) represents the third quartile (75th percentile) of the data set.
Subtracting the first quartile from the third quartile gives you the interquartile range.

Make sure to replace "data_range" in the formula with the actual range of data you want to calculate the interquartile range for.

For example, if your data is in cells A1 to A10, the formula would be:

=QUARTILE.INC(A1:A10,3)-QUARTILE.INC(A1:A10,1)

This formula will give you the interquartile range of the data set in Excel.

A business is trying to analyze its sales data from the past year. The data shows the amount of money made from each sale, sorted in ascending order. The business wants to find the three points that divide the dataset into four equal parts, in order to better understand the distribution of its sales. The data in USD is shown below: [10, 12, 15, 18, 20, 21, 22, 24, 25, 27, 28, 30, 32, 35, 38, 40, 42, 45, 50] What is the Interquartile Range for this data set?

To find the interquartile range (IQR) for the given data set, we need to calculate the first quartile (Q1) and the third quartile (Q3).

First, let's determine the position of the quartiles. Since we have 19 data points, we can use the following formulas:

Position of Q1 = (n + 1) / 4 = (19 + 1) / 4 = 5 Position of Q3 = 3 * (n + 1) / 4 = 3 * (19 + 1) / 4 = 15

Next, let's find the values corresponding to these positions in the sorted data set:

Q1 = The value at position 5 = 20 Q3 = The value at position 15 = 38

Finally, we can calculate the interquartile range (IQR) by subtracting Q1 from Q3:

IQR = Q3 - Q1 = 38 - 20 = 18

Therefore, the interquartile range for the given data set is 18 USD.

2.2 Measurements of Central Tendency - Mean, Median, and Mode

2023-05-29T12:14:48.982Z

মিন আর মিডিয়ান বোঝার জন্য সবচেয়ে সহজ যে কনসেপ্টটা, যদি আমার কাছে এমন পাঁচজন থাকে যাদের বয়স ৮, ৬, ৬, ১০, ৮৫ এবং আমি যদি একটা মুভি প্লে করতে ছাই। মুভিটা কত বছর বয়সীদের জন্য উপযুক্ত হতে হবে?

আমি যদি এদের বয়সের এভারেজ করি তাহলে আমি পাই তেইশ।

এই ডাটা সেটে বাচ্চারা আছে চারজন এবং একজন বয়স্ক লোক আছে। ফলে এইখানের এভারেজ হচ্ছে ২৩ কিন্তু এর মানে এই না যে আমি তেইশ বছর বয়সীদের উপযুক্ত মুভি প্লে করলে সবাই এনজয় করবে। সেই ক্ষেত্রেই আসবে মিডিয়ানের কনসেপ্ট।

মিডিয়ান হচ্ছে সবগুলো বয়স সাজিয়ে মাঝ বরাবর যেই বয়সটা পাওয়া যায়।

এই ক্ষেত্রে মিডিয়ান হচ্ছে আট বছর কারণ সবগুলো ডাটা সাজিয়ে মাঝের বয়সটাকে আমরা পাচ্ছি ৮ বছর। এজন্যই যদি আমরা কার্টুন নেটওয়ার্ক চালাই তাহলে আট বছর বয়সী ইনজয় করবে, ছয় বছর বয়সী এনজয় করবে, ১০ বছর বয়সী ছেলেটা ইনজয় করবে, অ্যান্ড most likely ৮৫ বছর বয়সের নানুও এনজয় করবে।

2.1 Core Data Concepts

2023-05-29T10:17:12.212Z

Let's explore some core concepts and the vocabulary used to describe them:

Continuous vs. Discrete (Categorical)
Nominal vs. Ordinal
Structured vs. Unstructured
Population vs. Sample

Discrete Data:

Can only take certain values, there are no values "in-between" values.
Car models: Toyota, Tesla, Ferrari।

অর্থাৎ যদি আমার ভেলুর মধ্যে এমন ভ্যালু থাকে যেগুলো ভগ্নাংশ হয় না মানে অর্ধেক Toyota বা অর্ধেক টেসলা এরকম তো গাড়ি হয় না এর কারণেই গাড়ির ভ্যালু বা গাড়ির নাম একটা Discrete ডাটা টাইপ।

সেই একই কারণে ঘরের সিঁড়ির সংখ্যা কিংবা পরিবারের সদস্যদের সংখ্যাও ডিসক্রিট ডাটা টাইপ।

There can be no 3.5 in the dice.

Continuous Data:

Can take any value, there are an "infinite" amount of values in-between any two values if you are able to get precise enough.

Notice how someone could be in between that at 172.5 cm tall.

Remember that while continuous data is numeric (160kg), discrete data can be numericl (dice roll of 2) or a string ("Blue"). Keep in mind that sometimes the context and framing of a dataset will decide whether you should think of data as continuous or discrete.

But what if the context is physics and the visible spectrum of light in wavelengths?

Do not confuse numeric and ordered discretel data with continuous data!

Nominal vs. Ordinal:

Nominal data is classified without a natural order or rank. For categories Of discrete anirnals: dogs, cats, lizards, horses, etc...

নমিনাল চেনা খুবই সহজ। আমার ডাটা যদি এমন হয় যেখানে আমি একটা কালেকশন অফ এনিমেল নিয়েছি যেমনঃ কুকুর, বিড়াল, ঘোড়া, টিকটিকে। তখন আমাকে যদি বলি, বিড়াল ভালো না কুকুর ভালো? আবার টিকটিকি খারাপ নাকি ঘোড়া খারাপ? আবার ঘোড়ার মধ্যে আর কুকুরের মধ্যে কে ভালো? এই ভালো মন্দের কোন স্ট্যান্ডার্ড প্যারামিটার নাই এই জন্যই এগুলোকে বলা হয় নমিনাল।

A good test for nominal data is if it can be clearly sorted or not. Nominal data can not be sorted. আপনার ডাটাকে যদি কোন না কোনভাবে সাজানোর না যায় তাহলে সেই ডাটাই নমিনাল ডাটা।

Ordinal Data:

Ordinal শব্দের মধ্যেই তো অর্ডার শব্দটা চলে আসছে তারমানে Ordinal ডাটা গুলোকে অর্ডার wise ভাগ করা বা সাজানো যায়। Sort করা যায়

Hot, or Mild কিংবা কোল্ড এর ডাটা গুলোকে শর্ট করা যায়। যেমন ঢাকায় গত এক সপ্তাহে সর্বোচ্চ তাপমাত্রা কত ছিল? সর্বনিম্ন কত ছিল? এই হিসাবে কিন্তু সাতদিনের তাপমাত্রা থেকে একটা সাজানো বা sorting করতে পারব। এই কারণেই ওয়েদারের ডাটা হচ্ছে অর্ডিনাল।

Structured vs. Unstructured

We also need to understand that not all data is formatted nicely in a table or spreadsheet, and in some cases we don't even want it in a structured format!

Structured data is highly specific and is stored in a predefined format. For example: Excel spreadsheets, JSON files, XML files, or SQL databases follow a predefined format.

Unstructured data is not in a particular format. For example video, or text data doesn't need to follow any particular predefined sturctured format.

Be careful not to confused computer encoded file formats with "formatted data"! Just because text is in a PDF format doesn't make it structured data.

Unstructured ডাটা নিয়ে কাজ করার কঠিন হলেও কিছু কিছু example এমন আছে যখন Unstructured ডাটাই আমাদের কাজে লাগে।

For example: DALLE-2 from OpenAI

Population vs. Sample

Population:

পপুলেশন আর স্যাম্পল এর concept বোঝার জন্য কনটেক্স বোঝাটা খুব জরুরী। যদি কন্টেক্সট হয় একটা ক্লাস কে কেন্দ্র করে তাহলে ওই ক্লাসের সব স্টুডেন্ট হবে পপুলেশন। কিন্তু যদি কন্টেক্স হয় পুরো স্কুল কে কেন্দ্র করে তাহলে কোন একটা ক্লাসের সব স্টুডেন্ট কখনোই পপুলেশন হবে না।

Often however it is not possible to record data on an entire population. In this case we rely on a sample from the population, which is a subset of the members of the group.

Later on we'll discover that sample sizes are a well studied science. For example: How many students should we survey for a school of 1,000 students to get a representative sample?

একটা স্কুলের ১০০০ ছাত্রছাত্রীর একটা সার্ভে যদি আমি করতে যাই সে ক্ষেত্রে এক দুইজনের স্যাম্পল নিলে কি হবে?

- হবে না।

আবার ১০০০ ছাত্র-ছাত্রী সবাইকে সার্ভে করাও তো সম্ভব না।

সার্ভের জন্য স্যাম্পল সাইজ ডিটারমাইন্ড করার জন্য কিছু সাইন্স আছে সেটার জন্য এই আর্টিকেলটা পড়ে দেখতে পারি।

https://en.wikipedia.org/wiki/Sample_size_determination

1. Intro to Math for Data Science

2023-05-29T09:49:10.367Z

In this course, we'll cover:

Understanding Data Concepts
Measurements of Dispersion and Central Tendency
Different ways to visualize data
Permutations
Combinatorics
Bayes' Theorem
Random Variables
Joint Distributions
Covariance and Correlation
Probability Mass and Density Functions
Binomial, Bernoulli, and Poisson Distributions
Normal Distribution and Z-Scores
Sampling and Bias
Central Limit Theorem
Hypothesis Testing
Linear Regression
and much more!

1. Getting Started?? With Pandas.

2023-03-26T05:50:10.823Z

Install Anaconda

Fuel up your anaconda ( shoftware bundle of python)

Installed? How would you know? Find this-

To update Anaconda to the latest version:

conda update conda

Create Environment

conda info --envs [asking condas for information about environments. environment is the flag and that is why followed by two dash --]

the asteric sign is saying that currently active environment is the base.

If we install any packages, it will be installed in the base environment.

Create New Envs first

conda create --name pandas_playground

To activate we need one last command

conda activate pandas_playground

Activated:

Install Pandas Packages:

conda install pandas jupyter bottleneck numexpr matplotlib