Math for Data Science Masterclass
May 31, 2023

3.1 Visualization of Data

Visualizing data is a key aspect of data science!

It’s important to be able to convey information to others, especially people who may not have the full technical knowledge or understanding to view the raw data or statistical analysis.

As we continue learning about data visualization, you should always keep in mind:

“What is the information I want to share or story I am trying to tell? How does this visualization help in conveying that information to others?”

Data visualization is especially crucial in organizations because often decisions are made based on a final visualization or interpretation of the data.

Remember that the purpose of data science at an enterprise level is to use it to make key decisions and improve products or services!


To understand which data visualization to use, let’s have a quick tour of the different data visualizations categories we’ll cover in this section:

Scatter Plots

A typical scatter plot will represent 2 dimensions (data features). For example, the height vs. weight of a group of people. A scatter plot can reveal relationships between two data features. 

Imagine the following data set:

Is there a relationship between tip and bill?

x Axis এ টোটাল বিল

y- axis এ টিপ চিহ্নিত করি।

ডাটা বসিয়ে আমরা দেখতে পাই এমনঃ

এখান থেকে আমরা বলতে পারিঃ Tip tends to increase as Total Bill increases.


Now,

We can add a trend line (or regression line, or best fit line, or least squares line).

ট্রেন্ড লাইনটি স্ক্যাটার প্লটের ডেটা পয়েন্টগুলির উপরে আঁকা হয় এবং এর উদ্দেশ্য হলো ডেটার সাধারণ প্রবণতা বা প্রবণতার উপরে নজর দেয়া। এটি ভবিষ্যতের ডেটা পয়েন্টগুলি উপর নজর রাখতে বা আনুমানিক প্রতিষ্ঠান করতে ব্যবহৃত হয় যা দেখা যায় পর্যাপ্ত প্রবণতায়।

In most plotting libraries or software, you can set the alpha value to a decimal between 0 and 1, where 0 represents completely transparent and 1 represents fully opaque. By reducing the alpha value, you can make the stacked scatter points appear more transparent, allowing underlying points or patterns to be visible.

Here's an example using Python's Matplotlib library:

import matplotlib.pyplot as plt
# Generate example data
x = [1, 2, 3, 4, 5]
y = [10, 20, 30, 40, 50]
stacked_values = [5, 15, 25, 35, 45]
# Plotting scatter points with transparency
plt.scatter(x, y, c='blue', alpha=0.5)
plt.scatter(x, stacked_values, c='red', alpha=0.5)
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Stacked Scatter Points with Transparency')
plt.show()

In this example, two scatter plots are created: one for the original data points (blue) and another for the stacked points (red). The alpha parameter is set to 0.5 for both scatter plots, making them semi-transparent.

Adjust the alpha value according to your desired level of transparency.


Line Plots

Sometimes we already know there should be a continuous relationship between points along an axis. This is where we can use a simple line to indicate a known relationship between points along a data feature.

For example, in our previous example, it would not make sense to draw a line between the points!

কারণ শনিবারের লাঞ্চে আসা লোকটার সাথে রবিবার রাতে ডিনারে আসার লোকটার সাথে কোন সম্পর্ক নাই এই জন্যই এই পার্টিকুলার case এর জন্য লাইন প্লট একদমই ইউজ করা উচিত না।

So when is a line plot appropriate?

When we already know for certain there is a continuous relationship between data points along a feature. For example, timestamps. We know for certain there are continuous times in between each time stamped point.

We know flights occurred between years!

A line indicates that continuous knowledge.

Line plots also make it easy to stack features:

Distribution Plots

Distribution plots allow us to visualize the dispersion of data across a feature or variable. One of the most common ways to do this is through a histogram.

What is the distribution of Total Bill amounts?

We could answer this through some statistical metrics:

Mean Total Bill is $18.79 with a standard deviation is $8.9 with a min value of $3.07 and max value of $50.81.

A histogram will count the number of occurrences within a range and then create a bar of the height of the count of the occurrences within the range.

Note how the x-axis feature is continuous, this must be the case for a histogram!

There are many more types of plots to display distribution, such as box-and-whisker plots and KDE (Kernel Density Estimation) plots, but by far the most common you will encounter is the histogram.

  1. Distribution plots:
    • Suppose you have a dataset that contains customer ratings for three different product categories: electronics, clothing, and home appliances. The ratings range from 1 to 5, where 1 represents low satisfaction and 5 represents high satisfaction.
    • To analyze the distribution of ratings for each product category, you can create histograms for each category.
    • The histogram for the electronics category may show that a large proportion of customers gave ratings of 4 and 5, indicating higher satisfaction. It may also reveal a few customers who gave lower ratings, potentially indicating areas for improvement.
    • The histogram for the clothing category might show a more evenly distributed set of ratings, indicating moderate satisfaction across different customer preferences.
    • The histogram for the home appliances category could display a skewed distribution, indicating that most customers are highly satisfied, with fewer giving lower ratings.

2. Categorical plots:

    • To further analyze the relationship between customer satisfaction and other categorical variables, such as gender and age group, you can create categorical plots.
    • A bar plot can be used to show the average rating for each gender group. This plot may reveal that females tend to have slightly higher satisfaction ratings compared to males.
    • A count plot can be used to show the count of ratings for different age groups. It might indicate that younger customers (e.g., 18-25 age group) have a higher participation rate and a wider range of ratings compared to older customers.
    • A categorical scatter plot can be used to examine the relationship between customer satisfaction ratings and the price range of products. Different markers or colors can represent different price ranges. This plot may show that customers who purchased products in the higher price range tend to give higher satisfaction ratings.

Categorical Plots

Categorical plots simply display some metric per category. For example, a mean value per category or a count per category. There are many variations of these types of plots, but one of the most common is the simple bar plot.

Be careful not to confuse the bar plot for a histogram! While they may appear similar, carefully look at the x-axis feature. For a bar chart, the feature is categorical while for a histogram it is continuous!