Machine Learning with PySpark

Apache Spark comes with a library named MLlib to perform Machine Learning tasks using the Spark framework. Since there is a Python API for Apache Spark, i.e., PySpark, you can also use this Spark ML library in PySpark. MLlib contains many algorithms and Machine Learning utilities.

In this tutorial, you will learn how to use Machine Learning in PySpark. The dataset of Fortune 500 is used in this tutorial to implement this. This dataset consists of the information related to the top 5 companies ranked by Fortune 500 in the year 2017. This tutorial will use the first five fields. You can download the dataset by clicking here.

In this Spark ML tutorial, you will implement Machine Learning to predict which one of the fields is the most important factor to predict the ranking of the above-mentioned companies in the coming years. Also, you will use DataFrames to implement Machine Learning.

Enhance your skills in Machine Learning by grabbing from the best machine learning institute in Pune!

What is PySpark MLlib?

Basic Introduction to PySpark MLlib

Spark MLlib is the short form of the Spark Machine Learning library. Machine Learning in PySpark is easy to use and scalable. It works on distributed systems. You can use Spark Machine Learning for data analysis. There are various techniques you can make use of with Machine Learning algorithms such as regression, classification, etc., all because of the PySpark MLlib.

Parameters in PySpark MLlib

Some of the main parameters of PySpark MLlib are listed below:

  • Ratings: This parameter is used to create an RDD of ratings, rows, or tuples.
  • Rank: It shows the number of features computed and ranks them.
  • Lambda: Lambda is a regularization parameter.
  • Blocks: Blocks are used to parallel the number of computations. The default value for this is −1.

Performing Linear Regression on a Real-world Dataset

Let’s understand Machine Learning better by implementing a full-fledged code to perform linear regression on the dataset of the top 5 Fortune 500 companies in the year 2017.

Go through these Machine Learning Interview Questions and Answers to excel in your ML interview!

Loading Data

As mentioned above, you are going to use a DataFrame that is created directly from a CSV file. The following are the commands to load data into a DataFrame and to view the loaded data.

  • Input:
    • In [1]:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
Sc = SparkContext()
sqlContext = SQLContext(sc)
    • In [2]:
company_df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('C:/Users/intellipaat/Downloads/spark-2.3.2-bin-hadoop2.7/Fortune5002017.csv')
company_df.take(1)

You can choose the number of rows you want to view while displaying the data of the DataFrame. Here, only the first row is displayed.

  • Output:
    • Out[2]:
[Row (Rank=1, Title= ‘Walmart’, Website= ‘http:/www.walmart.com’, Employees-2300000, Sector= ‘retailing’)]

Data Exploration

To check the data type of every column of a DataFrame and to print the schema of the DataFrame in a tree format, you can use the following commands, respectively:

  • Input:
    • In[3]:
company_df.cache()
company_df.printSchema()
  • Output:
    • Out [3]:
DataFrame[Rank: int, Title: string, Website: string, Employees: Int, Sector: string]
root
|-- Rank:  integer (nullable = true)
|-- Title:  string (nullable = true)
|-- Website:  string (nullable = true)
|-- Employees:  integer (nullable = true)
|-- Sector:  string (nullable = true)

Performing Descriptive Analysis

  • Input:
    • In [4]:
company_df.describe().toPandas().transpose()

Want to know more, then click on Machine Learning with PySpark.