Priya Reddy

Machine Learning in R for Beginners with Example

2021-09-21T08:11:23.563Z

Machine Learning with R

Machine learning is the present and the future! From Netflix’s recommendation engine to Google’s self-driving car, it’s all machine learning. This blog on Machine Learning with R helps you understand the core concepts of machine learning followed by different machine learning algorithms and implementing those machine learning algorithms with R.

This blog on “Machine Learning with R” comprises of these sections:

Understanding Machine Learning
Types of Machine Learning Algorithms
Implementing Machine Learning Algorithms with R

Understanding Machine Learning

How do you know all of those are fish?

As a kid, you might have come across a picture of a fish and you would have been told by your kindergarten teachers or parents that this is a fish and it has some specific features associated with it like it has fins, gills, a pair of eyes, a tail and so on. Now, whenever your brain comes across an image with those set of features, it automatically registers it as a fish because your brain has learned that it is a fish. To get in-depth knowledge on R Programming Please go through R Books for Beginners

That’s how our brain functions but what about a machine? If the same image is fed to a machine, how will the machine identify it to be a fish?

This is where Machine Learning comes in. We’ll keep on feeding images of a fish to a computer with the tag “fish” until the machine learns all the features associated with a fish.

Once the machine learns all the features associated with a fish, we will feed it new data to determine how much has it learned.

In other words, Raw Data/Training Data is given to the machine, so that it learns all the features associated with the Training Data. Once, the learning is done, it is given New Data/Test Data to determine how well the machine has learned.

Let us move ahead in this Machine Learning with R blog and understand about types of Machine Learning.

Types of Machine Learning

Supervised Learning:

Supervised Learning algorithm learns from a known data-set(Training Data) which has labels to make predictions.

Regression and Classification are some examples of Supervised Learning.

#Classification:

Classification determines to which set of categories does a new observation belongs i.e. a classification algorithm learns all the features and labels of the training data and when new data is given to it, it has to assign labels to the new observations depending on what it has learned from the training data.

For this example, if the first observation is given the label “Man” then it is rightly classified but if it is given the label “Woman”, the classification is wrong. Similarly for the second observation, if the label given is “Woman”, it is rightly classified, else the classification is wrong.

#Regression:

Regression is a supervised learning algorithm which helps in determining how does one variable influence another variable.

Over here, “living_area” is the independent variable and “price” is the dependent variable i.e. we are determining how does “price” vary with respect to “living_area”.

Unsupervised Learning:

Unsupervised learning algorithm draws inferences from data which does not have labels.

Data Science with R Programming Certification Training CourseInstructor-led SessionsReal-life Case StudiesAssignmentsLifetime AccessExplore Curriculum

Clustering is an example of unsupervised learning. “K-means”, “Hierarchical”, “Fuzzy C-Means” are some examples of clustering algorithms.

In this example, the set of observations is divided into two clusters. Clustering is done on the basis of similarity between the observations. There is a high intra-cluster similarity and low inter-cluster similarity i.e. there is a very high similarity between all the buses but low similarity between the buses and cars.

Reinforcement Learning:

Reinforcement Learning is a type of machine learning algorithm where the machine/agent in an environment learns ideal behavior in order to maximize its performance. Simple reward feedback is required for the agent to learn its behavior, this is known as the reinforcement signal.

Let’s take pacman for example. As long as pacman keeps eating food, it earns points but when it crashes against a monster it loses it’s life. Thus pacman learns that it needs to eat more food and avoid monsters so as to improve it’s performance.

Implementing Machine Learning with R:

Linear Regression:

We’ll be working with the diamonds data-set to implement linear regression algorithm:

Description of the data-set:

Prior to building any model on the data, we are supposed to split the data into “train” and “test” sets. The model will be built on the “train” set and it’s accuracy will be checked on the “test” set.

We need to load the “caTools” package to split the data into two sets.

library(caTools)

“caTools” package provides a function “sample.split()” which helps in splitting the data.

sample.split(diamonds$price,SplitRatio = 0.65)->split_index

65% of the observations from price column have been assigned the “true” label and the rest 35% have been assigned “false” label.

subset(diamonds,split_index==T)->train
subset(diamonds,split_index==F)->test

All the observations which have “true” label have been stored in the “train” object and those observations having “false” label have been assigned to the “test” set.

Now that the splitting is done and we have our “train” and “test” sets, it’s time to build the linear regression model on the training set.

We’ll be using the “lm()” function to build the linear regression model on the “train” data. We are determining the price of the diamonds with respect to all other variables of the data-set. The built model is stored in the object “mod_regress”.

lm(price~.,data = train)->mod_regress

Now, that we have built the model, we need to make predictions on the “test” set. “predict()” function is used to get predictions. It takes two arguments: the built model and the test set. The predicted results are stored in the “result_regress” object.

predict(mod_regress,test)->result_regress

Let’s bind the actual price values from the “test” data-set and the predicted values into a single data-set using the “cbind()” function. The new data-frame is stored in “Final_Data”

cbind(Actual=test$price,Predicted=result_regress)->Final_Data

as.data.frame(Final_Data)->Final_Data

A glance at the “Final_Data” which comprises of actual values and predicted values:

Let’s find the error by subtracting the predicted values from the actual values and add this error as a new column to the “Final_Data”:

(Final_Data$Actual- Final_Data$Predicted)->error

cbind(Final_Data,error)->Final_Data

A glance at the “Final_Data” which also comprises of the error in prediction:

Now, we’ll go ahead and calculate “Root Mean Square Error” which gives an aggregate error for all the predictions

rmse1<-sqrt(mean(Final_Data$error^2))

rmse1

Going ahead, let’s build another model, so that we can compare the accuracy of both these models and determine which is a better one.

We’ll build a new linear regression model on the “train” set but this time, we’ll be dropping the ‘x’ and ‘y’ columns from the independent variables i.e. the “price” of the diamonds is determined by all the columns except ‘x’ and ‘y’.

The model built is stored in “mod_regress2”:

lm(price~.-y-z,data = train)->mod_regress2

The predicted results are stored in “result_regress2”

predict(mod_regress2,test)->result_regress2

Actual and Predicted values are combined and stored in “Final_Data2”:

cbind(Actual=test$price,Predicted=result_regress2)->Final_Data2

as.data.frame(Final_Data2)->Final_Data2

Let’s also add the error in prediction to “Final_Data2”

(Final_Data2$Actual- Final_Data2$Predicted)->error2

cbind(Final_Data2,error2)->Final_Data2

A glance at “Final_Data2”:

Finding Root Mean Square Error to get the aggregate error:

rmse2<-sqrt(mean(Final_Data2$error^2))

We see that “rmse2” is marginally less than “rmse1” and hence the second model is marginally better than the first model.

Classification:

We’ll be working with the “car_purchase” data-set to implement recursive partitioning which is a classification algorithm.

Let’s split the data into “train” and “test” sets using “sample.split()” function from “caTools” package.

library(caTools)

65% of the observations from ‘Purchased’ column will be assigned “TRUE” labels and the rest will be assigned “FALSE” labels.

sample.split(car_purchase$Purchased,SplitRatio = 0.65)->split_values

All those observations which have “TRUE” label will be stored into ‘train’ data and those observations having “FALSE” label will be assigned to ‘test’ data.

subset(car_purchase,split_values==T)->train_data

subset(car_purchase,split_values==F)->test_data

Time to build the Recursive Partitioning algorithm:

We’ll start off by loading the ‘rpart’ package:

library(rpart)

“Purchased” column will be the dependent variable and all other columns are the independent variables i.e. we are determining whether the person has bought the car or not with respect to all other columns. The model is built on the “train_data” and the result is stored in “mod1”.

rpart(Purchased~.,data = train_data)->mod1

Let’s plot the result:

plot(mod1,margin = 0.1) text(mod1,pretty = T,cex=0.8)

Now, let’s go ahead and predict the results on “test_data”. We are giving the built rpart model “mod1” as the first argument, the test set “test_data” as the second argument and prediction type as “class” for the third argument. The result is stored in ‘result1’ object.

predict(mod1,test_data,type = "class")->result1

Let’s evaluate the accuracy of the model using “confusionMatrix()” function from caret package.

library(caret) confusionMatrix(table(test_data$Purchased,result1))

The confusion matrix tells us that out of the 90 observations where the person did not buy the car, 79 observations have been rightly classified as “No” and 11 have been wrongly classified as “YES”. Similarly, out of the 50 observations where the person actually bought the car, 47 have been rightly classified as “YES” and 3 have been wrongly classified as “NO”.

We can find the accuracy of the model by dividing the correct predictions with total predictions i.e. (79+47)/(79+47+11+3).

K-Means Clustering:

We’ll work with “iris” data-set to implement k-means clustering:

Let’s remove the “Species” column and create a new data-set which comprises only the first four columns from the ‘iris’ data-set.

iris[1:4]->iris_k

Let us take the number of clusters to be 3. “Kmeans()” function takes the input data and the number of clusters in which the data is to be clustered. The syntax is : kmeans( data, k) where k is the number of cluster centers.

kmeans(iris_k,3)->k1

Analyzing the clustering:

str(k1)

The str() function gives the structure of the kmeans which includes various parameters like withinss, betweenss, etc, analyzing which you can find out the performance of kmeans.

betweenss : Between sum of squares i.e. Intracluster similarity

withinss : Within sum of square i.e. Intercluster similarity

totwithinss : Sum of all the withinss of all the clusters i.e.Total intra-cluster similarity

A good clustering will have a lower value of “tot.withinss” and higher value of “betweenss” which depends on the number of clusters ‘k’ chosen initially.

The time is ripe to become an expert in Machine Learning to take advantage of new opportunities that come your way. This brings us to the end of this “Machine Learning” blog. I hope this blog was informative fruitful.

How to Learn Python for Data Science In 5 Steps

2021-09-19T05:21:03.168Z

Why Learn Python For Data Science?

Before we explore how to learn Python for data science, we should briefly answer why you should learn Python in the first place.

In short, understanding Python is one of the valuable skills needed for a data science career.

Though it hasn’t always been, Python is the programming language of choice for data science.

Data science experts expect this trend to continue with increasing development in the Python ecosystem. And while your journey to learn Python programming may be just beginning, it’s nice to know that employment opportunities are abundant (and growing) as well.

According to Indeed, the average salary for a Data Scientist is $121,583.

The good news? That number is only expected to increase, as demand for data scientists is expected to keep growing. In 2021 - 2022, there are three times as many job postings in data science as job searches for data science, according to Quanthub. That means the demand for data scientitsts is vastly outstripping the supply.

So, the future is bright for data science, and Python is just one piece of the proverbial pie. Fortunately, learning Udacity Data Science Nanodegree, Python and other programming fundamentals is as attainable as ever. We’ll show you how in five simple steps.

But remember – just because the steps are simple doesn’t mean you won’t have to put in the work. If you apply yourself and dedicate meaningful time to learning Python, you have the potential to not only pick up a new skill, but potentially bring your career to a new level.

How to Learn Python for Data Science

Click to View Our How to Learn Python Infographic

First, you’ll want to find the right course to help you learn Python programming. Dataquest’s courses are specifically designed for you to learn Python for data science at your own pace, challenging you to write real code and use real data in our interactive, in-browser interface.

In addition to learning Python in a course setting, your journey to becoming a data scientist should also include soft skills. Plus, there are some complimentary technical skills we recommend you learn along the way.

Step 1: Learn Python Fundamentals

Everyone starts somewhere. This first step is where you’ll learn Python programming basics. You’ll also want an introduction to data science.

One of the important tools you should start using early in your journey is Best Python Programming Books, which comes prepackaged with Python libraries to help you learn these two things.

Kickstart your learning by: Joining a community

By joining a community, you’ll put yourself around like-minded people and increase your opportunities for employment. According to the Society for Human Resource Management, employee referrals account for 30% of all hires.

Related skills: Try the Command Line Interface

The Command Line Interface (CLI) lets you run scripts more quickly, allowing you to test programs faster and work with more data.

Step 2: Practice Mini Python Projects

We truly believe in hands-on learning. You may be surprised by how soon you’ll be ready to build small Python projects. We've already put together a great guide to Python projects for beginners, which includes ideas like:

Tracking and Analyzing Your Personal Amazon.com Spending Habits — A fun project that'll help you practice Python and pandas basics while also giving you some real insight into your personal finance.
Analyze Data from a Survey — Find public survey data or use survey data from your own work in this beginner project that'll teach you to drill down into answers to mine insights.
Try one of our Guided Projects — Interactive Python projects for every skill level that use real data and offer guidance while still challenging you to apply your skills in new ways.

But that's just the tip of the iceberg, really. You can try programming things like calculators for an online game, or a program that fetches the weather from Google in your city. You can also build simple games and apps to help you familiarize yourself with working with Python.

Building mini projects like these will help you learn Python. programming projects like these are standard for all languages, and a great way to solidify your understanding of the basics.

You should start to build your experience with APIs and begin web scraping. Beyond helping you learn Python programming, web scraping will be useful for you in gathering data later.

Kickstart your learning by: Reading

Enhance your coursework and find answers to the Python programming challenges you encounter. Read guidebooks, blog posts, and even other people’s open source code to learn Python and data science best practices – and get new ideas.

Automate The Boring Stuff With Python by Al Sweigart is an excellent and entertaining resource. But we've put together an entire list of data science ebooks that are totally free for you to check out, too. Highlights include:

The Data Science Handbook — A great collection of interviews with working data scientists that'll give you a better idea of what real data science work is like and how you can succeed in the field.
Python Libraries for machine Learning— A helfpul guide that's also available in convenient, so you can dive in and run all the sample code for yourself.
Elements of Statistical Learning — A massive and recently-updated statisics textbook that can serve as a great reference as you're learning Python to make sure your work is statistically valid.

Step 3: Learn Python Data Science Libraries

Unlike some other programming languages, in Python, there is generally a best way of doing something. The three best and most important Python libraries for data science are NumPy, Pandas, and Matplotlib.

We've put together a helpful guide to the 15 most important Python libraries for data science, but here are a few that are really critical for any data work in Python:

NumPy — A library that makes a variety of mathematical and statistical operations easier; it is also the basis for many features of the pandas library.
pandas — A Python library created specifically to facilitate working with data, this is the bread and butter of a lot of Python data science work.
Matplotlib — A visualization library that makes it quick and easy to generate charts from your data.
scikit-learn — The most popular library for machine learning work in Python.

NumPy and Pandas are great for exploring and playing with data. Matplotlib is a data visualization library that makes graphs like you’d find in Excel or Google Sheets.

Kickstart your learning by: Asking questions

You don’t know what you don’t know!

Python has a rich community of experts who are eager to help you learn Python. Resources like Best Book to Learn Python are full of people excited to share their knowledge and help you learn Python programming. We also have an FAQ for each lesson to help with questions you encounter throughout your programming courses with udacity data science nanodegree review

Related skills: Use Git for version control

Git is a popular tool that helps you keep track of changes made to your code, which makes it much easier to correct mistakes, experiment, and collaborate with others.

Step 4: Build a Data Science Portfolio as you Learn Python

For aspiring data scientists, a portfolio is a must.

These projects should include work with several different datasets and should leave readers with interesting insights that you’ve gleaned. Some types of projects to consider:

Data Cleaning Project — Any project that involves dirty or "unstructured" data that you clean up and analyze will impress potential employers, since most real-world data is going to require cleaning.
Data Visualization Project — Making attractive, easy-to-read visualizations is both a programming and a design challenge, but if you can do it right, your analysis will be considerably more impactful. Having great-looking charts in a project will make your portfolio stand out.
Machine Learning Project — If you aspire to work as a data scientist, you definitely will need a project that shows off your ML chops (and you may want a few different machine learning projects, with each focused on your use of a different popular algorithm).

Your analysis should be presented clearly and visually; so that technical folks can read your code, but non-technical people can also follow along with your charts and written explanations.

Your portfolio doesn’t necessarily need a particular theme. Find datasets that interest you, then come up with a way to put them together. However, if you aspire to work at a particular company or industry, showcasing projects relevant to that industry in your portfolio is a good idea.

Displaying projects like these gives fellow data scientists nanodegree an opportunity to potentially collaborate with you, and shows future employers that you’ve truly taken the time to learn Python and other important programming skills.

One of the nice things about data science is that your portfolio doubles as a resume while highlighting the skills you’ve learned, like Python programming.

Kickstart your learning by: Communicating, collaborating, and focusing on technical competence

During this time, you’ll want to make sure you’re cultivating those soft skills required to work with others, making sure you really understand the inner workings of the tools you’re using.

Related skills: Learn beginner and intermediate statistics

While learning Python for data science, you’ll also want to get a solid background in statistics. Understanding statistics will give you the mindset you need to focus on the right things, so you’ll find valuable insights (and real solutions) rather than just executing code.

Step 5: Apply Advanced Data Science Techniques

Finally, aim to sharpen your skills. Your data science journey will be full of constant learning, but there are advanced courses you can complete to ensure you’ve covered all the bases.

You’ll want to be comfortable with regression, classification, and k-means clustering models. You can also step into machine learning – bootstrapping models and creating neural networks using scikit-learn.

At this point, programming projects can include creating models using live data feeds. Machine learning models of this kind adjust their predictions over time.

Remember to: Keep learning!

Data science is an ever-growing field that spans numerous industries.

At the rate that demand is increasing, there are exponential opportunities to learn. Continue reading, collaborating, and conversing with others, and you’re sure to maintain interest and a competitive edge over time.

How Long Will It Take To Learn Python?

After reading these steps, the most common question we have people ask us is: “How long does all this take?”

There are a lot of estimates for how long takes to learn Python. For data science specifically, estimates a range from three months to a year of consistent practice.

We’ve watched people move through our courses at lightning speed and others who have taken it much slower.

Really, it all depends on your desired timeline, free time that you can dedicate to learn Python programming and the pace at which you learn.

udacity data science review courses are created for you to go at your own speed. Each path is full of lessons, hands-on learning and opportunities to ask questions so that you get can an in-depth mastery of data science fundamentals.

Get started for free. Learn Python with our Data Scientist path and start mastering a new skill today!

Where Can I Learn Python for Data Science?

There are tons of Python learning resources out there, but if you're looking to learn it for data science, it's best to choose somewhere that teaches about data science specifically.

This is because Python is also used in a variety of other programming disciplines from game development to mobile apps. Generic "learn Python" resources try to teach a bit of everything, but this means you'll be learning quite a few things that aren't actually relevant to data science work.

Moreover, working on something that doesn't feel connected to your goals can feel really demotivating. If you want to be doing data analysis and instead you're struggling through a course that's teaching you to build a game with Python, it's going to be easy to get frustrated and quit.

There are lots of best python book for beginners out there. If you don't want to pay to learn Python, these can be a good option — and the link in the previous sentence includes dozens, separated out by difficulty level and focus area.

If you're serious about it, though, it may be best to find a platform that'll teach you interactively, with a curriculum that's been constructed to guide you through your data science learning journey.

Is Python Necessary in the Data Science Field?

It's possible to work as a data scientist using either Python or R. Each language has its strengths and weaknesses, and both are widely-used in the industry. Python is more popular overall, but R dominates in some industries (particularly in academia and research).

To do data science work, you'll definitely need to learn at least one of these two languages. It doesn't have to be Python, but it does have to be one of either Python or R.

(Of course, you'll also have to learn some SQL no matter which of Python or R you pick to be your primary programming language).

Is Python Better than R for Data Science?

This is a constant topic of discussion in data science, but the true answer is that it depends on what you're looking for, and what you like.

R was built with statistics and mathematics in mind, and there are amazing packages that make it easy to use for data science. It also has a very supporting online community.

Python is a much better language for all-around work, meaning that your Python skills would be more transferrable to other disciplines. It's also slightly more popular, and some would argue that it's the easier of the two to learn (although plenty of R folks would disagree).

Rather than reading opinions, check out best book for r programming beginners, and see which one looks more approachable to you.

How is Python Used for Data Science?

Programming languages like Python are used at every step in the data science process. For example, a data science project workflow might look something like this:

1Using Python and SQL, you write a query to pull the data you need from your company database.
2Using Python and the pandas library, you clean and sort the data into a dataframe (table) that's ready for analysis.
3Using Python and the pandas and matplotlib libraries, you begin analyzing, exploring, and visualizing the data.
4After learning more about the data through your exploration, you use Python and the scikit-learn library to build a predictive model that forecasts future outcomes for your company based on the data you pulled.
5You arrange your final analysis and your model results into an appropriate format for communicating with your coworkers.

Python is used at almost every step along the way!

Machin Learning interview questions asked in Top Companies

2021-09-16T10:00:03.950Z

This page will guide you to brush up on the skills of machine learning to crack the interview.

Here, our focus will be on real-world scenario ML interview questions asked in Microsoft, Amazon, etc., And how to answer them.

Let’s get started!

Firstly, Machine Learning Books refers to the process of training a computer program to build a statistical model based on data. The goal of machine learning (ML) is to turn data and identify the key patterns out of data or to get key insights.

For example, if we have a historical dataset of actual sales figures, we can train machine learning models to predict sales for the coming future.

Why is the Machine Learning trend emerging so fast?

Machine Learning solves Real-World problems. Unlike the hard coding rule to solve the problem, machine learning algorithms learn from the data.

The learnings can later be used to predict the feature. It is paying off for early adopters.

A full 82% of enterprises adopting machine learning and Artificial Intelligence (AI) have gained a significant financial advantage from their investments.

According to Deloitte, companies have an impressive median ROI of 17%.

1. Why was Machine Learning Introduced?

The simplest answer is to make our lives easier. In the early days of “intelligent” applications, many systems used hardcoded rules of “if” and “else” decisions to process data or adjust the user input. Think of a spam filter whose job is to move the appropriate incoming email messages to a spam folder.

But with the machine learning algorithms, we are given ample information for the data to learn and identify the patterns from the data.

Unlike the normal problems we don’t need to write the new rules for each problem in machine learning, we just need to use the same workflow but with a different dataset.

Let’s talk about Alen Turing, in his 1950 paper, “Computing Machinery and Intelligence”, Alen asked, “Can machines think?”

Full paper here Best Machine Learning Books

The paper describes the “Imitation Game”, which includes three participants -

Human acting as a judge,
Another human, and
A computer is an attempt to convince the judge that it is human.

The judge asks the other two participants to talk. While they respond the judge needs to decide which response came from the computer. If the judge could not tell the difference the computer won the game.

The test continues today as an annual competition in artificial intelligence. The aim is simple enough: convince the judge that they are chatting to a human instead of a computer chatbot program.

2. What are Different Types of Machine Learning algorithms?

There are various types of machine learning algorithms. Here is the list of them in a broad category based on:

Whether they are trained with human supervision (Supervised, unsupervised, reinforcement learning)
The criteria in the below diagram are not exclusive, we can combine them any way we like.

3. What is Supervised Learning?

Supervised learning is a machine learning algorithm of inferring a function from labeled training data. The training data consists of a set of training examples.

Example: 01

Knowing the height and weight identifying the gender of the person. Below are the popular supervised learning algorithms.

Support Vector Machines
Regression
Naive Bayes
Decision Trees
K-nearest Neighbour Algorithm and Neural Networks.

Example: 02

If you build a T-shirt classifier, the labels will be “this is an S, this is an M and this is L”, based on showing the classifier examples of S, M, and L.

4. What is Unsupervised Learning?

Unsupervised learning is also a type of machine learning algorithm used to find patterns on the set of data given. In this, we don’t have any dependent variable or label to predict. Unsupervised Learning Algorithms:

Clustering,
Anomaly Detection,
Neural Networks and Latent Variable Models.

Example:

In the same example, a T-shirt clustering will categorize as “collar style and V neck style”, “crew neck style” and “sleeve types”.

5. What is ‘Naive’ in a Naive Bayes?

The Naive Bayes method is a supervised learning algorithm, it is naive since it makes assumptions by applying Bayes’ theorem that all attributes are independent of each other.

Bayes’ theorem states the following relationship, given class variable y and dependent vector x1 through xn:

P(yi | x1,..., xn) =P(yi)P(x1,..., xn | yi)(P(x1,..., xn)

Using the naive conditional independence assumption that each xiis independent: for all I this relationship is simplified to:

P(xi | yi, x1, ..., xi-1, xi+1, ...., xn) = P(xi | yi)

Since, P(x1,..., xn) is a constant given the input, we can use the following classification rule:

P(yi | x1, ..., xn) = P(y) ni=1P(xi | yi)P(x1,...,xn) and we can also use Maximum A Posteriori (MAP) estimation to estimate P(yi)and P(yi | xi) the former is then the relative frequency of class yin the training set.

P(yi | x1,..., xn) P(yi) ni=1P(xi | yi)

y = arg max P(yi)ni=1P(xi | yi)

The different naive Bayes classifiers mainly differ by the assumptions they make regarding the distribution of P(yi | xi): can be Bernoulli, binomial, Gaussian, and so on.

6. What is PCA? When do you use it?

Principal component analysis (PCA) is most commonly used for dimension reduction.

In this case, PCA measures the variation in each variable (or column in the table). If there is little variation, it throws the variable out, as illustrated in the figure below:

Principal component analysis (PCA)

Thus making the dataset easier to visualize. PCA is used in finance, neuroscience, and pharmacology.

It is very useful as a preprocessing step, especially when there are linear correlations between features.

7. Explain SVM Algorithm in Detail

A Support Vector Machine (SVM) is a very powerful and versatile supervised machine learning model, capable of performing linear or non-linear classification, regression, and even outlier detection.

Suppose we have given some data points that each belong to one of two classes, and the goal is to separate two classes based on a set of examples.

In SVM, a data point is viewed as a p-dimensional vector (a list of p numbers), and we wanted to know whether we can separate such points with a (p-1)-dimensional hyperplane. This is called a linear classifier.

There are many hyperplanes that classify the data. To choose the best hyperplane that represents the largest separation or margin between the two classes.
If such a hyperplane exists, it is known as a maximum-margin hyperplane and the linear classifier it defines is known as a maximum margin classifier. The best hyperplane that divides the data in H3

We have data (x1, y1), ..., (xn, yn), and different features (xii, ..., xip), and yiis either 1 or -1.

The equation of the hyperplane H3 is the set of points satisfying:

w. x-b = 0

Where w is the normal vector of the hyperplane. The parameter b||w||determines the offset of the hyperplane from the original along the normal vector w

So for each i, either xiis in the hyperplane of 1 or -1. Basically, xisatisfies:

w . xi - b 1 or w. xi - b -1

8. What are Support Vectors in SVM?

A Support Vector Machine (SVM) is an algorithm that tries to fit a line (or plane or hyperplane) between the different classes that maximizes the distance from the line to the points of the classes.

To get in-depth knowledge on ML lets try Machine Learning

In this way, it tries to find a robust separation between the classes. The Support Vectors are the points of the edge of the dividing hyperplane as in the below figure.

9. What are Different Kernels in SVM?

There are six types of kernels in SVM:

Linear kernel - used when data is linearly separable.
Polynomial kernel - When you have discrete data that has no natural notion of smoothness.
Radial basis kernel - Create a decision boundary able to do a much better job of separating two classes than the linear kernel.
Sigmoid kernel - used as an activation function for neural networks.

10. What is Cross-Validation?

Cross-validation is a method of splitting all your data into three parts: training, testing, and validation data. Data is split into k subsets, and the model has trained on k-1of those datasets.

The last subset is held for testing. This is done for each of the subsets. This is k-fold cross-validation. Finally, the scores from all the k-folds are averaged to produce the final score.