How to Extract the Web to Get Data about the Top Rated Movies on TV?
This listing having these films would get stored within the SQLite database as well as emailed. That’s how you would never miss any blockbuster movies on TV again.
We will start with the online TV guide for finding films on different Belgian TV channels. However, you can easily adapt our code to utilize it for other websites. For making the life easier while extracting for films, ensure the site you wish to extract:
- has different HTML tags having a clear id or class
- utilizes ids and classes in a constant way
- provides well-structured URLs
- contains all applicable TV channels on single page
- has a different page every weekday
- only lists films as well as no other programs like news, live shows, reportage, etc. Except you can easily differentiate films from other program kinds.
With available results, we will scrape The Movie Database (TMDB) data for film ratings and other information.
We will extract the following details about films:
- Film Title
- TV Channel
- TMDB Rating
- The Time When a Film Starts
- The Date Film is on TV
- Release Date
- Plot
- Link To The Details Page On TMDB
- Genre
You can complement the list with different actors, director, interesting facts, and more– all the data you’d love to know about.
In Scrapy, the information would get stored in an item’s fields.
We assume here that you have already installed Scrapy. If not, just install it first.
When Scrapy gets installed, just open a command line as well as go to a directory wherever you wish to store a Scrapy project. After that, run:
scrapy startproject topfilms
It will make a folder structure for top films project given below. You could ignore a topfilms.db file now.
Define Scrapy Items
We would be dealing with a file items.py as it is made by default while making a Scrapy project.
A scrapy.Item is the container, which will get filled during web scraping. This will hold different fields, which we wish to scrape from web page(s). Different contents of an Item could get accessed in similar way like the Python dict.
Then, open items.py as well as add Scrapy.Item class using following fields:
import scrapy class TVGuideItem(scrapy.Item): title = scrapy.Field() channel = scrapy.Field() start_ts = scrapy.Field() film_date_long = scrapy.Field() film_date_short = scrapy.Field() genre = scrapy.Field() plot = scrapy.Field() rating = scrapy.Field() tmdb_link = scrapy.Field() release_date = scrapy.Field() nb_votes = scrapy.Field()
Process Items using Pipelines
After beginning a new Scrapy project, you would get a file named pipelines.py. Open the file as well as copy-paste a code given below. Then, we will show step-by-step what every part of a code does.
import sqlite3 as lite con = None # db connection class StoreInDBPipeline(object): def __init__(self): self.setupDBCon() self.dropTopFilmsTable() self.createTopFilmsTable() def process_item(self, item, spider): self.storeInDb(item) return item def storeInDb(self, item): self.cur.execute("INSERT INTO topfilms(\ title, \ channel, \ start_ts, \ film_date_long, \ film_date_short, \ rating, \ genre, \ plot, \ tmdb_link, \ release_date, \ nb_votes \ ) \ VALUES( ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ? )", ( item['title'], item['channel'], item['start_ts'], item['film_date_long'], item['film_date_short'], float(item['rating']), item['genre'], item['plot'], item['tmdb_link'], item['release_date'], item['nb_votes'] )) self.con.commit() def setupDBCon(self): self.con = lite.connect('topfilms.db') self.cur = self.con.cursor() def __del__(self): self.closeDB() def createTopFilmsTable(self): self.cur.execute("CREATE TABLE IF NOT EXISTS topfilms(id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, \ title TEXT, \ channel TEXT, \ start_ts TEXT, \ film_date_long TEXT, \ film_date_short TEXT, \ rating TEXT, \ genre TEXT, \ plot TEXT, \ tmdb_link TEXT, \ release_date TEXT, \ nb_votes \ )") def dropTopFilmsTable(self): self.cur.execute("DROP TABLE IF EXISTS topfilms") def closeDB(self): self.con.close()
Initially, we begin by importing SQLite package as well as provide that an alias lite. Also, we initialize an adjustable con that is utilized for a database connection.
Create a Class for Storing Items in a Database
After that, you make a class using the logical name. After allowing the pipeline within a settings file, this class would be named.
class StoreInDBPipeline(object):
Define a Constructor Method
The constructor method is a method having name __init__. It is run automatically while making an example of a StoreInDBPipeline class.
def __init__(self): self.setupDBCon() self.dropTopFilmsTable() self.createTopFilmsTable()
In this constructor method, we have launched three other techniques that are given below a constructor method.
SetupDBCon Method
Using this method called setupDBCon, we have created a topfilms database as well as make connection of that with a connect function.
def setupDBCon(self): self.con = lite.connect('topfilms.db') self.cur = self.con.cursor()
Now, we utilize an alias lite for SQLite package. After that, we make a Cursor object having a cursor function. Using the Cursor object, we can perform SQL statements in a database.
DropTopFilmsTable Method
Another method is a dropTopFilmsTable method. Here, it drops a table in a SQLite database.
Every time a web scraper runs a database would get completely removed. If you wish to do it also. If you wish to do a few inquiring or analysis of a films’ data, you can keep scraping results of every run.
We just wish to see top-rated films of coming days as well as nothing more. So, we have decided to delete a database in every run.
def dropTopFilmsTable(self): self.cur.execute("DROP TABLE IF EXISTS topfilms")
Using a Cursor object cur we have executed a DROP statement. Let’s understand more about this.
CreateTopFilmsTable Method
After dropping a top films table, we require to make it. That is done with a last method call using a constructor method.
def createTopFilmsTable(self): self.cur.execute("CREATE TABLE IF NOT EXISTS topfilms(id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, \ title TEXT, \ channel TEXT, \ start_ts TEXT, \ film_date_long TEXT, \ film_date_short TEXT, \ rating TEXT, \ genre TEXT, \ plot TEXT, \ tmdb_link TEXT, \ release_date TEXT, \ nb_votes \ )")
After that, we utilize a Cursor object cur for executing CREATE TABLE statement. Different fields, which are added in table top films are similar like in a Scrapy Item that we have created before. For keeping things easy, we utilize exactly similar names in a SQLite table like in an Item. Only id field is additional.
Note: Any good application to search at the SQLite databases is a SQLite Manager plug-in within Firefox. You could watch any SQLite Manager tutorial on YouTube to find how to utilize this plugin.
Process_item Method
This technique must get implemented in a Pipeline class as well as this must return the dict, a DropItem or Item exception. In the web scraper, we would return an item.
def process_item(self, item, spider): self.storeInDb(item) return item
In contrast to other methods given, this has two additional arguments. The item, which was extracted as well as the spider, which extrcted the items. Using this method, we have launch a storeInDb method as well as afterwards returned the item.
StoreInDb Method
The given method implements the INSERT statement for inserting the extracted items into a SQLite database.
def storeInDb(self, item): self.cur.execute("INSERT INTO topfilms(\ title, \ channel, \ start_ts, \ film_date_long, \ film_date_short, \ rating, \ genre, \ plot, \ tmdb_link, \ release_date, \ nb_votes \ ) \ VALUES( ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ? )", ( item['title'], item['channel'], item['start_ts'], item['film_date_long'], item['film_date_short'], float(item['rating']), item['genre'], item['plot'], item['tmdb_link'], item['release_date'], item['nb_votes'] )) self.con.commit()
Different values for a table fields are available from an item that is the argument for the given method. All the values are known as the dict value (keep in mind that an Item is not more than the dict).
Every Constructor Comes with a Destructor!
The equivalent of a constructor method is a destructor method having a name __del__. In this destructor method for a pipelines class, we have closed the connection with a database.
def __del__(self): self.closeDB()
CloseDB Method
def closeDB(self): self.con.close()
Using the last method, we have closed a database connection using a close function. Therefore, we have written a well-functional pipeline. Still, there is one last step left for helping the pipeline.
Allowing a Pipeline in settings.py
Open a settings.py file as well as add the given code:
ITEM_PIPELINES = { 'topfilms.pipelines.StoreInDBPipeline':1 }
The integer values indicate the order at which pipelines are running. As we get merely one pipeline, we have assigned it with value 1.
Making a Spider in Scrapy
Now, we would be looking at core of the Scrapy. That is where heavy lifting of the web scraper would be done. We will show here step-by-step about how to make one.
Importing the Required Packages
Initially, we’ll import all necessary modules and packages. We utilize a CrawlSpider module for following links right through an online TV guide.
LinkExtractor and Rule are used for determining which links are needed to follow.
A config module has a few constants including DOM_1, DOM_2 as well as START_URL which are utilized in a Spider. A config module has found one directory equal to a current directory and that’s why you can see a couple of dots before a config module.
In the last, we import TVGuideItem as it would be utilized to have the information during scraping.
import scrapy from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from fuzzywuzzy import fuzz from ..config import * from topfilms.items import TVGuideItem
Instructing a Spider about Where to Go
After that, we subclass a CrawlSpider class. That is done through inserting CrawlSpider like an argument for a TVGuideSpider class.
We provide a Spider with name, offer allowed_domains (e.g. themoviedb.org) as well as start_urls. A start_urls is within our case a web page of a TV guide, therefore you need to change that by your individual preferred website.
Using rules as well as deny argument, we inform the Spider about which URLs are not required to follow on a start URL. The not to URLs are specified within a regular appearance.
We are not interested in films, which were shown yesterday, so we don't permit the Spider to follow the URLs ending at “gisteren“.
So, now the questions is, which URLs should a Spider follow? To do that, we have used a restrict_xpaths argument. This tells to follow all the URLs with class=”button button–beta”. All these are in fact the URLs having films for coming week.
In the end, with a callback argument, we allow a Spider to know what to perform when this is following one of URLs. This will implement a function parse_by_day. We will explain it later.
class TVGuideSpider(CrawlSpider): name = "tvguide" allowed_domains = [DOM_1, DOM_2] start_urls = [START_URL] # Extract the links from the navigation per day # We will not crawl the films for yesterday rules = ( Rule(LinkExtractor(allow=(), deny=(r'\/gisteren'), restrict_xpaths=('//a[@class="button button--beta"]',)), callback="parse_by_day", follow= True), )
Parse the Followed URLs
The parse_by_day function is a part of TVGuideScraper, extracts webpages with an overview of different films for every channel every day. The response argument is available from a Request which has been launched while running a data scraping program.
On the webpage, having scraped you need to discover HTML elements that are used to show data, which we are associated with. Some good tools for it include Chrome Developer Tools as well as Firebug plugin used in Firefox.
A thing we wish to store is date for films that we are extracting. This date could be available in a paragraph (p) within the div having class="grid__col__inner". Evidently, it is something that you need to modify for a page that you are extracting.
With xpath method of a Response object, we scrape text within a paragraph. We have learned a lot from this with the tutorial about how to utilize a xpath function.
Through utilizing extract_first, we ensure that we are not storing the date like a list. Or else, it will provide us problems while storing date in a SQLite database.
Subsequently, we have performed some cleaning of data on film_date_long as well as make film_date_short using a format YYYYMMDD. We created the YYYYMMDD format for sorting films chronologically afterwards.
Then, a TV channel gets scraped. In case, this is within a list of ALLOWED_CHANNELS (well-defined in a config module), we will continue to extract a title as well as starting time. The information gets stored an item that is started by TVGuideItem().
After that, we wish to continue extracting on The Movie Database. We would utilize a URL https://www.themoviedb.org/search?query= for showing search results for a film getting scraped. For this URL, we wish to add a film title (url_part within the code). We just re-use a URL part, which is available in a link on a TV guide’s web page.
Using this URL, we make a newer request as well as continue on the TMDB. Using request.meta['item'] = item , we have added already extracted data to a request. That way, we could continue filling up your current TVGuideItem.
Yield request in fact launches a request.
def parse_by_day(self, response): film_date_long = response.xpath('//div[@class="grid__col__inner"]/p/text()').extract_first() film_date_long = film_date_long.rsplit(',',1)[-1].strip() # Remove day name and white spaces # Create a film date with a short format like YYYYMMDD to sort the results chronologically film_day_parts = film_date_long.split() months_list = ['januari', 'februari', 'maart', 'april', 'mei', 'juni', 'juli', 'augustus', 'september', 'oktober', 'november', 'december' ] year = str(film_day_parts[2]) month = str(months_list.index(film_day_parts[1]) + 1).zfill(2) day = str(film_day_parts[0]).zfill(2) film_date_short = year + month + day for col_inner in response.xpath('//div[@class="grid__col__inner"]'): chnl = col_inner.xpath('.//div[@class="tv-guide__channel"]/h6/a/text()').extract_first() if chnl in ALLOWED_CHANNELS: for program in col_inner.xpath('.//div[@class="program"]'): item = TVGuideItem() item['channel'] = chnl item['title'] = program.xpath('.//div[@class="title"]/a/text()').extract_first() item['start_ts'] = program.xpath('.//div[@class="time"]/text()').extract_first() item['film_date_long'] = film_date_long item['film_date_short'] = film_date_short detail_link = program.xpath('.//div[@class="title"]/a/@href').extract_first() url_part = detail_link.rsplit('/',1)[-1] # Extract information from the Movie Database www.themoviedb.org request = scrapy.Request("https://www.themoviedb.org/search?query="+url_part,callback=self.parse_tmdb) request.meta['item'] = item # Pass the item with the request to the detail page yield request
Scrape Additional Data on The Movie Database
As you could see in a request made in the functions parse_by_day, we utilize a callback function parse_tmdb. The function is utilized during a request to extract a TMDB website.
In the initial step, we get item data, which was passed through parse_by_day function.
A page having search results about TMDB could possibly list different search results for similar film titles (url_part passed within the query). We also observe if there are results having if tmddb_titles.
We utilize the fuzzywuzzy package for doing fuzzy matching on film titles. To utilize a fuzzywuzzy package, we require to add an import statement alongside the past import statements.
from fuzzywuzzy import fuzz
In case, we get a match of 90%, we utilize search results to perform rest of extracting. We won’t look at any other search results. To perform that, we utilize a break statement.
After that, we collect rating, genre, as well as release_date from search result pages in the similar way that we used an xpath function earlier. To get the YYYYMMDD format for any release date, we have executed data processing using a join and split functions.
Yet again, we need to launch the new requests to details’ page on the TMDB. The request would call parse_tmdb_detail function for scraping a film plot as well as total votes on the TMDB. It is explained within next section.
def parse_tmdb(self, response): item = response.meta['item'] # Use the passed item tmdb_titles = response.xpath('//a[@class="title result"]/text()').extract() if tmdb_titles: # Check if there are results on TMDB for tmdb_title in tmdb_titles: match_ratio = fuzz.ratio(item['title'], tmdb_title) if match_ratio > 90: item['genre'] = response.xpath('.//span[@class="genres"]/text()').extract_first() item['rating'] = response.xpath('//span[@class="vote_average"]/text()').extract_first() release_date = response.xpath('.//span[@class="release_date"]/text()').extract_first() release_date_parts = release_date.split('/') item['release_date'] = "/".join( [release_date_parts[1].strip(), release_date_parts[0].strip(), release_date_parts[2].strip()]) tmdb_link = "https://www.themoviedb.org" + response.xpath( '//a[@class="title result"]/@href').extract_first() item['tmdb_link'] = tmdb_link # Extract more info from the detail page request = scrapy.Request(tmdb_link, callback=self.parse_tmdb_detail) request.meta['item'] = item # Pass the item with the request to the detail page yield request break # We only consider the first match else: return
Scrape a Film Plot from Details Page
The last function that we will talk about is the short one. Before that, we get an item passed through a parse_tmdb function as well as extract details page for a plot as well as number of votes.
Here, we are completing scraping of information for a film. The items for a film is totally filled. Then, Scrapy will use a code written within a pipelines to procedure these data as well as put that in a database.
def parse_tmdb_detail(self, response): item = response.meta['item'] # Use the passed item item['nb_votes'] = response.xpath('//span[@itemprop="ratingCount"]/text()').extract_first() item['plot'] = response.xpath('.//p[@id="overview"]/text()').extract_first() yield item
Here, in the section of Pipelines, we have already seen how we have stored the extracting results in the SQLite database. Currently, we will exhibit how you could send the extraction results through email. That way, you will get a good overview about the best films for coming weeks in the mailbox.
Import of Required Packages
We would be working with a file called extensions.py. The file is automatically made in a root directory while creating a Scrapy project. We begin by importing packages that we will utilize later in a file.
import logging from scrapy import signals from scrapy.exceptions import NotConfigured import smtplib import sqlite3 as lite from config import *
A logging package is not required. However, this package could be useful in debugging a program or only to write information in the log.
The signals module would assist us in know when a spider has been opened as well as closed. We would send an email with films after a spider has completed the job.
From the scrapy.exceptions component, we have imported a method called NotConfigured. That will get raised while an extension isn’t organized in a settings.py file. So, a parameter MYEXT_ENABLED has to set to True. We would see that later in a code.
A smtplib package gets imported to send emails. We have used a Gmail address for sending the email, however, you can adapt a code in the config.py to utilize other email service.
Finally, we import a sqlite3 package for scraping top-rated films from a database as well as import config for getting constants.
Create a SendEmail Class in Extensions
Initially, we define a logger object. Having the object, we could write messages to a log at definite events. After that, we make a SendEmail class having a constructor method. With constructor, we have assigned FROMADDR as well as TOADDR to corresponding attributes about the class. All the constants are set with a config.py file. We have used our Gmail addressed for both the attributes.
logger = logging.getLogger(__name__) class SendEmail(object): def __init__(self): self.fromaddr = FROMADDR self.toaddr = TOADDR
Instantiating an Extension Object
The initial method of a SendEmail object is from_crawler. The initial check we have done is if MYEXT_ENABLED is allowed in a settings.py file. In case, that is not a case, we advance with the NotConfigured exception. While this happens, rest of codes in an extension isn’t performed.
In a settings.py file, we require to add these following codes to allow the extension.
MYEXT_ENABLED = True EXTENSIONS = { 'topfilms.extensions.SendEmail': 500, 'scrapy.telnet.TelnetConsole': None }
Therefore, we set a Boolean flag MYEXT_ENABLED with True. After that, we added our individual extension SendEmail to an EXTENSIONS dictionary. The integer values of 500 specifies a order in which extension must get executed. We also need to disable TelnetConsole. Otherwise sending an email did not work. The extension get disabled with putting None rather than integer order values.
After that, we instantiate an extension object having a cls() function. For the extension object, we bond some signals. We are involved in a spider_opened as well as spider_closed signals and finally, we return an ext object.
@classmethod def from_crawler(cls, crawler): # first check if the extension should be enabled and raise # NotConfigured otherwise if not crawler.settings.getbool('MYEXT_ENABLED'): raise NotConfigured # instantiate the extension object ext = cls() # connect the extension object to signals crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened) crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed) # return the extension object return ext
Define Actions in a spider_opened Event
When a spider has get opened, we just wish to write that to a log. So, we utilize a logger object that we have created on top of a code. Having the info technique, we write the message to a log. Spider.name gets replaced by a name that we have defined in a TVGuideSpider.py file.
def spider_opened(self, spider): logger.info("opened spider %s", spider.name)
Send an Email After spider_closed event
With the last method about SendEmail class, we had sent an email having an overview with the top-rated films.
Also, we send the notifications to a log that a spider is closed. Secondly, we make a connection to SQLite database having all films of coming week to ALLOWED_CHANNELS. We choose films with the rating >= 6.5. You may change ratings to a lower or higher threshold as you want. Then, the result films are sorted by film_date_short that has YYYYMMDD format as well as by starting time at start_ts.
We draw all the rows in a cursor cur as well as check if we have a few results in a len function. It is quite possible to get no results while setting the threshold ratings too high, for instance.
Having for row in data, we just go through every resulting film. We scrape all interesting data from a row. For a few data, we have applied a few encodings with encode('ascii','ignore'). It is to overlook a few of special characters including é, è, à, etc. Otherwise, we have errors while sending the emails.
If all the data about a film is collected, we comprise a string variable called topfilm. Then, every topfilm is concatenated to a variable topfilms_overview that will be a message of an email we send. In case, we have no films in the query results, we mention that in the short message.
In the end, we send a message having a Gmail address because of a smtplib package.
def spider_closed(self, spider): logger.info("closed spider %s", spider.name) # Getting films with a rating above a threshold topfilms_overview = "" con = lite.connect('topfilms.db') cur = con.execute( "SELECT title, channel, start_ts, film_date_long, plot, genre, release_date, rating, tmdb_link, nb_votes " "FROM topfilms " "WHERE rating >= 6.5 " "ORDER BY film_date_short, start_ts") data = cur.fetchall() if len(data) > 0: # Check if we have records in the query result for row in data: title = row[0].encode('ascii', 'ignore') channel = row[1] start_ts = row[2] film_date_long = row[3] plot = row[4].encode('ascii', 'ignore') genre = row[5] release_date = row[6].rstrip() rating = row[7] tmdb_link = row[8] nb_votes = row[9] topfilm = ' - '.join([title, channel, film_date_long, start_ts]) topfilm = topfilm + "\r\n" + "Release date: " + release_date topfilm = topfilm + "\r\n" + "Genre: " + str(genre) topfilm = topfilm + "\r\n" + "TMDB rating: " + rating + " from " + nb_votes + " votes" topfilm = topfilm + "\r\n" + plot topfilm = topfilm + "\r\n" + "More info on: " + tmdb_link topfilms_overview = "\r\n\r\n".join([topfilms_overview, topfilm]) con.close() if len(topfilms_overview) > 0: message = topfilms_overview else: message = "There are no top rated films for the coming week." msg = "\r\n".join([ "From: " + self.fromaddr, "To: " + self.toaddr, "Subject: Top Films Overview", message ]) username = UNAME password = PW server = smtplib.SMTP(GMAIL) server.ehlo() server.starttls() server.login(username, password) server.sendmail(self.fromaddr, self.toaddr, msg) server.quit()
Results of Sending Emails through Extensions
The final results about this code is the overview having top-rated films within your mailbox. Wonderful! Now, you don’t need to look at this any longer on an online TV guide.
Whenever you make different requests in the shortest time, you risk getting banned by a server. In the final section, we will show some tricks for avoiding IP banning.
Delay Your Requests
One easy way of avoiding IP banning is taking a pause between every request. With Scrapy, it can be done with setting the parameters in a settings.py file. If you have probably observed, the settings.py file is having many parameters mentioned out.
Search for a parameter DOWNLOAD_DELAY as well as uncomment that. We set a pause length of 2 seconds. Subject to how many requests you need to make, you could change this. However, we would set that at a minimum of 1 second.
DOWNLOAD_DELAY=2
More Superior Way of Avoiding IP Banning
By default, every time you do any request, you can do that with similar user agent. Because of a package fake_useragent, we could easily change a user agent for every request.
All the credits about this piece for code going to Alecxe that wrote the nice Python scripts to utilize a fake_useragent package.
Initially, we make a folder scrapy_fake_useragent within a root directory of the web scraping project. In the given folder, we have added two files:
For using this middleware, we want to allow it in a settings.py file. It is done using a code:
DOWNLOADER_MIDDLEWARES = { 'scrapy.downloadermiddleware.useragent.UserAgentMiddleware': None, 'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400, }
Initially, we disable default UserAgentMiddleware about Scrapy by specifying none rather than integer values. After that, we allow our own middleware RandomUserAgentMiddleware. Spontaneously, middleware is the piece of code, which is performed during the request.
In a file middleware.py, we add a code for randomizing a user agent for every request. Ensure you have a fake_useragent set installed. From a fake_usergent set, we import a UserAgent module. It contains a list of various user agents. With a constructor of RandomUserAgentMiddleware class, we have instantiated a UserAgent object. Using the method process_request, we have set a user agent with the random user agents from an object with a header of a request.
from fake_useragent import UserAgent class RandomUserAgentMiddleware(object): def __init__(self): super(RandomUserAgentMiddleware, self).__init__() self.ua = UserAgent() def process_request(self, request, spider): request.headers.setdefault('User-Agent', self.ua.random)
We hope that you have a clear view of how to utilize Scrapy for web scraping projects.
For more information about scraping the web for top-rated movies on TV, contact 3i Data Scraping or ask for a free quote!