Automate Routine Like a Programmer
Lately, I've been using Notion a lot to take notes and work with tabular data. This tool provides rich functionality, and its databases stand out in particular.
Also, currently, I watch translations from the Yandex Summer Schools Open Lecture Hall, and considering the number of lectures, it is pretty convenient to use Notion databases to keep count of the lessons studied. However, specifying the data for each lesson manually is quite problematic, so I could not resist the desire to automate this process using web scraping and the Notion API.
The functionality of this small project is quite simple: a Python script reads data from the Yandex schedule page using the Beautiful Soup 4 library, then processes the received information about lectures and saves it to the Notion database.
Prerequisites
Before writing the actual code, one should create a new Notion integration by following the link. After the integration is created, you should be able to receive a secret token for using the Notion API. This token should be copied and pasted to a special config.toml file in the root of the project. The config file should also contain a Notion database ID, which can be received from the URL address of a database page you are working with.
In addition, don't forget to add the integration to the database you are working with to allow the script to edit the content.
Also, you should retrieve your cookie and browser User-Agent by opening DevTools at https://yandex.ru/yaintern/schools/open-lectures and going to the initial request headers. After all, your config file should look like this:
Then, a new Python project must be initialized. I highly recommend using a tool called Poetry, which drastically simplifies operations with virtual environments and dependencies.
poetry init
To add third-party packages to the project, use:
poetry add requests beautifulsoup4 toml
The Code
Imports
As for imports, we are going to use some tools from the standard library to create sessions for the requests library. unicodedata.normalize function is required to process data we get after reading content from the abovementioned webpage of the Yandex Lecture Hall schedule. Speaking of third-party packages, toml is needed to read the config file, Beautiful Soup 4 to process HTML page content, and requests to make HTTP requests to the Notion API.
from copy import deepcopy from requests.adapters import HTTPAdapter from unicodedata import normalize from urllib3.util.retry import Retry from datetime import datetime from bs4 import BeautifulSoup import requests import toml
Reading config and specifying constants
Firstly, the config should be read. API_DATABASE_QUERY and API_PAGE_CREATE are Notion API endpoints. HEADERS variable stands for request headers for accessing the Notion API. LESSON_TRACKS defines categories of Yandex lectures, as specified on the website.
config = toml.load("config.toml")
NOTION_DATABASE_ID = config["Notion"]["DATABASE_ID"]
NOTION_SECRET = config["Notion"]["SECRET"]
LECTURES_SCHEDULE_URL = "https://yandex.ru/yaintern/schools/open-lectures"
API_DATABASE_QUERY = f"https://api.notion.com/v1/databases/{NOTION_DATABASE_ID}/query"
API_PAGE_CREATE = "https://api.notion.com/v1/pages"
HEADERS = {
"Authorization": f"Bearer {NOTION_SECRET}",
"Content-Type": "application/json",
"Notion-Version": "2022-06-28"
}
LESSON_TRACKS = [
"interfaces development",
"backend (Python)",
"backend (Java)",
"backend (C++)",
"backend (Go)",
"mobile (Android)",
"mobile (iOS)",
"mobile (Flutter)",
"management",
"marketing",
"product analytics"
]Getting data from Yandex
This function, as well as other functions I am going to create, uses a session object to make requests. The session will be defined later, but basically, it's just a wrapper around requests to reattempt requests on failure. When requesting the website of Yandex Open Lecture Hall, a cookie and a browser's user-agent must be defined due to Captcha checks that may prevent you from reading the data.
def get_lessons_data(session):
"""Get lessons data from Yandex."""
data = session.get(LECTURES_SCHEDULE_URL, headers={
"Cookie": config["YANDEX_COOKIE"].encode(),
"User-Agent": config["USER_AGENT"]
})
return BeautifulSoup(data.content, features="html.parser")Using BeautifulSoup to scrape data from page
Using special selectors provided by the Beautiful Soup library, we can find elements by their class names and access attributes and inner text. This way, we can get containers with data about each lecture, and then retrieve the lecture's titles, dates, descriptions, links to YouTube translations, and lists of speakers. All data is saved to a dictionary and then appended to a list.
def scrape_lessons(soup):
lessons = []
for el in soup.find_all("div", class_="lc-events-program__container"):
data = el.find_all("div", class_="lc-styled-text__text")
lesson_title = normalize("NFKD", data[1].text)
lesson_desc = normalize("NFKD", data[2].text).strip().replace("\n", " ")
lesson_date = data[0].text[:5].strip()
lesson_link = el.find("a").attrs["href"] if el.find("a") else None
lesson_speakers = []
speakers_el = el.find_all("div", class_="lc-events-speaker__name")
for speaker in speakers_el:
lesson_speakers.append(speaker.text)
lessons.append({
"title": lesson_title,
"link": lesson_link,
"description": lesson_desc,
"date": lesson_date,
"speakers": lesson_speakers
})
return lessonsProcessing lessons data
The list then gets processed. It can be noticed that each track at the Yandex Open Lectures Hall website starts with an opening ceremony that is held on the 6th of June. Therefore, all lectures can be separated into different tracks by this exact date.
Also, the date field must be transformed to an ISO date string, like "2023-06-06" (instead of "06.06"). This can be accomplished by using the datetime library.
Moreover, an additional ?feature=share query parameter can be removed from video links to improve the Notion database look.
def process_lessons(lessons):
"""Process scraped lessons."""
cur_track = -1
for lesson in lessons:
if lesson["date"] == "06.06":
cur_track += 1
lesson["track"] = [LESSON_TRACKS[cur_track]]
lesson["date"] = datetime.strptime(lesson["date"], "%d.%m")\
.replace(year=2023).isoformat().split("T")[0]
lesson["link"] = lesson["link"].split("?")[0] if lesson["link"] else None
return lessonsMerge lessons with the same URLs
After exploring processed lessons, it can be seen that lectures may be shared between different tracks. To address this problem, we can group lessons by URLs of referenced videos. Using this method, the total number of lectures to upload to the database can be reduced from 260 to 161. We should also remember to add all corresponding tracks to a merged lesson object and, of course, remove repeating lessons from the original list.
def merge_lessons(lessons):
"""Merge lessons with same video URLs."""
urls = set((lesson["link"] for lesson in lessons))
urls.remove(None)
for url in urls:
lesson_versions = [lesson for lesson in lessons if lesson['link'] == url]
new_lesson = deepcopy(lesson_versions[0])
new_lesson['track'] = [lesson['track'][0] for lesson in lesson_versions]
for lesson in lesson_versions:
lessons.remove(lesson)
lessons.append(new_lesson)
return lessonsCreating a new database page with Notion API
To create a new page in the Notion database, one should follow a special JSON format, as specified in the documentation. The format is different for each type of database field. After transforming the data of a lecture, one should make a POST request to the Notion API endpoint.
def create_page(lesson_data, session):
"""Create a new lesson page in the database using Notion API."""
page_data = {
"Title": {"title": [{"text": {"content": lesson_data["title"]}}]},
"Track": {"multi_select": [{"name": track} for track in lesson_data["track"]]},
"Video": {"url": lesson_data["link"]},
"Date": {"date": {"start": lesson_data["date"]}},
"Lecturers": {"multi_select": [{"name": lecturer} for lecturer in lesson_data["speakers"]]},
"Description": {"rich_text": [{"text": {"content": lesson_data["description"]}}]}
}
payload = {
"parent": {"database_id": NOTION_DATABASE_ID},
"properties": page_data
}
response = session.post(API_PAGE_CREATE, headers=HEADERS, json=payload)
return responseConnecting functions and creating lessons
In this block of code, lectures are finally retrieved, processed, and merged. Then lectures are iterated and uploaded to the database.
def create_lessons(session):
lessons = merge_lessons(process_lessons(scrape_lessons(get_lessons_data(session))))
processed = 0
for lesson in lessons:
response = create_page(lesson, session=session)
if response.status_code == 200:
print(f"Lesson {lesson['title']} was created successfully... {round(processed / len(lessons) * 100)}%")
else:
print(f"Error {response.status_code}\n{response.content}")
processed += 1Creating a session and finalizing the code
A session is required to retry requests on failure. After the session is initialized, lectures can be created.
def main():
# Requests session initialization
session = requests.Session()
retry = Retry(connect=3, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
create_lessons(session)
if __name__ == "__main__":
main()Conclusion
This is just a tiny example of how routine and complex tasks can be automated or simplified using a bit of programming. I hope that some of you may find this guide useful to scrape some other data (for instance, from university admission lists) or interpret this article as an inspiration for some other kind of automatization ideas.
All the code shown in this article can be found in a GitHub repository.