July 20, 2023

Automate Routine Like a Programmer

Notion database with information about lectures

Lately, I've been using Notion a lot to take notes and work with tabular data. This tool provides rich functionality, and its databases stand out in particular.

Also, currently, I watch translations from the Yandex Summer Schools Open Lecture Hall, and considering the number of lectures, it is pretty convenient to use Notion databases to keep count of the lessons studied. However, specifying the data for each lesson manually is quite problematic, so I could not resist the desire to automate this process using web scraping and the Notion API.

The functionality of this small project is quite simple: a Python script reads data from the Yandex schedule page using the Beautiful Soup 4 library, then processes the received information about lectures and saves it to the Notion database.

Data that I am going to scrape and upload to Notion

Prerequisites

Notion integrations page

Before writing the actual code, one should create a new Notion integration by following the link. After the integration is created, you should be able to receive a secret token for using the Notion API. This token should be copied and pasted to a special config.toml file in the root of the project. The config file should also contain a Notion database ID, which can be received from the URL address of a database page you are working with.

Notion database ID retrieval

In addition, don't forget to add the integration to the database you are working with to allow the script to edit the content.

Also, you should retrieve your cookie and browser User-Agent by opening DevTools at https://yandex.ru/yaintern/schools/open-lectures and going to the initial request headers. After all, your config file should look like this:

The config file (all sensitive data is blurred)

Then, a new Python project must be initialized. I highly recommend using a tool called Poetry, which drastically simplifies operations with virtual environments and dependencies.

poetry init

To add third-party packages to the project, use:

poetry add requests beautifulsoup4 toml

The Code

Imports

As for imports, we are going to use some tools from the standard library to create sessions for the requests library. unicodedata.normalize function is required to process data we get after reading content from the abovementioned webpage of the Yandex Lecture Hall schedule. Speaking of third-party packages, toml is needed to read the config file, Beautiful Soup 4 to process HTML page content, and requests to make HTTP requests to the Notion API.

from copy import deepcopy
from requests.adapters import HTTPAdapter
from unicodedata import normalize
from urllib3.util.retry import Retry
from datetime import datetime

from bs4 import BeautifulSoup
import requests
import toml

Reading config and specifying constants

Firstly, the config should be read. API_DATABASE_QUERY and API_PAGE_CREATE are Notion API endpoints. HEADERS variable stands for request headers for accessing the Notion API. LESSON_TRACKS defines categories of Yandex lectures, as specified on the website.

config = toml.load("config.toml")

NOTION_DATABASE_ID = config["Notion"]["DATABASE_ID"]
NOTION_SECRET = config["Notion"]["SECRET"]

LECTURES_SCHEDULE_URL = "https://yandex.ru/yaintern/schools/open-lectures"
API_DATABASE_QUERY = f"https://api.notion.com/v1/databases/{NOTION_DATABASE_ID}/query"
API_PAGE_CREATE = "https://api.notion.com/v1/pages"

HEADERS = {
    "Authorization": f"Bearer {NOTION_SECRET}",
    "Content-Type": "application/json",
    "Notion-Version": "2022-06-28"
}

LESSON_TRACKS = [
    "interfaces development",
    "backend (Python)",
    "backend (Java)",
    "backend (C++)",
    "backend (Go)",
    "mobile (Android)",
    "mobile (iOS)",
    "mobile (Flutter)",
    "management",
    "marketing",
    "product analytics"
]

Getting data from Yandex

This function, as well as other functions I am going to create, uses a session object to make requests. The session will be defined later, but basically, it's just a wrapper around requests to reattempt requests on failure. When requesting the website of Yandex Open Lecture Hall, a cookie and a browser's user-agent must be defined due to Captcha checks that may prevent you from reading the data.

def get_lessons_data(session):
    """Get lessons data from Yandex."""
    data = session.get(LECTURES_SCHEDULE_URL, headers={
        "Cookie": config["YANDEX_COOKIE"].encode(),
        "User-Agent": config["USER_AGENT"]
    })
    return BeautifulSoup(data.content, features="html.parser")

Using BeautifulSoup to scrape data from page

Using special selectors provided by the Beautiful Soup library, we can find elements by their class names and access attributes and inner text. This way, we can get containers with data about each lecture, and then retrieve the lecture's titles, dates, descriptions, links to YouTube translations, and lists of speakers. All data is saved to a dictionary and then appended to a list.

def scrape_lessons(soup):
    lessons = []
    for el in soup.find_all("div", class_="lc-events-program__container"):
        data = el.find_all("div", class_="lc-styled-text__text")

        lesson_title = normalize("NFKD", data[1].text)
        lesson_desc = normalize("NFKD", data[2].text).strip().replace("\n", " ")
        lesson_date = data[0].text[:5].strip()
        lesson_link = el.find("a").attrs["href"] if el.find("a") else None

        lesson_speakers = []
        speakers_el = el.find_all("div", class_="lc-events-speaker__name")
        for speaker in speakers_el:
            lesson_speakers.append(speaker.text)

        lessons.append({
            "title": lesson_title,
            "link": lesson_link,
            "description": lesson_desc,
            "date": lesson_date,
            "speakers": lesson_speakers
        })
    return lessons

Processing lessons data

The list then gets processed. It can be noticed that each track at the Yandex Open Lectures Hall website starts with an opening ceremony that is held on the 6th of June. Therefore, all lectures can be separated into different tracks by this exact date.

Also, the date field must be transformed to an ISO date string, like "2023-06-06" (instead of "06.06"). This can be accomplished by using the datetime library.

Moreover, an additional ?feature=share query parameter can be removed from video links to improve the Notion database look.

def process_lessons(lessons):
    """Process scraped lessons."""
    cur_track = -1
    for lesson in lessons:
        if lesson["date"] == "06.06":
            cur_track += 1
        lesson["track"] = [LESSON_TRACKS[cur_track]]
        lesson["date"] = datetime.strptime(lesson["date"], "%d.%m")\
            .replace(year=2023).isoformat().split("T")[0]
        lesson["link"] = lesson["link"].split("?")[0] if lesson["link"] else None
    return lessons

Merge lessons with the same URLs

After exploring processed lessons, it can be seen that lectures may be shared between different tracks. To address this problem, we can group lessons by URLs of referenced videos. Using this method, the total number of lectures to upload to the database can be reduced from 260 to 161. We should also remember to add all corresponding tracks to a merged lesson object and, of course, remove repeating lessons from the original list.

def merge_lessons(lessons):
    """Merge lessons with same video URLs."""
    urls = set((lesson["link"] for lesson in lessons))
    urls.remove(None)
    for url in urls:
        lesson_versions = [lesson for lesson in lessons if lesson['link'] == url]
        new_lesson = deepcopy(lesson_versions[0])
        new_lesson['track'] = [lesson['track'][0] for lesson in lesson_versions]
        for lesson in lesson_versions:
            lessons.remove(lesson)
        lessons.append(new_lesson)
    return lessons

Creating a new database page with Notion API

To create a new page in the Notion database, one should follow a special JSON format, as specified in the documentation. The format is different for each type of database field. After transforming the data of a lecture, one should make a POST request to the Notion API endpoint.

def create_page(lesson_data, session):
    """Create a new lesson page in the database using Notion API."""
    page_data = {
        "Title": {"title": [{"text": {"content": lesson_data["title"]}}]},
        "Track": {"multi_select": [{"name": track} for track in lesson_data["track"]]},
        "Video": {"url": lesson_data["link"]},
        "Date": {"date": {"start": lesson_data["date"]}},
        "Lecturers": {"multi_select": [{"name": lecturer} for lecturer in lesson_data["speakers"]]},
        "Description": {"rich_text": [{"text": {"content": lesson_data["description"]}}]}
    }

    payload = {
        "parent": {"database_id": NOTION_DATABASE_ID},
        "properties": page_data
    }

    response = session.post(API_PAGE_CREATE, headers=HEADERS, json=payload)
    return response

Connecting functions and creating lessons

In this block of code, lectures are finally retrieved, processed, and merged. Then lectures are iterated and uploaded to the database.

def create_lessons(session):
    lessons = merge_lessons(process_lessons(scrape_lessons(get_lessons_data(session))))

    processed = 0
    for lesson in lessons:
        response = create_page(lesson, session=session)
        if response.status_code == 200:
            print(f"Lesson {lesson['title']} was created successfully... {round(processed / len(lessons) * 100)}%")
        else:
            print(f"Error {response.status_code}\n{response.content}")
        processed += 1

Creating a session and finalizing the code

A session is required to retry requests on failure. After the session is initialized, lectures can be created.

def main():
    # Requests session initialization
    session = requests.Session()
    retry = Retry(connect=3, backoff_factor=0.5)
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)

    create_lessons(session)


if __name__ == "__main__":
    main()

Conclusion

This is just a tiny example of how routine and complex tasks can be automated or simplified using a bit of programming. I hope that some of you may find this guide useful to scrape some other data (for instance, from university admission lists) or interpret this article as an inspiration for some other kind of automatization ideas.

All the code shown in this article can be found in a GitHub repository.