3.3. Web Scrapping

Requests HTML https://github.com/psf/requests-html
BeautifulSoup https://www.crummy.com/software/BeautifulSoup/bs4/doc/#
Scrapy https://scrapy.org/

3.3.1. `BeautifulSoup`

3.3.2. Example usage

https://github.com/AstroMatt/thesis-masters-aerospace/blob/master/src/worldspaceflight-astronaut-bios.py

3.3.3. Install

$ pip install BeautifulSoup4

3.3.4. Parser

Parser	Typical usage	Advantages	Disadvantages
Python's html.parser	`BeautifulSoup(markup, "html.parser")`	Batteries included Decent speed tolerant (as of Python 2.7.3 and 3.2.)	Not very tolerant (before Python 2.7.3 or 3.2.2)
lxml's HTML parser	`BeautifulSoup(markup, "lxml")`	Very fast Tolerant	External C dependency
lxml's XML parser	`BeautifulSoup(markup, "lxml-xml")` `BeautifulSoup(markup, "xml")`	Very fast The only currently supported XML parser	External C dependency
html5lib	`BeautifulSoup(markup, "html5lib")`	Extremely tolerant Parses pages the same way a web browser does Creates valid HTML5	Very slow External Python dependency

3.3.5. Open

from bs4 import BeautifulSoup

with open("index.html") as file:
    html = BeautifulSoup(file, 'html.parser')

html.find(id='menubox').decompose()

3.3.6. Basic Usage

from bs4 import BeautifulSoup


html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>

    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="https://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="https://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="https://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>

    <p class="story">...</p>
"""

html = BeautifulSoup(html_doc, 'html.parser')

print(html.prettify())
# <html>
#  <head>
#   <title>
#    The Dormouse's story
#   </title>
#  </head>
#  <body>
#   <p class="title">
#    <b>
#     The Dormouse's story
#    </b>
#   </p>
#   <p class="story">
#    Once upon a time there were three little sisters; and their names were
#    <a class="sister" href="https://example.com/elsie" id="link1">
#     Elsie
#    </a>
#    ,
#    <a class="sister" href="https://example.com/lacie" id="link2">
#     Lacie
#    </a>
#    and
#    <a class="sister" href="https://example.com/tillie" id="link2">
#     Tillie
#    </a>
#    ; and they lived at the bottom of a well.
#   </p>
#   <p class="story">
#     ...
#   </p>
#  </body>
# </html>

html.title              # <title>The Dormouse's story</title>
html.title.name         # 'title'
html.title.string       # 'The Dormouse's story'
html.title.parent.name  # 'head'
html.p                  # <p class="title"><b>The Dormouse's story</b></p>
html.p['class']         # 'title'
html.a                  # <a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>

html.find_all('a')
# [<a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="https://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="https://example.com/tillie" id="link3">Tillie</a>]

html.find(id="link3")
# <a class="sister" href="https://example.com/tillie" id="link3">Tillie</a>

3.3.7. Iterating over items

for link in html.find_all('a'):
    print(link.get('href'))

# https://example.com/elsie
# https://example.com/lacie
# https://example.com/tillie

3.3.8. Getting Page Text

html.get_text()
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...

3.3.9. Assignments

3.3.9.1. Scrapping Iris

About:

Name: Scrapping Iris
Difficulty: medium
Lines: 20
Minutes: 21

License:

Copyright 2025, Matt Harasymczuk <matt@python3.info>
This code can be used only for learning by humans (self-education)
This code cannot be used for teaching others (trainings, bootcamps, etc.)
This code cannot be used for teaching LLMs and AI algorithms
This code cannot be used in commercial or proprietary products
This code cannot be distributed in any form
This code cannot be changed in any form outside of training course
This code cannot have its license changed
If you use this code in your product, you must open-source it under GPLv2
Exception can be granted only by the author (Matt Harasymczuk)

English:

Using BeautifulSoup4 from https://python3.info/_static/iris-dirty.csv download Iris dataset.
Parse HTML code to clean data.
Delete first header row.
Name columns: sepal_length, sepal_width, petal_length, petal_width, species
Display data as list of dicts, keys should be column names.
Run doctests - all must succeed

Polish:

Za pomocą BeautifulSoup4 ze strony https://python3.info/_static/iris-dirty.csv pobierz dane zbioru Irysów.
Parsując kod HTML oczyść dane.
Skasuj pierwszy wiersz nagłówkowy.
Kolumny nazwij: sepal_length, sepal_width, petal_length, petal_width, species
Wyświetl dane w formacie listy dictów, kluczami mają być nazwy kolumn.
Uruchom doctesty - wszystkie muszą się powieść

3.3.9.2. Scrapping EVA

About:

Name: Scrapping EVA
Difficulty: medium
Lines: 100
Minutes: 21

License:

This code can be used only for learning by humans (self-education)
This code cannot be used for teaching others (trainings, bootcamps, etc.)
This code cannot be used for teaching LLMs and AI algorithms
This code cannot be used in commercial or proprietary products
This code cannot be distributed in any form
This code cannot be changed in any form outside of training course
This code cannot have its license changed
If you use this code in your product, you must open-source it under GPLv2
Exception can be granted only by the author (Matt Harasymczuk)

English:

Based on given URL:
Scrape page using BeautifulSoup4
Prepare CSV file with data about spacewalks
Try to do the same using pandas.read_html():
1. Providing fourth URL as parameter
2. For partially parsed page, e.g. extracted table
Run doctests - all must succeed

Polish:

Na podstawie podanych URL:
Skrapuj stronę wykorzystując BeautifulSoup4
Przygotuj plik CSV z danymi dotyczącymi spacerów kosmicznych
Spróbuj to samo zrobić za pomocą pandas.read_html():
1. Podając jako parametr czwarty URL
2. Dla częściowo sparsowanej strony, np. wyciągniętej tabelki
Uruchom doctesty - wszystkie muszą się powieść

3.3. Web Scrapping

3.3.1. BeautifulSoup

3.3.2. Example usage

3.3.3. Install

3.3.4. Parser

3.3.5. Open

3.3.6. Basic Usage

3.3.7. Iterating over items

3.3.8. Getting Page Text

3.3.9. Assignments

3.3.9.1. Scrapping Iris

3.3.9.2. Scrapping EVA

3.3.1. `BeautifulSoup`