3.3. Web Scrapping
Requests HTML https://github.com/psf/requests-html
BeautifulSoup https://www.crummy.com/software/BeautifulSoup/bs4/doc/#
Scrapy https://scrapy.org/
3.3.1. BeautifulSoup
3.3.2. Example usage
3.3.3. Install
$ pip install BeautifulSoup4
3.3.4. Parser
Parser |
Typical usage |
Advantages |
Disadvantages |
Python's html.parser |
|
|
|
lxml's HTML parser |
|
|
|
lxml's XML parser |
|
|
|
html5lib |
|
|
|
3.3.5. Open
from bs4 import BeautifulSoup
with open("index.html") as file:
html = BeautifulSoup(file, 'html.parser')
html.find(id='menubox').decompose()
3.3.6. Basic Usage
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="https://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="https://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="https://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
html = BeautifulSoup(html_doc, 'html.parser')
print(html.prettify())
# <html>
# <head>
# <title>
# The Dormouse's story
# </title>
# </head>
# <body>
# <p class="title">
# <b>
# The Dormouse's story
# </b>
# </p>
# <p class="story">
# Once upon a time there were three little sisters; and their names were
# <a class="sister" href="https://example.com/elsie" id="link1">
# Elsie
# </a>
# ,
# <a class="sister" href="https://example.com/lacie" id="link2">
# Lacie
# </a>
# and
# <a class="sister" href="https://example.com/tillie" id="link2">
# Tillie
# </a>
# ; and they lived at the bottom of a well.
# </p>
# <p class="story">
# ...
# </p>
# </body>
# </html>
html.title # <title>The Dormouse's story</title>
html.title.name # 'title'
html.title.string # 'The Dormouse's story'
html.title.parent.name # 'head'
html.p # <p class="title"><b>The Dormouse's story</b></p>
html.p['class'] # 'title'
html.a # <a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>
html.find_all('a')
# [<a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="https://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="https://example.com/tillie" id="link3">Tillie</a>]
html.find(id="link3")
# <a class="sister" href="https://example.com/tillie" id="link3">Tillie</a>
3.3.7. Iterating over items
for link in html.find_all('a'):
print(link.get('href'))
# https://example.com/elsie
# https://example.com/lacie
# https://example.com/tillie
3.3.8. Getting Page Text
html.get_text()
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...
3.3.9. Assignments
3.3.9.1. Scrapping Iris
- About:
Name: Scrapping Iris
Difficulty: medium
Lines: 20
Minutes: 21
- License:
Copyright 2025, Matt Harasymczuk <matt@python3.info>
This code can be used only for learning by humans (self-education)
This code cannot be used for teaching others (trainings, bootcamps, etc.)
This code cannot be used for teaching LLMs and AI algorithms
This code cannot be used in commercial or proprietary products
This code cannot be distributed in any form
This code cannot be changed in any form outside of training course
This code cannot have its license changed
If you use this code in your product, you must open-source it under GPLv2
Exception can be granted only by the author (Matt Harasymczuk)
- English:
Using BeautifulSoup4 from https://python3.info/_static/iris-dirty.csv download Iris dataset.
Parse HTML code to clean data.
Delete first header row.
Name columns:
sepal_length
,sepal_width
,petal_length
,petal_width
,species
Display data as list of dicts, keys should be column names.
Run doctests - all must succeed
- Polish:
Za pomocą BeautifulSoup4 ze strony https://python3.info/_static/iris-dirty.csv pobierz dane zbioru Irysów.
Parsując kod HTML oczyść dane.
Skasuj pierwszy wiersz nagłówkowy.
Kolumny nazwij:
sepal_length
,sepal_width
,petal_length
,petal_width
,species
Wyświetl dane w formacie listy dictów, kluczami mają być nazwy kolumn.
Uruchom doctesty - wszystkie muszą się powieść
3.3.9.2. Scrapping EVA
- About:
Name: Scrapping EVA
Difficulty: medium
Lines: 100
Minutes: 21
- License:
Copyright 2025, Matt Harasymczuk <matt@python3.info>
This code can be used only for learning by humans (self-education)
This code cannot be used for teaching others (trainings, bootcamps, etc.)
This code cannot be used for teaching LLMs and AI algorithms
This code cannot be used in commercial or proprietary products
This code cannot be distributed in any form
This code cannot be changed in any form outside of training course
This code cannot have its license changed
If you use this code in your product, you must open-source it under GPLv2
Exception can be granted only by the author (Matt Harasymczuk)
- English:
Based on given URL:
Scrape page using
BeautifulSoup4
Prepare CSV file with data about spacewalks
Try to do the same using
pandas.read_html()
:Providing fourth URL as parameter
For partially parsed page, e.g. extracted table
Run doctests - all must succeed
- Polish:
Na podstawie podanych URL:
Skrapuj stronę wykorzystując
BeautifulSoup4
Przygotuj plik CSV z danymi dotyczącymi spacerów kosmicznych
Spróbuj to samo zrobić za pomocą
pandas.read_html()
:Podając jako parametr czwarty URL
Dla częściowo sparsowanej strony, np. wyciągniętej tabelki
Uruchom doctesty - wszystkie muszą się powieść