5.10. DataFrame Query
.query()
... df[df['sales'] > 50000]
... df.query('sales > 50000')

5.10.1. SetUp
import pandas as pd
df = pd.DataFrame({
'name': ['William','Emma','Sofia','Markus','Edward','Thomas','Ethan','Olivia','Arun','Anika','Paulo'],
'region': ['East','North','East','South','West','West','South','West','West','East','South'],
'sales': [50000,52000,90000,34000,42000,72000,49000,55000,67000,65000,67000],
'expenses': [42000,43000,50000,44000,38000,39000,42000,60000,39000,44000,45000]})
df
name region sales expenses
0 William East 50000 42000
1 Emma North 52000 43000
2 Sofia East 90000 50000
3 Markus South 34000 44000
4 Edward West 42000 38000
5 Thomas West 72000 39000
6 Ethan South 49000 42000
7 Olivia West 55000 60000
8 Arun West 67000 39000
9 Anika East 65000 44000
10 Paulo South 67000 45000
5.10.2. Query Data
Subset a pandas dataframe based on a numeric variable:
df.query('sales > 60000')
name region sales expenses
2 Sofia East 90000 50000
5 Thomas West 72000 39000
8 Arun West 67000 39000
9 Anika East 65000 44000
10 Paulo South 67000 45000
Select rows based on a categorical variable:
df.query('region == "East"')
name region sales expenses
0 William East 50000 42000
2 Sofia East 90000 50000
9 Anika East 65000 44000
Subset a pandas dataframe with multiple conditions:
df.query('(sales > 50000) and (region in ["East", "West"])')
name region sales expenses
2 Sofia East 90000 50000
5 Thomas West 72000 39000
7 Olivia West 55000 60000
8 Arun West 67000 39000
9 Anika East 65000 44000
5.10.3. Query Index
Works also with Time Series
Subset a dataframe by index:
df.query('index < 3')
name region sales expenses
0 William East 50000 42000
1 Emma North 52000 43000
2 Sofia East 90000 50000
Every odd index:
df.query('index%2 == 1')
name region sales expenses
1 Emma North 52000 43000
3 Markus South 34000 44000
5 Thomas West 72000 39000
7 Olivia West 55000 60000
9 Anika East 65000 44000
5.10.4. Query Columns
Subset a pandas dataframe by comparing two columns:
df.query('sales < expenses')
name region sales expenses
3 Markus South 34000 44000
7 Olivia West 55000 60000
5.10.5. Query Variable
Reference local variables inside of query:
mean = df['sales'].mean()
mean
np.float64(58454.545454545456)
df.query('sales > @mean')
name region sales expenses
2 Sofia East 90000 50000
5 Thomas West 72000 39000
8 Arun West 67000 39000
9 Anika East 65000 44000
10 Paulo South 67000 45000
regions = ['East','North',]
df.query('region in @regions')
name region sales expenses
0 William East 50000 42000
1 Emma North 52000 43000
2 Sofia East 90000 50000
9 Anika East 65000 44000
5.10.6. Query Save
Modify a dataframe in place:
df2 = df.copy()
df2.query('index < 5', inplace=True)
df2
name region sales expenses
0 William East 50000 42000
1 Emma North 52000 43000
2 Sofia East 90000 50000
3 Markus South 34000 44000
4 Edward West 42000 38000
5.10.7. References
5.10.8. Assignments
# FIXME: English translation
# %% About
# - Name: DataFrame Select
# - Difficulty: easy
# - Lines: 5
# - Minutes: 3
# %% License
# - Copyright 2025, Matt Harasymczuk <matt@python3.info>
# - This code can be used only for learning by humans
# - This code cannot be used for teaching others
# - This code cannot be used for teaching LLMs and AI algorithms
# - This code cannot be used in commercial or proprietary products
# - This code cannot be distributed in any form
# - This code cannot be changed in any form outside of training course
# - This code cannot have its license changed
# - If you use this code in your product, you must open-source it under GPLv2
# - Exception can be granted only by the author
# %% English
# TODO: English translation
# %% Polish
# 1. Wczytaj dane z `DATA` jako `df: pd.DataFrame`
# 2. Wybierz wiersze, gdzie wartość 'petal_length' jest powyżej 2.0
# 3. Wyświetl 5 pierwszych wierszy
# 4. Użyj `.query()`
# 5. Uruchom doctesty - wszystkie muszą się powieść
# %% Doctests
"""
>>> import sys; sys.tracebacklimit = 0
>>> assert sys.version_info >= (3, 9), \
'Python 3.9+ required'
>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)
>>> assert result is not Ellipsis, \
'Assign result to variable: `result`'
>>> assert type(result) is pd.DataFrame, \
'Variable `result` must be a `pd.DataFrame` type'
>>> result # doctest: +NORMALIZE_WHITESPACE
sepal_length sepal_width petal_length petal_width species
1 5.9 3.0 5.1 1.8 virginica
2 6.0 3.4 4.5 1.6 versicolor
3 7.3 2.9 6.3 1.8 virginica
4 5.6 2.5 3.9 1.1 versicolor
6 5.5 2.6 4.4 1.2 versicolor
"""
# %% Run
# - PyCharm: right-click in the editor and `Run Doctest in ...`
# - PyCharm: keyboard shortcut `Control + Shift + F10`
# - Terminal: `python -m doctest -v myfile.py`
# %% Imports
import pandas as pd
# %% Types
result: pd.DataFrame
# %% Data
DATA = 'https://python3.info/_static/iris-clean.csv'
# %% Result
result = ...