5.10. DataFrame Query

  • .query()


... df[df['sales'] > 50000]
... df.query('sales > 50000')
../../_images/pandas-dataframe-query.png

Figure 5.6. Pandas query expression [1]

5.10.1. SetUp

import pandas as pd


df = pd.DataFrame({
    'name': ['William','Emma','Sofia','Markus','Edward','Thomas','Ethan','Olivia','Arun','Anika','Paulo'],
    'region': ['East','North','East','South','West','West','South','West','West','East','South'],
    'sales': [50000,52000,90000,34000,42000,72000,49000,55000,67000,65000,67000],
    'expenses': [42000,43000,50000,44000,38000,39000,42000,60000,39000,44000,45000]})

df
       name region  sales  expenses
0   William   East  50000     42000
1      Emma  North  52000     43000
2     Sofia   East  90000     50000
3    Markus  South  34000     44000
4    Edward   West  42000     38000
5    Thomas   West  72000     39000
6     Ethan  South  49000     42000
7    Olivia   West  55000     60000
8      Arun   West  67000     39000
9     Anika   East  65000     44000
10    Paulo  South  67000     45000

5.10.2. Query Data

Subset a pandas dataframe based on a numeric variable:

df.query('sales > 60000')
      name region  sales  expenses
2    Sofia   East  90000     50000
5   Thomas   West  72000     39000
8     Arun   West  67000     39000
9    Anika   East  65000     44000
10   Paulo  South  67000     45000

Select rows based on a categorical variable:

df.query('region == "East"')
      name region  sales  expenses
0  William   East  50000     42000
2    Sofia   East  90000     50000
9    Anika   East  65000     44000

Subset a pandas dataframe with multiple conditions:

df.query('(sales > 50000) and (region in ["East", "West"])')
     name region  sales  expenses
2   Sofia   East  90000     50000
5  Thomas   West  72000     39000
7  Olivia   West  55000     60000
8    Arun   West  67000     39000
9   Anika   East  65000     44000

5.10.3. Query Index

  • Works also with Time Series

Subset a dataframe by index:

df.query('index < 3')
      name region  sales  expenses
0  William   East  50000     42000
1     Emma  North  52000     43000
2    Sofia   East  90000     50000

Every odd index:

df.query('index%2 == 1')
     name region  sales  expenses
1    Emma  North  52000     43000
3  Markus  South  34000     44000
5  Thomas   West  72000     39000
7  Olivia   West  55000     60000
9   Anika   East  65000     44000

5.10.4. Query Columns

Subset a pandas dataframe by comparing two columns:

df.query('sales < expenses')
     name region  sales  expenses
3  Markus  South  34000     44000
7  Olivia   West  55000     60000

5.10.5. Query Variable

Reference local variables inside of query:

mean = df['sales'].mean()
mean
np.float64(58454.545454545456)

df.query('sales > @mean')
      name region  sales  expenses
2    Sofia   East  90000     50000
5   Thomas   West  72000     39000
8     Arun   West  67000     39000
9    Anika   East  65000     44000
10   Paulo  South  67000     45000
regions = ['East','North',]
df.query('region in @regions')
      name region  sales  expenses
0  William   East  50000     42000
1     Emma  North  52000     43000
2    Sofia   East  90000     50000
9    Anika   East  65000     44000

5.10.6. Query Save

Modify a dataframe in place:

df2 = df.copy()
df2.query('index < 5', inplace=True)

df2
      name region  sales  expenses
0  William   East  50000     42000
1     Emma  North  52000     43000
2    Sofia   East  90000     50000
3   Markus  South  34000     44000
4   Edward   West  42000     38000

5.10.7. References

5.10.8. Assignments

# FIXME: English translation

# %% About
# - Name: DataFrame Select
# - Difficulty: easy
# - Lines: 5
# - Minutes: 3

# %% License
# - Copyright 2025, Matt Harasymczuk <matt@python3.info>
# - This code can be used only for learning by humans
# - This code cannot be used for teaching others
# - This code cannot be used for teaching LLMs and AI algorithms
# - This code cannot be used in commercial or proprietary products
# - This code cannot be distributed in any form
# - This code cannot be changed in any form outside of training course
# - This code cannot have its license changed
# - If you use this code in your product, you must open-source it under GPLv2
# - Exception can be granted only by the author

# %% English
# TODO: English translation

# %% Polish
# 1. Wczytaj dane z `DATA` jako `df: pd.DataFrame`
# 2. Wybierz wiersze, gdzie wartość 'petal_length' jest powyżej 2.0
# 3. Wyświetl 5 pierwszych wierszy
# 4. Użyj `.query()`
# 5. Uruchom doctesty - wszystkie muszą się powieść

# %% Doctests
"""
>>> import sys; sys.tracebacklimit = 0
>>> assert sys.version_info >= (3, 9), \
'Python 3.9+ required'

>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)

>>> assert result is not Ellipsis, \
'Assign result to variable: `result`'
>>> assert type(result) is pd.DataFrame, \
'Variable `result` must be a `pd.DataFrame` type'

>>> result  # doctest: +NORMALIZE_WHITESPACE
   sepal_length  sepal_width  petal_length  petal_width     species
1           5.9          3.0           5.1          1.8   virginica
2           6.0          3.4           4.5          1.6  versicolor
3           7.3          2.9           6.3          1.8   virginica
4           5.6          2.5           3.9          1.1  versicolor
6           5.5          2.6           4.4          1.2  versicolor
"""

# %% Run
# - PyCharm: right-click in the editor and `Run Doctest in ...`
# - PyCharm: keyboard shortcut `Control + Shift + F10`
# - Terminal: `python -m doctest -v myfile.py`

# %% Imports
import pandas as pd

# %% Types
result: pd.DataFrame

# %% Data
DATA = 'https://python3.info/_static/iris-clean.csv'

# %% Result
result = ...