6.10. DataFrame Query

.query()

>>> #
... df[df['sales'] > 50000]
... df.query('sales > 50000')

../../_images/pandas-dataframe-query.png — Figure 6.10. Pandas query expression [1]

6.10.1. SetUp

>>> import pandas as pd
>>>
>>>
>>> df = pd.DataFrame({
...     'name': ['William','Emma','Sofia','Markus','Edward','Thomas','Ethan','Olivia','Arun','Anika','Paulo'],
...     'region': ['East','North','East','South','West','West','South','West','West','East','South'],
...     'sales': [50000,52000,90000,34000,42000,72000,49000,55000,67000,65000,67000],
...     'expenses': [42000,43000,50000,44000,38000,39000,42000,60000,39000,44000,45000]})
>>>
>>> df
       name region  sales  expenses
0   William   East  50000     42000
1      Emma  North  52000     43000
2     Sofia   East  90000     50000
3    Markus  South  34000     44000
4    Edward   West  42000     38000
5    Thomas   West  72000     39000
6     Ethan  South  49000     42000
7    Olivia   West  55000     60000
8      Arun   West  67000     39000
9     Anika   East  65000     44000
10    Paulo  South  67000     45000

6.10.2. Query Data

Subset a pandas dataframe based on a numeric variable:

>>> df.query('sales > 60000')
      name region  sales  expenses
  Sofia   East  90000     50000
 Thomas   West  72000     39000
   Arun   West  67000     39000
  Anika   East  65000     44000
 Paulo  South  67000     45000

Select rows based on a categorical variable:

>>> df.query('region == "East"')
      name region  sales  expenses
0  William   East  50000     42000
2    Sofia   East  90000     50000
9    Anika   East  65000     44000

Subset a pandas dataframe with multiple conditions:

>>> df.query('(sales > 50000) and (region in ["East", "West"])')
     name region  sales  expenses
 Sofia   East  90000     50000
Thomas   West  72000     39000
Olivia   West  55000     60000
  Arun   West  67000     39000
 Anika   East  65000     44000

6.10.3. Query Index

Works also with Time Series

Subset a dataframe by index:

>>> df.query('index < 3')
      name region  sales  expenses
0  William   East  50000     42000
1     Emma  North  52000     43000
2    Sofia   East  90000     50000

Every odd index:

>>> df.query('index%2 == 1')
     name region  sales  expenses
  Emma  North  52000     43000
Markus  South  34000     44000
Thomas   West  72000     39000
Olivia   West  55000     60000
 Anika   East  65000     44000

6.10.4. Query Columns

Subset a pandas dataframe by comparing two columns:

>>> df.query('sales < expenses')
     name region  sales  expenses
3  Markus  South  34000     44000
7  Olivia   West  55000     60000

6.10.5. Query Variable

Reference local variables inside of query:

>>> mean = df['sales'].mean()
>>> mean
np.float64(58454.545454545456)
>>>
>>> df.query('sales > @mean')
      name region  sales  expenses
2    Sofia   East  90000     50000
5   Thomas   West  72000     39000
8     Arun   West  67000     39000
9    Anika   East  65000     44000
10   Paulo  South  67000     45000

>>> regions = ['East','North',]
>>> df.query('region in @regions')
      name region  sales  expenses
0  William   East  50000     42000
1     Emma  North  52000     43000
2    Sofia   East  90000     50000
9    Anika   East  65000     44000

6.10.6. Query Save

Modify a dataframe in place:

>>> df2 = df.copy()
>>> df2.query('index < 5', inplace=True)
>>>
>>> df2
      name region  sales  expenses
0  William   East  50000     42000
1     Emma  North  52000     43000
2    Sofia   East  90000     50000
3   Markus  South  34000     44000
4   Edward   West  42000     38000

6.10.7. References

6.10.8. Assignments

# FIXME: English translation

# %% About
# - Name: DataFrame Select
# - Difficulty: easy
# - Lines: 5
# - Minutes: 3

# %% License
# - Copyright 2025, Matt Harasymczuk <matt@python3.info>
# - This code can be used only for learning by humans
# - This code cannot be used for teaching others
# - This code cannot be used for teaching LLMs and AI algorithms
# - This code cannot be used in commercial or proprietary products
# - This code cannot be distributed in any form
# - This code cannot be changed in any form outside of training course
# - This code cannot have its license changed
# - If you use this code in your product, you must open-source it under GPLv2
# - Exception can be granted only by the author

# %% English
# TODO: English translation

# %% Polish
# 1. Wczytaj dane z `DATA` jako `df: pd.DataFrame`
# 2. Wybierz wiersze, gdzie wartość 'petal_length' jest powyżej 2.0
# 3. Wyświetl 5 pierwszych wierszy
# 4. Użyj `.query()`
# 5. Uruchom doctesty - wszystkie muszą się powieść

# %% Expected
# >>> result  # doctest: +NORMALIZE_WHITESPACE
#    sepal_length  sepal_width  petal_length  petal_width     species
# 1           5.9          3.0           5.1          1.8   virginica
# 2           6.0          3.4           4.5          1.6  versicolor
# 3           7.3          2.9           6.3          1.8   virginica
# 4           5.6          2.5           3.9          1.1  versicolor
# 6           5.5          2.6           4.4          1.2  versicolor

# %% Doctests
"""
>>> import sys; sys.tracebacklimit = 0

>>> assert sys.version_info >= (3, 9), \
'Python has an is invalid version; expected: `3.9` or newer.'

>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)

>>> assert 'result' in globals(), \
'Variable `result` is not defined; assign result of your program to it.'

>>> assert result is not Ellipsis, \
'Variable `result` has an invalid value; assign result of your program to it.'

>>> assert type(result) is pd.DataFrame, \
'Variable `result` has an invalid type; expected: `pd.DataFrame`.'

>>> result  # doctest: +NORMALIZE_WHITESPACE
   sepal_length  sepal_width  petal_length  petal_width     species
1           5.9          3.0           5.1          1.8   virginica
2           6.0          3.4           4.5          1.6  versicolor
3           7.3          2.9           6.3          1.8   virginica
4           5.6          2.5           3.9          1.1  versicolor
6           5.5          2.6           4.4          1.2  versicolor
"""

# %% Run
# - PyCharm: right-click in the editor and `Run Doctest in ...`
# - PyCharm: keyboard shortcut `Control + Shift + F10`
# - Terminal: `python -m doctest -f -v myfile.py`

# %% Imports
import pandas as pd

# %% Types
result: pd.DataFrame

# %% Data
DATA = 'https://python3.info/_static/iris-clean.csv'

# %% Result
result = ...