5.13. DataFrame NA

Important

pd.NA and np.nan Represents missing values
pd.NA will be used in future, but for now, there are function which does not support it yet
.isna()
.dropna(how='any|all', axis='rows|columns')
.any()
.all()
.fillna(value|dict)
.ffill()
.bfill()
.interpolate() - works only with np.nan (not pd.NA)

A floating-point 'not a number' (NaN) value. Equivalent to the output of float('nan'). Due to the requirements of the IEEE-754 standard, math.nan and float('nan') are not considered to equal to any other numeric value, including themselves. To check whether a number is a NaN, use the isnan() function to test for NaNs instead of is or ==. Example [1]:

Python Standard Library:

✘>>> import math
>>>
>>> math.nan == math.nan
False
>>> float('nan') == float('nan')
False
>>> math.isnan(math.nan)
True
>>> math.isnan(float('nan'))
True

5.13.1. SetUp

✘>>> import pandas as pd
>>> import numpy as np
>>>
>>>
>>> df = pd.DataFrame({
...     'A': [1, 2, np.nan, np.nan, 3, np.nan, 4],
...     'B': [1.1, 2.2, np.nan, np.nan, 3.3, np.nan, 4.4],
...     'C': ['a', 'b', np.nan, np.nan, 'c', np.nan, 'd'],
...     'D': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
... })
>>>
>>> df
     A    B    C   D
0  1.0  1.1    a NaN
1  2.0  2.2    b NaN
2  NaN  NaN  NaN NaN
3  NaN  NaN  NaN NaN
4  3.0  3.3    c NaN
5  NaN  NaN  NaN NaN
6  4.0  4.4    d NaN

5.13.2. Check if Any

✘>>> df.any()
A     True
B     True
C     True
D    False
dtype: bool

5.13.3. Check if All

✘>>> df.all()
A    True
B    True
C    True
D    True
dtype: bool

5.13.4. Check if Null

✘>>> df.isnull()
       A      B      C     D
False  False  False  True
False  False  False  True
 True   True   True  True
 True   True   True  True
False  False  False  True
 True   True   True  True
False  False  False  True

5.13.5. Check if NA

✘>>> df.isna()
       A      B      C     D
False  False  False  True
False  False  False  True
 True   True   True  True
 True   True   True  True
False  False  False  True
 True   True   True  True
False  False  False  True

5.13.6. Fill With Scalar Value

✘>>> df.fillna(0.0)
     A    B    C    D
1.0  1.1    a  0.0
2.0  2.2    b  0.0
0.0  0.0  0.0  0.0
0.0  0.0  0.0  0.0
3.0  3.3    c  0.0
0.0  0.0  0.0  0.0
4.0  4.4    d  0.0

5.13.7. Fill With Dict Values

✘>>> df.fillna({
...     'A': 99,
...     'B': 88,
...     'C': 77
... })
      A     B   C   D
0   1.0   1.1   a NaN
1   2.0   2.2   b NaN
2  99.0  88.0  77 NaN
3  99.0  88.0  77 NaN
4   3.0   3.3   c NaN
5  99.0  88.0  77 NaN
6   4.0   4.4   d NaN

5.13.8. Fill Forwards

ffill: propagate last valid observation forward:

✘>>> df.fillna(method='ffill')
     A    B  C   D
1.0  1.1  a NaN
2.0  2.2  b NaN
2.0  2.2  b NaN
2.0  2.2  b NaN
3.0  3.3  c NaN
3.0  3.3  c NaN
4.0  4.4  d NaN

5.13.9. Fill Backwards

bfill: use NEXT valid observation to fill gap:

✘>>> df.fillna(method='bfill')
     A    B  C   D
1.0  1.1  a NaN
2.0  2.2  b NaN
3.0  3.3  c NaN
3.0  3.3  c NaN
3.0  3.3  c NaN
4.0  4.4  d NaN
4.0  4.4  d NaN

5.13.10. Interpolate

✘>>> df.interpolate()
          A         B    C   D
1.000000  1.100000    a NaN
2.000000  2.200000    b NaN
2.333333  2.566667  NaN NaN
2.666667  2.933333  NaN NaN
3.000000  3.300000    c NaN
3.500000  3.850000  NaN NaN
4.000000  4.400000    d NaN

Table 5.2. Interpolation techniques
Method	Description
`linear`	Ignore the index and treat the values as equally spaced. This is the only method supported on MultiIndexes
`time`	Works on daily and higher resolution data to interpolate given length of interval
`index`, `values`	use the actual numerical values of the index.
`pad`	Fill in NA using existing values
`nearest`, `zero`, `slinear`, `quadratic`, `cubic`, `spline`, `barycentric`, `polynomial`	Passed to `scipy.interpolate.interp1d`. These methods use the numerical values of the index. Both `polynomial` and `spline` require that you also specify an `order` (int), e.g. `df.interpolate(method='polynomial', order=5)`
`krogh`, `piecewise_polynomial`, `spline`, `pchip`, `akima`	Wrappers around the SciPy interpolation methods of similar names
`from_derivatives`	Refers to `scipy.interpolate.BPoly.from_derivatives` which replaces `piecewise_polynomial` interpolation method in scipy 0.18.

5.13.11. Drop Rows with NA

✘>>> df.dropna(how='all')
     A    B  C   D
0  1.0  1.1  a NaN
1  2.0  2.2  b NaN
4  3.0  3.3  c NaN
6  4.0  4.4  d NaN

✘>>> df.dropna(how='all', axis='rows')
     A    B  C   D
0  1.0  1.1  a NaN
1  2.0  2.2  b NaN
4  3.0  3.3  c NaN
6  4.0  4.4  d NaN

✘>>> df.dropna(how='all', axis=0)
     A    B  C   D
0  1.0  1.1  a NaN
1  2.0  2.2  b NaN
4  3.0  3.3  c NaN
6  4.0  4.4  d NaN

✘>>> df.dropna(how='any')
Empty DataFrame
Columns: [A, B, C, D]
Index: []

✘>>> df.dropna(how='any', axis=0)
Empty DataFrame
Columns: [A, B, C, D]
Index: []

✘>>> df.dropna(how='any', axis='rows')
Empty DataFrame
Columns: [A, B, C, D]
Index: []

5.13.12. Drop Columns with NA

✘>>> df.dropna(how='all', axis='columns')
     A    B    C
1.0  1.1    a
2.0  2.2    b
NaN  NaN  NaN
NaN  NaN  NaN
3.0  3.3    c
NaN  NaN  NaN
4.0  4.4    d

✘>>> df.dropna(how='all', axis=1)
     A    B    C
1.0  1.1    a
2.0  2.2    b
NaN  NaN  NaN
NaN  NaN  NaN
3.0  3.3    c
NaN  NaN  NaN
4.0  4.4    d

✘>>> df.dropna(how='all', axis=-1)
Traceback (most recent call last):
ValueError: No axis named -1 for object type DataFrame

✘>>> df.dropna(how='any', axis='columns')
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6]

✘>>> df.dropna(how='any', axis=1)
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6]

✘>>> df.dropna(how='any', axis=-1)
Traceback (most recent call last):
ValueError: No axis named -1 for object type DataFrame

5.13.13. Recap

✘>>> data = pd.DataFrame({
...     'A': [1,2,3,4,5,6,7,8,9],
...     'B': [1,2,np.nan,np.nan,5,6,7,8,9]
... })
>>>
>>> a = data['A'].isnull()
>>> b = data['B'].isnull()
>>> c = data['B'].isnull().any()
>>> d = data['B'].isnull().all()
>>>
>>> e = data.fillna(0)
>>>
>>> f = data.dropna()
>>> g = data.dropna(how='any')
>>> h = data.dropna(how='any', axis='rows')
>>> i = data.dropna(how='all', axis='columns')
>>>
>>> j = data.ffill()
>>> k = data.bfill()
>>> l = data.interpolate('linear')
>>> m = data.interpolate('quadratic')
>>> n = data.interpolate('polynomial', order=3)

5.13.14. References

5.13.15. Assignments

# %% About
# - Name: DataFrame NaN
# - Difficulty: easy
# - Lines: 10
# - Minutes: 8

# %% License
# - Copyright 2025, Matt Harasymczuk <matt@python3.info>
# - This code can be used only for learning by humans
# - This code cannot be used for teaching others
# - This code cannot be used for teaching LLMs and AI algorithms
# - This code cannot be used in commercial or proprietary products
# - This code cannot be distributed in any form
# - This code cannot be changed in any form outside of training course
# - This code cannot have its license changed
# - If you use this code in your product, you must open-source it under GPLv2
# - Exception can be granted only by the author

# %% English
# 1. Define variable `result` with
#    dataframe `DATA` without rows having any empty value (`pd.NA`)
# 2. Run doctests - all must succeed

# %% Polish
# 1. Zdefiniuj zmienną `result` z
#    dataframe `DATA` bez wierszy mających jakiekolwiek puste wartości (`pd.NA`)
# 2. Uruchom doctesty - wszystkie muszą się powieść

# %% Doctests
"""
>>> import sys; sys.tracebacklimit = 0
>>> assert sys.version_info >= (3, 9), \
'Python 3.9+ required'

>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)

>>> assert result is not Ellipsis, \
'Assign result to variable: `result`'
>>> assert type(result) is pd.DataFrame, \
'Variable `result` must be a `pd.DataFrame` type'

>>> result
  firstname   lastname       role
0      Mark     Watney   botanist
1   Melissa      Lewis  commander
2      Rick   Martinez      pilot
4      Beth  Johanssen   engineer
5     Chris       Back      medic
"""

# %% Run
# - PyCharm: right-click in the editor and `Run Doctest in ...`
# - PyCharm: keyboard shortcut `Control + Shift + F10`
# - Terminal: `python -m doctest -v myfile.py`

# %% Imports
import pandas as pd

# %% Types
result: pd.DataFrame

# %% Data
DATA = pd.DataFrame([
    {'firstname': 'Mark', 'lastname': 'Watney', 'role': 'botanist'},
    {'firstname': 'Melissa', 'lastname': 'Lewis', 'role': 'commander'},
    {'firstname': 'Rick', 'lastname': 'Martinez', 'role': 'pilot'},
    {'firstname': 'Alex', 'lastname': 'Vogel', 'role': pd.NA},
    {'firstname': 'Beth', 'lastname': 'Johanssen', 'role': 'engineer'},
    {'firstname': 'Chris', 'lastname': 'Back', 'role': 'medic'},
])

# %% Result
result = ...

# %% About
# - Name: DataFrame NaN
# - Difficulty: easy
# - Lines: 10
# - Minutes: 8

# %% License
# - Copyright 2025, Matt Harasymczuk <matt@python3.info>
# - This code can be used only for learning by humans
# - This code cannot be used for teaching others
# - This code cannot be used for teaching LLMs and AI algorithms
# - This code cannot be used in commercial or proprietary products
# - This code cannot be distributed in any form
# - This code cannot be changed in any form outside of training course
# - This code cannot have its license changed
# - If you use this code in your product, you must open-source it under GPLv2
# - Exception can be granted only by the author

# %% English
# 1. Define variable `result` with
#    dataframe `DATA` without empty rows
# 2. Run doctests - all must succeed

# %% Polish
# 1. Zdefiniuj zmienną `result` z
#    dataframe `DATA` bez pustych wierszy
# 2. Uruchom doctesty - wszystkie muszą się powieść

# %% Doctests
"""
>>> import sys; sys.tracebacklimit = 0
>>> assert sys.version_info >= (3, 9), \
'Python 3.9+ required'

>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)

>>> assert result is not Ellipsis, \
'Assign result to variable: `result`'
>>> assert type(result) is pd.DataFrame, \
'Variable `result` must be a `pd.DataFrame` type'

>>> result
  firstname   lastname       role
0      Mark     Watney   botanist
1   Melissa      Lewis  commander
2      Rick   Martinez      pilot
3      Alex      Vogel       <NA>
5      Beth  Johanssen   engineer
6     Chris       Back      medic
"""

# %% Run
# - PyCharm: right-click in the editor and `Run Doctest in ...`
# - PyCharm: keyboard shortcut `Control + Shift + F10`
# - Terminal: `python -m doctest -v myfile.py`

# %% Imports
import pandas as pd

# %% Types
result: pd.DataFrame

# %% Data
DATA = pd.DataFrame([
    {'firstname': 'Mark', 'lastname': 'Watney', 'role': 'botanist'},
    {'firstname': 'Melissa', 'lastname': 'Lewis', 'role': 'commander'},
    {'firstname': 'Rick', 'lastname': 'Martinez', 'role': 'pilot'},
    {'firstname': 'Alex', 'lastname': 'Vogel', 'role': pd.NA},
    {'firstname': pd.NA, 'lastname': pd.NA, 'role': pd.NA},
    {'firstname': 'Beth', 'lastname': 'Johanssen', 'role': 'engineer'},
    {'firstname': 'Chris', 'lastname': 'Back', 'role': 'medic'},
])

# %% Result
result = ...

# %% About
# - Name: DataFrame NaN
# - Difficulty: easy
# - Lines: 10
# - Minutes: 8

# %% License
# - Copyright 2025, Matt Harasymczuk <matt@python3.info>
# - This code can be used only for learning by humans
# - This code cannot be used for teaching others
# - This code cannot be used for teaching LLMs and AI algorithms
# - This code cannot be used in commercial or proprietary products
# - This code cannot be distributed in any form
# - This code cannot be changed in any form outside of training course
# - This code cannot have its license changed
# - If you use this code in your product, you must open-source it under GPLv2
# - Exception can be granted only by the author

# %% English
# 1. Define variable `result` with
#    dataframe `DATA` without empty columns
# 2. Run doctests - all must succeed

# %% Polish
# 1. Zdefiniuj zmienną `result` z
#    dataframe `DATA` bez pustych kolumn
# 2. Uruchom doctesty - wszystkie muszą się powieść

# %% Doctests
"""
>>> import sys; sys.tracebacklimit = 0
>>> assert sys.version_info >= (3, 9), \
'Python 3.9+ required'

>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)

>>> assert result is not Ellipsis, \
'Assign result to variable: `result`'
>>> assert type(result) is pd.DataFrame, \
'Variable `result` must be a `pd.DataFrame` type'

>>> result
  firstname   lastname       role
0      Mark     Watney   botanist
1   Melissa      Lewis  commander
2      Rick   Martinez      pilot
3      Alex      Vogel    chemist
4      Beth  Johanssen   engineer
5     Chris       Back      medic
"""

# %% Run
# - PyCharm: right-click in the editor and `Run Doctest in ...`
# - PyCharm: keyboard shortcut `Control + Shift + F10`
# - Terminal: `python -m doctest -v myfile.py`

# %% Imports
import pandas as pd

# %% Types
result: pd.DataFrame

# %% Data
DATA = pd.DataFrame([
    {'firstname': 'Mark', 'lastname': 'Watney', 'role': 'botanist', 'mission': pd.NA},
    {'firstname': 'Melissa', 'lastname': 'Lewis', 'role': 'commander', 'mission': pd.NA},
    {'firstname': 'Rick', 'lastname': 'Martinez', 'role': 'pilot', 'mission': pd.NA},
    {'firstname': 'Alex', 'lastname': 'Vogel', 'role': 'chemist', 'mission': pd.NA},
    {'firstname': 'Beth', 'lastname': 'Johanssen', 'role': 'engineer', 'mission': pd.NA},
    {'firstname': 'Chris', 'lastname': 'Back', 'role': 'medic', 'mission': pd.NA},
])

# %% Result
result = ...

# %% About
# - Name: DataFrame NaN
# - Difficulty: easy
# - Lines: 10
# - Minutes: 8

# %% License
# - Copyright 2025, Matt Harasymczuk <matt@python3.info>
# - This code can be used only for learning by humans
# - This code cannot be used for teaching others
# - This code cannot be used for teaching LLMs and AI algorithms
# - This code cannot be used in commercial or proprietary products
# - This code cannot be distributed in any form
# - This code cannot be changed in any form outside of training course
# - This code cannot have its license changed
# - If you use this code in your product, you must open-source it under GPLv2
# - Exception can be granted only by the author

# %% English
# 1. Define variable `result` with
#    dataframe `DATA` with values in column `mission` forward filled
# 2. Run doctests - all must succeed

# %% Polish
# 1. Zdefiniuj zmienną `result` z
#    dataframe `DATA` z wartościami w kolumnie `mission` wypełnionymi w przód
# 2. Uruchom doctesty - wszystkie muszą się powieść

# %% Doctests
"""
>>> import sys; sys.tracebacklimit = 0
>>> assert sys.version_info >= (3, 9), \
'Python 3.9+ required'

>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)

>>> assert result is not Ellipsis, \
'Assign result to variable: `result`'
>>> assert type(result) is pd.DataFrame, \
'Variable `result` must be a `pd.DataFrame` type'

>>> result
  firstname   lastname       role agency
0      Mark     Watney   botanist   NASA
1   Melissa      Lewis  commander   NASA
2      Rick   Martinez      pilot   NASA
3      Alex      Vogel    chemist    ESA
4      Beth  Johanssen   engineer   NASA
5     Chris       Back      medic   NASA
"""

# %% Run
# - PyCharm: right-click in the editor and `Run Doctest in ...`
# - PyCharm: keyboard shortcut `Control + Shift + F10`
# - Terminal: `python -m doctest -v myfile.py`

# %% Imports
import pandas as pd

# %% Types
result: pd.DataFrame

# %% Data
DATA = pd.DataFrame([
    {'firstname': 'Mark', 'lastname': 'Watney', 'role': 'botanist', 'agency': 'NASA'},
    {'firstname': 'Melissa', 'lastname': 'Lewis', 'role': 'commander', 'agency': pd.NA},
    {'firstname': 'Rick', 'lastname': 'Martinez', 'role': 'pilot', 'agency': pd.NA},
    {'firstname': 'Alex', 'lastname': 'Vogel', 'role': 'chemist', 'agency': 'ESA'},
    {'firstname': 'Beth', 'lastname': 'Johanssen', 'role': 'engineer', 'agency': 'NASA'},
    {'firstname': 'Chris', 'lastname': 'Back', 'role': 'medic', 'agency': pd.NA},
])

# %% Result
result = ...

# %% About
# - Name: DataFrame NaN
# - Difficulty: easy
# - Lines: 10
# - Minutes: 8

# %% License
# - Copyright 2025, Matt Harasymczuk <matt@python3.info>
# - This code can be used only for learning by humans
# - This code cannot be used for teaching others
# - This code cannot be used for teaching LLMs and AI algorithms
# - This code cannot be used in commercial or proprietary products
# - This code cannot be distributed in any form
# - This code cannot be changed in any form outside of training course
# - This code cannot have its license changed
# - If you use this code in your product, you must open-source it under GPLv2
# - Exception can be granted only by the author

# %% English
# 1. Read data from `DATA` as `df: pd.DataFrame`
# 2. Skip first line with metadata
# 3. Rename columns to:
#    - sepal_length
#    - sepal_width
#    - petal_length
#    - petal_width
#    - species
# 4. Replace values in column species
#    - 0 -> 'setosa',
#    - 1 -> 'versicolor',
#    - 2 -> 'virginica'
# 5. Select values in column 'petal_length' less than 4
# 6. Set selected values to `NaN`
# 7. Drop rows with remaining `NaN` values
# 8. Define `result` as first two rows
# 9. Run doctests - all must succeed

# %% Polish
# 1. Wczytaj dane z `DATA` jako `df: pd.DataFrame`
# 2. Pomiń pierwszą linię z metadanymi
# 3. Zmień nazwy kolumn na:
#    - sepal_length
#    - sepal_width
#    - petal_length
#    - petal_width
#    - species
# 4. Podmień wartości w kolumnie species
#    - 0 -> 'setosa',
#    - 1 -> 'versicolor',
#    - 2 -> 'virginica'
# 5. Wybierz wartości w kolumnie 'petal_length' mniejsze od 4
# 6. Wybrane wartości ustaw na `NaN`
# 7. Usuń wiersze z pozostałymi wartościami `NaN`
# 8. Zdefiniuj `result` jako dwa pierwsze wiersze
# 9. Uruchom doctesty - wszystkie muszą się powieść

# %% Doctests
"""
>>> import sys; sys.tracebacklimit = 0
>>> assert sys.version_info >= (3, 9), \
'Python 3.9+ required'

>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)

>>> assert result is not Ellipsis, \
'Assign result to variable: `result`'
>>> assert type(result) is pd.DataFrame, \
'Variable `result` must be a `pd.DataFrame` type'

>>> result  # doctest: +NORMALIZE_WHITESPACE
   sepal_length  sepal_width  petal_length  petal_width     species
1           5.9          3.0           5.1          1.8   virginica
2           6.0          3.4           4.5          1.6  versicolor
"""

# %% Run
# - PyCharm: right-click in the editor and `Run Doctest in ...`
# - PyCharm: keyboard shortcut `Control + Shift + F10`
# - Terminal: `python -m doctest -v myfile.py`

# %% Imports
import pandas as pd

# %% Types
result: pd.DataFrame

# %% Data
DATA = 'https://python3.info/_static/iris-dirty.csv'

COLUMNS = [
    'sepal_length',
    'sepal_width',
    'petal_length',
    'petal_width',
    'species']

LABELS = {
    0: 'setosa',
    1: 'versicolor',
    2: 'virginica',
}

# %% Result
result = ...