5.13. DataFrame NA

  • pd.NA and np.nan Represents missing values

  • pd.NA will be used in future, but for now, there are function which does not support it yet

  • .isna()

  • .dropna(how='any|all', axis='rows|columns')

  • .any()

  • .all()

  • .fillna(value|dict)

  • .ffill()

  • .bfill()

  • .interpolate() - works only with np.nan (not pd.NA)

A floating-point 'not a number' (NaN) value. Equivalent to the output of float('nan'). Due to the requirements of the IEEE-754 standard, math.nan and float('nan') are not considered to equal to any other numeric value, including themselves. To check whether a number is a NaN, use the isnan() function to test for NaNs instead of is or ==. Example [1]:

Python Standard Library:

>>> import math
>>>
>>> math.nan == math.nan
False
>>> float('nan') == float('nan')
False
>>> math.isnan(math.nan)
True
>>> math.isnan(float('nan'))
True

5.13.1. SetUp

>>> import pandas as pd
>>> import numpy as np
>>>
>>>
>>> df = pd.DataFrame({
...     'A': [1, 2, np.nan, np.nan, 3, np.nan, 4],
...     'B': [1.1, 2.2, np.nan, np.nan, 3.3, np.nan, 4.4],
...     'C': ['a', 'b', np.nan, np.nan, 'c', np.nan, 'd'],
...     'D': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
... })
>>>
>>> df
     A    B    C   D
0  1.0  1.1    a NaN
1  2.0  2.2    b NaN
2  NaN  NaN  NaN NaN
3  NaN  NaN  NaN NaN
4  3.0  3.3    c NaN
5  NaN  NaN  NaN NaN
6  4.0  4.4    d NaN

5.13.2. Check if Any

>>> df.any()
A     True
B     True
C     True
D    False
dtype: bool

5.13.3. Check if All

>>> df.all()
A    True
B    True
C    True
D    True
dtype: bool

5.13.4. Check if Null

>>> df.isnull()
       A      B      C     D
0  False  False  False  True
1  False  False  False  True
2   True   True   True  True
3   True   True   True  True
4  False  False  False  True
5   True   True   True  True
6  False  False  False  True

5.13.5. Check if NA

>>> df.isna()
       A      B      C     D
0  False  False  False  True
1  False  False  False  True
2   True   True   True  True
3   True   True   True  True
4  False  False  False  True
5   True   True   True  True
6  False  False  False  True

5.13.6. Fill With Scalar Value

>>> df.fillna(0.0)
     A    B    C    D
0  1.0  1.1    a  0.0
1  2.0  2.2    b  0.0
2  0.0  0.0  0.0  0.0
3  0.0  0.0  0.0  0.0
4  3.0  3.3    c  0.0
5  0.0  0.0  0.0  0.0
6  4.0  4.4    d  0.0

5.13.7. Fill With Dict Values

>>> df.fillna({
...     'A': 99,
...     'B': 88,
...     'C': 77
... })
      A     B   C   D
0   1.0   1.1   a NaN
1   2.0   2.2   b NaN
2  99.0  88.0  77 NaN
3  99.0  88.0  77 NaN
4   3.0   3.3   c NaN
5  99.0  88.0  77 NaN
6   4.0   4.4   d NaN

5.13.8. Fill Forwards

ffill: propagate last valid observation forward:

>>> df.fillna(method='ffill')
     A    B  C   D
0  1.0  1.1  a NaN
1  2.0  2.2  b NaN
2  2.0  2.2  b NaN
3  2.0  2.2  b NaN
4  3.0  3.3  c NaN
5  3.0  3.3  c NaN
6  4.0  4.4  d NaN

5.13.9. Fill Backwards

bfill: use NEXT valid observation to fill gap:

>>> df.fillna(method='bfill')
     A    B  C   D
0  1.0  1.1  a NaN
1  2.0  2.2  b NaN
2  3.0  3.3  c NaN
3  3.0  3.3  c NaN
4  3.0  3.3  c NaN
5  4.0  4.4  d NaN
6  4.0  4.4  d NaN

5.13.10. Interpolate

>>> df.interpolate()
          A         B    C   D
0  1.000000  1.100000    a NaN
1  2.000000  2.200000    b NaN
2  2.333333  2.566667  NaN NaN
3  2.666667  2.933333  NaN NaN
4  3.000000  3.300000    c NaN
5  3.500000  3.850000  NaN NaN
6  4.000000  4.400000    d NaN
Table 5.7. Interpolation techniques

Method

Description

linear

Ignore the index and treat the values as equally spaced. This is the only method supported on MultiIndexes

time

Works on daily and higher resolution data to interpolate given length of interval

index, values

use the actual numerical values of the index.

pad

Fill in NA using existing values

nearest, zero, slinear, quadratic, cubic, spline, barycentric, polynomial

Passed to scipy.interpolate.interp1d. These methods use the numerical values of the index. Both polynomial and spline require that you also specify an order (int), e.g. df.interpolate(method='polynomial', order=5)

krogh, piecewise_polynomial, spline, pchip, akima

Wrappers around the SciPy interpolation methods of similar names

from_derivatives

Refers to scipy.interpolate.BPoly.from_derivatives which replaces piecewise_polynomial interpolation method in scipy 0.18.

5.13.11. Drop Rows with NA

>>> df.dropna(how='all')
     A    B  C   D
0  1.0  1.1  a NaN
1  2.0  2.2  b NaN
4  3.0  3.3  c NaN
6  4.0  4.4  d NaN
>>> df.dropna(how='all', axis='rows')
     A    B  C   D
0  1.0  1.1  a NaN
1  2.0  2.2  b NaN
4  3.0  3.3  c NaN
6  4.0  4.4  d NaN
>>> df.dropna(how='all', axis=0)
     A    B  C   D
0  1.0  1.1  a NaN
1  2.0  2.2  b NaN
4  3.0  3.3  c NaN
6  4.0  4.4  d NaN
>>> df.dropna(how='any')
Empty DataFrame
Columns: [A, B, C, D]
Index: []
>>> df.dropna(how='any', axis=0)
Empty DataFrame
Columns: [A, B, C, D]
Index: []
>>> df.dropna(how='any', axis='rows')
Empty DataFrame
Columns: [A, B, C, D]
Index: []

5.13.12. Drop Columns with NA

>>> df.dropna(how='all', axis='columns')
     A    B    C
0  1.0  1.1    a
1  2.0  2.2    b
2  NaN  NaN  NaN
3  NaN  NaN  NaN
4  3.0  3.3    c
5  NaN  NaN  NaN
6  4.0  4.4    d
>>> df.dropna(how='all', axis=1)
     A    B    C
0  1.0  1.1    a
1  2.0  2.2    b
2  NaN  NaN  NaN
3  NaN  NaN  NaN
4  3.0  3.3    c
5  NaN  NaN  NaN
6  4.0  4.4    d
>>> df.dropna(how='all', axis=-1)
Traceback (most recent call last):
ValueError: No axis named -1 for object type DataFrame
>>> df.dropna(how='any', axis='columns')
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6]
>>> df.dropna(how='any', axis=1)
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6]
>>> df.dropna(how='any', axis=-1)
Traceback (most recent call last):
ValueError: No axis named -1 for object type DataFrame

5.13.13. Recap

>>> data = pd.DataFrame({
...     'A': [1,2,3,4,5,6,7,8,9],
...     'B': [1,2,np.nan,np.nan,5,6,7,8,9]
... })
>>>
>>> a = data['A'].isnull()
>>> b = data['B'].isnull()
>>> c = data['B'].isnull().any()
>>> d = data['B'].isnull().all()
>>>
>>> e = data.fillna(0)
>>>
>>> f = data.dropna()
>>> g = data.dropna(how='any')
>>> h = data.dropna(how='any', axis='rows')
>>> i = data.dropna(how='all', axis='columns')
>>>
>>> j = data.ffill()
>>> k = data.bfill()
>>> l = data.interpolate('linear')
>>> m = data.interpolate('quadratic')  
>>> n = data.interpolate('polynomial', order=3)  

5.13.14. References

5.13.15. Assignments

# %% License
# - Copyright 2025, Matt Harasymczuk <matt@python3.info>
# - This code can be used only for learning by humans
# - This code cannot be used for teaching others
# - This code cannot be used for teaching LLMs and AI algorithms
# - This code cannot be used in commercial or proprietary products
# - This code cannot be distributed in any form
# - This code cannot be changed in any form outside of training course
# - This code cannot have its license changed
# - If you use this code in your product, you must open-source it under GPLv2
# - Exception can be granted only by the author

# %% Run
# - PyCharm: right-click in the editor and `Run Doctest in ...`
# - PyCharm: keyboard shortcut `Control + Shift + F10`
# - Terminal: `python -m doctest -v myfile.py`

# %% About
# - Name: DataFrame NaN
# - Difficulty: easy
# - Lines: 10
# - Minutes: 8

# %% English
# 1. Define variable `result` with
#    dataframe `DATA` without rows having any empty value (`pd.NA`)
# 2. Run doctests - all must succeed

# %% Polish
# 1. Zdefiniuj zmienną `result` z
#    dataframe `DATA` bez wierszy mających jakiekolwiek puste wartości (`pd.NA`)
# 2. Uruchom doctesty - wszystkie muszą się powieść

# %% Tests
"""
>>> import sys; sys.tracebacklimit = 0
>>> assert sys.version_info >= (3, 9), \
'Python 3.9+ required'

>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)

>>> assert result is not Ellipsis, \
'Assign result to variable: `result`'
>>> assert type(result) is pd.DataFrame, \
'Variable `result` must be a `pd.DataFrame` type'

>>> result
  firstname   lastname       role
0      Mark     Watney   botanist
1   Melissa      Lewis  commander
2      Rick   Martinez      pilot
4      Beth  Johanssen   engineer
5     Chris       Back      medic
"""

import pandas as pd
import numpy as np


DATA = pd.DataFrame([
    {'firstname': 'Mark', 'lastname': 'Watney', 'role': 'botanist'},
    {'firstname': 'Melissa', 'lastname': 'Lewis', 'role': 'commander'},
    {'firstname': 'Rick', 'lastname': 'Martinez', 'role': 'pilot'},
    {'firstname': 'Alex', 'lastname': 'Vogel', 'role': pd.NA},
    {'firstname': 'Beth', 'lastname': 'Johanssen', 'role': 'engineer'},
    {'firstname': 'Chris', 'lastname': 'Back', 'role': 'medic'},
])

# dataframe `DATA` without rows having any empty value (`pd.NA`)
# type: pd.DataFrame
result = ...


# %% License
# - Copyright 2025, Matt Harasymczuk <matt@python3.info>
# - This code can be used only for learning by humans
# - This code cannot be used for teaching others
# - This code cannot be used for teaching LLMs and AI algorithms
# - This code cannot be used in commercial or proprietary products
# - This code cannot be distributed in any form
# - This code cannot be changed in any form outside of training course
# - This code cannot have its license changed
# - If you use this code in your product, you must open-source it under GPLv2
# - Exception can be granted only by the author

# %% Run
# - PyCharm: right-click in the editor and `Run Doctest in ...`
# - PyCharm: keyboard shortcut `Control + Shift + F10`
# - Terminal: `python -m doctest -v myfile.py`

# %% About
# - Name: DataFrame NaN
# - Difficulty: easy
# - Lines: 10
# - Minutes: 8

# %% English
# 1. Define variable `result` with
#    dataframe `DATA` without empty rows
# 2. Run doctests - all must succeed

# %% Polish
# 1. Zdefiniuj zmienną `result` z
#    dataframe `DATA` bez pustych wierszy
# 2. Uruchom doctesty - wszystkie muszą się powieść

# %% Tests
"""
>>> import sys; sys.tracebacklimit = 0
>>> assert sys.version_info >= (3, 9), \
'Python 3.9+ required'

>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)

>>> assert result is not Ellipsis, \
'Assign result to variable: `result`'
>>> assert type(result) is pd.DataFrame, \
'Variable `result` must be a `pd.DataFrame` type'

>>> result
  firstname   lastname       role
0      Mark     Watney   botanist
1   Melissa      Lewis  commander
2      Rick   Martinez      pilot
3      Alex      Vogel       <NA>
5      Beth  Johanssen   engineer
6     Chris       Back      medic
"""

import pandas as pd


DATA = pd.DataFrame([
    {'firstname': 'Mark', 'lastname': 'Watney', 'role': 'botanist'},
    {'firstname': 'Melissa', 'lastname': 'Lewis', 'role': 'commander'},
    {'firstname': 'Rick', 'lastname': 'Martinez', 'role': 'pilot'},
    {'firstname': 'Alex', 'lastname': 'Vogel', 'role': pd.NA},
    {'firstname': pd.NA, 'lastname': pd.NA, 'role': pd.NA},
    {'firstname': 'Beth', 'lastname': 'Johanssen', 'role': 'engineer'},
    {'firstname': 'Chris', 'lastname': 'Back', 'role': 'medic'},
])

# dataframe `DATA` without empty rows
# type: pd.DataFrame
result = ...


# %% License
# - Copyright 2025, Matt Harasymczuk <matt@python3.info>
# - This code can be used only for learning by humans
# - This code cannot be used for teaching others
# - This code cannot be used for teaching LLMs and AI algorithms
# - This code cannot be used in commercial or proprietary products
# - This code cannot be distributed in any form
# - This code cannot be changed in any form outside of training course
# - This code cannot have its license changed
# - If you use this code in your product, you must open-source it under GPLv2
# - Exception can be granted only by the author

# %% Run
# - PyCharm: right-click in the editor and `Run Doctest in ...`
# - PyCharm: keyboard shortcut `Control + Shift + F10`
# - Terminal: `python -m doctest -v myfile.py`

# %% About
# - Name: DataFrame NaN
# - Difficulty: easy
# - Lines: 10
# - Minutes: 8

# %% English
# 1. Define variable `result` with
#    dataframe `DATA` without empty columns
# 2. Run doctests - all must succeed

# %% Polish
# 1. Zdefiniuj zmienną `result` z
#    dataframe `DATA` bez pustych kolumn
# 2. Uruchom doctesty - wszystkie muszą się powieść

# %% Tests
"""
>>> import sys; sys.tracebacklimit = 0
>>> assert sys.version_info >= (3, 9), \
'Python 3.9+ required'

>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)

>>> assert result is not Ellipsis, \
'Assign result to variable: `result`'
>>> assert type(result) is pd.DataFrame, \
'Variable `result` must be a `pd.DataFrame` type'

>>> result
  firstname   lastname       role
0      Mark     Watney   botanist
1   Melissa      Lewis  commander
2      Rick   Martinez      pilot
3      Alex      Vogel    chemist
4      Beth  Johanssen   engineer
5     Chris       Back      medic
"""

import pandas as pd


DATA = pd.DataFrame([
    {'firstname': 'Mark', 'lastname': 'Watney', 'role': 'botanist', 'mission': pd.NA},
    {'firstname': 'Melissa', 'lastname': 'Lewis', 'role': 'commander', 'mission': pd.NA},
    {'firstname': 'Rick', 'lastname': 'Martinez', 'role': 'pilot', 'mission': pd.NA},
    {'firstname': 'Alex', 'lastname': 'Vogel', 'role': 'chemist', 'mission': pd.NA},
    {'firstname': 'Beth', 'lastname': 'Johanssen', 'role': 'engineer', 'mission': pd.NA},
    {'firstname': 'Chris', 'lastname': 'Back', 'role': 'medic', 'mission': pd.NA},
])

# dataframe `DATA` without empty columns
# type: pd.DataFrame
result = ...


# %% License
# - Copyright 2025, Matt Harasymczuk <matt@python3.info>
# - This code can be used only for learning by humans
# - This code cannot be used for teaching others
# - This code cannot be used for teaching LLMs and AI algorithms
# - This code cannot be used in commercial or proprietary products
# - This code cannot be distributed in any form
# - This code cannot be changed in any form outside of training course
# - This code cannot have its license changed
# - If you use this code in your product, you must open-source it under GPLv2
# - Exception can be granted only by the author

# %% Run
# - PyCharm: right-click in the editor and `Run Doctest in ...`
# - PyCharm: keyboard shortcut `Control + Shift + F10`
# - Terminal: `python -m doctest -v myfile.py`

# %% About
# - Name: DataFrame NaN
# - Difficulty: easy
# - Lines: 10
# - Minutes: 8

# %% English
# 1. Define variable `result` with
#    dataframe `DATA` with values in column `mission` forward filled
# 2. Run doctests - all must succeed

# %% Polish
# 1. Zdefiniuj zmienną `result` z
#    dataframe `DATA` z wartościami w kolumnie `mission` wypełnionymi w przód
# 2. Uruchom doctesty - wszystkie muszą się powieść

# %% Tests
"""
>>> import sys; sys.tracebacklimit = 0
>>> assert sys.version_info >= (3, 9), \
'Python 3.9+ required'

>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)

>>> assert result is not Ellipsis, \
'Assign result to variable: `result`'
>>> assert type(result) is pd.DataFrame, \
'Variable `result` must be a `pd.DataFrame` type'

>>> result
  firstname   lastname       role agency
0      Mark     Watney   botanist   NASA
1   Melissa      Lewis  commander   NASA
2      Rick   Martinez      pilot   NASA
3      Alex      Vogel    chemist    ESA
4      Beth  Johanssen   engineer   NASA
5     Chris       Back      medic   NASA
"""

import pandas as pd


DATA = pd.DataFrame([
    {'firstname': 'Mark', 'lastname': 'Watney', 'role': 'botanist', 'agency': 'NASA'},
    {'firstname': 'Melissa', 'lastname': 'Lewis', 'role': 'commander', 'agency': pd.NA},
    {'firstname': 'Rick', 'lastname': 'Martinez', 'role': 'pilot', 'agency': pd.NA},
    {'firstname': 'Alex', 'lastname': 'Vogel', 'role': 'chemist', 'agency': 'ESA'},
    {'firstname': 'Beth', 'lastname': 'Johanssen', 'role': 'engineer', 'agency': 'NASA'},
    {'firstname': 'Chris', 'lastname': 'Back', 'role': 'medic', 'agency': pd.NA},
])

# dataframe `DATA` with values in column `mission` forward filled
# type: pd.DataFrame
result = ...


# %% License
# - Copyright 2025, Matt Harasymczuk <matt@python3.info>
# - This code can be used only for learning by humans
# - This code cannot be used for teaching others
# - This code cannot be used for teaching LLMs and AI algorithms
# - This code cannot be used in commercial or proprietary products
# - This code cannot be distributed in any form
# - This code cannot be changed in any form outside of training course
# - This code cannot have its license changed
# - If you use this code in your product, you must open-source it under GPLv2
# - Exception can be granted only by the author

# %% Run
# - PyCharm: right-click in the editor and `Run Doctest in ...`
# - PyCharm: keyboard shortcut `Control + Shift + F10`
# - Terminal: `python -m doctest -v myfile.py`

# %% About
# - Name: DataFrame NaN
# - Difficulty: easy
# - Lines: 10
# - Minutes: 8

# %% English
# 1. Read data from `DATA` as `df: pd.DataFrame`
# 2. Skip first line with metadata
# 3. Rename columns to:
#    - sepal_length
#    - sepal_width
#    - petal_length
#    - petal_width
#    - species
# 4. Replace values in column species
#    - 0 -> 'setosa',
#    - 1 -> 'versicolor',
#    - 2 -> 'virginica'
# 5. Select values in column 'petal_length' less than 4
# 6. Set selected values to `NaN`
# 7. Drop rows with remaining `NaN` values
# 8. Define `result` as first two rows
# 9. Run doctests - all must succeed

# %% Polish
# 1. Wczytaj dane z `DATA` jako `df: pd.DataFrame`
# 2. Pomiń pierwszą linię z metadanymi
# 3. Zmień nazwy kolumn na:
#    - sepal_length
#    - sepal_width
#    - petal_length
#    - petal_width
#    - species
# 4. Podmień wartości w kolumnie species
#    - 0 -> 'setosa',
#    - 1 -> 'versicolor',
#    - 2 -> 'virginica'
# 5. Wybierz wartości w kolumnie 'petal_length' mniejsze od 4
# 6. Wybrane wartości ustaw na `NaN`
# 7. Usuń wiersze z pozostałymi wartościami `NaN`
# 8. Zdefiniuj `result` jako dwa pierwsze wiersze
# 9. Uruchom doctesty - wszystkie muszą się powieść

# %% Tests
"""
>>> import sys; sys.tracebacklimit = 0
>>> assert sys.version_info >= (3, 9), \
'Python 3.9+ required'

>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)

>>> assert result is not Ellipsis, \
'Assign result to variable: `result`'
>>> assert type(result) is pd.DataFrame, \
'Variable `result` must be a `pd.DataFrame` type'

>>> result  # doctest: +NORMALIZE_WHITESPACE
   sepal_length  sepal_width  petal_length  petal_width     species
1           5.9          3.0           5.1          1.8   virginica
2           6.0          3.4           4.5          1.6  versicolor
"""

import pandas as pd


DATA = 'https://python3.info/_static/iris-dirty.csv'

COLUMNS = [
    'sepal_length',
    'sepal_width',
    'petal_length',
    'petal_width',
    'species']

LABELS = {
    0: 'setosa',
    1: 'versicolor',
    2: 'virginica',
}

# type: pd.DataFrame
result = ...