5.13. DataFrame NA
pd.NA
andnp.nan
Represents missing valuespd.NA
will be used in future, but for now, there are function which does not support it yet.isna()
.dropna(how='any|all', axis='rows|columns')
.any()
.all()
.fillna(value|dict)
.ffill()
.bfill()
.interpolate()
- works only withnp.nan
(notpd.NA
)
A floating-point 'not a number' (NaN) value. Equivalent to the output of
float('nan')
. Due to the requirements of the IEEE-754 standard,
math.nan
and float('nan')
are not considered to equal to any other
numeric value, including themselves. To check whether a number is a NaN
,
use the isnan()
function to test for NaNs
instead of is
or
==
. Example [1]:
Python Standard Library:
>>> import math
>>>
>>> math.nan == math.nan
False
>>> float('nan') == float('nan')
False
>>> math.isnan(math.nan)
True
>>> math.isnan(float('nan'))
True
5.13.1. SetUp
>>> import pandas as pd
>>> import numpy as np
>>>
>>>
>>> df = pd.DataFrame({
... 'A': [1, 2, np.nan, np.nan, 3, np.nan, 4],
... 'B': [1.1, 2.2, np.nan, np.nan, 3.3, np.nan, 4.4],
... 'C': ['a', 'b', np.nan, np.nan, 'c', np.nan, 'd'],
... 'D': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
... })
>>>
>>> df
A B C D
0 1.0 1.1 a NaN
1 2.0 2.2 b NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 3.0 3.3 c NaN
5 NaN NaN NaN NaN
6 4.0 4.4 d NaN
5.13.2. Check if Any
>>> df.any()
A True
B True
C True
D False
dtype: bool
5.13.3. Check if All
>>> df.all()
A True
B True
C True
D True
dtype: bool
5.13.4. Check if Null
>>> df.isnull()
A B C D
0 False False False True
1 False False False True
2 True True True True
3 True True True True
4 False False False True
5 True True True True
6 False False False True
5.13.5. Check if NA
>>> df.isna()
A B C D
0 False False False True
1 False False False True
2 True True True True
3 True True True True
4 False False False True
5 True True True True
6 False False False True
5.13.6. Fill With Scalar Value
>>> df.fillna(0.0)
A B C D
0 1.0 1.1 a 0.0
1 2.0 2.2 b 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
4 3.0 3.3 c 0.0
5 0.0 0.0 0.0 0.0
6 4.0 4.4 d 0.0
5.13.7. Fill With Dict Values
>>> df.fillna({
... 'A': 99,
... 'B': 88,
... 'C': 77
... })
A B C D
0 1.0 1.1 a NaN
1 2.0 2.2 b NaN
2 99.0 88.0 77 NaN
3 99.0 88.0 77 NaN
4 3.0 3.3 c NaN
5 99.0 88.0 77 NaN
6 4.0 4.4 d NaN
5.13.8. Fill Forwards
ffill
: propagate last valid observation forward:
>>> df.fillna(method='ffill')
A B C D
0 1.0 1.1 a NaN
1 2.0 2.2 b NaN
2 2.0 2.2 b NaN
3 2.0 2.2 b NaN
4 3.0 3.3 c NaN
5 3.0 3.3 c NaN
6 4.0 4.4 d NaN
5.13.9. Fill Backwards
bfill
: use NEXT valid observation to fill gap:
>>> df.fillna(method='bfill')
A B C D
0 1.0 1.1 a NaN
1 2.0 2.2 b NaN
2 3.0 3.3 c NaN
3 3.0 3.3 c NaN
4 3.0 3.3 c NaN
5 4.0 4.4 d NaN
6 4.0 4.4 d NaN
5.13.10. Interpolate
>>> df.interpolate()
A B C D
0 1.000000 1.100000 a NaN
1 2.000000 2.200000 b NaN
2 2.333333 2.566667 NaN NaN
3 2.666667 2.933333 NaN NaN
4 3.000000 3.300000 c NaN
5 3.500000 3.850000 NaN NaN
6 4.000000 4.400000 d NaN
Method |
Description |
---|---|
|
Ignore the index and treat the values as equally spaced. This is the only method supported on MultiIndexes |
|
Works on daily and higher resolution data to interpolate given length of interval |
|
use the actual numerical values of the index. |
|
Fill in NA using existing values |
|
Passed to |
|
Wrappers around the SciPy interpolation methods of similar names |
|
Refers to |
5.13.11. Drop Rows with NA
>>> df.dropna(how='all')
A B C D
0 1.0 1.1 a NaN
1 2.0 2.2 b NaN
4 3.0 3.3 c NaN
6 4.0 4.4 d NaN
>>> df.dropna(how='all', axis='rows')
A B C D
0 1.0 1.1 a NaN
1 2.0 2.2 b NaN
4 3.0 3.3 c NaN
6 4.0 4.4 d NaN
>>> df.dropna(how='all', axis=0)
A B C D
0 1.0 1.1 a NaN
1 2.0 2.2 b NaN
4 3.0 3.3 c NaN
6 4.0 4.4 d NaN
>>> df.dropna(how='any')
Empty DataFrame
Columns: [A, B, C, D]
Index: []
>>> df.dropna(how='any', axis=0)
Empty DataFrame
Columns: [A, B, C, D]
Index: []
>>> df.dropna(how='any', axis='rows')
Empty DataFrame
Columns: [A, B, C, D]
Index: []
5.13.12. Drop Columns with NA
>>> df.dropna(how='all', axis='columns')
A B C
0 1.0 1.1 a
1 2.0 2.2 b
2 NaN NaN NaN
3 NaN NaN NaN
4 3.0 3.3 c
5 NaN NaN NaN
6 4.0 4.4 d
>>> df.dropna(how='all', axis=1)
A B C
0 1.0 1.1 a
1 2.0 2.2 b
2 NaN NaN NaN
3 NaN NaN NaN
4 3.0 3.3 c
5 NaN NaN NaN
6 4.0 4.4 d
>>> df.dropna(how='all', axis=-1)
Traceback (most recent call last):
ValueError: No axis named -1 for object type DataFrame
>>> df.dropna(how='any', axis='columns')
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6]
>>> df.dropna(how='any', axis=1)
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6]
>>> df.dropna(how='any', axis=-1)
Traceback (most recent call last):
ValueError: No axis named -1 for object type DataFrame
5.13.13. Recap
>>> data = pd.DataFrame({
... 'A': [1,2,3,4,5,6,7,8,9],
... 'B': [1,2,np.nan,np.nan,5,6,7,8,9]
... })
>>>
>>> a = data['A'].isnull()
>>> b = data['B'].isnull()
>>> c = data['B'].isnull().any()
>>> d = data['B'].isnull().all()
>>>
>>> e = data.fillna(0)
>>>
>>> f = data.dropna()
>>> g = data.dropna(how='any')
>>> h = data.dropna(how='any', axis='rows')
>>> i = data.dropna(how='all', axis='columns')
>>>
>>> j = data.ffill()
>>> k = data.bfill()
>>> l = data.interpolate('linear')
>>> m = data.interpolate('quadratic')
>>> n = data.interpolate('polynomial', order=3)
5.13.14. References
5.13.15. Assignments
# %% License
# - Copyright 2025, Matt Harasymczuk <matt@python3.info>
# - This code can be used only for learning by humans
# - This code cannot be used for teaching others
# - This code cannot be used for teaching LLMs and AI algorithms
# - This code cannot be used in commercial or proprietary products
# - This code cannot be distributed in any form
# - This code cannot be changed in any form outside of training course
# - This code cannot have its license changed
# - If you use this code in your product, you must open-source it under GPLv2
# - Exception can be granted only by the author
# %% Run
# - PyCharm: right-click in the editor and `Run Doctest in ...`
# - PyCharm: keyboard shortcut `Control + Shift + F10`
# - Terminal: `python -m doctest -v myfile.py`
# %% About
# - Name: DataFrame NaN
# - Difficulty: easy
# - Lines: 10
# - Minutes: 8
# %% English
# 1. Define variable `result` with
# dataframe `DATA` without rows having any empty value (`pd.NA`)
# 2. Run doctests - all must succeed
# %% Polish
# 1. Zdefiniuj zmienną `result` z
# dataframe `DATA` bez wierszy mających jakiekolwiek puste wartości (`pd.NA`)
# 2. Uruchom doctesty - wszystkie muszą się powieść
# %% Tests
"""
>>> import sys; sys.tracebacklimit = 0
>>> assert sys.version_info >= (3, 9), \
'Python 3.9+ required'
>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)
>>> assert result is not Ellipsis, \
'Assign result to variable: `result`'
>>> assert type(result) is pd.DataFrame, \
'Variable `result` must be a `pd.DataFrame` type'
>>> result
firstname lastname role
0 Mark Watney botanist
1 Melissa Lewis commander
2 Rick Martinez pilot
4 Beth Johanssen engineer
5 Chris Back medic
"""
import pandas as pd
import numpy as np
DATA = pd.DataFrame([
{'firstname': 'Mark', 'lastname': 'Watney', 'role': 'botanist'},
{'firstname': 'Melissa', 'lastname': 'Lewis', 'role': 'commander'},
{'firstname': 'Rick', 'lastname': 'Martinez', 'role': 'pilot'},
{'firstname': 'Alex', 'lastname': 'Vogel', 'role': pd.NA},
{'firstname': 'Beth', 'lastname': 'Johanssen', 'role': 'engineer'},
{'firstname': 'Chris', 'lastname': 'Back', 'role': 'medic'},
])
# dataframe `DATA` without rows having any empty value (`pd.NA`)
# type: pd.DataFrame
result = ...
# %% License
# - Copyright 2025, Matt Harasymczuk <matt@python3.info>
# - This code can be used only for learning by humans
# - This code cannot be used for teaching others
# - This code cannot be used for teaching LLMs and AI algorithms
# - This code cannot be used in commercial or proprietary products
# - This code cannot be distributed in any form
# - This code cannot be changed in any form outside of training course
# - This code cannot have its license changed
# - If you use this code in your product, you must open-source it under GPLv2
# - Exception can be granted only by the author
# %% Run
# - PyCharm: right-click in the editor and `Run Doctest in ...`
# - PyCharm: keyboard shortcut `Control + Shift + F10`
# - Terminal: `python -m doctest -v myfile.py`
# %% About
# - Name: DataFrame NaN
# - Difficulty: easy
# - Lines: 10
# - Minutes: 8
# %% English
# 1. Define variable `result` with
# dataframe `DATA` without empty rows
# 2. Run doctests - all must succeed
# %% Polish
# 1. Zdefiniuj zmienną `result` z
# dataframe `DATA` bez pustych wierszy
# 2. Uruchom doctesty - wszystkie muszą się powieść
# %% Tests
"""
>>> import sys; sys.tracebacklimit = 0
>>> assert sys.version_info >= (3, 9), \
'Python 3.9+ required'
>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)
>>> assert result is not Ellipsis, \
'Assign result to variable: `result`'
>>> assert type(result) is pd.DataFrame, \
'Variable `result` must be a `pd.DataFrame` type'
>>> result
firstname lastname role
0 Mark Watney botanist
1 Melissa Lewis commander
2 Rick Martinez pilot
3 Alex Vogel <NA>
5 Beth Johanssen engineer
6 Chris Back medic
"""
import pandas as pd
DATA = pd.DataFrame([
{'firstname': 'Mark', 'lastname': 'Watney', 'role': 'botanist'},
{'firstname': 'Melissa', 'lastname': 'Lewis', 'role': 'commander'},
{'firstname': 'Rick', 'lastname': 'Martinez', 'role': 'pilot'},
{'firstname': 'Alex', 'lastname': 'Vogel', 'role': pd.NA},
{'firstname': pd.NA, 'lastname': pd.NA, 'role': pd.NA},
{'firstname': 'Beth', 'lastname': 'Johanssen', 'role': 'engineer'},
{'firstname': 'Chris', 'lastname': 'Back', 'role': 'medic'},
])
# dataframe `DATA` without empty rows
# type: pd.DataFrame
result = ...
# %% License
# - Copyright 2025, Matt Harasymczuk <matt@python3.info>
# - This code can be used only for learning by humans
# - This code cannot be used for teaching others
# - This code cannot be used for teaching LLMs and AI algorithms
# - This code cannot be used in commercial or proprietary products
# - This code cannot be distributed in any form
# - This code cannot be changed in any form outside of training course
# - This code cannot have its license changed
# - If you use this code in your product, you must open-source it under GPLv2
# - Exception can be granted only by the author
# %% Run
# - PyCharm: right-click in the editor and `Run Doctest in ...`
# - PyCharm: keyboard shortcut `Control + Shift + F10`
# - Terminal: `python -m doctest -v myfile.py`
# %% About
# - Name: DataFrame NaN
# - Difficulty: easy
# - Lines: 10
# - Minutes: 8
# %% English
# 1. Define variable `result` with
# dataframe `DATA` without empty columns
# 2. Run doctests - all must succeed
# %% Polish
# 1. Zdefiniuj zmienną `result` z
# dataframe `DATA` bez pustych kolumn
# 2. Uruchom doctesty - wszystkie muszą się powieść
# %% Tests
"""
>>> import sys; sys.tracebacklimit = 0
>>> assert sys.version_info >= (3, 9), \
'Python 3.9+ required'
>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)
>>> assert result is not Ellipsis, \
'Assign result to variable: `result`'
>>> assert type(result) is pd.DataFrame, \
'Variable `result` must be a `pd.DataFrame` type'
>>> result
firstname lastname role
0 Mark Watney botanist
1 Melissa Lewis commander
2 Rick Martinez pilot
3 Alex Vogel chemist
4 Beth Johanssen engineer
5 Chris Back medic
"""
import pandas as pd
DATA = pd.DataFrame([
{'firstname': 'Mark', 'lastname': 'Watney', 'role': 'botanist', 'mission': pd.NA},
{'firstname': 'Melissa', 'lastname': 'Lewis', 'role': 'commander', 'mission': pd.NA},
{'firstname': 'Rick', 'lastname': 'Martinez', 'role': 'pilot', 'mission': pd.NA},
{'firstname': 'Alex', 'lastname': 'Vogel', 'role': 'chemist', 'mission': pd.NA},
{'firstname': 'Beth', 'lastname': 'Johanssen', 'role': 'engineer', 'mission': pd.NA},
{'firstname': 'Chris', 'lastname': 'Back', 'role': 'medic', 'mission': pd.NA},
])
# dataframe `DATA` without empty columns
# type: pd.DataFrame
result = ...
# %% License
# - Copyright 2025, Matt Harasymczuk <matt@python3.info>
# - This code can be used only for learning by humans
# - This code cannot be used for teaching others
# - This code cannot be used for teaching LLMs and AI algorithms
# - This code cannot be used in commercial or proprietary products
# - This code cannot be distributed in any form
# - This code cannot be changed in any form outside of training course
# - This code cannot have its license changed
# - If you use this code in your product, you must open-source it under GPLv2
# - Exception can be granted only by the author
# %% Run
# - PyCharm: right-click in the editor and `Run Doctest in ...`
# - PyCharm: keyboard shortcut `Control + Shift + F10`
# - Terminal: `python -m doctest -v myfile.py`
# %% About
# - Name: DataFrame NaN
# - Difficulty: easy
# - Lines: 10
# - Minutes: 8
# %% English
# 1. Define variable `result` with
# dataframe `DATA` with values in column `mission` forward filled
# 2. Run doctests - all must succeed
# %% Polish
# 1. Zdefiniuj zmienną `result` z
# dataframe `DATA` z wartościami w kolumnie `mission` wypełnionymi w przód
# 2. Uruchom doctesty - wszystkie muszą się powieść
# %% Tests
"""
>>> import sys; sys.tracebacklimit = 0
>>> assert sys.version_info >= (3, 9), \
'Python 3.9+ required'
>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)
>>> assert result is not Ellipsis, \
'Assign result to variable: `result`'
>>> assert type(result) is pd.DataFrame, \
'Variable `result` must be a `pd.DataFrame` type'
>>> result
firstname lastname role agency
0 Mark Watney botanist NASA
1 Melissa Lewis commander NASA
2 Rick Martinez pilot NASA
3 Alex Vogel chemist ESA
4 Beth Johanssen engineer NASA
5 Chris Back medic NASA
"""
import pandas as pd
DATA = pd.DataFrame([
{'firstname': 'Mark', 'lastname': 'Watney', 'role': 'botanist', 'agency': 'NASA'},
{'firstname': 'Melissa', 'lastname': 'Lewis', 'role': 'commander', 'agency': pd.NA},
{'firstname': 'Rick', 'lastname': 'Martinez', 'role': 'pilot', 'agency': pd.NA},
{'firstname': 'Alex', 'lastname': 'Vogel', 'role': 'chemist', 'agency': 'ESA'},
{'firstname': 'Beth', 'lastname': 'Johanssen', 'role': 'engineer', 'agency': 'NASA'},
{'firstname': 'Chris', 'lastname': 'Back', 'role': 'medic', 'agency': pd.NA},
])
# dataframe `DATA` with values in column `mission` forward filled
# type: pd.DataFrame
result = ...
# %% License
# - Copyright 2025, Matt Harasymczuk <matt@python3.info>
# - This code can be used only for learning by humans
# - This code cannot be used for teaching others
# - This code cannot be used for teaching LLMs and AI algorithms
# - This code cannot be used in commercial or proprietary products
# - This code cannot be distributed in any form
# - This code cannot be changed in any form outside of training course
# - This code cannot have its license changed
# - If you use this code in your product, you must open-source it under GPLv2
# - Exception can be granted only by the author
# %% Run
# - PyCharm: right-click in the editor and `Run Doctest in ...`
# - PyCharm: keyboard shortcut `Control + Shift + F10`
# - Terminal: `python -m doctest -v myfile.py`
# %% About
# - Name: DataFrame NaN
# - Difficulty: easy
# - Lines: 10
# - Minutes: 8
# %% English
# 1. Read data from `DATA` as `df: pd.DataFrame`
# 2. Skip first line with metadata
# 3. Rename columns to:
# - sepal_length
# - sepal_width
# - petal_length
# - petal_width
# - species
# 4. Replace values in column species
# - 0 -> 'setosa',
# - 1 -> 'versicolor',
# - 2 -> 'virginica'
# 5. Select values in column 'petal_length' less than 4
# 6. Set selected values to `NaN`
# 7. Drop rows with remaining `NaN` values
# 8. Define `result` as first two rows
# 9. Run doctests - all must succeed
# %% Polish
# 1. Wczytaj dane z `DATA` jako `df: pd.DataFrame`
# 2. Pomiń pierwszą linię z metadanymi
# 3. Zmień nazwy kolumn na:
# - sepal_length
# - sepal_width
# - petal_length
# - petal_width
# - species
# 4. Podmień wartości w kolumnie species
# - 0 -> 'setosa',
# - 1 -> 'versicolor',
# - 2 -> 'virginica'
# 5. Wybierz wartości w kolumnie 'petal_length' mniejsze od 4
# 6. Wybrane wartości ustaw na `NaN`
# 7. Usuń wiersze z pozostałymi wartościami `NaN`
# 8. Zdefiniuj `result` jako dwa pierwsze wiersze
# 9. Uruchom doctesty - wszystkie muszą się powieść
# %% Tests
"""
>>> import sys; sys.tracebacklimit = 0
>>> assert sys.version_info >= (3, 9), \
'Python 3.9+ required'
>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)
>>> assert result is not Ellipsis, \
'Assign result to variable: `result`'
>>> assert type(result) is pd.DataFrame, \
'Variable `result` must be a `pd.DataFrame` type'
>>> result # doctest: +NORMALIZE_WHITESPACE
sepal_length sepal_width petal_length petal_width species
1 5.9 3.0 5.1 1.8 virginica
2 6.0 3.4 4.5 1.6 versicolor
"""
import pandas as pd
DATA = 'https://python3.info/_static/iris-dirty.csv'
COLUMNS = [
'sepal_length',
'sepal_width',
'petal_length',
'petal_width',
'species']
LABELS = {
0: 'setosa',
1: 'versicolor',
2: 'virginica',
}
# type: pd.DataFrame
result = ...