5.12. DataFrame Alter
pd.date_range('1999-12-30', periods=7)
df['not-existing'] = 1
df['not-existing'] = range(0,10)
df['not-existing'] = np.arange(0,10)
.transpose()
.drop(columns=[...])
.drop(index=[...])
.drop_duplicates()
5.12.1. SetUp
>>> import pandas as pd
>>> import numpy as np
5.12.2. Add Rows and Columns
Add Column:
>>> df = pd.DataFrame({
... 'A': [10, 11, 12],
... 'B': [20, 21, 22],
... 'C': [30, 31, 32]})
>>>
>>> df
A B C
0 10 20 30
1 11 21 31
2 12 22 32
>>> df['X'] = ['a', 'b', 'c']
>>> df
A B C X
0 10 20 30 a
1 11 21 31 b
2 12 22 32 c
>>> df['X'] = ['a', 'b']
Traceback (most recent call last):
ValueError: Length of values (2) does not match length of index (3)
>>> df['X'] = ['a', 'b', 'c', 'd']
Traceback (most recent call last):
ValueError: Length of values (4) does not match length of index (3)
>>> df['Z'] = np.arange(3.0)
>>> df
A B C X Z
0 10 20 30 a 0.0
1 11 21 31 b 1.0
2 12 22 32 c 2.0
5.12.3. Drop Rows and Columns
Works with
inplace=True
Drop Column:
>>> df = pd.DataFrame({
... 'A': [10, 11, 12],
... 'B': [20, 21, 22],
... 'C': [30, 31, 32]})
>>>
>>> df
A B C
0 10 20 30
1 11 21 31
2 12 22 32
>>> df.drop('A', axis='columns')
B C
0 20 30
1 21 31
2 22 32
>>> df.drop(columns='A')
B C
0 20 30
1 21 31
2 22 32
>>> df.drop(columns=['A', 'B'])
C
0 30
1 31
2 32
Drop Row:
>>> df = pd.DataFrame({
... 'A': [10, 11, 12],
... 'B': [20, 21, 22],
... 'C': [30, 31, 32]})
>>>
>>> df
A B C
0 10 20 30
1 11 21 31
2 12 22 32
>>> df.drop(1)
A B C
0 10 20 30
2 12 22 32
>>> df.drop([0, 2])
A B C
1 11 21 31
>>> rows = df[:2].index
>>> df.drop(rows)
A B C
2 12 22 32
Drop from Timeseries:
>>> np.random.seed(0)
>>>
>>> df = pd.DataFrame(
... columns = ['Morning', 'Noon', 'Evening', 'Midnight'],
... index = pd.date_range('1999-12-30', periods=7),
... data = np.random.randn(7, 4))
>>>
>>> df
Morning Noon Evening Midnight
1999-12-30 1.764052 0.400157 0.978738 2.240893
1999-12-31 1.867558 -0.977278 0.950088 -0.151357
2000-01-01 -0.103219 0.410599 0.144044 1.454274
2000-01-02 0.761038 0.121675 0.443863 0.333674
2000-01-03 1.494079 -0.205158 0.313068 -0.854096
2000-01-04 -2.552990 0.653619 0.864436 -0.742165
2000-01-05 2.269755 -1.454366 0.045759 -0.187184
>>> df.drop('1999-12-30')
Morning Noon Evening Midnight
1999-12-31 1.867558 -0.977278 0.950088 -0.151357
2000-01-01 -0.103219 0.410599 0.144044 1.454274
2000-01-02 0.761038 0.121675 0.443863 0.333674
2000-01-03 1.494079 -0.205158 0.313068 -0.854096
2000-01-04 -2.552990 0.653619 0.864436 -0.742165
2000-01-05 2.269755 -1.454366 0.045759 -0.187184
5.12.4. Transpose
df.transpose()
ordf.T
df.transpose()
is preferred
>>> df = pd.DataFrame({
... 'A': [10, 11, 12],
... 'B': [20, 21, 22],
... 'C': [30, 31, 32]})
>>>
>>> df
A B C
0 10 20 30
1 11 21 31
2 12 22 32
>>>
>>> df.transpose()
0 1 2
A 10 11 12
B 20 21 22
C 30 31 32
>>>
>>> df.T
0 1 2
A 10 11 12
B 20 21 22
C 30 31 32
>>> df = pd.DataFrame({
... 'A': [10, 11, 12],
... 'B': [20, 21, 22],
... 'C': [30, 31, 32]})
>>>
>>> x = df['A'] # will select column A
>>> x = df['B'] # will select column B
>>> x = df['C'] # will select column C
>>>
>>> x = df.A # will select column A
>>> x = df.B # will select column B
>>> x = df.C # will select column C
>>>
>>> x = df.T # will transpose data
>>> x = df.transpose() # will transpose data
>>> df = pd.DataFrame({
... 'R': [10, 11, 12],
... 'S': [20, 21, 22],
... 'T': [30, 31, 32]})
>>>
>>> x = df['R'] # will select column R
>>> x = df['S'] # will select column S
>>> x = df['T'] # will select column T
>>>
>>> x = df.R # will select column R
>>> x = df.S # will select column S
>>> x = df.T # will transpose data
>>>
>>> x = df.transpose() # will transpose data
5.12.5. Assignments
# %% License
# - Copyright 2025, Matt Harasymczuk <matt@python3.info>
# - This code can be used only for learning by humans
# - This code cannot be used for teaching others
# - This code cannot be used for teaching LLMs and AI algorithms
# - This code cannot be used in commercial or proprietary products
# - This code cannot be distributed in any form
# - This code cannot be changed in any form outside of training course
# - This code cannot have its license changed
# - If you use this code in your product, you must open-source it under GPLv2
# - Exception can be granted only by the author
# %% Run
# - PyCharm: right-click in the editor and `Run Doctest in ...`
# - PyCharm: keyboard shortcut `Control + Shift + F10`
# - Terminal: `python -m doctest -v myfile.py`
# %% About
# - Name: DataFrame Alter Categorize
# - Difficulty: medium
# - Lines: 8
# - Minutes: 8
# %% English
# 1. Modify `df: pd.DataFrame` (cars dataset)
# 2. Add column `status` with values:
# - `new` if `mileage` from 0 to 10_000 km
# - `young` if `mileage` from 10_000 km to 100_000 km
# - `old` if `mileage` above 100_000 km
# 3. All ranges includes lower bounds and exclude upper bounds
# 4. Do not use `pd.cut()` or `pd.select()`
# 5. Run doctests - all must succeed
# %% Polish
# 1. Zmodyfikuj `df: pd.DataFrame` (zestaw danych o samochodach)
# 2. Dodaj kolumnę `status` o wartościach:
# - `new` jeżeli `mileage` od 0 do 10_000 km
# - `young` jeżeli `mileage` od 10_000 km do 100_000 km
# - `old` jeżeli `mileage` powyżej 100_000 km
# 3. Wszystkie przedziały włączają dolny zares i wyłączają górny zakres
# 4. Nie używaj `pd.cut()` ani `pd.select()`
# 5. Uruchom doctesty - wszystkie muszą się powieść
# %% Hints
# - `pd.NA`
# - `DataFrame.loc[query, column] = value`
# - `DataFrame.between()`
# %% Tests
"""
>>> import sys; sys.tracebacklimit = 0
>>> assert sys.version_info >= (3, 9), \
'Python 3.9+ required'
>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)
>>> assert result is not Ellipsis, \
'Assign result to variable: `result`'
>>> assert type(result) is pd.DataFrame, \
'Variable `result` must be a `pd.DataFrame` type'
>>> result # doctest: +NORMALIZE_WHITESPACE
mileage consumption status
0 199340 2 old
1 43567 0 young
2 173685 0 old
3 117952 4 old
4 176963 5 old
.. ... ... ...
45 49866 16 young
46 123031 17 old
47 125603 5 old
48 11723 9 young
49 174962 3 old
<BLANKLINE>
[50 rows x 3 columns]
"""
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame({
'mileage': np.random.randint(0, 200_000, size=50),
'consumption': np.random.randint(0, 21, size=50),
})
# Add column `status` with values:
# - `new` if `mileage` from 0 to 10_000 km
# - `young` if `mileage` from 10_000 km to 100_000 km
# - `old` if `mileage` above 100_000 km
# All ranges includes lower bounds and exclude upper bounds
# Do not use `pd.cut()` or `pd.select()`
# type: pd.DataFrame
result = ...
# %% License
# - Copyright 2025, Matt Harasymczuk <matt@python3.info>
# - This code can be used only for learning by humans
# - This code cannot be used for teaching others
# - This code cannot be used for teaching LLMs and AI algorithms
# - This code cannot be used in commercial or proprietary products
# - This code cannot be distributed in any form
# - This code cannot be changed in any form outside of training course
# - This code cannot have its license changed
# - If you use this code in your product, you must open-source it under GPLv2
# - Exception can be granted only by the author
# %% Run
# - PyCharm: right-click in the editor and `Run Doctest in ...`
# - PyCharm: keyboard shortcut `Control + Shift + F10`
# - Terminal: `python -m doctest -v myfile.py`
# %% About
# - Name: DataFrame Alter Cut
# - Difficulty: medium
# - Lines: 15
# - Minutes: 21
# %% English
# 1. Modify `df: pd.DataFrame` (cars dataset)
# 2. Using `pd.cut` add column `type`:
# - if `consumption` from 0 to 1 then `type` is `electric`
# - if `consumption` from 1 to 10 then `type` is `car`
# - if `consumption` from 10 to 100 then `type` is `truck`
# 3. All ranges includes lower bounds and exclude upper bounds
# 4. Use `pd.cut()` function
# 5. Run doctests - all must succeed
# %% Polish
# 1. Zmodyfikuj `df: pd.DataFrame` (zestaw danych o samochodach)
# 2. Używając `pd.cut` dodaj kolumnę `type`:
# - jeżeli `consumption` od 0 do 1 to `type` jest `electric`
# - jeżeli `consumption` od 2 do 10 to `type` jest `car`
# - jeżeli `consumption` od 10 do 100 to `type` jest `truck`
# 3. Wszystkie przedziały włączają dolny zares i wyłączają górny zakres
# 4. Użyj funkcji `pd.cut()`
# 5. Uruchom doctesty - wszystkie muszą się powieść
# %% Hints
# - `pd.DataFrame()`
# %% Tests
"""
>>> import sys; sys.tracebacklimit = 0
>>> assert sys.version_info >= (3, 9), \
'Python 3.9+ required'
>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)
>>> assert result is not Ellipsis, \
'Assign result to variable: `result`'
>>> assert type(result) is pd.DataFrame, \
'Variable `result` must be a `pd.DataFrame` type'
>>> result
mileage consumption type
0 199340 2 car
1 43567 0 electric
2 173685 0 electric
3 117952 4 car
4 176963 5 car
.. ... ... ...
45 49866 16 truck
46 123031 17 truck
47 125603 5 car
48 11723 9 car
49 174962 3 car
<BLANKLINE>
[50 rows x 3 columns]
"""
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame({
'mileage': np.random.randint(0, 200_000, size=50),
'consumption': np.random.randint(0, 21, size=50),
})
# Using `pd.cut` add column `type`:
# - if `consumption` from 0 to 1 then `type` is `electric`
# - if `consumption` from 1 to 10 then `type` is `car`
# - if `consumption` from 10 to 100 then `type` is `truck`
# All ranges includes lower bounds and exclude upper bounds
# Use `pd.cut()` function
# type: pd.DataFrame
result = ...