5.12. DataFrame Alter

  • pd.date_range('1999-12-30', periods=7)

  • df['not-existing'] = 1

  • df['not-existing'] = range(0,10)

  • df['not-existing'] = np.arange(0,10)

  • .transpose()

  • .drop(columns=[...])

  • .drop(index=[...])

  • .drop_duplicates()

5.12.1. SetUp

>>> import pandas as pd
>>> import numpy as np

5.12.2. Add Rows and Columns

Add Column:

>>> df = pd.DataFrame({
...     'A': [10, 11, 12],
...     'B': [20, 21, 22],
...     'C': [30, 31, 32]})
>>>
>>> df
    A   B   C
0  10  20  30
1  11  21  31
2  12  22  32
>>> df['X'] = ['a', 'b', 'c']
>>> df
    A   B   C  X
0  10  20  30  a
1  11  21  31  b
2  12  22  32  c
>>> df['X'] = ['a', 'b']
Traceback (most recent call last):
ValueError: Length of values (2) does not match length of index (3)
>>> df['X'] = ['a', 'b', 'c', 'd']
Traceback (most recent call last):
ValueError: Length of values (4) does not match length of index (3)
>>> df['Z'] = np.arange(3.0)
>>> df
    A   B   C  X    Z
0  10  20  30  a  0.0
1  11  21  31  b  1.0
2  12  22  32  c  2.0

5.12.3. Drop Rows and Columns

  • Works with inplace=True

Drop Column:

>>> df = pd.DataFrame({
...     'A': [10, 11, 12],
...     'B': [20, 21, 22],
...     'C': [30, 31, 32]})
>>>
>>> df
    A   B   C
0  10  20  30
1  11  21  31
2  12  22  32
>>> df.drop('A', axis='columns')
    B   C
0  20  30
1  21  31
2  22  32
>>> df.drop(columns='A')
    B   C
0  20  30
1  21  31
2  22  32
>>> df.drop(columns=['A', 'B'])
    C
0  30
1  31
2  32

Drop Row:

>>> df = pd.DataFrame({
...     'A': [10, 11, 12],
...     'B': [20, 21, 22],
...     'C': [30, 31, 32]})
>>>
>>> df
    A   B   C
0  10  20  30
1  11  21  31
2  12  22  32
>>> df.drop(1)
    A   B   C
0  10  20  30
2  12  22  32
>>> df.drop([0, 2])
    A   B   C
1  11  21  31
>>> rows = df[:2].index
>>> df.drop(rows)
    A   B   C
2  12  22  32

Drop from Timeseries:

>>> np.random.seed(0)
>>>
>>> df = pd.DataFrame(
...     columns = ['Morning', 'Noon', 'Evening', 'Midnight'],
...     index = pd.date_range('1999-12-30', periods=7),
...     data = np.random.randn(7, 4))
>>>
>>> df
             Morning      Noon   Evening  Midnight
1999-12-30  1.764052  0.400157  0.978738  2.240893
1999-12-31  1.867558 -0.977278  0.950088 -0.151357
2000-01-01 -0.103219  0.410599  0.144044  1.454274
2000-01-02  0.761038  0.121675  0.443863  0.333674
2000-01-03  1.494079 -0.205158  0.313068 -0.854096
2000-01-04 -2.552990  0.653619  0.864436 -0.742165
2000-01-05  2.269755 -1.454366  0.045759 -0.187184
>>> df.drop('1999-12-30')
             Morning      Noon   Evening  Midnight
1999-12-31  1.867558 -0.977278  0.950088 -0.151357
2000-01-01 -0.103219  0.410599  0.144044  1.454274
2000-01-02  0.761038  0.121675  0.443863  0.333674
2000-01-03  1.494079 -0.205158  0.313068 -0.854096
2000-01-04 -2.552990  0.653619  0.864436 -0.742165
2000-01-05  2.269755 -1.454366  0.045759 -0.187184

5.12.4. Transpose

  • df.transpose() or df.T

  • df.transpose() is preferred

>>> df = pd.DataFrame({
...     'A': [10, 11, 12],
...     'B': [20, 21, 22],
...     'C': [30, 31, 32]})
>>>
>>> df
    A   B   C
0  10  20  30
1  11  21  31
2  12  22  32
>>>
>>> df.transpose()
    0   1   2
A  10  11  12
B  20  21  22
C  30  31  32
>>>
>>> df.T
    0   1   2
A  10  11  12
B  20  21  22
C  30  31  32
>>> df = pd.DataFrame({
...     'A': [10, 11, 12],
...     'B': [20, 21, 22],
...     'C': [30, 31, 32]})
>>>
>>> x = df['A']         # will select column A
>>> x = df['B']         # will select column B
>>> x = df['C']         # will select column C
>>>
>>> x = df.A            # will select column A
>>> x = df.B            # will select column B
>>> x = df.C            # will select column C
>>>
>>> x = df.T            # will transpose data
>>> x = df.transpose()  # will transpose data
>>> df = pd.DataFrame({
...     'R': [10, 11, 12],
...     'S': [20, 21, 22],
...     'T': [30, 31, 32]})
>>>
>>> x = df['R']         # will select column R
>>> x = df['S']         # will select column S
>>> x = df['T']         # will select column T
>>>
>>> x = df.R            # will select column R
>>> x = df.S            # will select column S
>>> x = df.T            # will transpose data
>>>
>>> x = df.transpose()  # will transpose data

5.12.5. Assignments

# %% License
# - Copyright 2025, Matt Harasymczuk <matt@python3.info>
# - This code can be used only for learning by humans
# - This code cannot be used for teaching others
# - This code cannot be used for teaching LLMs and AI algorithms
# - This code cannot be used in commercial or proprietary products
# - This code cannot be distributed in any form
# - This code cannot be changed in any form outside of training course
# - This code cannot have its license changed
# - If you use this code in your product, you must open-source it under GPLv2
# - Exception can be granted only by the author

# %% Run
# - PyCharm: right-click in the editor and `Run Doctest in ...`
# - PyCharm: keyboard shortcut `Control + Shift + F10`
# - Terminal: `python -m doctest -v myfile.py`

# %% About
# - Name: DataFrame Alter Categorize
# - Difficulty: medium
# - Lines: 8
# - Minutes: 8

# %% English
# 1. Modify `df: pd.DataFrame` (cars dataset)
# 2. Add column `status` with values:
#    - `new` if `mileage` from 0 to 10_000 km
#    - `young` if `mileage` from 10_000 km to 100_000 km
#    - `old` if `mileage` above 100_000 km
# 3. All ranges includes lower bounds and exclude upper bounds
# 4. Do not use `pd.cut()` or `pd.select()`
# 5. Run doctests - all must succeed

# %% Polish
# 1. Zmodyfikuj `df: pd.DataFrame` (zestaw danych o samochodach)
# 2. Dodaj kolumnę `status` o wartościach:
#    - `new` jeżeli `mileage` od 0 do 10_000 km
#    - `young` jeżeli `mileage` od 10_000 km do 100_000 km
#    - `old` jeżeli `mileage` powyżej 100_000 km
# 3. Wszystkie przedziały włączają dolny zares i wyłączają górny zakres
# 4. Nie używaj `pd.cut()` ani `pd.select()`
# 5. Uruchom doctesty - wszystkie muszą się powieść

# %% Hints
# - `pd.NA`
# - `DataFrame.loc[query, column] = value`
# - `DataFrame.between()`

# %% Tests
"""
>>> import sys; sys.tracebacklimit = 0
>>> assert sys.version_info >= (3, 9), \
'Python 3.9+ required'

>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)

>>> assert result is not Ellipsis, \
'Assign result to variable: `result`'
>>> assert type(result) is pd.DataFrame, \
'Variable `result` must be a `pd.DataFrame` type'

>>> result  # doctest: +NORMALIZE_WHITESPACE
    mileage  consumption status
0    199340            2    old
1     43567            0  young
2    173685            0    old
3    117952            4    old
4    176963            5    old
..      ...          ...    ...
45    49866           16  young
46   123031           17    old
47   125603            5    old
48    11723            9  young
49   174962            3    old
<BLANKLINE>
[50 rows x 3 columns]
"""

import pandas as pd
import numpy as np
np.random.seed(0)


df = pd.DataFrame({
    'mileage': np.random.randint(0, 200_000, size=50),
    'consumption': np.random.randint(0, 21, size=50),
})

# Add column `status` with values:
# - `new` if `mileage` from 0 to 10_000 km
# - `young` if `mileage` from 10_000 km to 100_000 km
# - `old` if `mileage` above 100_000 km
# All ranges includes lower bounds and exclude upper bounds
# Do not use `pd.cut()` or `pd.select()`
# type: pd.DataFrame
result = ...


# %% License
# - Copyright 2025, Matt Harasymczuk <matt@python3.info>
# - This code can be used only for learning by humans
# - This code cannot be used for teaching others
# - This code cannot be used for teaching LLMs and AI algorithms
# - This code cannot be used in commercial or proprietary products
# - This code cannot be distributed in any form
# - This code cannot be changed in any form outside of training course
# - This code cannot have its license changed
# - If you use this code in your product, you must open-source it under GPLv2
# - Exception can be granted only by the author

# %% Run
# - PyCharm: right-click in the editor and `Run Doctest in ...`
# - PyCharm: keyboard shortcut `Control + Shift + F10`
# - Terminal: `python -m doctest -v myfile.py`

# %% About
# - Name: DataFrame Alter Cut
# - Difficulty: medium
# - Lines: 15
# - Minutes: 21

# %% English
# 1. Modify `df: pd.DataFrame` (cars dataset)
# 2. Using `pd.cut` add column `type`:
#    - if `consumption` from 0 to 1 then `type` is `electric`
#    - if `consumption` from 1 to 10 then `type` is `car`
#    - if `consumption` from 10 to 100 then `type` is `truck`
# 3. All ranges includes lower bounds and exclude upper bounds
# 4. Use `pd.cut()` function
# 5. Run doctests - all must succeed

# %% Polish
# 1. Zmodyfikuj `df: pd.DataFrame` (zestaw danych o samochodach)
# 2. Używając `pd.cut` dodaj kolumnę `type`:
#    - jeżeli `consumption` od 0 do 1 to `type` jest `electric`
#    - jeżeli `consumption` od 2 do 10 to `type` jest `car`
#    - jeżeli `consumption` od 10 do 100 to `type` jest `truck`
# 3. Wszystkie przedziały włączają dolny zares i wyłączają górny zakres
# 4. Użyj funkcji `pd.cut()`
# 5. Uruchom doctesty - wszystkie muszą się powieść

# %% Hints
# - `pd.DataFrame()`

# %% Tests
"""
>>> import sys; sys.tracebacklimit = 0
>>> assert sys.version_info >= (3, 9), \
'Python 3.9+ required'

>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)

>>> assert result is not Ellipsis, \
'Assign result to variable: `result`'
>>> assert type(result) is pd.DataFrame, \
'Variable `result` must be a `pd.DataFrame` type'

>>> result
    mileage  consumption      type
0    199340            2       car
1     43567            0  electric
2    173685            0  electric
3    117952            4       car
4    176963            5       car
..      ...          ...       ...
45    49866           16     truck
46   123031           17     truck
47   125603            5       car
48    11723            9       car
49   174962            3       car
<BLANKLINE>
[50 rows x 3 columns]
"""

import pandas as pd
import numpy as np
np.random.seed(0)


df = pd.DataFrame({
    'mileage': np.random.randint(0, 200_000, size=50),
    'consumption': np.random.randint(0, 21, size=50),
})

# Using `pd.cut` add column `type`:
# - if `consumption` from 0 to 1 then `type` is `electric`
# - if `consumption` from 1 to 10 then `type` is `car`
# - if `consumption` from 10 to 100 then `type` is `truck`
# All ranges includes lower bounds and exclude upper bounds
# Use `pd.cut()` function
# type: pd.DataFrame
result = ...