DataFrame Create

  • pd.DataFrame(list[dict])

  • pd.DataFrame(dict[str,list])

>>> data = [
...     {"firstname": "Rick", "lastname": "Martinez"},
...     {"firstname": "Melissa", "lastname": "Lewis"},
...     {"firstname": "Mark", "lastname": "Watney"},
... ]
>>>
>>> df = pd.DataFrame(data)
>>>
>>> df
  firstname  lastname
0      Rick  Martinez
1   Melissa     Lewis
2      Mark    Watney
>>> df = pd.DataFrame({
...     'firstname': ['Mark', 'Melissa', 'Rick'],
...     'lastname': ['Watney', 'Lewis', 'Martinez'],
... })
>>>
>>> df
  firstname  lastname
0      Mark    Watney
1   Melissa     Lewis
2      Rick  Martinez

Create from List of Dicts

Create from Dict

Create from NDArray

DataFrame Attributes

  • df.columns

  • df.index

  • df.values

>>> df.columns
Index(['firstname', 'lastname'], dtype='object')
>>> df.index
RangeIndex(start=0, stop=3, step=1)
>>> df.values
array([['Mark', 'Watney'],
       ['Melissa', 'Lewis'],
       ['Rick', 'Martinez']], dtype=object)
>>> df
  firstname  lastname is active
0      Mark    Watney      True
1   Melissa     Lewis     False
2      Rick  Martinez      True

df.firstname # works df['firstname'] # works

df.is active # error df['is active'] # works

Columns

Indexes

Values

DataFrame Index

  • DataFrame.index

  • pd.DataFrame(index=...)

  • Range Index

  • Index

  • Object Index

  • Datetime Index

  • Timedelta Index

  • Period Index

  • Interval Index

  • Categorical Index

  • Multi Index

>>> df
  firstname  lastname
0      Mark    Watney
1   Melissa     Lewis
2      Rick  Martinez
>>> df = pd.DataFrame({
...     'firstname': ['Mark', 'Melissa', 'Rick'],
...     'lastname': ['Watney', 'Lewis', 'Martinez'],
... }, index=pd.date_range(start='2000-01-01', periods=3))
>>>
>>>
>>> df
           firstname  lastname
2000-01-01      Mark    Watney
2000-01-02   Melissa     Lewis
2000-01-03      Rick  Martinez
>>> df['firstname']
0       Mark
1    Melissa
2       Rick
Name: firstname, dtype: object
>>>
>>> df.firstname
0       Mark
1    Melissa
2       Rick
Name: firstname, dtype: object
>>> np.random.seed(0)
>>>
>>> data = np.random.randn(7,4)
>>> index = pd.date_range(start='1999-12-29', periods=len(data))
>>> columns = ['Morning', 'Noon', 'Evening', 'Midnight']
>>>
>>> df = pd.DataFrame(data=data, columns=columns, index=index)
>>>
>>> df
             Morning      Noon   Evening  Midnight
1999-12-29  1.764052  0.400157  0.978738  2.240893
1999-12-30  1.867558 -0.977278  0.950088 -0.151357
1999-12-31 -0.103219  0.410599  0.144044  1.454274
2000-01-01  0.761038  0.121675  0.443863  0.333674
2000-01-02  1.494079 -0.205158  0.313068 -0.854096
2000-01-03 -2.552990  0.653619  0.864436 -0.742165
2000-01-04  2.269755 -1.454366  0.045759 -0.187184
>>> df.loc['1999']
             Morning      Noon   Evening  Midnight
1999-12-29  1.764052  0.400157  0.978738  2.240893
1999-12-30  1.867558 -0.977278  0.950088 -0.151357
1999-12-31 -0.103219  0.410599  0.144044  1.454274
>>> df.iloc[0]
Morning     1.764052
Noon        0.400157
Evening     0.978738
Midnight    2.240893
Name: 1999-12-29 00:00:00, dtype: float64
>>> df.iloc[0:5]
             Morning      Noon   Evening  Midnight
1999-12-29  1.764052  0.400157  0.978738  2.240893
1999-12-30  1.867558 -0.977278  0.950088 -0.151357
1999-12-31 -0.103219  0.410599  0.144044  1.454274
2000-01-01  0.761038  0.121675  0.443863  0.333674
2000-01-02  1.494079 -0.205158  0.313068 -0.854096
>>> df.loc['1999-12-27':'2000-01-02']
             Morning      Noon   Evening  Midnight
1999-12-29  1.764052  0.400157  0.978738  2.240893
1999-12-30  1.867558 -0.977278  0.950088 -0.151357
1999-12-31 -0.103219  0.410599  0.144044  1.454274
2000-01-01  0.761038  0.121675  0.443863  0.333674
2000-01-02  1.494079 -0.205158  0.313068 -0.854096
>>> df.loc['1999-12-27':'2000-01-02']
             Morning      Noon   Evening  Midnight
1999-12-29  1.764052  0.400157  0.978738  2.240893
1999-12-30  1.867558 -0.977278  0.950088 -0.151357
1999-12-31 -0.103219  0.410599  0.144044  1.454274
2000-01-01  0.761038  0.121675  0.443863  0.333674
2000-01-02  1.494079 -0.205158  0.313068 -0.854096
>>>
>>>
>>> df.loc['1999-12-27':'2000-01-02', 'Noon']
1999-12-29    0.400157
1999-12-30   -0.977278
1999-12-31    0.410599
2000-01-01    0.121675
2000-01-02   -0.205158
Freq: D, Name: Noon, dtype: float64
>>>
>>> df.loc['1999-12-27':'2000-01-02', ['Noon', 'Midnight']]
                Noon  Midnight
1999-12-29  0.400157  2.240893
1999-12-30 -0.977278 -0.151357
1999-12-31  0.410599  1.454274
2000-01-01  0.121675  0.333674
2000-01-02 -0.205158 -0.854096
>>>
>>> df.loc['1999-12-27':'2000-01-02', 'Noon':'Midnight']
                Noon   Evening  Midnight
1999-12-29  0.400157  0.978738  2.240893
1999-12-30 -0.977278  0.950088 -0.151357
1999-12-31  0.410599  0.144044  1.454274
2000-01-01  0.121675  0.443863  0.333674
2000-01-02 -0.205158  0.313068 -0.854096
>>> df.loc['2000-01-01']
Morning     0.761038
Noon        0.121675
Evening     0.443863
Midnight    0.333674
Name: 2000-01-01 00:00:00, dtype: float64
>>>
>>> df.loc['2000-01-01':'2000-01-03']
             Morning      Noon   Evening  Midnight
2000-01-01  0.761038  0.121675  0.443863  0.333674
2000-01-02  1.494079 -0.205158  0.313068 -0.854096
2000-01-03 -2.552990  0.653619  0.864436 -0.742165
>>>
>>>
>>> df.loc['2000-01-01':'2000-01-03':2]
             Morning      Noon   Evening  Midnight
2000-01-01  0.761038  0.121675  0.443863  0.333674
2000-01-03 -2.552990  0.653619  0.864436 -0.742165
>>>
>>> df.loc['2000-01-01':'2000-01-03':2, 'Morning'::2]
             Morning   Evening
2000-01-01  0.761038  0.443863
2000-01-03 -2.552990  0.864436
>>> df.at['2000-01-01', 'Noon']
0.12167501649282841
>>>
>>>
>>> df.iat[3, 1]
0.12167501649282841
>>> df
             Morning      Noon   Evening  Midnight
1999-12-29  1.764052  0.400157  0.978738  2.240893
1999-12-30  1.867558 -0.977278  0.950088 -0.151357
1999-12-31 -0.103219  0.410599  0.144044  1.454274
2000-01-01  0.761038  0.121675  0.443863  0.333674
2000-01-02  1.494079 -0.205158  0.313068 -0.854096
2000-01-03 -2.552990  0.653619  0.864436 -0.742165
2000-01-04  2.269755 -1.454366  0.045759 -0.187184
>>>
>>> df.loc[:]
             Morning      Noon   Evening  Midnight
1999-12-29  1.764052  0.400157  0.978738  2.240893
1999-12-30  1.867558 -0.977278  0.950088 -0.151357
1999-12-31 -0.103219  0.410599  0.144044  1.454274
2000-01-01  0.761038  0.121675  0.443863  0.333674
2000-01-02  1.494079 -0.205158  0.313068 -0.854096
2000-01-03 -2.552990  0.653619  0.864436 -0.742165
2000-01-04  2.269755 -1.454366  0.045759 -0.187184
>>>
>>> df.loc[:, 'Noon']
1999-12-29    0.400157
1999-12-30   -0.977278
1999-12-31    0.410599
2000-01-01    0.121675
2000-01-02   -0.205158
2000-01-03    0.653619
2000-01-04   -1.454366
Freq: D, Name: Noon, dtype: float64

Deprecation

Range Index

Integer Index

  • Int64Index, UInt64Index and Float64Index have been deprecated in favor of

>>> df
             Morning      Noon   Evening  Midnight
1999-12-29  1.764052  0.400157  0.978738  2.240893
1999-12-30  1.867558 -0.977278  0.950088 -0.151357
1999-12-31 -0.103219  0.410599  0.144044  1.454274
2000-01-01  0.761038  0.121675  0.443863  0.333674
2000-01-02  1.494079 -0.205158  0.313068 -0.854096
2000-01-03 -2.552990  0.653619  0.864436 -0.742165
2000-01-04  2.269755 -1.454366  0.045759 -0.187184
>>>
>>> df.ndim
2
>>> df.size
28
>>>
>>> len(df)
7

Object Index

Datetime Index

Set Index

DataFrame Sample

  • .sample(n=5)

  • .sample(n=5, replace=True)

  • .sample(frac=.5)

  • .sample(frac=1/2)

  • .head(n=5)

  • .tail(n=5)

  • .first('5D') - works only on time series

  • .last('5D') - works only on time series

  • .reset_index(drop=True)

>>> df
             Morning      Noon   Evening  Midnight
1999-12-29  1.764052  0.400157  0.978738  2.240893
1999-12-30  1.867558 -0.977278  0.950088 -0.151357
1999-12-31 -0.103219  0.410599  0.144044  1.454274
2000-01-01  0.761038  0.121675  0.443863  0.333674
2000-01-02  1.494079 -0.205158  0.313068 -0.854096
2000-01-03 -2.552990  0.653619  0.864436 -0.742165
2000-01-04  2.269755 -1.454366  0.045759 -0.187184
>>>
>>> df.sample()
             Morning      Noon   Evening  Midnight
1999-12-31 -0.103219  0.410599  0.144044  1.454274
>>>
>>> df.sample(n=5)
             Morning      Noon   Evening  Midnight
2000-01-02  1.494079 -0.205158  0.313068 -0.854096
2000-01-03 -2.552990  0.653619  0.864436 -0.742165
2000-01-01  0.761038  0.121675  0.443863  0.333674
1999-12-30  1.867558 -0.977278  0.950088 -0.151357
1999-12-29  1.764052  0.400157  0.978738  2.240893
>>>
>>> df.sample(n=5, replace=True)
             Morning      Noon   Evening  Midnight
1999-12-30  1.867558 -0.977278  0.950088 -0.151357
1999-12-30  1.867558 -0.977278  0.950088 -0.151357
2000-01-01  0.761038  0.121675  0.443863  0.333674
2000-01-04  2.269755 -1.454366  0.045759 -0.187184
2000-01-01  0.761038  0.121675  0.443863  0.333674

Tail

First

Last

Sample

  • 1/4 is 25%

  • .05 is 5%

  • 0.5 is 50%

  • 1.0 is 100%

Reset Index

DataFrame Getitem

  • .at[] - takes tuple[str,str] as argument

  • .loc[] - takes tuple[str,str] as argument

  • .iat[] - takes tuple[int,int] as argument

  • .iloc[] - takes tuple[int,int] as argument

Columns

Rows

Columns by Index

DataFrame Slice

  • df[1:3]

  • .loc[:, 'Morning':'Evening']

Slicing by index

Column range

DataFrame At

  • .at[row, col] - fancy indexing

  • .iat[row, col] - integer at (no fancy indexing)

  • Access a single value for a row/column pair by integer position

  • Use iat if you need to get or set a single value in a DataFrame or Series

Get value at specified row/column pair

  • First argument is column

  • Second argument is row

Get value from row

  • loc returns Series

Set value at a position

DataFrame Loc

  • .loc[] - uses fancy indexing, start and stop are included!!

  • .iloc[] - only index numbers, behaves like Python slices

Single row

  • Returns the row as a pd.Series

Range of rows

  • Returns the rows as a pd.DataFrame

Range of dates

Values in Selected Columns

  • Note that both the start and stop of the slice are included

Fancy Indexing

  • Return row for given index is True

Callable

Setting Values

DataFrame Select

  • df[df['Morning'] > 0.0]

  • ~ - logical not

  • & - logical and

  • | - logical or

  • ^ - logical xor

>>> df
             Morning      Noon   Evening  Midnight
1999-12-29  1.764052  0.400157  0.978738  2.240893
1999-12-30  1.867558 -0.977278  0.950088 -0.151357
1999-12-31 -0.103219  0.410599  0.144044  1.454274
2000-01-01  0.761038  0.121675  0.443863  0.333674
2000-01-02  1.494079 -0.205158  0.313068 -0.854096
2000-01-03 -2.552990  0.653619  0.864436 -0.742165
2000-01-04  2.269755 -1.454366  0.045759 -0.187184
>>>
>>>
>>> df > 0.0
            Morning   Noon  Evening  Midnight
1999-12-29     True   True     True      True
1999-12-30     True  False     True     False
1999-12-31    False   True     True      True
2000-01-01     True   True     True      True
2000-01-02     True  False     True     False
2000-01-03    False   True     True     False
2000-01-04     True  False     True     False
>>>
>>> df[df>0.0]
             Morning      Noon   Evening  Midnight
1999-12-29  1.764052  0.400157  0.978738  2.240893
1999-12-30  1.867558       NaN  0.950088       NaN
1999-12-31       NaN  0.410599  0.144044  1.454274
2000-01-01  0.761038  0.121675  0.443863  0.333674
2000-01-02  1.494079       NaN  0.313068       NaN
2000-01-03       NaN  0.653619  0.864436       NaN
2000-01-04  2.269755       NaN  0.045759       NaN
>>> query = df >= 0.0
>>> df[query]
             Morning      Noon   Evening  Midnight
1999-12-29  1.764052  0.400157  0.978738  2.240893
1999-12-30  1.867558       NaN  0.950088       NaN
1999-12-31       NaN  0.410599  0.144044  1.454274
2000-01-01  0.761038  0.121675  0.443863  0.333674
2000-01-02  1.494079       NaN  0.313068       NaN
2000-01-03       NaN  0.653619  0.864436       NaN
2000-01-04  2.269755       NaN  0.045759       NaN
>>> query = df['Noon'] >= 0.0
>>> df[query]
             Morning      Noon   Evening  Midnight
1999-12-29  1.764052  0.400157  0.978738  2.240893
1999-12-31 -0.103219  0.410599  0.144044  1.454274
2000-01-01  0.761038  0.121675  0.443863  0.333674
2000-01-03 -2.552990  0.653619  0.864436 -0.742165
>>> df.query('Noon >= 0.0')
             Morning      Noon   Evening  Midnight
1999-12-29  1.764052  0.400157  0.978738  2.240893
1999-12-31 -0.103219  0.410599  0.144044  1.454274
2000-01-01  0.761038  0.121675  0.443863  0.333674
2000-01-03 -2.552990  0.653619  0.864436 -0.742165

Query Data

  • df.where() Works with inplace=True

>>> df.where((df>0), (df==0))
             Morning      Noon   Evening  Midnight
1999-12-29  1.764052  0.400157  0.978738  2.240893
1999-12-30  1.867558     False  0.950088     False
1999-12-31     False  0.410599  0.144044  1.454274
2000-01-01  0.761038  0.121675  0.443863  0.333674
2000-01-02  1.494079     False  0.313068     False
2000-01-03     False  0.653619  0.864436     False
2000-01-04  2.269755     False  0.045759     False

Logical NOT

>>> df[~query]
             Morning      Noon  Evening  Midnight
1999-12-29  1.764052       NaN      NaN  2.240893
1999-12-30  1.867558 -0.977278      NaN -0.151357
1999-12-31 -0.103219       NaN      NaN  1.454274
2000-01-01       NaN       NaN      NaN       NaN
2000-01-02  1.494079 -0.205158      NaN -0.854096
2000-01-03 -2.552990       NaN      NaN -0.742165
2000-01-04  2.269755 -1.454366      NaN -0.187184

Logical AND

  • In first and in second query

query = (df>0.0) & (df<1.0) >>> df[query]

Morning Noon Evening Midnight

1999-12-29 NaN 0.400157 0.978738 NaN 1999-12-30 NaN NaN 0.950088 NaN 1999-12-31 NaN 0.410599 0.144044 NaN 2000-01-01 0.761038 0.121675 0.443863 0.333674 2000-01-02 NaN NaN 0.313068 NaN 2000-01-03 NaN 0.653619 0.864436 NaN 2000-01-04 NaN NaN 0.045759 NaN

>>> query
            Morning   Noon  Evening  Midnight
1999-12-29    False   True     True     False
1999-12-30    False  False     True     False
1999-12-31    False   True     True     False
2000-01-01     True   True     True      True
2000-01-02    False  False     True     False
2000-01-03    False   True     True     False
2000-01-04    False  False     True     False

Logical OR

  • In first or in second query

>>> query = (df>0) | (df==0)
>>> df[query]
             Morning      Noon   Evening  Midnight
1999-12-29  1.764052  0.400157  0.978738  2.240893
1999-12-30  1.867558       NaN  0.950088       NaN
1999-12-31       NaN  0.410599  0.144044  1.454274
2000-01-01  0.761038  0.121675  0.443863  0.333674
2000-01-02  1.494079       NaN  0.313068       NaN
2000-01-03       NaN  0.653619  0.864436       NaN
2000-01-04  2.269755       NaN  0.045759       NaN

Logical XOR

  • In first or in second, but not in both queries

DataFrame Query

  • .query()

  • .eval()

  • .assign()

>>> value = 0.0
>>> df.query('Noon >= @value')
             Morning      Noon   Evening  Midnight
1999-12-29  1.764052  0.400157  0.978738  2.240893
1999-12-31 -0.103219  0.410599  0.144044  1.454274
2000-01-01  0.761038  0.121675  0.443863  0.333674
2000-01-03 -2.552990  0.653619  0.864436 -0.742165
>>> df.query('index >= "2000-01-01"')
             Morning      Noon   Evening  Midnight
2000-01-01  0.761038  0.121675  0.443863  0.333674
2000-01-02  1.494079 -0.205158  0.313068 -0.854096
2000-01-03 -2.552990  0.653619  0.864436 -0.742165
2000-01-04  2.269755 -1.454366  0.045759 -0.187184

days = ['1999-12-28', '2000-01-01', '2000-01-03'] >>> df.query('index in @days') # deprecated

Morning Noon Evening Midnight

2000-01-01 0.761038 0.121675 0.443863 0.333674 2000-01-03 -2.552990 0.653619 0.864436 -0.742165

>>> days = [
...     pd.Timestamp('1999-12-28'),
...     pd.Timestamp('2000-01-01'),
...     pd.Timestamp('2000-01-03'),
... ]
>>> days
[Timestamp('1999-12-28 00:00:00'), Timestamp('2000-01-01 00:00:00'), Timestamp('2000-01-03 00:00:00')]
>>>
>>> df.query('index in @days')
             Morning      Noon   Evening  Midnight
2000-01-01  0.761038  0.121675  0.443863  0.333674
2000-01-03 -2.552990  0.653619  0.864436 -0.742165
>>> days = pd.to_datetime(['1999-12-28', '2000-01-01', '2000-01-03'])
>>> days
DatetimeIndex(['1999-12-28', '2000-01-01', '2000-01-03'], dtype='datetime64[ns]', freq=None)
>>>
>>> df.query('index in @days')
             Morning      Noon   Evening  Midnight
2000-01-01  0.761038  0.121675  0.443863  0.333674
2000-01-03 -2.552990  0.653619  0.864436 -0.742165

Query Data

>>> df.assign(
...     mean=lambda df: df.mean(),
...     std=lambda df: df.std(),
... )
             Morning      Noon   Evening  Midnight  mean  std
1999-12-29  1.764052  0.400157  0.978738  2.240893   NaN  NaN
1999-12-30  1.867558 -0.977278  0.950088 -0.151357   NaN  NaN
1999-12-31 -0.103219  0.410599  0.144044  1.454274   NaN  NaN
2000-01-01  0.761038  0.121675  0.443863  0.333674   NaN  NaN
2000-01-02  1.494079 -0.205158  0.313068 -0.854096   NaN  NaN
2000-01-03 -2.552990  0.653619  0.864436 -0.742165   NaN  NaN
2000-01-04  2.269755 -1.454366  0.045759 -0.187184   NaN  NaN
>>>
>>> df.mean()
Morning     0.785753
Noon       -0.150107
Evening     0.534285
Midnight    0.299148
dtype: float64
>>> x = df.mean()
>>> type(x)
<class 'pandas.core.series.Series'>
>>>
>>> x.index
Index(['Morning', 'Noon', 'Evening', 'Midnight', 'A', 'B'], dtype='object')
>>>
>>> x.values
array([ 0.7857533 , -0.1501075 ,  0.53428509,  0.29914846,  0.        ,
        4.        ])

Query Index

  • Works also with Time Series

Query Columns

Query Variable

Query Save

DataFrame Update

  • df['column'] = 0

  • df[1:2] = 0

  • .loc[df['species'] == 0, 'species'] = 'Setosa'

  • .replace()

  • .eval()

>>> df
             Morning      Noon   Evening  Midnight
1999-12-29  1.764052  0.400157  0.978738  2.240893
1999-12-30  1.867558 -0.977278  0.950088 -0.151357
1999-12-31 -0.103219  0.410599  0.144044  1.454274
2000-01-01  0.761038  0.121675  0.443863  0.333674
2000-01-02  1.494079 -0.205158  0.313068 -0.854096
2000-01-03 -2.552990  0.653619  0.864436 -0.742165
2000-01-04  2.269755 -1.454366  0.045759 -0.187184
>>>
>>>
>>> df['A'] = 0
>>>
>>> df
             Morning      Noon   Evening  Midnight  A
1999-12-29  1.764052  0.400157  0.978738  2.240893  0
1999-12-30  1.867558 -0.977278  0.950088 -0.151357  0
1999-12-31 -0.103219  0.410599  0.144044  1.454274  0
2000-01-01  0.761038  0.121675  0.443863  0.333674  0
2000-01-02  1.494079 -0.205158  0.313068 -0.854096  0
2000-01-03 -2.552990  0.653619  0.864436 -0.742165  0
2000-01-04  2.269755 -1.454366  0.045759 -0.187184  0
>>>
>>> df['B'] = [1,2,3,4,5,6,7]
>>>
>>> df
             Morning      Noon   Evening  Midnight  A  B
1999-12-29  1.764052  0.400157  0.978738  2.240893  0  1
1999-12-30  1.867558 -0.977278  0.950088 -0.151357  0  2
1999-12-31 -0.103219  0.410599  0.144044  1.454274  0  3
2000-01-01  0.761038  0.121675  0.443863  0.333674  0  4
2000-01-02  1.494079 -0.205158  0.313068 -0.854096  0  5
2000-01-03 -2.552990  0.653619  0.864436 -0.742165  0  6
2000-01-04  2.269755 -1.454366  0.045759 -0.187184  0  7
>>> df[1:5]
             Morning  Noon   Evening  Midnight  A  B
1999-12-30  1.867558     0  0.950088 -0.151357  0  2
1999-12-31 -0.103219     0  0.144044  1.454274  0  3
2000-01-01  0.761038     0  0.443863  0.333674  0  4
2000-01-02  1.494079     0  0.313068 -0.854096  0  5
>>>
>>>
>>> df[1:5] = 0.0
>>>
>>> df
             Morning  Noon   Evening  Midnight  A  B
1999-12-29  1.764052     0  0.978738  2.240893  0  1
1999-12-30  0.000000     0  0.000000  0.000000  0  0
1999-12-31  0.000000     0  0.000000  0.000000  0  0
2000-01-01  0.000000     0  0.000000  0.000000  0  0
2000-01-02  0.000000     0  0.000000  0.000000  0  0
2000-01-03 -2.552990     0  0.864436 -0.742165  0  6
2000-01-04  2.269755     0  0.045759 -0.187184  0  7
>>> df.iloc[0:1]
             Morning  Noon   Evening  Midnight  A  B
1999-12-29  1.764052     0  0.978738  2.240893  0  1
>>>
>>> df.iloc[:, 0:1]
             Morning
1999-12-29  1.764052
1999-12-30  0.000000
1999-12-31  0.000000
2000-01-01  0.000000
2000-01-02  0.000000
2000-01-03 -2.552990
2000-01-04  2.269755
>>> df['Morning']
1999-12-29    1.764052
1999-12-30    0.000000
1999-12-31    0.000000
2000-01-01    0.000000
2000-01-02    0.000000
2000-01-03   -2.552990
2000-01-04    2.269755
Freq: D, Name: Morning, dtype: float64
>>>
>>> df[0]
KeyError: 0

The above exception was the direct cause of the following exception: KeyError: 0

>>> df[0:1]
             Morning  Noon   Evening  Midnight  A  B
1999-12-29  1.764052     0  0.978738  2.240893  0  1

Update Column

Update Row

DataFrame Alter

  • pd.date_range('1999-12-30', periods=7)

  • df['not-existing'] = 1

  • df['not-existing'] = range(0,10)

  • df['not-existing'] = np.arange(0,10)

  • .transpose()

  • .drop(columns=[...])

  • .drop(index=[...])

  • .drop_duplicates()

>>> df.drop(index=['1999-12-29', '1999-12-30'])
             Morning  Noon   Evening  Midnight  A  B
1999-12-31  0.000000     0  0.000000  0.000000  0  0
2000-01-01  0.000000     0  0.000000  0.000000  0  0
2000-01-02  0.000000     0  0.000000  0.000000  0  0
2000-01-03 -2.552990     0  0.864436 -0.742165  0  6
2000-01-04  2.269755     0  0.045759 -0.187184  0  7
>>> df.drop(columns=['Morning', 'Noon'])
             Evening  Midnight  A  B
1999-12-29  0.978738  2.240893  0  1
1999-12-30  0.000000  0.000000  0  0
1999-12-31  0.000000  0.000000  0  0
2000-01-01  0.000000  0.000000  0  0
2000-01-02  0.000000  0.000000  0  0
2000-01-03  0.864436 -0.742165  0  6
2000-01-04  0.045759 -0.187184  0  7

Add Rows and Columns

Drop Rows and Columns

  • Works with inplace=True

Transpose

  • df.transpose() or df.T

  • df.transpose() is preferred

DataFrame NA

  • pd.NA and np.nan Represents missing values

  • pd.NA will be used in future, but for now, there are function which does not support it yet

  • .isna()

  • .dropna(how='any|all', axis='rows|columns')

  • .any()

  • .all()

  • .fillna(value|dict)

  • .ffill()

  • .bfill()

  • .interpolate() - works only with np.nan (not pd.NA)

>>> df.fillna(1)
             Morning  Noon   Evening  Midnight    A    B  T
1999-12-29  1.764052   1.0  0.978738  2.240893  1.0  1.0  1
1999-12-30  1.000000   1.0  1.000000  1.000000  1.0  1.0  1
1999-12-31  1.000000   1.0  1.000000  1.000000  1.0  1.0  1
2000-01-01  1.000000   1.0  1.000000  1.000000  1.0  1.0  1
2000-01-02  1.000000   1.0  1.000000  1.000000  1.0  1.0  1
2000-01-03 -2.552990   1.0  0.864436 -0.742165  1.0  6.0  1
2000-01-04  2.269755   1.0  0.045759 -0.187184  1.0  7.0  1
>>>
>>>
>>> df.fillna({
...     'Noon': 5.0,
...     'Evening': 0.0,
...     })
             Morning  Noon   Evening  Midnight   A    B  T
1999-12-29  1.764052   5.0  0.978738  2.240893 NaN  1.0  1
1999-12-30       NaN   5.0  0.000000       NaN NaN  NaN  1
1999-12-31       NaN   5.0  0.000000       NaN NaN  NaN  1
2000-01-01       NaN   5.0  0.000000       NaN NaN  NaN  1
2000-01-02       NaN   5.0  0.000000       NaN NaN  NaN  1
2000-01-03 -2.552990   5.0  0.864436 -0.742165 NaN  6.0  1
2000-01-04  2.269755   5.0  0.045759 -0.187184 NaN  7.0  1
>>> df
             Morning  Noon   Evening  Midnight   A    B  T
1999-12-29  1.764052   NaN  0.978738  2.240893 NaN  1.0  1
1999-12-30       NaN   NaN       NaN       NaN NaN  NaN  1
1999-12-31       NaN   NaN       NaN       NaN NaN  NaN  1
2000-01-01       NaN   NaN       NaN       NaN NaN  NaN  1
2000-01-02       NaN   NaN       NaN       NaN NaN  NaN  1
2000-01-03 -2.552990   NaN  0.864436 -0.742165 NaN  6.0  1
2000-01-04  2.269755   NaN  0.045759 -0.187184 NaN  7.0  1
>>>
>>>
>>> df.drop(how='all')
TypeError: DataFrame.drop() got an unexpected keyword argument 'how'
>>> df.dropna(how='all')
             Morning  Noon   Evening  Midnight   A    B  T
1999-12-29  1.764052   NaN  0.978738  2.240893 NaN  1.0  1
1999-12-30       NaN   NaN       NaN       NaN NaN  NaN  1
1999-12-31       NaN   NaN       NaN       NaN NaN  NaN  1
2000-01-01       NaN   NaN       NaN       NaN NaN  NaN  1
2000-01-02       NaN   NaN       NaN       NaN NaN  NaN  1
2000-01-03 -2.552990   NaN  0.864436 -0.742165 NaN  6.0  1
2000-01-04  2.269755   NaN  0.045759 -0.187184 NaN  7.0  1
>>>
>>> df.dropna(how='any')
Empty DataFrame
Columns: [Morning, Noon, Evening, Midnight, A, B, T]
Index: []
>>>
>>> df.dropna(how='any', axis='rows')
Empty DataFrame
Columns: [Morning, Noon, Evening, Midnight, A, B, T]
Index: []
>>>
>>> df.dropna(how='any', axis='columns')
            T
1999-12-29  1
1999-12-30  1
1999-12-31  1
2000-01-01  1
2000-01-02  1
2000-01-03  1
2000-01-04  1
>>> df.dropna(how='all', axis='columns')
             Morning   Evening  Midnight    B  T
1999-12-29  1.764052  0.978738  2.240893  1.0  1
1999-12-30       NaN       NaN       NaN  NaN  1
1999-12-31       NaN       NaN       NaN  NaN  1
2000-01-01       NaN       NaN       NaN  NaN  1
2000-01-02       NaN       NaN       NaN  NaN  1
2000-01-03 -2.552990  0.864436 -0.742165  6.0  1
2000-01-04  2.269755  0.045759 -0.187184  7.0  1

Check if Any

Check if All

Check if Null

Check if NA

Fill With Scalar Value

Fill With Dict Values

Fill Forwards

Fill Backwards

Interpolate

Drop Rows with NA

Drop Columns with NA

Recap

DataFrame Sort

  • .sort_index(ascending, axis)

  • .sort_values(by, ascending, axis)

  • Works with inplace=True

  • axis=0 - Rows

  • axis=1 - Columns

>>> df.sort_values(by='Morning')
             Morning  Noon   Evening  Midnight   A    B  T
2000-01-03 -2.552990   NaN  0.864436 -0.742165 NaN  6.0  1
1999-12-29  1.764052   NaN  0.978738  2.240893 NaN  1.0  1
2000-01-04  2.269755   NaN  0.045759 -0.187184 NaN  7.0  1
1999-12-30       NaN   NaN       NaN       NaN NaN  NaN  1
1999-12-31       NaN   NaN       NaN       NaN NaN  NaN  1
2000-01-01       NaN   NaN       NaN       NaN NaN  NaN  1
2000-01-02       NaN   NaN       NaN       NaN NaN  NaN  1
>>>
>>>
>>> df.sort_values(by=['Morning', 'Evening'])
             Morning  Noon   Evening  Midnight   A    B  T
2000-01-03 -2.552990   NaN  0.864436 -0.742165 NaN  6.0  1
1999-12-29  1.764052   NaN  0.978738  2.240893 NaN  1.0  1
2000-01-04  2.269755   NaN  0.045759 -0.187184 NaN  7.0  1
1999-12-30       NaN   NaN       NaN       NaN NaN  NaN  1
1999-12-31       NaN   NaN       NaN       NaN NaN  NaN  1
2000-01-01       NaN   NaN       NaN       NaN NaN  NaN  1
2000-01-02       NaN   NaN       NaN       NaN NaN  NaN  1
>>> df.sort_values(by=['Morning', 'Evening'], ascending=False)
             Morning  Noon   Evening  Midnight   A    B  T
2000-01-04  2.269755   NaN  0.045759 -0.187184 NaN  7.0  1
1999-12-29  1.764052   NaN  0.978738  2.240893 NaN  1.0  1
2000-01-03 -2.552990   NaN  0.864436 -0.742165 NaN  6.0  1
1999-12-30       NaN   NaN       NaN       NaN NaN  NaN  1
1999-12-31       NaN   NaN       NaN       NaN NaN  NaN  1
2000-01-01       NaN   NaN       NaN       NaN NaN  NaN  1
2000-01-02       NaN   NaN       NaN       NaN NaN  NaN  1
>>> df
             Morning  Noon   Evening  Midnight   A    B  T
1999-12-29  1.764052   NaN  0.978738  2.240893 NaN  1.0  1
1999-12-30       NaN   NaN       NaN       NaN NaN  NaN  1
1999-12-31       NaN   NaN       NaN       NaN NaN  NaN  1
2000-01-01       NaN   NaN       NaN       NaN NaN  NaN  1
2000-01-02       NaN   NaN       NaN       NaN NaN  NaN  1
2000-01-03 -2.552990   NaN  0.864436 -0.742165 NaN  6.0  1
2000-01-04  2.269755   NaN  0.045759 -0.187184 NaN  7.0  1
>>>
>>>
>>> df.sort_values(by='Morning', axis='rows')
             Morning  Noon   Evening  Midnight   A    B  T
2000-01-03 -2.552990   NaN  0.864436 -0.742165 NaN  6.0  1
1999-12-29  1.764052   NaN  0.978738  2.240893 NaN  1.0  1
2000-01-04  2.269755   NaN  0.045759 -0.187184 NaN  7.0  1
1999-12-30       NaN   NaN       NaN       NaN NaN  NaN  1
1999-12-31       NaN   NaN       NaN       NaN NaN  NaN  1
2000-01-01       NaN   NaN       NaN       NaN NaN  NaN  1
2000-01-02       NaN   NaN       NaN       NaN NaN  NaN  1
>>>
>>> df.sort_values(by='Morning', axis='columns')
KeyError: 'Morning'
>>> df.sort_values(by='2000-01-03', axis='columns')
             Morning  Midnight   Evening  T    B  Noon   A
1999-12-29  1.764052  2.240893  0.978738  1  1.0   NaN NaN
1999-12-30       NaN       NaN       NaN  1  NaN   NaN NaN
1999-12-31       NaN       NaN       NaN  1  NaN   NaN NaN
2000-01-01       NaN       NaN       NaN  1  NaN   NaN NaN
2000-01-02       NaN       NaN       NaN  1  NaN   NaN NaN
2000-01-03 -2.552990 -0.742165  0.864436  1  6.0   NaN NaN
2000-01-04  2.269755 -0.187184  0.045759  1  7.0   NaN NaN

Sort Series by Index

Sort Rows by Index

Sort Columns by Index

Sort Rows by Value

Sorting Columns by Value

DataFrame Statistics

  • .count()

  • .value_counts()

  • .nunique()

  • .sum()

  • .cumsum()

  • .prod()

  • .cumprod()

  • .min()

  • .idxmin()

  • .cummin()

  • .max()

  • .idxmax()

  • .cummax()

  • .mean()

  • .median()

  • .mode()

  • .rolling(window=3).mean()

  • .abs()

  • .std()

  • .mad() - Mean absolute deviation

  • .sem() - Standard Error of the Mean (SEM)

  • .skew() - Skewness (3rd moment)

  • .kurt() - Kurtosis (4th moment)

  • .quantile(.33) - Sample quantile (value at %). Quantile also known as Percentile

  • .quantile([.25, .5, .75])

  • .var()

  • .corr()

  • .describe()

  • .nsmallest()

  • .values()

  • .rank()

>>> np.random.seed(0)
>>>
>>> df = pd.DataFrame(
...     data=np.random.randn(7,4),
...     index=pd.date_range(start='1999-12-29', freq='D', periods=7),
...     columns=['Morning', 'Noon', 'Evening', 'Midnight'],
... )
>>> df
             Morning      Noon   Evening  Midnight
1999-12-29  1.764052  0.400157  0.978738  2.240893
1999-12-30  1.867558 -0.977278  0.950088 -0.151357
1999-12-31 -0.103219  0.410599  0.144044  1.454274
2000-01-01  0.761038  0.121675  0.443863  0.333674
2000-01-02  1.494079 -0.205158  0.313068 -0.854096
2000-01-03 -2.552990  0.653619  0.864436 -0.742165
2000-01-04  2.269755 -1.454366  0.045759 -0.187184

Count

>>> df.describe()
        Morning      Noon   Evening  Midnight
count  7.000000  7.000000  7.000000  7.000000
mean   0.785753 -0.150107  0.534285  0.299148
std    1.671798  0.787967  0.393169  1.151785
min   -2.552990 -1.454366  0.045759 -0.854096
25%    0.328909 -0.591218  0.228556 -0.464674
50%    1.494079  0.121675  0.443863 -0.151357
75%    1.815805  0.405378  0.907262  0.893974
max    2.269755  0.653619  0.978738  2.240893
>>>
>>> df.count()
Morning     7
Noon        7
Evening     7
Midnight    7
dtype: int64
>>>
>>> df.mean()
Morning     0.785753
Noon       -0.150107
Evening     0.534285
Midnight    0.299148
dtype: float64
>>>
>>> df.std()
Morning     1.671798
Noon        0.787967
Evening     0.393169
Midnight    1.151785
dtype: float64
>>>
>>> df.min()
Morning    -2.552990
Noon       -1.454366
Evening     0.045759
Midnight   -0.854096
dtype: float64
>>>
>>> df.quantile(.25)
Morning     0.328909
Noon       -0.591218
Evening     0.228556
Midnight   -0.464674
Name: 0.25, dtype: float64
>>>
>>> df.quantile(.50)
Morning     1.494079
Noon        0.121675
Evening     0.443863
Midnight   -0.151357
Name: 0.5, dtype: float64
>>>
>>> df.quantile(.75)
Morning     1.815805
Noon        0.405378
Evening     0.907262
Midnight    0.893974
Name: 0.75, dtype: float64
>>>
>>> df.max()
Morning     2.269755
Noon        0.653619
Evening     0.978738
Midnight    2.240893
dtype: float64
>>> df['Morning'].describe()
count    7.000000
mean     0.785753
std      1.671798
min     -2.552990
25%      0.328909
50%      1.494079
75%      1.815805
max      2.269755
Name: Morning, dtype: float64
>>> df['Morning'].describe()
count    7.000000
mean     0.785753
std      1.671798
min     -2.552990
25%      0.328909
50%      1.494079
75%      1.815805
max      2.269755
Name: Morning, dtype: float64
>>>
>>>
>>>
>>> df.Morning
1999-12-29    1.764052
1999-12-30    1.867558
1999-12-31   -0.103219
2000-01-01    0.761038
2000-01-02    1.494079
2000-01-03   -2.552990
2000-01-04    2.269755
Freq: D, Name: Morning, dtype: float64
>>>
>>> df['Morning']
1999-12-29    1.764052
1999-12-30    1.867558
1999-12-31   -0.103219
2000-01-01    0.761038
2000-01-02    1.494079
2000-01-03   -2.552990
2000-01-04    2.269755
Freq: D, Name: Morning, dtype: float64
>>>
>>> df[['Morning', 'Midnight']]
             Morning  Midnight
1999-12-29  1.764052  2.240893
1999-12-30  1.867558 -0.151357
1999-12-31 -0.103219  1.454274
2000-01-01  0.761038  0.333674
2000-01-02  1.494079 -0.854096
2000-01-03 -2.552990 -0.742165
2000-01-04  2.269755 -0.187184
>>>
>>> df['Morning':'Midnight']
DateParseError: Unknown datetime string format, unable to parse: Morning

The above exception was the direct cause of the following exception: TypeError: cannot do slice indexing on DatetimeIndex with these indexers [Morning] of type str

>>>
>>>
>>> df.loc[:, 'Morning':'Evening']
             Morning      Noon   Evening
1999-12-29  1.764052  0.400157  0.978738
1999-12-30  1.867558 -0.977278  0.950088
1999-12-31 -0.103219  0.410599  0.144044
2000-01-01  0.761038  0.121675  0.443863
2000-01-02  1.494079 -0.205158  0.313068
2000-01-03 -2.552990  0.653619  0.864436
2000-01-04  2.269755 -1.454366  0.045759
>>> columns = ['Morning', 'Midnight']
>>> df[columns]
             Morning  Midnight
1999-12-29  1.764052  2.240893
1999-12-30  1.867558 -0.151357
1999-12-31 -0.103219  1.454274
2000-01-01  0.761038  0.333674
2000-01-02  1.494079 -0.854096
2000-01-03 -2.552990 -0.742165
2000-01-04  2.269755 -0.187184
>>>
>>>
>>> df[['Morning', 'Midnight']]
             Morning  Midnight
1999-12-29  1.764052  2.240893
1999-12-30  1.867558 -0.151357
1999-12-31 -0.103219  1.454274
2000-01-01  0.761038  0.333674
2000-01-02  1.494079 -0.854096
2000-01-03 -2.552990 -0.742165
2000-01-04  2.269755 -0.187184
def loc(x):
if type(x) is str:

...

df.loc['2000-01-01'] Morning 0.761038 Noon 0.121675 Evening 0.443863 Midnight 0.333674 Name: 2000-01-01 00:00:00, dtype: float64

def loc(x):
if type(x) is list[str]:

...

>>> df.loc[ ['2000-01-01', '2000-01-02', '2000-01-03'] ]
             Morning      Noon   Evening  Midnight
2000-01-01  0.761038  0.121675  0.443863  0.333674
2000-01-02  1.494079 -0.205158  0.313068 -0.854096
2000-01-03 -2.552990  0.653619  0.864436 -0.742165
def loc(x):
if type(x) is slice:

...

>>> # [] -> list()
>>> # () -> tuple()
>>> # user.firstname = 'Mark' -> setattr(user, 'firstname', 'Mark')
>>> # user.firstname -> getattr(user, 'firstname')
>>> # 1 -> int(1)
>>> # 1,2 -> tuple((1,2))
>>> # 1:4 -> slice(1,4)
>>> # 1:4:2 -> slice(1,4,2)
>>> # :2 -> slice(None, 2)
>>> df.loc['2000-01-01':'2000-01-03']
             Morning      Noon   Evening  Midnight
2000-01-01  0.761038  0.121675  0.443863  0.333674
2000-01-02  1.494079 -0.205158  0.313068 -0.854096
2000-01-03 -2.552990  0.653619  0.864436 -0.742165
>>>
>>> x = slice('2000-01-01', '2000-01-03')
>>> df.loc[x]
             Morning      Noon   Evening  Midnight
2000-01-01  0.761038  0.121675  0.443863  0.333674
2000-01-02  1.494079 -0.205158  0.313068 -0.854096
2000-01-03 -2.552990  0.653619  0.864436 -0.742165
def loc(x):
if type(x) is tuple[str,str]:

...

>>> df.loc['2000-01-01', 'Morning']
0.7610377251469934
>>> x = '2000-01-01', 'Morning'
>>> type(x)
<class 'tuple'>
def loc(x):
if type(x) is tuple[slice,str]:

...

>>> df.loc['2000-01-01':'2000-01-03', 'Morning']
2000-01-01    0.761038
2000-01-02    1.494079
2000-01-03   -2.552990
Freq: D, Name: Morning, dtype: float64
def loc(x):
if type(x) is tuple[slice,list[str]]:

...

>>> df.loc['2000-01-01':'2000-01-03', ['Morning', 'Evening']]
             Morning   Evening
2000-01-01  0.761038  0.443863
2000-01-02  1.494079  0.313068
2000-01-03 -2.552990  0.864436
def loc(x):
if type(x) is tuple[slice,slice]:

...

>>> df.loc['2000-01-01':'2000-01-03', 'Morning':'Evening']
             Morning      Noon   Evening
2000-01-01  0.761038  0.121675  0.443863
2000-01-02  1.494079 -0.205158  0.313068
2000-01-03 -2.552990  0.653619  0.864436

Sum

>>> df.sum()
Morning     5.500273
Noon       -1.050752
Evening     3.739996
Midnight    2.094039
dtype: float64
>>>
>>> df.sum(axis='rows')
Morning     5.500273
Noon       -1.050752
Evening     3.739996
Midnight    2.094039
dtype: float64
>>>
>>> df.sum(axis='columns')
1999-12-29    5.383841
1999-12-30    1.689011
1999-12-31    1.905697
2000-01-01    1.660250
2000-01-02    0.747893
2000-01-03   -1.777100
2000-01-04    0.673964
Freq: D, dtype: float64
>>> df.sum()
Morning     5.500273
Noon       -1.050752
Evening     3.739996
Midnight    2.094039
dtype: float64
>>>
>>> df.sum().sum()
10.283555433636499

Product

Extremes

Average

średnia: sum() / len()

>>> data = pd.Series([1, 2, 3, 4, 5])
>>> data.mean()
3.0
>>> data.median()
3.0
>>> data = pd.Series([1, 2, 3, 4, 1000000])
>>> data.mean()
200002.0
>>> data.median()
3.0
>>> from statistics import median_low, median_high, median, mean, median_grouped
>>>
>>>
>>> data = [1, 2, 3, 4, 5, 1000000]
>>>
>>> median(data)
3.5
>>> median_low(data)
3
>>> median_high(data)
4
>>> median_grouped(data)
3.5
>>>
>>> mean(data)
166669.16666666666
>>> df
             Morning      Noon   Evening  Midnight
1999-12-29  1.764052  0.400157  0.978738  2.240893
1999-12-30  1.867558 -0.977278  0.950088 -0.151357
1999-12-31 -0.103219  0.410599  0.144044  1.454274
2000-01-01  0.761038  0.121675  0.443863  0.333674
2000-01-02  1.494079 -0.205158  0.313068 -0.854096
2000-01-03 -2.552990  0.653619  0.864436 -0.742165
2000-01-04  2.269755 -1.454366  0.045759 -0.187184
>>>
>>>
>>> df.rolling(window=3).mean()
             Morning      Noon   Evening  Midnight
1999-12-29       NaN       NaN       NaN       NaN
1999-12-30       NaN       NaN       NaN       NaN
1999-12-31  1.176130 -0.055507  0.690957  1.181270
2000-01-01  0.841792 -0.148335  0.512665  0.545530
2000-01-02  0.717299  0.109038  0.300325  0.311284
2000-01-03 -0.099291  0.190045  0.540456 -0.420862
2000-01-04  0.403615 -0.335302  0.407754 -0.594482
>>>
>>>
>>> df.rolling(window=3)
Rolling [window=3,center=False,axis=0,method=single]
>>>
>>> df.rolling(window=3).median()
             Morning      Noon   Evening  Midnight
1999-12-29       NaN       NaN       NaN       NaN
1999-12-30       NaN       NaN       NaN       NaN
1999-12-31  1.764052  0.400157  0.950088  1.454274
2000-01-01  0.761038  0.121675  0.443863  0.333674
2000-01-02  0.761038  0.121675  0.313068  0.333674
2000-01-03  0.761038  0.121675  0.443863 -0.742165
2000-01-04  1.494079 -0.205158  0.313068 -0.742165
>>>
>>> df.rolling(window=3).max()
             Morning      Noon   Evening  Midnight
1999-12-29       NaN       NaN       NaN       NaN
1999-12-30       NaN       NaN       NaN       NaN
1999-12-31  1.867558  0.410599  0.978738  2.240893
2000-01-01  1.867558  0.410599  0.950088  1.454274
2000-01-02  1.494079  0.410599  0.443863  1.454274
2000-01-03  1.494079  0.653619  0.864436  0.333674
2000-01-04  2.269755  0.653619  0.864436 -0.187184
>>>
>>>
>>> df.rolling(window=3).std()
             Morning      Noon   Evening  Midnight
1999-12-29       NaN       NaN       NaN       NaN
1999-12-30       NaN       NaN       NaN       NaN
1999-12-31  1.109157  0.798294  0.473857  1.219268
2000-01-01  0.987867  0.732277  0.407403  0.823514
2000-01-02  0.799547  0.308073  0.150315  1.154347
2000-01-03  2.156343  0.433452  0.288096  0.655840
2000-01-04  2.589701  1.060001  0.417471  0.357142

Distribution

>>> df.skew()
Morning    -1.602706
Noon       -0.907414
Evening     0.031047
Midnight    0.915190
dtype: float64
>>>
>>> df.std()
Morning     1.671798
Noon        0.787967
Evening     0.393169
Midnight    1.151785
dtype: float64
>>>
>>> df.sem()
Morning     0.631880
Noon        0.297824
Evening     0.148604
Midnight    0.435334
dtype: float64
>>>
>>> df.kurt()
Morning     2.502051
Noon       -0.588010
Evening    -2.208781
Midnight   -0.351782
dtype: float64
>>>
>>> df.kurtosis()
Morning     2.502051
Noon       -0.588010
Evening    -2.208781
Midnight   -0.351782
dtype: float64
>>> df.var()
Morning     2.794907
Noon        0.620892
Evening     0.154582
Midnight    1.326610
dtype: float64
>>>
>>> df.corr()
           Morning      Noon   Evening  Midnight
Morning   1.000000 -0.698340 -0.190219  0.201034
Noon     -0.698340  1.000000  0.307686  0.359761
Evening  -0.190219  0.307686  1.000000  0.136436
Midnight  0.201034  0.359761  0.136436  1.000000

Describe

Other

  • .nsmallest()

  • .values()

  • .rank()

>>> df.nsmallest(3, columns=['Morning', 'Noon'])
             Morning      Noon   Evening  Midnight
2000-01-03 -2.552990  0.653619  0.864436 -0.742165
1999-12-31 -0.103219  0.410599  0.144044  1.454274
2000-01-01  0.761038  0.121675  0.443863  0.333674

DataFrame Rolling

  • .rolling(window=10)

  • .resample(freq)

  • .shift(periods=1, freq="D")

  • .diff()

>>> orig = df
>>> plus1 = df.shift(periods=1, freq='D')
>>>
>>> orig
             Morning      Noon   Evening  Midnight
1999-12-29  1.764052  0.400157  0.978738  2.240893
1999-12-30  1.867558 -0.977278  0.950088 -0.151357
1999-12-31 -0.103219  0.410599  0.144044  1.454274
2000-01-01  0.761038  0.121675  0.443863  0.333674
2000-01-02  1.494079 -0.205158  0.313068 -0.854096
2000-01-03 -2.552990  0.653619  0.864436 -0.742165
2000-01-04  2.269755 -1.454366  0.045759 -0.187184
>>>
>>> plus1
             Morning      Noon   Evening  Midnight
1999-12-30  1.764052  0.400157  0.978738  2.240893
1999-12-31  1.867558 -0.977278  0.950088 -0.151357
2000-01-01 -0.103219  0.410599  0.144044  1.454274
2000-01-02  0.761038  0.121675  0.443863  0.333674
2000-01-03  1.494079 -0.205158  0.313068 -0.854096
2000-01-04 -2.552990  0.653619  0.864436 -0.742165
2000-01-05  2.269755 -1.454366  0.045759 -0.187184
>>>
>>> orig - plus1
             Morning      Noon   Evening  Midnight
1999-12-29       NaN       NaN       NaN       NaN
1999-12-30  0.103506 -1.377435 -0.028650 -2.392250
1999-12-31 -1.970777  1.387876 -0.806045  1.605631
2000-01-01  0.864257 -0.288923  0.299820 -1.120599
2000-01-02  0.733041 -0.326833 -0.130796 -1.187770
2000-01-03 -4.047069  0.858777  0.551368  0.111931
2000-01-04  4.822744 -2.107984 -0.818678  0.554981
2000-01-05       NaN       NaN       NaN       NaN
>>>
>>>
>>> 1.867558 - 1.764052
0.1035060000000001
>>> df.resample('W')
<pandas.core.resample.DatetimeIndexResampler object at 0x16e563080>
>>> df.resample('W').mean()
             Morning      Noon   Evening  Midnight
2000-01-02  1.156702 -0.050001  0.565960  0.604678
2000-01-09 -0.141618 -0.400374  0.455097 -0.464674
>>>
>>> df.resample(rule='W').mean()
             Morning      Noon   Evening  Midnight
2000-01-02  1.156702 -0.050001  0.565960  0.604678
2000-01-09 -0.141618 -0.400374  0.455097 -0.464674
>>>
>>> df.resample(rule='2D').mean()
             Morning      Noon   Evening  Midnight
1999-12-29  1.815805 -0.288560  0.964413  1.044768
1999-12-31  0.328909  0.266137  0.293953  0.893974
2000-01-02 -0.529455  0.224230  0.588752 -0.798130
2000-01-04  2.269755 -1.454366  0.045759 -0.187184
>>>
>>> df.resample(rule='3D').mean()
             Morning      Noon   Evening  Midnight
1999-12-29  1.176130 -0.055507  0.690957  1.181270
2000-01-01 -0.099291  0.190045  0.540456 -0.420862
2000-01-04  2.269755 -1.454366  0.045759 -0.187184
>>>
>>>
>>> df.resample(rule='Q').mean()
<ipython-input-561-ad2ca8e5f3e2>:1: FutureWarning: 'Q' is deprecated and will be removed in a future version, please use 'QE' instead.
  df.resample(rule='Q').mean()
            Morning      Noon   Evening  Midnight
1999-12-31  1.17613 -0.055507  0.690957  1.181270
2000-03-31  0.49297 -0.221058  0.416781 -0.362443
>>>
>>> df.resample(rule='Y').mean()
<ipython-input-562-7e9570670f71>:1: FutureWarning: 'Y' is deprecated and will be removed in a future version, please use 'YE' instead.
  df.resample(rule='Y').mean()
            Morning      Noon   Evening  Midnight
1999-12-31  1.17613 -0.055507  0.690957  1.181270
2000-12-31  0.49297 -0.221058  0.416781 -0.362443
>>>
>>> df.resample(rule='YE').mean()
            Morning      Noon   Evening  Midnight
1999-12-31  1.17613 -0.055507  0.690957  1.181270
2000-12-31  0.49297 -0.221058  0.416781 -0.362443
>>>
>>>
>>> df.resample(rule='QE').mean()
            Morning      Noon   Evening  Midnight
1999-12-31  1.17613 -0.055507  0.690957  1.181270
2000-03-31  0.49297 -0.221058  0.416781 -0.362443

Resample

Rolling

Shift

  • Subtract DataFrame object from other shifted by index

Diff

  • Subtract next row from the previous

DataFrame Mapping

  • Series.map() - Map values of Series according to an input mapping or function

  • Series.apply() - Invoke function on values of Series

  • DataFrame.apply() - Apply a function along an axis of the DataFrame

  • DataFrame.applymap() - Apply a function to a Dataframe elementwise

  • DataFrame.pipe() - Apply chainable functions that expect Series or DataFrames

  • DataFrame.where(cond, sub) - Replace values where the condition is False

  • DataFrame.mask(cond, sub) - Replace values where the condition is True

  • Series.str.split(regex, expand=True) - Split strings around given separator/delimiter

  • Series.str.extract(regex, expand=True) - Extract capture groups in the regex pat as columns in a DataFrame

  • Series.str.extractall() - Extract capture groups in the regex pat as columns in DataFrame

  • Series.str.findall(regex) - Find all occurrences of pattern or regular expression in the Series/Index

  • Series.str.fullmatch(regex) - Determine if each string entirely matches a regular expression

  • Series.dt.strftime(...) - formatted strings specified by date_format, which supports the same string format as the python standard library

  • Series.dt.date - Returns numpy array of python datetime.date objects

  • Series.dt.time - Returns numpy array of datetime.time objects

  • Series.dt.timez - Returns numpy array of datetime.time objects with timezone information

Map

  • Works only on Series

  • Argument: dict, Series, or Callable

  • Works element-wise on a Series

  • Operate on one element at time

  • When passed a dictionary/Series will map elements based on the keys in that dictionary/Series, missing values will be recorded as NaN in the output

  • Is optimised for elementwise mappings and transformation

  • Operations that involve dictionaries or Series will enable pandas to use faster code paths for better performance [#stackoverflowMapApplyApplyMap]_

Apply

  • Works on both Series and DataFrame

  • Argument: Callable

  • On Series: operate on one element at time

  • On DataFrame: elementwise but also row / column basis

  • Suited to more complex operations and aggregation

  • The behaviour and return value depends on the function

  • Returns a scalar for aggregating operations, Series otherwise. Similarly for DataFrame.apply

  • Has fastpaths when called with certain NumPy functions such as mean, sum, etc. [#stackoverflowMapApplyApplyMap]_

Applymap

  • Deprecated in favour of DataFrame.map()

  • Works only on DataFrame

  • Argument: Callable

  • Works element-wise on a DataFrame

  • Operate on one element at time

  • In more recent versions has been optimised for some operations

  • You will find applymap slightly faster than apply in some cases.

  • Test both and use whatever works better [#stackoverflowMapApplyApplyMap]_

Summary

Differentiation

Cleaning User Input

  • 80% of machine learning and data science is cleaning data

Is This the Same Address?

  • This is a dump of distinct records of a single address

  • Which one of the below is a true address?

Spelling and Abbreviations

House and Apartment Number

Phone Numbers

Conversion

DataFrame Pivot

  • pd.pivot_table()

  • Create a spreadsheet-style pivot table as a DataFrame

  • Levels in the pivot table will be stored in MultiIndex objects

Parameters

Returns

See Also

DataFrame GroupBy

  • .size()

  • .mean()

  • .nunique()

  • .sum()

  • .count()

  • .max()

  • .first()

Grouping

Groupby Methods

  • Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns

  • .size()

  • .mean()

  • .nunique()

  • .sum()

  • .count()

  • .max()

  • .first()

Size

Mean

Number of Uniques

Sum

Count

Minimum

Maximum

First

Last

Output format

  • Series or DataFrame?

Datasets

DataFrame Aggregations

  • .count()

  • .sum()

  • .nunique()

  • .mean()

  • .median()

  • .std()

  • .std2()

  • .min()

  • .quantile()

  • .max()

  • .first()

  • .last()

  • lambda column:

Single Statistic

Multiple Statistics per Group

Applying multiple functions to columns in groups

Named Aggregations

Renaming index

  • using droplevel and ravel

  • Dictionary groupby format is deprecated

DataFrame Join

  • pd.concat()

  • DetaFrame.merge()

  • DetaFrame.join()

  • DetaFrame.melt() - stack columns

Concatenate

  • Useful for merging data from two files or datasources

>>> dfs = [
...     pd.read_csv(file1),
...     pd.read_csv(file2),
...     pd.read_csv(file3),
... ]
>>>
>>> df = pd.concat(dfs)
>>> df = pd.concat([
...     pd.read_csv(file1),
...     pd.read_csv(file2),
...     pd.read_csv(file3),
... ])
>>>
>>> df
>>> from pathlib import Path
>>>
>>> files = Path('mydirectory').rglob('*.csv')
>>> dfs = map(pd.read_csv, files)
>>> df = pd.concat(dfs)

Merge

  • Merge DataFrame or named Series objects with a database-style join.

  • The join is done on columns or indexes.

  • If joining columns on columns, the DataFrame indexes will be ignored.

  • Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on.

Join

  • Join columns of another DataFrame.

  • Join columns with other DataFrame either on index or on a key column.

  • Efficiently join multiple DataFrame objects by index at once by passing a list.

  • rfuffix - If two columns has the same name, add suffix to right

  • lfuffix - If two columns has the same name, add suffix to left

DataFrame Plotting

Plot kinds

  • line - Line Plot

  • bar - Vertical Bar Plot

  • barh - Horizontal Bar Plot

  • hist - Histogram

  • box - Boxplot

  • density, kde - Kernel Density Estimation Plot

  • area - Area Plot

  • pie - Pie Plot

  • scatter - Scatter Plot

  • hexbin - Hexbin Plot

Parameters

Line Plot

  • default

Vertical Bar Plot

Horizontal Bar Plot

Histogram

Boxplot

Kernel Density Estimation Plot

  • Also known as kind='kde' - Kernel Density Estimation

Area Plot

Pie Plot

Scatter Plot

Hexbin Plot

Scatter matrix

  • The in pandas version 0.22 plotting module has been moved from pandas.tools.plotting to pandas.plotting

  • As of version 0.19, the pandas.plotting library did not exist

Actinograms

Further Reading