5.3. DataFrame Index
DataFrame.index
pd.DataFrame(index=...)
Range Index
Index
Object Index
Datetime Index
Timedelta Index
Period Index
Interval Index
Categorical Index
Multi Index
5.3.1. SetUp
>>> import pandas as pd
>>> import numpy as np
5.3.2. Deprecation
Note
Non-monotonic indexes require exact matches. If the index of a Series or DataFrame is monotonically increasing or decreasing, then the bounds of a label-based slice can be outside the range of the index, much like slice indexing a normal Python list. Monotonicity of an index can be tested with the is_monotonic_increasing() and is_monotonic_decreasing() attributes. [2]
Note
Compared with standard Python sequence slicing in which the slice endpoint is not inclusive, label-based slicing in pandas is inclusive. The primary reason for this is that it is often not possible to easily determine the "successor" or next element after a particular label in an index. [2]
>>> pd.Index([1, 2, 3], dtype='int64')
Index([1, 2, 3], dtype='int64')
5.3.3. Range Index
>>> df = pd.DataFrame({
... 'A': [10, 11, 12],
... 'B': [20, 21, 22],
... 'C': [30, 31, 32]})
>>>
>>> df
A B C
0 10 20 30
1 11 21 31
2 12 22 32
>>>
>>> df.index
RangeIndex(start=0, stop=3, step=1)
5.3.4. Integer Index
Int64Index, UInt64Index and Float64Index have been deprecated in favor of the base Index class and will be removed in Pandas 2.0 [1]
>>> df = pd.DataFrame(
... data = np.arange(16).reshape(4,4),
... index = [99, 88, 77, 66],
... columns = ['A', 'B', 'C', 'D'])
>>>
>>> df
A B C D
99 0 1 2 3
88 4 5 6 7
77 8 9 10 11
66 12 13 14 15
>>>
>>> df.index
Index([99, 88, 77, 66], dtype='int64')
5.3.5. Object Index
>>> df = pd.DataFrame(
... data = np.arange(16).reshape(4,4),
... index = ['a', 'b', 'c', 'd'],
... columns = ['A', 'B', 'C', 'D'])
>>>
>>> df
A B C D
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
d 12 13 14 15
>>>
>>> df.index
Index(['a', 'b', 'c', 'd'], dtype='object')
5.3.6. Datetime Index
>>> np.random.seed(0)
>>>
>>>
>>> df = pd.DataFrame(
... columns = ['Morning', 'Noon', 'Evening', 'Midnight'],
... index = pd.date_range('1999-12-30', periods=7),
... data = np.random.randn(7, 4))
>>>
>>> df
Morning Noon Evening Midnight
1999-12-30 1.764052 0.400157 0.978738 2.240893
1999-12-31 1.867558 -0.977278 0.950088 -0.151357
2000-01-01 -0.103219 0.410599 0.144044 1.454274
2000-01-02 0.761038 0.121675 0.443863 0.333674
2000-01-03 1.494079 -0.205158 0.313068 -0.854096
2000-01-04 -2.552990 0.653619 0.864436 -0.742165
2000-01-05 2.269755 -1.454366 0.045759 -0.187184
>>>
>>> df.index
DatetimeIndex(['1999-12-30', '1999-12-31', '2000-01-01', '2000-01-02',
'2000-01-03', '2000-01-04', '2000-01-05'],
dtype='datetime64[ns]', freq='D')
5.3.7. Set Index
>>> df = pd.DataFrame([
... {'id': 1, 'firstname': 'Mark', 'lastname': 'Watney'},
... {'id': 2, 'firstname': 'Melissa', 'lastname': 'Lewis'},
... {'id': 3, 'firstname': 'Rick', 'lastname': 'Martinez'},
... {'id': 4, 'firstname': 'Alex', 'lastname': 'Vogel'},
... ])
>>>
>>> df
id firstname lastname
0 1 Mark Watney
1 2 Melissa Lewis
2 3 Rick Martinez
3 4 Alex Vogel
>>>
>>> df.set_index('id')
firstname lastname
id
1 Mark Watney
2 Melissa Lewis
3 Rick Martinez
4 Alex Vogel
5.3.8. Use Case - 1
>>> import pandas as pd
>>>
>>> pd.set_option('display.width', 250)
>>> pd.set_option('display.max_columns', 20)
>>> pd.set_option('display.max_rows', 30)
>>>
>>>
>>> def quantile25(column):
... return column.quantile(.25)
>>>
>>> def quantile50(column):
... return column.quantile(.50)
>>>
>>> def quantile75(column):
... return column.quantile(.75)
>>>
>>>
>>> DATA = 'https://python3.info/_static/phones-en.csv'
>>> df = pd.read_csv(DATA, parse_dates=['date'])
>>> df.drop(columns='index', inplace=True)
>>>
>>> result = df.groupby(['month','item']).agg(
... duration_count=('duration', 'count'),
... duration_sum=('duration', 'sum'),
... duration_nunique=('duration', 'nunique'),
...
... duration_mean=('duration', 'mean'),
... duration_median=('duration', 'median'),
... duration_std=('duration', 'std'),
... duration_std2=('duration', lambda column: column.std().astype(int)),
...
... duration_min=('duration', 'min'),
... duration_q25=('duration', quantile25),
... duration_q50=('duration', quantile50),
... duration_q75=('duration', quantile75),
... duration_max=('duration', 'max'),
...
... when_first=('date', 'first'),
... when_last=('date', 'last'),
... )
>>>
>>> result
duration_count duration_sum duration_nunique duration_mean duration_median duration_std duration_std2 duration_min duration_q25 duration_q50 duration_q75 duration_max when_first when_last
month item
2014-11 call 107 25547.000 76 238.757009 48.000 387.128905 387 1.000 5.500 48.000 328.000 1940.000 2014-10-15 06:58:00 2014-12-11 19:01:00
data 29 998.441 1 34.429000 34.429 0.000000 0 34.429 34.429 34.429 34.429 34.429 2014-10-15 06:58:00 2014-12-11 06:58:00
sms 94 94.000 1 1.000000 1.000 0.000000 0 1.000 1.000 1.000 1.000 1.000 2014-10-16 22:18:00 2014-11-13 22:31:00
2014-12 call 79 13561.000 61 171.658228 55.000 324.731798 324 2.000 10.500 55.000 152.000 2120.000 2014-11-14 17:24:00 2014-12-14 19:54:00
data 30 1032.870 1 34.429000 34.429 0.000000 0 34.429 34.429 34.429 34.429 34.429 2014-11-13 06:58:00 2014-12-12 06:58:00
sms 48 48.000 1 1.000000 1.000 0.000000 0 1.000 1.000 1.000 1.000 1.000 2014-11-14 17:28:00 2014-07-12 23:22:00
2015-01 call 88 17070.000 70 193.977273 55.500 300.671661 300 2.000 15.500 55.500 273.500 1859.000 2014-12-15 20:03:00 2015-01-14 20:47:00
data 31 1067.299 1 34.429000 34.429 0.000000 0 34.429 34.429 34.429 34.429 34.429 2014-12-13 06:58:00 2015-12-01 06:58:00
sms 86 86.000 1 1.000000 1.000 0.000000 0 1.000 1.000 1.000 1.000 1.000 2014-12-15 19:56:00 2015-01-14 23:36:00
2015-02 call 67 14416.000 63 215.164179 89.000 329.672914 329 1.000 30.000 89.000 241.000 1863.000 2015-01-15 10:36:00 2015-09-02 17:54:00
data 31 1067.299 1 34.429000 34.429 0.000000 0 34.429 34.429 34.429 34.429 34.429 2015-01-13 06:58:00 2015-12-02 06:58:00
sms 39 39.000 1 1.000000 1.000 0.000000 0 1.000 1.000 1.000 1.000 1.000 2015-01-15 12:23:00 2015-10-02 21:40:00
2015-03 call 47 21727.000 46 462.276596 107.000 1552.192218 1552 2.000 33.500 107.000 320.000 10528.000 2015-12-02 20:15:00 2015-04-03 12:29:00
data 29 998.441 1 34.429000 34.429 0.000000 0 34.429 34.429 34.429 34.429 34.429 2015-02-13 06:58:00 2015-03-13 06:58:00
sms 25 25.000 1 1.000000 1.000 0.000000 0 1.000 1.000 1.000 1.000 1.000 2015-02-19 18:46:00 2015-03-14 00:16:00
>>> result.loc[('2015-01','call')]
duration_count 88
duration_sum 17070.0
duration_nunique 70
duration_mean 193.977273
duration_median 55.5
duration_std 300.671661
duration_std2 300
duration_min 2.0
duration_q25 15.5
duration_q50 55.5
duration_q75 273.5
duration_max 1859.0
when_first 2014-12-15 20:03:00
when_last 2015-01-14 20:47:00
Name: (2015-01, call), dtype: object
>>> result.loc['2015-01']
duration_count duration_sum duration_nunique duration_mean duration_median duration_std duration_std2 duration_min duration_q25 duration_q50 duration_q75 duration_max when_first when_last
item
call 88 17070.000 70 193.977273 55.500 300.671661 300 2.000 15.500 55.500 273.500 1859.000 2014-12-15 20:03:00 2015-01-14 20:47:00
data 31 1067.299 1 34.429000 34.429 0.000000 0 34.429 34.429 34.429 34.429 34.429 2014-12-13 06:58:00 2015-12-01 06:58:00
sms 86 86.000 1 1.000000 1.000 0.000000 0 1.000 1.000 1.000 1.000 1.000 2014-12-15 19:56:00 2015-01-14 23:36:00
>>> result.loc['2015-01'].transpose()
item call data sms
duration_count 88 31 86
duration_sum 17070.0 1067.299 86.0
duration_nunique 70 1 1
duration_mean 193.977273 34.429 1.0
duration_median 55.5 34.429 1.0
duration_std 300.671661 0.0 0.0
duration_std2 300 0 0
duration_min 2.0 34.429 1.0
duration_q25 15.5 34.429 1.0
duration_q50 55.5 34.429 1.0
duration_q75 273.5 34.429 1.0
duration_max 1859.0 34.429 1.0
when_first 2014-12-15 20:03:00 2014-12-13 06:58:00 2014-12-15 19:56:00
when_last 2015-01-14 20:47:00 2015-12-01 06:58:00 2015-01-14 23:36:00
>>> sms = result.index.get_level_values('item') == 'sms'
>>> sms
array([False, False, True, False, False, True, False, False, True,
False, False, True, False, False, True])
>>>
>>> result[sms]
duration_count duration_sum duration_nunique duration_mean duration_median duration_std duration_std2 duration_min duration_q25 duration_q50 duration_q75 duration_max when_first when_last
month item
2014-11 sms 94 94.0 1 1.0 1.0 0.0 0 1.0 1.0 1.0 1.0 1.0 2014-10-16 22:18:00 2014-11-13 22:31:00
2014-12 sms 48 48.0 1 1.0 1.0 0.0 0 1.0 1.0 1.0 1.0 1.0 2014-11-14 17:28:00 2014-07-12 23:22:00
2015-01 sms 86 86.0 1 1.0 1.0 0.0 0 1.0 1.0 1.0 1.0 1.0 2014-12-15 19:56:00 2015-01-14 23:36:00
2015-02 sms 39 39.0 1 1.0 1.0 0.0 0 1.0 1.0 1.0 1.0 1.0 2015-01-15 12:23:00 2015-10-02 21:40:00
2015-03 sms 25 25.0 1 1.0 1.0 0.0 0 1.0 1.0 1.0 1.0 1.0 2015-02-19 18:46:00 2015-03-14 00:16:00
Cross-section:
>>> result.xs('sms', level='item')
duration_count duration_sum duration_nunique duration_mean duration_median duration_std duration_std2 duration_min duration_q25 duration_q50 duration_q75 duration_max when_first when_last
month
2014-11 94 94.0 1 1.0 1.0 0.0 0 1.0 1.0 1.0 1.0 1.0 2014-10-16 22:18:00 2014-11-13 22:31:00
2014-12 48 48.0 1 1.0 1.0 0.0 0 1.0 1.0 1.0 1.0 1.0 2014-11-14 17:28:00 2014-07-12 23:22:00
2015-01 86 86.0 1 1.0 1.0 0.0 0 1.0 1.0 1.0 1.0 1.0 2014-12-15 19:56:00 2015-01-14 23:36:00
2015-02 39 39.0 1 1.0 1.0 0.0 0 1.0 1.0 1.0 1.0 1.0 2015-01-15 12:23:00 2015-10-02 21:40:00
2015-03 25 25.0 1 1.0 1.0 0.0 0 1.0 1.0 1.0 1.0 1.0 2015-02-19 18:46:00 2015-03-14 00:16:00
Slicer Object:
>>> result.loc[(slice(None), 'sms'), :]
duration_count duration_sum duration_nunique duration_mean duration_median duration_std duration_std2 duration_min duration_q25 duration_q50 duration_q75 duration_max when_first when_last
month item
2014-11 sms 94 94.0 1 1.0 1.0 0.0 0 1.0 1.0 1.0 1.0 1.0 2014-10-16 22:18:00 2014-11-13 22:31:00
2014-12 sms 48 48.0 1 1.0 1.0 0.0 0 1.0 1.0 1.0 1.0 1.0 2014-11-14 17:28:00 2014-07-12 23:22:00
2015-01 sms 86 86.0 1 1.0 1.0 0.0 0 1.0 1.0 1.0 1.0 1.0 2014-12-15 19:56:00 2015-01-14 23:36:00
2015-02 sms 39 39.0 1 1.0 1.0 0.0 0 1.0 1.0 1.0 1.0 1.0 2015-01-15 12:23:00 2015-10-02 21:40:00
2015-03 sms 25 25.0 1 1.0 1.0 0.0 0 1.0 1.0 1.0 1.0 1.0 2015-02-19 18:46:00 2015-03-14 00:16:00