1.8. About Types
1.8.1. SetUp
>>> import pandas as pd
1.8.2. Series
1-dimensional data structure similar to
ndarray
Has index
Can have name
>>> pd.Series([1.0, 2.0, 3.0, 4.0])
0 1.0
1 2.0
2 3.0
3 4.0
dtype: float64
1.8.3. DataFrame
2-dimensional object
All columns share the same index
List of
Series
Each column must have name
Operations can be executed on columns or rows
>>> pd.DataFrame({
... 'A': ['a', 'b', 'c', 'd'],
... 'B': [11, 22, 33, 44],
... 'C': [1.1, 2.2, 3.3, 4.4],
... })
A B C
0 a 11 1.1
1 b 22 2.2
2 c 33 3.3
3 d 44 4.4
1.8.4. SparseArray
Data where a single value is repeated many times (e.g.
0
orNaN
) may be stored efficiently as aSparseArray
Sparse data with Series:
>>> pd.arrays.SparseArray([1, None, None, None, 3])
[1.0, nan, nan, nan, 3.0]
Fill: nan
IntIndex
Indices: array([0, 4], dtype=int32)
Sparse data with DataFrame:
>>> df = pd.DataFrame({
... 'A': [1, 2, 3],
... 'B': pd.arrays.SparseArray([None, None, None])})
>>>
>>> df
A B
0 1 NaN
1 2 NaN
2 3 NaN
>>>
>>> df.dtypes
A int64
B Sparse[object, nan]
dtype: object
1.8.5. Interval
Definition:
>>> pd.Interval(0, 5)
Interval(0, 5, closed='right')
>>> pd.Interval(left=0, right=5)
Interval(0, 5, closed='right')
>>> pd.Interval(left=0, right=5, closed='both')
Interval(0, 5, closed='both')
Contains:
>>> interval = pd.Interval(0, 5, closed='left')
>>>
>>> 2.5 in interval
True
>>>
>>> 5.0 in interval
False
Interval between Timestamps:
>>> year_1970 = pd.Interval(left=pd.Timestamp('1970-01-01 00:00:00'),
... right=pd.Timestamp('1971-01-01 00:00:00'),
... closed='left')
>>>
>>> apollo11 = pd.Timestamp('1969-07-16')
>>> apollo13 = pd.Timestamp('1970-04-11')
>>>
>>> apollo11 in year_1970
False
>>>
>>> apollo13 in year_1970
True
>>>
>>> year_1970.length
Timedelta('365 days 00:00:00')
>>> fiscalyear2020 = pd.Interval(
... left=pd.Timestamp('2020-01-01'),
... right=pd.Timestamp('2021-01-01'),
... closed='left')
>>>
>>> fiscalyear2021 = pd.Interval(
... left=pd.Timestamp('2021-01-01'),
... right=pd.Timestamp('2022-01-01'),
... closed='left')
>>>
>>>
>>> event1 = pd.Timestamp('2020-04-12')
>>> event2 = pd.Timestamp('2021-07-21')
>>>
>>> event1 in fiscalyear2020
True
>>> event1 in fiscalyear2021
False
>>> event2 in fiscalyear2020
False
>>> event2 in fiscalyear2021
True
1.8.6. Categorical
Limited, fixed set of values
>>> iris = pd.Categorical(['setosa', 'virginica', 'versicolor'])
>>>
>>> iris
['setosa', 'virginica', 'versicolor']
Categories (3, object): ['setosa', 'versicolor', 'virginica']
>>>
>>> 'arctica' in iris
False
>>> status = pd.Categorical(['todo', 'done', 'todo', 'done'])
>>>
>>> status
['todo', 'done', 'todo', 'done']
Categories (2, object): ['done', 'todo']
>>>
>>> 'in progress' in status
False
>>>
>>> 'todo' in status
True
>>>
>>> status.categories
Index(['done', 'todo'], dtype='object')
>>> moon_landings = pd.Categorical(['apollo11', 'apollo12', 'apollo14',
... 'apollo15', 'apollo16', 'apollo17'])
>>>
>>> moon_landings
['apollo11', 'apollo12', 'apollo14', 'apollo15', 'apollo16', 'apollo17']
Categories (6, object): ['apollo11', 'apollo12', 'apollo14', 'apollo15', 'apollo16', 'apollo17']
>>>
>>> 'apollo11' in moon_landings
True
>>>
>>> 'apollo13' in moon_landings
False
>>>
>>> moon_landings.categories
Index(['apollo11', 'apollo12', 'apollo14', 'apollo15', 'apollo16', 'apollo17'], dtype='object')