2.2. Read CSV

  • File paths works also with URLs

2.2.1. SetUp

>>> import pandas as pd
>>>
>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)

2.2.3. Content

>>> DATA = 'https://python3.info/_static/iris-clean.csv'
>>>
>>> df = pd.read_csv(DATA)
>>> df.head(3)
   sepal_length  sepal_width  petal_length  petal_width     species
0           5.4          3.9           1.3          0.4      setosa
1           5.9          3.0           5.1          1.8   virginica
2           6.0          3.4           4.5          1.6  versicolor

2.2.4. Rename Columns

>>> DATA = 'https://python3.info/_static/iris-dirty.csv'
>>>
>>> COLUMNS =  ['sepal_length', 'sepal_width',
...             'petal_length', 'petal_width', 'species']
>>>
>>> df = pd.read_csv(DATA)
>>> df.head(3)
   150    4  setosa  versicolor  virginica
0  5.4  3.9     1.3         0.4          0
1  5.9  3.0     5.1         1.8          2
2  6.0  3.4     4.5         1.6          1
>>>
>>> df = pd.read_csv(DATA, skiprows=1, names=COLUMNS)
>>> df.head(3)
   sepal_length  sepal_width  petal_length  petal_width  species
0           5.4          3.9           1.3          0.4        0
1           5.9          3.0           5.1          1.8        2
2           6.0          3.4           4.5          1.6        1
>>>
>>> df['species'].replace({
...     0: 'setosa',
...     1: 'versicolor',
...     2: 'virginica',
... }, inplace=True)
>>>
>>> df.head(n=3)
   sepal_length  sepal_width  petal_length  petal_width     species
0           5.4          3.9           1.3          0.4      setosa
1           5.9          3.0           5.1          1.8   virginica
2           6.0          3.4           4.5          1.6  versicolor

2.2.5. Compressed

  • If the extension is .gz, .bz2, .zip, and .xz, the corresponding compression method is automatically selected

>>> df = pd.read_csv('sample_file.zip', compression='zip')  
>>> df = pd.read_csv('sample_file.gz', compression='infer')  

2.2.6. Use Case - 0x01

>>> DATA = 'https://python3.info/_static/iris-dirty.csv'
>>>
>>> COLUMNS =  ['sepal_length', 'sepal_width',
...             'petal_length', 'petal_width', 'species']
>>> header = pd.read_csv(DATA, nrows=0)
>>> nrows, ncols, *class_labels = header.columns
>>> label_encoder = dict(enumerate(class_labels))
>>>
>>> label_encoder
{0: 'setosa', 1: 'versicolor', 2: 'virginica'}
>>> df = pd.read_csv(DATA, names=COLUMNS, skiprows=1)
>>> df['species'].replace(label_encoder, inplace=True)
>>> df.head(n=5)
   sepal_length  sepal_width  petal_length  petal_width     species
0           5.4          3.9           1.3          0.4      setosa
1           5.9          3.0           5.1          1.8   virginica
2           6.0          3.4           4.5          1.6  versicolor
3           7.3          2.9           6.3          1.8   virginica
4           5.6          2.5           3.9          1.1  versicolor

2.2.7. Use Case - 0x02

>>> DATA = 'https://python3.info/_static/martian-en.csv'
>>> pd.read_csv(DATA)
  firstname   lastname         birthdate  gender          ssn                email               phone
0      Mark     Watney   October 12 1994    male  94101212345     mwatney@nasa.gov   +1 (234) 555-0000
1   Melissa      Lewis      July 15 1995  female  95071512345      mlewis@nasa.gov   +1 (234) 555-0001
2      Rick   Martinez   January 21 1996    male  96012112345   rmartinez@nasa.gov   +1 (234) 555-0010
3      Alex      Vogel  November 15 1994    male  94111512345       avogel@esa.int  +49 (234) 555-0011
4      Beth  Johanssen        May 9 2006  female   6250912345  bjohanssen@nasa.gov   +1 (234) 555-0100
5     Chris       Beck     August 2 1999    male  99080212345       cbeck@nasa.gov   +1 (234) 555-0101
>>> pd.read_csv(DATA, parse_dates=['birthdate'])
  firstname   lastname  birthdate  gender          ssn                email               phone
0      Mark     Watney 1994-10-12    male  94101212345     mwatney@nasa.gov   +1 (234) 555-0000
1   Melissa      Lewis 1995-07-15  female  95071512345      mlewis@nasa.gov   +1 (234) 555-0001
2      Rick   Martinez 1996-01-21    male  96012112345   rmartinez@nasa.gov   +1 (234) 555-0010
3      Alex      Vogel 1994-11-15    male  94111512345       avogel@esa.int  +49 (234) 555-0011
4      Beth  Johanssen 2006-05-09  female   6250912345  bjohanssen@nasa.gov   +1 (234) 555-0100
5     Chris       Beck 1999-08-02    male  99080212345       cbeck@nasa.gov   +1 (234) 555-0101

2.2.8. Assignments

Code 2.44. Solution
"""
* Assignment: Pandas Read CSV Dates
* Complexity: easy
* Lines of code: 1 lines
* Time: 3 min

English:
    1. Read data from `DATA` to `result: pd.DataFrame`
    2. Parse dates in "birthdate" column
    3. Run doctests - all must succeed

Polish:
    1. Wczytaj dane z `DATA` do `result: pd.DataFrame`
    2. Sparsuj daty w kolumnie "birthdate"
    3. Uruchom doctesty - wszystkie muszą się powieść

Hints:
    * `parse_dates`

Tests:
    >>> import sys; sys.tracebacklimit = 0

    >>> assert result is not Ellipsis, \
    'Assign result to variable: `result`'
    >>> assert type(result) is pd.DataFrame, \
    'Variable `result` must be a `pd.DataFrame` type'

    >>> result[['firstname', 'lastname', 'birthdate']]
      firstname   lastname  birthdate
    0      Mark     Watney 1994-10-12
    1   Melissa      Lewis 1995-07-15
    2      Rick   Martinez 1996-01-21
    3      Alex      Vogel 1994-11-15
    4      Beth  Johanssen 2006-05-09
    5     Chris       Beck 1999-08-02
"""

import pandas as pd


DATA = 'https://python3.info/_static/martian-en.csv'


# Read DATA and parse dates in "birthdate" column
# type: pd.DataFrame
result = ...


Code 2.45. Solution
"""
* Assignment: Pandas Read CSV Replace
* Complexity: easy
* Lines of code: 5 lines
* Time: 8 min

English:
    1. Read data from `DATA` to `result: pd.DataFrame`
    2. Use provided column names in `COLUMNS`
    3. Read labels from the first row
    4. Replace data in `label` column with values extracted above
    5. Define `result: pd.DataFrame` with 25 first rows
    6. Run doctests - all must succeed

Polish:
    1. Wczytaj dane z `DATA` do `result: pd.DataFrame`
    2. Użyj podanych w `COLUMNS` nazw kolumn
    3. Wczytaj nazwy labeli z pierwszego wiersza
    4. Podmień dane w kolumnie `label` na wartości wyciągnięte powyżej
    5. Zdefiniuj `result: pd.DataFrame` z 25 pierwszymi wierszami
    6. Uruchom doctesty - wszystkie muszą się powieść

Hints:
    * `class_labels = pd.read_csv(DATA, nrows=0).columns[2:]`
    * `label_encoder = dict(enumerate(class_labels))`
    * `pd.Series.replace()`

Tests:
    >>> import sys; sys.tracebacklimit = 0

    >>> assert result is not Ellipsis, \
    'Assign result to variable: `result`'
    >>> assert type(result) is pd.DataFrame, \
    'Variable `result` must be a `pd.DataFrame` type'
    >>> assert len(result) == 25, \
    'Select only 25 first rows'

    >>> result.loc[[0,1,2,3,4,5], ['mean radius', 'mean texture', 'label']]
       mean radius  mean texture      label
    0        17.99         10.38  malignant
    1        20.57         17.77  malignant
    2        19.69         21.25  malignant
    3        11.42         20.38  malignant
    4        20.29         14.34  malignant
    5        12.45         15.70  malignant

    >>> result['label'].value_counts()
    label
    malignant    22
    benign        3
    Name: count, dtype: int64
"""

import pandas as pd


DATA = 'https://python3.info/_static/breast-cancer.csv'

COLUMNS = [
    'mean radius', 'mean texture', 'mean perimeter', 'mean area',
    'mean smoothness', 'mean compactness', 'mean concavity',
    'mean concave points', 'mean symmetry', 'mean fractal dimension',
    'radius error', 'texture error', 'perimeter error', 'area error',
    'smoothness error', 'compactness error', 'concavity error',
    'concave points error', 'symmetry error',
    'fractal dimension error', 'worst radius', 'worst texture',
    'worst perimeter', 'worst area', 'worst smoothness',
    'worst compactness', 'worst concavity', 'worst concave points',
    'worst symmetry', 'worst fractal dimension', 'label',
]


# Read DATA, substitute column names, and labels, select 25 rows
# type: pd.DataFrame
result = ...