17.1. CSV About

  • CSV - Comma/Character Separated Values

  • No CSV formal standard, just a practice

  • Flat file (2D) without relations

  • Relations has to be flatten (serialization, additional columns, etc...)

  • Typically first line (header) represents column names

  • Rarely first line can also have a structure (nrows, ncols)

  • Internationalization: encoding

  • Localization: decimal separator, thousands separator, date format

  • Parameters: delimiter, quotechar, quoting, lineterminator, dialect

Example CSV file:

firstname,lastname,age
Mark,Watney,42
Melissa,Lewis,41
Rick,Martinez,40
Alex,Vogel,42
Beth,Johanssen,29
Chris,Beck,36

17.1.1. Header with Column Names

  • Typically first line (header) represents column names

File without header:

Mark,Watney,42
Melissa,Lewis,41
Rick,Martinez,40
Alex,Vogel,42
Beth,Johanssen,29
Chris,Beck,36

First line is a header:

firstname,lastname,age
Mark,Watney,42
Melissa,Lewis,41
Rick,Martinez,40
Alex,Vogel,42
Beth,Johanssen,29
Chris,Beck,36

17.1.2. Header with Structure

  • First line can also have a structure

  • Used in Machine Learning

  • nrows - number of rows

  • ncols - number of columns

  • nvalues - number of values

First line is a structure: number of rows (nrows) and columns (ncols):

6,3
Mark,Watney,staff
Melissa,Lewis,admins
Rick,Martinez,staff
Alex,Vogel,users
Beth,Johanssen,staff
Chris,Beck,staff

First line is a structure: number of rows (nrows) and features (nfeatures), followed by label_encoder values for label column:

6,2,users,staff,admins
Mark,Watney,1
Melissa,Lewis,2
Rick,Martinez,1
Alex,Vogel,0
Beth,Johanssen,1
Chris,Beck,1
>>> encoder = {
...     0: 'users',
...     1: 'staff',
...     2: 'admins',
... }

17.1.3. Delimiter

  • csv module expects delimeter to be 1-character in length

delimiter=',':

firstname,lastname,age
Mark,Watney,42
Melissa,Lewis,41
Rick,Martinez,40
Alex,Vogel,42
Beth,Johanssen,29
Chris,Beck,36

delimiter=';':

firstname;lastname;age
Mark;Watney;42
Melissa;Lewis;41
Rick;Martinez;40
Alex;Vogel;42
Beth;Johanssen;29
Chris;Beck;36

delimiter='\t':

firstname   lastname        age
Mark        Watney  42
Melissa     Lewis   41
Rick        Martinez        40
Alex        Vogel   42
Beth        Johanssen       29
Chris       Beck    36

delimiter=':':

root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
halt:x:7:0:halt:/sbin:/sbin/halt
nobody:x:99:99:Nobody:/:/sbin/nologin
sshd:x:74:74:Privilege-separated SSH:/var/empty/sshd:/sbin/nologin
mwatney:x:1000:1000:Mark Watney:/home/mwatney:/bin/bash
mlewis:x:1001:1001:Melissa Lewis:/home/mlewis:/bin/bash
rmartinez:x:1002:1002:Rick Martinez:/home/rmartinez:/bin/bash
avogel:x:1003:1003:Alex Vogel:/home/avogel:/bin/bash
bjohanssen:x:1004:1004:Beth Johanssen:/home/bjohanssen:/bin/bash
cbeck:x:1005:1005:Chris Beck:/home/cbeck:/bin/bash

delimiter='|':

| Firstname | Lastname  | Age |
|-----------|-----------|-----|
| Mark      | Watney    |  42 |
| Melissa   | Lewis     |  41 |
| Rick      | Martinez  |  40 |
| Alex      | Vogel     |  42 |
| Beth      | Johanssen |  29 |
| Chris     | Beck      |  36 |

17.1.4. Quotechar

  • " - quote char (best)

  • ' - apostrophe

quotechar='"':

"firstname","lastname","age"
"Mark","Watney","42"
"Melissa","Lewis","41"
"Rick","Martinez","40"
"Alex","Vogel","42"
"Beth","Johanssen","29"
"Chris","Beck","36"

quotechar="'":

'firstname','lastname','age'
'Mark','Watney','42'
'Melissa','Lewis','41'
'Rick','Martinez','40'
'Alex','Vogel','42'
'Beth','Johanssen','29'
'Chris','Beck','36'

quotechar='|':

|firstname|,|lastname|,|age|
|Mark|,|Watney|,|42|
|Melissa|,|Lewis|,|41|
|Rick|,|Martinez|,|40|
|Alex|,|Vogel|,|42|
|Beth|,|Johanssen|,|29|
|Chris|,|Beck|,|36|

quotechar='/':

/firstname/,/lastname/,/age/
/Mark/,/Watney/,/42/
/Melissa/,/Lewis/,/41/
/Rick/,/Martinez/,/40/
/Alex/,/Vogel/,/42/
/Beth/,/Johanssen/,/29/
/Chris/,/Beck/,/36/

17.1.5. Quoting

  • csv.QUOTE_ALL (safest)

  • csv.QUOTE_MINIMAL

  • csv.QUOTE_NONE

  • csv.QUOTE_NONNUMERIC

quoting=csv.QUOTE_ALL:

"firstname","lastname","age"
"Mark","Watney","42"
"Melissa","Lewis","41"
"Rick","Martinez","40"
"Alex","Vogel","42"
"Beth","Johanssen","29"
"Chris","Beck","36"

quoting=csv.QUOTE_MINIMAL:

firstname,lastname,age
Mark,Watney,42
Melissa,Lewis,41
Rick,Martinez,40
Alex,Vogel,42
Beth,Johanssen,29
Chris,Beck,36

quoting=csv.QUOTE_NONE:

firstname,lastname,age
Mark,Watney,42
Melissa,Lewis,41
Rick,Martinez,40
Alex,Vogel,42
Beth,Johanssen,29
Chris,Beck,36

quoting=csv.QUOTE_NONNUMERIC:

"firstname","lastname","age"
"Mark","Watney",42
"Melissa","Lewis",41
"Rick","Martinez",40
"Alex","Vogel",42
"Beth","Johanssen",29
"Chris","Beck",36

17.1.6. Lineterminator

  • \r\n - New line on Windows

  • \n - New line on *nix

  • *nix operating systems: Linux, macOS, BSD and other POSIX compliant OSes (excluding Windows)

17.1.7. Decimal Separator

  • 1.0 - Decimal point

  • 1,0 - Decimal comma

  • 0٫‎1 - Arabic decimal separator (Left to right)

  • More information: [1]

../../_images/l10n-decimal-separator.png
firstname,lastname,age
Mark,Watney,42.0
Melissa,Lewis,41.0
Rick,Martinez,40.0
Alex,Vogel,42.0
Beth,Johanssen,29.0
Chris,Beck,36.0
firstname,lastname,age
Mark;Watney;42,0
Melissa;Lewis;41,0
Rick;Martinez;40,0
Alex;Vogel;42,0
Beth;Johanssen;29,0
Chris;Beck;36,0

17.1.8. Thousands Separator

  • 1000000 - None

  • 1'000'000 - Apostrophe

  • 1 000 000 - Space, the internationally recommended thousands separator

  • 1.000.000 - Period, used in many non-English speaking countries

  • 1,000,000 - Comma, used in most English-speaking countries

17.1.9. Date and Time

>>> date = '1961-04-12'
>>> date = '12.4.1961'
>>> date = '12.04.1961'
>>> date = '12-04-1961'
>>> date = '12/04/1961'
>>> date = '4/12/61'
>>> date = '4.12.1961'
>>> date = 'Apr 12, 1961'
>>> date = 'Apr 12th, 1961'
>>> time = '12:00:00'
>>> time = '12:00'
>>> time = '12:00 pm'
>>> duration = '04:30:00'
>>> duration = '4h 30m'
>>> duration = '4 hours 30 minutes'

17.1.10. Encoding

  • utf-8 - international standard (should be always used!)

  • iso-8859-1 - ISO standard for Western Europe and USA

  • iso-8859-2 - ISO standard for Central Europe (including Poland)

  • cp1250 or windows-1250 - Central European encoding on Windows

  • cp1251 or windows-1251 - Eastern European encoding on Windows

  • cp1252 or windows-1252 - Western European encoding on Windows

  • ASCII - ASCII characters only

with open(FILE, encoding='utf-8') as file:
    ...

17.1.11. Dialects

import csv

csv.list_dialects()
# ['excel', 'excel-tab', 'unix']
  • Microsoft Excel 2016-2020:

    • quoting=csv.QUOTE_MINIMAL

    • quotechar='"'

    • delimiter=',' or delimiter=';' depending on Windows locale decimal separator

    • lineterminator='\r\n'

    • encoding='...' - depends on Windows locale typically windows-*

  • Microsoft Excel macOS:

    • quoting=csv.QUOTE_MINIMAL

    • quotechar='"'

    • delimiter=','

    • lineterminator='\r\n'

    • encoding='utf-8'

  • Microsoft export options:

../../_images/csv-standard-dialects.png
$ file utf8.csv
utf8.csv: CSV text

$ cat utf8.csv
Firstname,Lastname,Age,Comment
Mark,Watney,21,zażółć gęślą jaźń
Melissa,Lewis,21.5,"Some, comment"
,,"21,5",Some; Comment
$ file standard.csv
standard.csv: CSV text

$ cat standard.csv
Firstname,Lastname,Age,Comment
Mark,Watney,21,za_?__ g__l_ ja__
Melissa,Lewis,21.5,"Some, comment"
,,"21,5",Some; Comment
$ file dos.csv
dos.csv: CSV text

$ cat dos.csv
Firstname,Lastname,Age,Comment
Mark,Watney,21,za_?__ g__l_ ja__
Melissa,Lewis,21.5,"Some, comment"
,,"21,5",Some; Comment
$ file macintosh.csv
macintosh.csv: Non-ISO extended-ASCII text, with CR line terminators

$ cat macintosh.csv
,,"21,5",Some; Comment

17.1.12. Good Practices

Always specify:

  • delimiter=',' to csv.DictReader() object

  • quotechar='"' to csv.DictReader() object

  • quoting=csv.QUOTE_ALL to csv.DictReader() object

  • lineterminator='\n' to csv.DictReader() object

  • encoding='utf-8' to open() function (especially when working with Microsoft Excel)

17.1.13. References