17.1. CSV About
CSV - Comma/Character Separated Values
No CSV formal standard, just a practice
Flat file (2D) without relations
Relations has to be flatten (serialization, additional columns, etc...)
Typically first line (header) represents column names
Rarely first line can also have a structure (nrows, ncols)
Internationalization: encoding
Localization: decimal separator, thousands separator, date format
Parameters: delimiter, quotechar, quoting, lineterminator, dialect
Example CSV file:
firstname,lastname,age
Mark,Watney,42
Melissa,Lewis,41
Rick,Martinez,40
Alex,Vogel,42
Beth,Johanssen,29
Chris,Beck,36
17.1.1. Header with Column Names
Typically first line (header) represents column names
File without header:
Mark,Watney,42
Melissa,Lewis,41
Rick,Martinez,40
Alex,Vogel,42
Beth,Johanssen,29
Chris,Beck,36
First line is a header:
firstname,lastname,age
Mark,Watney,42
Melissa,Lewis,41
Rick,Martinez,40
Alex,Vogel,42
Beth,Johanssen,29
Chris,Beck,36
17.1.2. Header with Structure
First line can also have a structure
Used in Machine Learning
nrows
- number of rowsncols
- number of columnsnvalues
- number of values
First line is a structure: number of rows (nrows) and columns (ncols):
6,3
Mark,Watney,staff
Melissa,Lewis,admins
Rick,Martinez,staff
Alex,Vogel,users
Beth,Johanssen,staff
Chris,Beck,staff
First line is a structure: number of rows (nrows) and features (nfeatures), followed by label_encoder values for label column:
6,2,users,staff,admins
Mark,Watney,1
Melissa,Lewis,2
Rick,Martinez,1
Alex,Vogel,0
Beth,Johanssen,1
Chris,Beck,1
>>> encoder = {
... 0: 'users',
... 1: 'staff',
... 2: 'admins',
... }
17.1.3. Delimiter
csv
module expects delimeter to be 1-character in length
delimiter=','
:
firstname,lastname,age
Mark,Watney,42
Melissa,Lewis,41
Rick,Martinez,40
Alex,Vogel,42
Beth,Johanssen,29
Chris,Beck,36
delimiter=';'
:
firstname;lastname;age
Mark;Watney;42
Melissa;Lewis;41
Rick;Martinez;40
Alex;Vogel;42
Beth;Johanssen;29
Chris;Beck;36
delimiter='\t'
:
firstname lastname age
Mark Watney 42
Melissa Lewis 41
Rick Martinez 40
Alex Vogel 42
Beth Johanssen 29
Chris Beck 36
delimiter=':'
:
root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
halt:x:7:0:halt:/sbin:/sbin/halt
nobody:x:99:99:Nobody:/:/sbin/nologin
sshd:x:74:74:Privilege-separated SSH:/var/empty/sshd:/sbin/nologin
mwatney:x:1000:1000:Mark Watney:/home/mwatney:/bin/bash
mlewis:x:1001:1001:Melissa Lewis:/home/mlewis:/bin/bash
rmartinez:x:1002:1002:Rick Martinez:/home/rmartinez:/bin/bash
avogel:x:1003:1003:Alex Vogel:/home/avogel:/bin/bash
bjohanssen:x:1004:1004:Beth Johanssen:/home/bjohanssen:/bin/bash
cbeck:x:1005:1005:Chris Beck:/home/cbeck:/bin/bash
delimiter='|'
:
| Firstname | Lastname | Age |
|-----------|-----------|-----|
| Mark | Watney | 42 |
| Melissa | Lewis | 41 |
| Rick | Martinez | 40 |
| Alex | Vogel | 42 |
| Beth | Johanssen | 29 |
| Chris | Beck | 36 |
17.1.4. Quotechar
"
- quote char (best)'
- apostrophe
quotechar='"'
:
"firstname","lastname","age"
"Mark","Watney","42"
"Melissa","Lewis","41"
"Rick","Martinez","40"
"Alex","Vogel","42"
"Beth","Johanssen","29"
"Chris","Beck","36"
quotechar="'"
:
'firstname','lastname','age'
'Mark','Watney','42'
'Melissa','Lewis','41'
'Rick','Martinez','40'
'Alex','Vogel','42'
'Beth','Johanssen','29'
'Chris','Beck','36'
quotechar='|'
:
|firstname|,|lastname|,|age|
|Mark|,|Watney|,|42|
|Melissa|,|Lewis|,|41|
|Rick|,|Martinez|,|40|
|Alex|,|Vogel|,|42|
|Beth|,|Johanssen|,|29|
|Chris|,|Beck|,|36|
quotechar='/'
:
/firstname/,/lastname/,/age/
/Mark/,/Watney/,/42/
/Melissa/,/Lewis/,/41/
/Rick/,/Martinez/,/40/
/Alex/,/Vogel/,/42/
/Beth/,/Johanssen/,/29/
/Chris/,/Beck/,/36/
17.1.5. Quoting
csv.QUOTE_ALL
(safest)csv.QUOTE_MINIMAL
csv.QUOTE_NONE
csv.QUOTE_NONNUMERIC
quoting=csv.QUOTE_ALL
:
"firstname","lastname","age"
"Mark","Watney","42"
"Melissa","Lewis","41"
"Rick","Martinez","40"
"Alex","Vogel","42"
"Beth","Johanssen","29"
"Chris","Beck","36"
quoting=csv.QUOTE_MINIMAL
:
firstname,lastname,age
Mark,Watney,42
Melissa,Lewis,41
Rick,Martinez,40
Alex,Vogel,42
Beth,Johanssen,29
Chris,Beck,36
quoting=csv.QUOTE_NONE
:
firstname,lastname,age
Mark,Watney,42
Melissa,Lewis,41
Rick,Martinez,40
Alex,Vogel,42
Beth,Johanssen,29
Chris,Beck,36
quoting=csv.QUOTE_NONNUMERIC
:
"firstname","lastname","age"
"Mark","Watney",42
"Melissa","Lewis",41
"Rick","Martinez",40
"Alex","Vogel",42
"Beth","Johanssen",29
"Chris","Beck",36
17.1.6. Lineterminator
\r\n
- New line on Windows\n
- New line on*nix
*nix
operating systems: Linux, macOS, BSD and other POSIX compliant OSes (excluding Windows)
17.1.7. Decimal Separator
1.0
- Decimal point1,0
- Decimal comma0٫1
- Arabic decimal separator (Left to right)More information: [1]
firstname,lastname,age
Mark,Watney,42.0
Melissa,Lewis,41.0
Rick,Martinez,40.0
Alex,Vogel,42.0
Beth,Johanssen,29.0
Chris,Beck,36.0
firstname,lastname,age
Mark;Watney;42,0
Melissa;Lewis;41,0
Rick;Martinez;40,0
Alex;Vogel;42,0
Beth;Johanssen;29,0
Chris;Beck;36,0
17.1.8. Thousands Separator
1000000
- None1'000'000
- Apostrophe1 000 000
- Space, the internationally recommended thousands separator1.000.000
- Period, used in many non-English speaking countries1,000,000
- Comma, used in most English-speaking countries
17.1.9. Date and Time
>>> date = '1961-04-12'
>>> date = '12.4.1961'
>>> date = '12.04.1961'
>>> date = '12-04-1961'
>>> date = '12/04/1961'
>>> date = '4/12/61'
>>> date = '4.12.1961'
>>> date = 'Apr 12, 1961'
>>> date = 'Apr 12th, 1961'
>>> time = '12:00:00'
>>> time = '12:00'
>>> time = '12:00 pm'
>>> duration = '04:30:00'
>>> duration = '4h 30m'
>>> duration = '4 hours 30 minutes'
17.1.10. Encoding
utf-8
- international standard (should be always used!)iso-8859-1
- ISO standard for Western Europe and USAiso-8859-2
- ISO standard for Central Europe (including Poland)cp1250
orwindows-1250
- Central European encoding on Windowscp1251
orwindows-1251
- Eastern European encoding on Windowscp1252
orwindows-1252
- Western European encoding on WindowsASCII
- ASCII characters only
with open(FILE, encoding='utf-8') as file:
...
17.1.11. Dialects
import csv
csv.list_dialects()
# ['excel', 'excel-tab', 'unix']
Microsoft Excel 2016-2020:
quoting=csv.QUOTE_MINIMAL
quotechar='"'
delimiter=','
ordelimiter=';'
depending on Windows locale decimal separatorlineterminator='\r\n'
encoding='...'
- depends on Windows locale typicallywindows-*
Microsoft Excel macOS:
quoting=csv.QUOTE_MINIMAL
quotechar='"'
delimiter=','
lineterminator='\r\n'
encoding='utf-8'
Microsoft export options:
$ file utf8.csv
utf8.csv: CSV text
$ cat utf8.csv
Firstname,Lastname,Age,Comment
Mark,Watney,21,zażółć gęślą jaźń
Melissa,Lewis,21.5,"Some, comment"
,,"21,5",Some; Comment
$ file standard.csv
standard.csv: CSV text
$ cat standard.csv
Firstname,Lastname,Age,Comment
Mark,Watney,21,za_?__ g__l_ ja__
Melissa,Lewis,21.5,"Some, comment"
,,"21,5",Some; Comment
$ file dos.csv
dos.csv: CSV text
$ cat dos.csv
Firstname,Lastname,Age,Comment
Mark,Watney,21,za_?__ g__l_ ja__
Melissa,Lewis,21.5,"Some, comment"
,,"21,5",Some; Comment
$ file macintosh.csv
macintosh.csv: Non-ISO extended-ASCII text, with CR line terminators
$ cat macintosh.csv
,,"21,5",Some; Comment
17.1.12. Good Practices
Always specify:
delimiter=','
tocsv.DictReader()
object
quotechar='"'
tocsv.DictReader()
object
quoting=csv.QUOTE_ALL
tocsv.DictReader()
object
lineterminator='\n'
tocsv.DictReader()
object
encoding='utf-8'
toopen()
function (especially when working with Microsoft Excel)