Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • C csvkit
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 61
    • Issues 61
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 4
    • Merge requests 4
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • wireservice
  • csvkit
  • Issues
  • #1007
Closed
Open
Issue created Dec 06, 2018 by Administrator@rootContributor

Detect and remove UTF-8 BOM

Created by: doekman

The bug is illustrated with files from this zip

example.csv is a CSV file, saved with UTF-8 with BOM marker and LF linefeeds (the webapp highrise uses this format).

When running ./setup.sh the file is imported into the user-database, and everything seems fine:

csvsql 1.0.2
Python 2.7.15
ProductName:	Mac OS X
ProductVersion:	10.13.6
BuildVersion:	17G3025
NOTICE:  table "example" does not exist, skipping
DROP TABLE

However, when running ./test.sh, the second command fails (and the first and third succeed):

----[ select * ]--------------------------------------------
   FullName   |        Email        
--------------+---------------------
 Test User    | [email protected]
 Another User | [email protected]
(2 rows)

----[ select FullName, Email ]------------------------------
ERROR:  column "fullname" does not exist
LINE 1: SELECT FullName, Email from example;
               ^
HINT:  Perhaps you meant to reference the column "example.FullName".
----[ select [BOM-marker]FullName, Email ]------------------
   FullName   |        Email        
--------------+---------------------
 Test User    | [email protected]
 Another User | [email protected]
(2 rows)

This is because the first column name is prefixed with the BOM marker, and is demonstrated by the third command. When you copy/paste the output into an editor that shows control characters (TextMate), you see the name of the first column is actually `U&"\FEFFFullName"

When the header columns in the CSV file are double-quoted ("FullName","Email"), the behaviour is different but also undesirable, but I haven't checked that.

The error message could be nicer (non-printable characters could be shown escaped as I did in the third example), but I'm not sure which software part that would have to do...

From csvkit, I expect the UTF-BOM marker to be detected and removed.

Assignee
Assign to
Time tracking