Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • C csvkit
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 61
    • Issues 61
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 4
    • Merge requests 4
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • wireservice
  • csvkit
  • Issues
  • #986
Closed
Open
Issue created Sep 26, 2018 by Administrator@rootContributor

Improve dialect sniffing

Created by: akkartik

Back in June 2017, @petersonjr pointed out some test cases that cause csvkit/agate to misinterpret the dialect of files: https://github.com/wireservice/csvkit/issues/751#issuecomment-310803282

The reasonable workaround @jpmckinney pointed out was to configure --snifflimit 0 at the commandline. While this helps, I'd like to point out a couple of things:

a) Those test cases work fine without the override to the Sniffer class introduced in https://github.com/wireservice/agate/commit/3b9ceea131ba143cc72b6d1a9f7871d059188b52

b) I dug into the default Sniffer in the Python standard library. It's not very robust; it uses regular expressions for parsing each line, which means that it gets confused by say commas inside quoted strings.

c) While agate has a reasonable default sniff_limit of 0, the commandline argument parsing in csvkit overrides it to None, which has the effect of reading the entire file. As the size of the input file grows, it becomes increasingly likely that it'll encounter stuff that the Python Sniffer can't handle.

These observations indicate a few places in the stack where a change may be beneficial. I'm not sure if any of them is a good idea -- I may well be missing context as an outsider -- but thought I'd try to start a discussion. Is there a testsuite for the custom Sniffer in agate? Is it an option to default --snifflimit to something smaller, say 1024? (Obviously changing the Python standard library is a bigger task. The maintainer for the csv module is also not active anymore.)

Assignee
Assign to
Time tracking