CSV Parsing: Handling Quotes, Commas, and Encoding Edge Cases
Troubleshoot common CSV parsing failures. Covers quoted fields with embedded commas, multiline values, BOM issues, and encoding mismatches that cause data corruption in spreadsheets and import tools.
Key Takeaways
- CSV (Comma-Separated Values) appears trivially simple but hides surprising complexity.
- Excel on Windows requires a UTF-8 BOM (byte order mark, EF BB BF) to correctly detect UTF-8 encoding.
- Never split on commas directly — use a proper CSV parser that handles quoting, escaping, and multiline values.
- Most programming languages and Unix tools do not add a BOM by default — you must add it explicitly for Excel compatibility.
- ## Robust Parsing Strategy Never split on commas directly — use a proper CSV parser that handles quoting, escaping, and multiline values.
CSV Is Not Simple
CSV (Comma-Separated Values) appears trivially simple but hides surprising complexity. RFC 4180 defines the standard, yet many CSV files do not conform — they use different delimiters, quoting rules, or line endings. Parsing tools that assume a well-formed CSV will silently produce wrong results on real-world data.
Common Parsing Failures
| Problem | Cause | Solution |
|---|---|---|
| Fields shifted right | Unquoted field contains comma | Quote fields with commas |
| Truncated fields | Unquoted field contains newline | Quote multiline fields |
| Extra quotes visible | Double-quote escaping not applied | Use "" inside quoted fields |
| Garbled characters | UTF-8 file opened as Latin-1 | Specify encoding explicitly |
| Leading zeros dropped | Excel interprets as number | Prepend = or format as text |
The BOM Problem
Excel on Windows requires a UTF-8 BOM (byte order mark, EF BB BF) to correctly detect UTF-8 encoding. Without it, Excel defaults to the system locale encoding, corrupting international characters. Most programming languages and Unix tools do not add a BOM by default — you must add it explicitly for Excel compatibility.
Delimiter Detection
Not all CSV files use commas. European files often use semicolons (because commas are decimal separators), TSV uses tabs, and some files use pipes. When receiving unknown CSV files, detect the delimiter by counting separator frequency in the first few lines.
Robust Parsing Strategy
Never split on commas directly — use a proper CSV parser that handles quoting, escaping, and multiline values. In JavaScript, PapaParse handles edge cases correctly. Parse and validate CSV files with the Peasy CSV tools for instant error detection and format correction.