Converting Data Between Formats Without Losing Your Mind
A practical guide to moving data between CSV, JSON, YAML, and XML — the real pipeline for migrations and integrations.
You've been there. The database export gives you CSV. The new API wants JSON. The deployment config needs YAML. The legacy system only speaks XML. And somehow, you're the one who has to make all of them talk to each other.
Data format conversion sounds simple until you actually do it. Here's how to build a real pipeline that doesn't fall apart.
The Scenario: A Real Migration
Let's say you're migrating user data from an old system to a new one. Here's what you're working with:
- Source: CSV export from the old database (10,000 rows, 15 columns)
- API layer: The new system accepts JSON via REST API
- Config files: Deployment settings in YAML
- Legacy integration: A partner system that still requires XML feeds
Four formats. One dataset. Let's go.
Step 1: CSV to JSON — From Export to API-Ready
Your CSV looks like this:
id,name,email,plan,created
1,Alice,alice@example.com,pro,2024-03-15
2,Bob,bob@example.com,free,2024-06-22
Convert it to JSON and you get structured, nested data your API can consume:
[
{
"id": 1,
"name": "Alice",
"email": "alice@example.com",
"plan": "pro",
"created": "2024-03-15"
}
]
Watch out for: Quoted fields with commas inside them, empty values that should be null not empty strings, and date formats that vary across rows. These are the bugs that show up at row 4,738 when you thought it was working.
Step 2: JSON to YAML — For Configuration Files
Your deployment needs YAML config. The data structure is the same — you just need a different syntax:
users:
- id: 1
name: Alice
email: alice@example.com
plan: pro
created: "2024-03-15"
YAML is more readable for config files, which is why Kubernetes, Docker Compose, and CI/CD pipelines all use it. But it's whitespace-sensitive, so one wrong indent and your deployment breaks at 2 AM.
Step 3: JSON to XML — For Legacy Systems
That partner system from 2015 needs XML. No, they won't upgrade their API. Yes, you still have to support it.
<users>
<user>
<id>1</id>
<name>Alice</name>
<email>alice@example.com</email>
<plan>pro</plan>
<created>2024-03-15</created>
</user>
</users>
The conversion is straightforward, but XML has quirks — attributes vs. elements, namespace requirements, schema validation. The XML converter handles the structural transformation; you handle the business logic.
Step 4: Validate Everything with JSON Formatter
At every stage, format and validate your JSON. One misplaced bracket, one trailing comma, one unescaped quote — and your import silently drops records or crashes entirely.
Before sending any converted data to an API, format it and visually scan the structure. Five seconds of formatting saves five hours of debugging "why are 200 records missing."
Common Gotchas in Data Migration
Encoding issues: CSV from Excel on Windows might be Windows-1252, not UTF-8. Your Japanese customer names turn into question marks. Always check encoding first.
Type coercion: CSV treats everything as strings. The number 001 in CSV becomes 1 in JSON. Zip codes, phone numbers, and ID fields with leading zeros are the usual victims.
Nested data: CSV is flat. JSON is nested. You'll need to decide how address_street, address_city, address_zip in CSV maps to an address object in JSON.
Null vs empty: Is an empty CSV cell null, "", or should the key be omitted entirely? Define this before you start, not when you find the bug.
The Pipeline Pattern
For any data migration, the workflow is:
- Export from source system (usually CSV or SQL dump)
- Convert to your working format (CSV to JSON)
- Transform the structure (rename fields, nest objects, fix types)
- Validate the output (format and check)
- Convert to target format (YAML or XML)
- Import into the destination system
The tools handle steps 2, 4, and 5. Steps 1, 3, and 6 are your code. Separating "conversion" from "transformation" keeps things manageable.
Data migrations are never as simple as they look in the planning doc. But having reliable conversion tools at each step means you spend your time on the real problems — business logic, edge cases, and data quality — instead of fighting with format syntax.