Data Wrangling in an Ever-Changing SaaS System

Data Engineering

A few years ago, I applied to lead a session at PyCon 2024. The topic centered on an issue I had spent countless hours working on at my job: the challenges of evolving from an on-prem IVR system to a SaaS-based platform and managing constantly shifting API contracts.

Moving to a SaaS IVR platform gave us new capabilities, but it also shifted our entire ingestion model. Instead of a single, well-understood database, we now depended exclusively on vendor APIs. Anyone who has worked with SaaS integrations knows these APIs evolve constantly—and not always predictably. This blog is the story of how those changes repeatedly broke our pipelines and how we eventually built a more resilient system.


From On-Prem Stability to SaaS Complexity

Our old on-prem setup gave corporate analysts direct read access to SQL servers. They could pull any metric they needed: disconnect data, average call time, call records—everything lived neatly in one place. The system wasn’t perfect, but it was reliable.

Then came the shift to a SaaS IVR platform.

The new system required all data to be extracted through APIs. After analyzing our old metrics versus the new system’s available endpoints, we realized it would take 13 different API calls to gather the data that previously lived in a single database. This was our first sign of what was coming.


Rebuilding Storage and ETL

Preparing storage for the new data wasn’t as simple as writing a few CREATE TABLE statements. In the corporate world, schema changes require DBA tickets and fully defined SQL upfront. I pulled the API models from Swagger and built table definitions based on the vendor’s “version 15” API.

The DBAs created the tables and set up ETL pipelines to ingest CSV files. We were a Python shop; the Data team was a .NET shop. CSV exchange became our common denominator.

Our ingestion engine began simply: a set of Python scripts triggered by Windows Task Scheduler on a domain jump box.

Week one—everything worked.
Week two—still fine.
Week three—everything broke.


Incident #1: Unexpected Extra Fields

The analysts reported missing data. Logs showed clean authentication, extraction, and record counts. But the Data team said the CSV was malformed.

Opening the CSV revealed extra rows. The API had silently updated. While our config still referenced version 15, the vendor had pushed a new version and modified the old one—with no release note indicating this.

The business impact: reporting dashboards failed for the entire morning.

The fix: manually map fields using a dictionary, then reshape the data before writing CSV. Not elegant, but it worked.


Incident #2: Character Limits and Messy Text

A month later, another issue appeared. This time, we hit a field-length limit on a notes column. In the database, we had set a reasonable character limit—DBAs rarely allow “max everything” fields.

The culprit was a badly formatted copy-paste block from a document. Historically, comments were short; this one wasn’t.

We had two options:

  1. Hash the string

  2. Clean and normalize it

We chose the second, trimming whitespace and collapsing formatting issues before writing CSV.


Incident #3: Another API Version Drop… and More Silent Changes

Another API version appeared, again with unexpected changes to previous versions. At this point, patching individual failures no longer made sense. We needed a layer between the API’s unpredictable structure and our internal schema.

This is where Pydantic became invaluable.


Why We Switched to Pydantic

After repeated breakages, we needed a formal data contract—something flexible enough to tolerate API drift but strict enough to protect our pipeline. Pydantic gave us that structure.


1. Automatic Field Mapping and Ignoring Extra Fields

Pydantic can automatically map JSON to typed models while ignoring unexpected fields. This alone solved one of our biggest recurring problems.

Example:

from pydantic import BaseModel

class CallRecord(BaseModel, extra="ignore"):
    call_id: int
    duration: float
    agent: str

If the API suddenly returns:

{
    "call_id": 123,
    "duration": 32.5,
    "agent": "John",
    "newField": "unexpected"
}

It still loads cleanly:

record = CallRecord(**data)

2. Aliases for Field Name Changes

The vendor frequently renamed fields between versions—sometimes snake_case, sometimes camelCase. Aliases allowed us to keep stable internal names regardless of what the API returned.

Example:

from pydantic import BaseModel, Field

class CallRecord(BaseModel):
    call_id: int = Field(..., alias="callId")
    agent_name: str = Field(..., alias="agentName")

API response:

{
    "callId": 10,
    "agentName": "Maria"
}

Yet internally we always use:

record.call_id
record.agent_name

3. Field Validators for Cleaning and Normalizing Data

Validators ensured that messy text, whitespace, or oversized fields were cleaned before our ETL ever saw them.

Example:

from pydantic import BaseModel, field_validator

class CallRecord(BaseModel):
    notes: str

    @field_validator("notes")
    def clean_notes(cls, v):
        cleaned = " ".join(v.split())
        return cleaned[:500]  # enforce DB limit

This made even chaotic copy-pasted comments predictable, safe, and consistent.


4. The Resilient Ingestion Loop

With these models in place, our ingestion script transformed from fragile to robust. Instead of crashing on bad data, we could filter it out and log it.

from pydantic import ValidationError
import logging

# ... setup logging ...

def ingest_call_records(raw_json_data):
    valid_records = []
    for item in raw_json_data:
        try:
            # The validation happens here!
            record = CallRecord(**item)
            valid_records.append(record)
        except ValidationError as e:
            logging.error(f"Skipping bad record ID {item.get('callId')}: {e}")

    return valid_records

Conclusion

Looking back, adopting Pydantic earlier would have saved us weeks of firefighting. Strong data models transformed our pipeline from a fragile set of scripts into a resilient, version-tolerant ingestion system. Working with SaaS APIs taught us a critical lesson:

When integrating with fast-moving external APIs, treat all incoming data as untrusted, validate aggressively, and maintain your own internal contract.


Get in Touch

Have questions or want to discuss how we can help your business? Contact us today and let's start a conversation about your automation needs.