The data analytics stack is changing. The proliferation of data sources and the complexity of pipelines has driven companies to look beyond traditional SaaS solutions. A look at how Kubernetes is impacting data analytics workflows.

Workflow Orchestration

Airflow and Airflow-like orchestrators like Prefect, Astronomer, or Dagster provide a workflow execution engine for data-intensive workloads. Usually, this means a combination of extracting, transforming, or moving data around.

Kubernetes is a possible execution platform for many of these tools. Kubernetes provides the API to launch and track different tasks in the DAG. Combine this with autoscaling or a smart scheduler, and many of these tasks can be performed cheaper and quicker.

For more DevOps-savvy teams, Kubernetes is can also serve as a common deployment target.

Data Ingestion

At the top of the analytics funnel is collecting and storing real-time data. Startups like PostHog provide a simple Kubernetes deployment for self-hosted analytics. An open-source Segment alternative called RudderStack also deploys to Kubernetes. These kinds of products need a columnar database to query efficiently, so PostHog ships with a high-performance database called ClickHouse (whose maintainers have also recently started a company around).

ClickHouse has a battle-tested Kubernetes operator to scale up and down deployments, maintained by a different company.

Extract & Load

Closely related to workflow orchestration is the process of extracting data from sources and loading it into a data warehouse like Snowflake. Think Zapier but more operational. The main problem here is ingesting data from many third-party APIs while maintaining data quality and surfacing API changes or breakages.

Historically, Fivetran has approached this by maintaining high-quality in-house connectors. Two startups have tried an open-source approach. Meltano, a GitLab spin-out, has focused on an open-source ecosystem of third-party connectors. Airbyte runs on Kubernetes, and also leveraged open-source connectors. Containers map one-to-one with different connectors.

Where Else?

I think that Kubernetes has the potential to change the data analytics stack beyond being a convenient deployment target. I think that this is just the beginning of an interesting partnership between data engineers and DevOps engineers.