CI/CD process for deploying and managing a data platform is visible in multiple places. Starting from infrastructure, there should be no place for manually created server, network or any other resources, with tools like Terraform or Ansible managing all of it. This part is explained further in Infrastructure as Code page. Second CI/CD layer relates to application part, where all changes within containers running on a cluster should be tested and prepared in an automated way. Final place for CI/CD is within data flows that are going to be executed on orchestrator like Prefect, Apache Airflow or Dagster. This part shouldn't allow manual changes with everything being hosted on repository like GitHub or Gitlab. Here we will focus on typical CI/CD for application part, and also on Data Flow Automation
Application Deployment Automation
In this chapter, we’ll dive into the technical details of setting up a robust CI/CD pipeline using GitHub workflows, Dockerfiles, and Helm charts for deployments on a Kubernetes cluster. By combining these tools, we can streamline the process of building, testing, and deploying applications efficiently. We’ll walk through how GitHub workflows automate tasks, Dockerfiles enable consistent application environments, and Helm charts facilitate scalable Kubernetes deployments—ultimately creating a seamless pipeline from code commit to production-ready deployments. This setup is essential for maintaining agility, reliability, and scalability in modern software development practices.
- Set up GitHub repository
We should start with creating or configuring a GitHub repository for a project. In our scenario we will use
main
branch that will be protected. Good start forBranch protection rule
formain
branch should look like this: Require a pull request before merging
withRequire approvals: 1
should be turned on. With such configuration we are sure that review process is established for repositoryRequire status checks to pass before merging
withRequire branches to be up to date before merging
and CI job we will create in next steps should be added inStatus checks that are required
. This way we will be sure that our automation is passing on latest main code and there is no way to merge any pull request without it. By default GitHub is not forcing workflows to pass before merging.
There are more options worth considering, but they should be applied according to project's needs.
- Create GitHub CI Workflow - code validation
All GitHub Actions workflows should be stored in
.github/workflows
folder, where we can define multiple workflows. For our CI/CD pipeline we can start with suchon
clause:This way workflow will be executed every time any change appears within pull request. There is also possibility to run workflow manually with1 2 3 4 5 6
on: pull_request: branches: [ main ] paths: - "path_with_code_changes/**" workflow_dispatch:
workflow_dispatch
, where we can also add additional properties likedebug
in case we need more information about failing workflow.
Inside, we can put basic CI checks, depending on technology we are using, below is simple example for parsing dbt:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
|
Such workflow can be save within ci.yml
file, and validate_dbt_project can be added to Status checks that are required
explained in previous step. The thing that need to be watched out here is the fact, that ci.yml
might not be executed for every change. To enable mandatory check, it's better to run such CI for every pull request.
- Docker image build & push GitHub CI workflow
In separate workflow file CD part can be prepared, where basic setup is focused on building and pushing docker image. on
clause can be set to be executed only on main branch, or we can have separate job that will build docker image on pull request, with push
being enabled only on main. Better approach is to check build process already on pull_request, that's why below pipeline will be used both for pull_request
and push
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
|
main
, push will happen to GitHub Container Registry (ghcr). In order to use image already on non-prod environment before merging changes, there is an option to modify tagging of the images, where we can use branch name as a tag on pull request, while using proper tagging for pushed changes to main.
- Helm Chart Workflow
Similarly to docker images, we can handle lint, package and push process for helm charts. In pull request we can use chart-testing (ct) library that will validate correctness of our yml files, and once we are sure that it's working as expected, workflow executed on main branch should package and push helm chart.
In case of reusing Helm chart in multiple projects/repositories, it is good practise to have dedicated repository that will keep helm chart only, and in project repository there should be only helm values file located. It is not a convenient option in highly customized helm charts, where even small change requires upgrading chart. But in most cases only values should be enough to be modified, with bigger changes happening more because of security issues than feature requests.
Workflow for Helm Lint, Package and Push, where lint will be executed on both pull request and push, while helm package and push only on main branch workflows. CT (chart testing) is used to validate syntax of helm charts. It can optionally also verify additional things, like incrementing of the helm chart version every time some change was introduced:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
|
For helm push we can introduce mechanism that will push only helm charts that are modified. Usage of GitHub action dorny/paths-filter can help with that. It's not mandatory step though, so below code is just assuming that helm package needs to be prepared and pushed. In addition, it is assumed that Chart Museum is used to store all helm charts and GitHub action helm-push-action is used for that:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
- CD preparation
Once there is docker image and helm chart with values prepared, it's good to prepare CD workflow that will deploy application into kubernetes cluster. We are assuming that whole environment is not accessible from the internet in any other way than via VPN, so our GitHub runner should be located on a virtual machine with access to it, ideally within same VPC or connected through VPN to environment. Additionally, such virtual machine should have configured access to Kubernetes cluster through proper RBAC policy or any other option that will limit access to the cluster only to deployments.
Once such GitHub runner is configured, it is possible to prepare additional workflow that will handle deployment of the application to DEV and PROD environments. In below example additionally SOPS was configured that is enabling keeping secrets on repository encrypted. To decrypt secrets.yaml files, it is necessary to use eg. AWS KMS, that's why access key and token are provided in last step:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
|
Once dev deployment will be successful, there can be manual approval process implemented. Below example requires GitHub Enterprise licence, but there are free alternatives on the market.
1 2 3 4 5 6 7 8 9 |
|
In addition, environment needs to be prepared in repository settings. Settings > Environments > New environment
. After providing environment name (prod_approve
in above case), please make sure that Required reviewers
checkbox is enabled and there are people/teams provided. Once it is configured, only manual approval from listed people/teams will enable prod deployment. One downside of such approach is the situation when we are not releasing particular version to production. In such situation workflow will stay in Waiting
state for 30 days, and after that people from the list will receive notification that this workflow has timed-out.
Code for production deployment should look almost identical to dev_deploy job above. The only difference should be with secrets and values files, and also dedicated production GitHub runner should be used.
With these configuration we managed to prepare basic CI/CD workflow for our application. Improvements can we introduce: - introduce GitOps approach using ArgoCD or similar tool - extend testing phase during CI process - introduce advanced alerting in case of an issue during deployment. Currently only person involved in the process will receive notification from GitHub - prepare workflow for rollback in case of issues with new version - add more CI checks, like verification of hardcoded secrets in repository (it can be handled with GitHub's Code scanning feature) or additional lint script for aligning with company's coding standard - release job that will show on GitHub recent version prepared - further use of GitHub's Environment feature. By using it, there is possibility to check history of deployments directly from GitHub
Data Flow Automation
TBD