Migrating from DC/OS to Kubernetes: A Deep Dive into The Challenges and Opportunities

14 min readOct 22, 2024

Back in 2016, when we started building a multitenant data platform for one of our customers, DC/OS (Datacenter Operating System) was the preferred choice in the big leagues of container orchestration technology.

DC/OS had a proven track record of running stateful workloads at scale, which made it a logical choice for large-scale deployments when building stateful data processing pipelines and platforms.

Kubernetes–an open-sourced Google project–was initially considered mostly suitable for stateless workloads, like microservices. Over the years, though, Kubernetes emerged as the cool kid on the block with a lively ecosystem and adoption of Kubernetes managed services by all major cloud providers.

When Mesosphere — the company behind DC/OS — rebranded to D2iQ in 2019 and began focusing on Kubernetes as its core product, it became evident that Kubernetes was in the lead. In 2021, DC/OS (as well as the underlying Apache Mesos) was marked as End of Life.

To make a (very) long story short: Kubernetes won the big league container orchestration race, we tackled the technically challenging migration from DC/OS and gained valuable hands-on experience along the way; we even have an article to prove it, so buckle up and find out some of the things we learned.

The Platform : Context and Initial Setup

First, let’s talk about the platform: it’s designed for low-latency, real-time data processing and data sharing within an ecosystem of trusted parties. Some of them provide data sources, while others contribute their intellectual property (algorithms) to process and enrich the data.

From day one, the platform was designed to operate across both on-premises environments and multiple cloud providers. So initially, we implemented the platform using DC/OS for container orchestration, with Calico virtual networking for tenant isolation.

Platform resources are automatically provisioned and allocated to tenants, with resource quotas in place to guarantee fair resource distribution and prevent any tenant from hogging all resources.

Tenants have access to a web user interface that offers an overview of their deployed services, service status, and observability dashboards. An application catalog enables tenants to deploy various applications with a single click of a button, creating an optimal developer experience.

”There are so many interesting challenges when it comes to managing platforms, setting them up, and designing the workflows. It’s simply fascinating to think about and work on.” — Pieter, Lead SRE at Klarrio

The control plane is a separate stack to manage platform functions like environment management, logging, alerting, and identity management. In the event of a catastrophe, the entire environment can be rebuilt from the control plane with just a single command.

Through the control plane, platforms are continuously observed to ensure they remain stable in production. This includes performance monitoring, real-time alerting and security vulnerability management, which helps identify and mitigate potential threats before they can impact the system.

Key Technical Challenges

An exciting IT project poses plenty of interesting challenges and in that regard, this migration fully met our engineers’ expectations.

1. Preserving Platform Functionalities and Achieving 100% API Compatibility

One of our key challenges was to replace the underlying container orchestration technology while preserving all of the platform capabilities, ensuring there is no data loss and without impacting any of the tenant applications.

Our customers and tenants shouldn’t be able to notice the difference, meaning they retain the same user interface, keep the same functionalities, and have 100% API compatibility.

So how did we pull this off? The first thing that definitely made our lives easier was the fact that — back in the day — we opted to build our own abstraction layer and define our own service definitions. This abstraction layer was designed to be able to enforce security policies, rather than having tenants directly use the low-level Marathon API.

Thanks to this approach, we were able to map the abstraction layer API to the equivalent Kubernetes functionalities, and change the underpinnings entirely without our customers noticing.

2.Bridging the Gap Between Existing Tooling, Our Methodology, and Further Modernization

Nowadays, declarative management and GitOps are household terms, and modern projects are set up that way. In 2016, we already explored those ideas, but there were no off-the-shelf solutions that met the requirements. Therefore, we built those tools in-house.

Furthermore, we wanted to fully immerse ourselves in Kubernetes’ philosophy and its native features as much as possible, while also making sure to keep thinking outside the box to meet our customer requirements.

Upgrading Each Piece of the Puzzle

We initially had a custom scheduler for IaC (Infrastructure as code) and configuration management. This in-house tool integrated Terraform for provisioning infrastructure and Ansible for managing software configurations and deployments.

With the DC/OS migration, we took the opportunity to leverage more up-to-date solutions, such as Flux and CI/CD pipelines integrated with GitHub Actions.

This shift allowed us to streamline the infrastructure and deployment processes, aligning with the industry’s best practices in orchestration and automation.

The control plane was set up in an older Kubernetes environment that didn’t use Flux. To modernize, we aligned it with our new Terraform and Flux codebase.

Redeploying Clusters as an Upgrade Strategy

Kubernetes operates on the principle of continuous uptime, where rolling updates ensure that systems are always available. However, we see interesting use cases to completely shut down a cluster, such as for rare complex upgrade scenarios or disaster recovery for business continuity management..

This presents challenges since Kubernetes isn’t inherently designed for complete cluster shutdowns and restarts, leading to complexities when trying to recover a full state after such an event.

“Kubernetes doesn’t really support shutting down and restarting a cluster while preserving its configuration and state. We leverage Velero to make this possible.” Tom, Software Architect at Klarrio

Velero is essential in our approach because it backs up essential resources, such as disk volumes and object storage buckets, before shutting down the cluster.

The key here is to restore all relevant states before the operators are initialized, ensuring the system resumes operations as intended. This approach ensures that existing resources are reused rather than new ones being created, allowing the system to recover as though it had never been down.

That way, we can ensure that all the disks, object storage buckets — the assets we had — continue to work.

3. Buckling Up with Object Storage Buckets

One of the resources the platform offers to the tenants is the allocation of object storage buckets. Therefore, during the migration, we had to ensure these were available on the new Kubernetes platform as well.

Making the object storage buckets accessible to the Kubernetes cluster involved significant changes for the object storage buckets used by the platform services. Previously, on DC/OS, we managed static access keys ourselves.

In contrast, Kubernetes supports creating short-lived credentials, which is much safer. We implemented this for the object storage buckets of the platform services by creating them with Terraform, and accessing them using these short-lived credentials. However, this approach requires ensuring that the right buckets have the correct credentials.

For tenant object storage buckets, nothing has changed yet because Kubernetes doesn’t currently provide an API for creating and managing S3 buckets — they’re still working on their Container Object Storage Interface (COSI) API. We continue to use our own abstraction layer to create and access these buckets as before. In the long term, when Kubernetes provides native support, we can switch over.

4. Managing Configuration Complexity and Customization

Before transitioning to Kubernetes, we worked with Ansible in DC/OS to manage configurations. Ansible was run offline using custom-built tooling. Each time we rolled out a release, all configurations were applied at once.

Managing configurations in Kubernetes became more complex. Every action within Kubernetes requires defining separate resource objects, each component and its dependencies must be defined individually. Given the need for customization across different environments, templating and managing these configurations became a complex puzzle.

We carefully evaluated tools like Helm and Kustomize, which offer templating and customization capabilities for Kubernetes configurations.

Specifically, we build Helm configuration charts and use them as input for other Helm charts. When installing out-of-the-box components, you need to provide configuration values. The problem is that these values are often platform-specific — you need to use certain URLs or settings unique to your cluster environment. These input values cannot be easily templated directly.

To address this, we employ a two-stage approach:

Generate Configuration Values with a Helm Chart: We first create a Helm chart dedicated to generating the necessary configuration values. This allows us to use templating effectively to account for environment-specific variations.
Use Generated Configurations as Input: The output from this initial Helm chart — the templated configuration file — is then used as input for the other Helm charts that install the actual components.

5. Dealing with the Quirks

Managed Clusters on AWS and Azure

In DC/OS, we managed and operated the entire cluster ourselves. With Kubernetes, we opted to leverage the managed Kubernetes services offered by the major cloud providers. We encountered some notable differences between these offerings.

In terms of networking and tenant isolation, while we continue to use Calico as before, the underlying networking layer in Kubernetes required a complete rebuild. The way Calico rules are defined has been entirely restructured.

Each platform controls its networking layer, leading to situations where something worked on one platform but behaved unexpectedly on the other. For example, internal routing posed some challenges, especially around how traffic is handled between external endpoints and the cluster.

Also, the method of installing new software and rolling out nodes differ across platforms. For example, deployment processes for instance type updates in AWS’s EKS and Azure’s AKS clusters are inherently different. These differences required additional testing and development to ensure smooth functionality across both environments.

Internal DNS Names

Internal DNS names in Kubernetes were different from the ones we had on DC/OS. On DC/OS, a service like foo in tenant bar had a DNS name like foo.bar.marathon.mesos. However, in Kubernetes, the DNS name follows a different pattern, such as foo.bar-default.svc.cluster.local.

Since much of the tenant code relies on those predictable DNS patterns, we came up with some DNS rewriting tricks to ensure compatibility. This was made possible thanks to CoreDNS, a crucial part of Kubernetes that allowed us to implement these DNS changes effectively and retain 100% API compatibility, even with changing service names.

Retaining the Same URLs and IP Addresses

Another interesting thinking exercise: many IoT devices, often spread out across the field, had whitelist entries for the IPs and URLs associated with the platform, so one of the requirements was to maintain the same endpoints and IP addresses.

The main challenge in retaining the same IP addresses and URLs was the fact that the entire cluster was automated. Starting it from scratch would have automatically created new load balancers and IP addresses, so we had to ensure that the existing ones were reused, not replaced.

Since these resources can only be used in one place at a time, we had to release, transfer, and import them into Kubernetes after turning off DC/OS.

For URLs, we used an AWS Route 53 or Azure DNS entry pointing to a load balancer. Previously, it directed traffic to DC/OS, so we just updated it to point to the new load balancer. For IPs, we used AWS Elastic IPs or Azure public IP addresses, which we had to retain and attach to the correct load balancer.

6. To Snapshot or Transfer Disks, That’s the Question

In our quest to ensure minimal disruption and optimal restoration during the migration, we had to choose between taking snapshots or transferring disks.

We encountered performance issues with restoring incremental snapshots. The snapshot and restoration process went smoothly, and the volumes could be used immediately, but there was a catch: the volumes operated in an optimization state for a period of time (up to hours), which impacted performance.

Transferring the Kafka Data Volumes

The performance was significantly impacted due to what we assumed was the underlying copy-on-write mechanism, where the first time each byte is accessed, it gets copied from the snapshot to the physical disk. The Kafka clusters couldn’t keep up with this.

“We were dealing with terabytes of Kafka data. Extended downtime meant increased delays in syncing the data, so minimizing downtime was critical” — Tom, Software Architect at Klarrio

As the snapshot-restore process was too slow for such large volumes, we opted for another solution: we transferred the actual Kafka disks instead of taking snapshots. So we automated the process of detaching the disks (with their unique IDs) from DC/OS and attaching them directly to the Kubernetes cluster.

Safer Snapshots

For migrating tenant volumes, we chose to use snapshots. Although slower, they were the safest choice for tenant data as we couldn’t directly manage their data. If something went wrong, we could always restore the original cluster.

Decisive Days; Migrating the Platforms

We were up for some heavy lifting:The migration involved 6 platforms across different cloud providers, and the production platform handled over 50 TB of data ingress and egress on a daily basis.

We had to migrate 548 applications, 338 vhosts, 66 volumes, 63 tenants, 35 buckets, 26 Kafka proxies, 3 databases, and 2 Flink clusters. We achieved the main production migration within a maintenance window of 4 hours and a half.

Preparing for the Migration

In the early stages of research and testing, everything was done manually. Over time, we automated the execution and validation of each step of the migration process, ensuring precision and eliminating the risk of human error.

“Just because it worked once doesn’t mean it will work every time. It needs to be done hundreds of times and automated before we can be reasonably sure to perform a production upgrade successfully.” — Dominique, Lead Architect at Klarrio

All the insights gathered during testing were consolidated into a detailed 76-step runbook, which was refined through repeated trials. Our preparation process focused on several key aspects:

Local environment setup: Preparing local development and SRE environments to support a seamless migration.
Source platform configuration: Running essential Ansible playbooks and ensuring all dependencies were correctly configured.
Target platform preparation: Setting up the Kubernetes cluster on the target platform, configuring tools like kubectl, and validating Terraform and other critical configurations.
Pre-migration testing: Conducting volume mount tests and verifying tenant volumes and databases by fetching data from the source platform.
Volume mount test: Testing volume mounts to confirm compatibility across platforms.

In addition to the technical preparations, we timely notified tenants about the migration and provided them with best practices and tools to test whether their services were fully functional on Kubernetes before the migration.

Executing the Migration

We chose to migrate the internal development and validation platforms first, in case we encountered some unexpected issues.

Pre-migration steps on the big day:

Ensure all replication factors are correct and the platforms are stable and ready for migration.
Ensure any remaining platform setups, including monitoring and Kafka configurations, are completed.

Migration execution:

Shutdown applications on the source platform and update public status.
Execute the migration scripts, including database and volume migration, ensuring backups are taken and volumes are transferred to the target platform.
Run jobs for creating and migrating Kubernetes manifests, including secrets, volumes, and namespaces.
Monitor the migration process, verifying that Kafka and Zookeeper services are properly transferred and restarted.

Post-migration verification:

Validate that services and applications are running correctly on the target platform.
Conduct tests to ensure data integrity, performance, and system stability.

Our Advice Towards Fellow IT Adventurers

If you have a big upcoming migration, here’s some humble advice we can offer:

1. Carefully Consider the Lift-and-Shift Approach or Strike the Right Balance

Lift-and-shift is often viewed as the simplest migration path, but it’s a double-edged sword. In our case, we had to ensure backward compatibility with the customer’s tenants, who were managing their own business logic. This made lift-and-shift on a tenant level the only option, allowing us to preserve compatibility without disrupting tenant operations.

However, this approach can also lock you into legacy paradigms. For components that weren’t publicly exposed through an API, we took the opportunity to modernize and embrace Kubernetes-native features. If you aren’t bound by compatibility requirements, take a step back and identify areas where refactoring would allow you to fully leverage Kubernetes.

2. Fully Subscribe to the Kubernetes Ecosystem

Fully embracing Kubernetes requires adopting cloud native principles, from infrastructure management to deployment practices. This includes embracing microservices, automated deployment methods, and relying on Kubernetes-native tools instead of reinventing the wheel with custom-built solutions. Kubernetes has covered many scenarios through its ecosystem, and using established tools improves integration with the platform’s ecosystem.

3. Prepare for YAML Management Complexity

Kubernetes’ reliance on YAML configurations can become overwhelming. The right tool for managing these configurations will depend on the size and complexity of your platform. Investing in templating and automation tools early in the migration process is crucial to avoid configuration sprawl.

4. Adapt the Abstraction Layer or Rethink It

If your platform includes an abstraction layer, it can give you flexibility in handling migrations. While the top layer — exposed to tenants — remained consistent, we radically refactored the underlying systems to be more Kubernetes-native. An abstraction layer lets you “lift-and-shift” the public-facing side while refactoring the core systems for more efficiency. This approach worked well for us, but each decision depends on how much control you have over the components of your system.

5. Test Extensively (and Beyond) Across Environments

Consistency across environments is essential, especially in multicloud setups. However, testing should go beyond just “extensive.” In our case, we ran hundreds of test migrations and continually refined our approach, solving complexities along the way.

It’s only through this process that we learned how to handle the massive data transfers, DNS rewrites, and complex integrations needed for a smooth transition. The key lesson is that successfully migrating once doesn’t guarantee success every time. Automate everything and test until you can confidently run a production upgrade.

6. Disaster Recovery Testing Is Non-Negotiable

Frequent disaster recovery testing should be part of any operation. The longer a platform remains up, the more likely it is that circular dependencies or manual workarounds have been introduced. If you don’t regularly test full platform restarts, you risk running into issues that can only be resolved through manual interventions. Ensure the platform can be recovered without relying on ad-hoc fixes, and conduct disaster recovery drills as a routine.

Conclusion

Migrating from DC/OS to Kubernetes is a complex adventure that requires rethinking architecture, adopting new tools, and navigating the complexities of cloud native operations.

Rather than a straightforward transition, it’s an opportunity to evaluate legacy practices, adopt Kubernetes-native tools, and drive architectural improvements. While the process is challenging, the long-term benefits of running on Kubernetes — greater flexibility, scalability, and access to a robust ecosystem — make it a worthwhile investment.

If you consider a similar migration, the key takeaway is to approach it with a strategic mindset, embracing the opportunity to refactor and modernize, rather than simply replicating existing workflows in a new environment.

About Klarrio

At Klarrio, we design cloud native, cloud agnostic software solutions to empower our customers to control their data, limit cloud costs, and optimize performance. We ensure flexibility for scalable platform building across various cloud and on-premises infrastructures, prioritizing privacy, security, and resilience by design.

We are platform pioneers at heart, with a proven track record in building self-service data platforms, Internal Developer Platforms, log aggregation platforms, and other innovative software solutions in various domains: from Telecom, Transportation & Logistics, Manufacturing, Public Sector, Healthcare to Entertainment.

Beyond technology, we actively collaborate and share knowledge, both in-house and together with our customers. True impact is achieved together.