In this blog post we will go back 10 years to understand how OpenStack started at CERN. In Part 1 we explored the motivation, the first installation, the prototypes, the deployment challenges and the initial production architecture. Now, we will dive into the service offering evolution of the CERN cloud infrastructure over the last years and how we scaled from 2 cells and few hundred compute nodes to 75 cells with more than 8000 compute nodes.
Service Growth and Architecture
From the beginning the CERN cloud infrastructure has been very popular within the Organization. As the users started to migrate all their applications to run on top of the cloud infrastructure the number of virtual machines increased very rapidly. In just a few months we were managing several thousand virtual machines running production workloads. At the same time the cloud operations team was busy converting the existing physical servers in the data centre into OpenStack compute nodes.
Only 4 months after moving the cloud infrastructure into production we were managing more than 1000 Nova compute nodes. This rapid growth continued during the next months.
At the beginning we designed the infrastructure to have only 2 Nova cells (one cell in the Geneva data centre and the other in the Hungary data centre). These were cellsV1 and there was very little support and experience available at that time in the community.
One question that I get asked frequently is why we configured Nova cells instead of having regions per data centre. The main reason was because we wanted our users to move their workloads as easy as possible to the new infrastructure. Regions would have been another friction for all the users that weren’t that enthusiastic moving from dedicated physical servers (their pets!) to a shared cloud environment (remember, this was 7 years ago!).
After a few months converting the physical servers already in the data centre to Nova compute nodes and adding them to the Geneva cell we noticed the first flaws in our architecture design. We had a mixture of different hardware types, different retirement cycles and different network segments in the same cell. Also, we had a “zoo” of different workloads running in these compute nodes. They ranged from IT services to scientific data-processing applications to remote desktops. As an example, projects like “CMS Data Acquisition”, “Batch“, “IT Windows Terminal Service” and “Personal Belmiro” were sharing the same compute nodes. This was extremely challenging because of all the different use cases, service expectations and conflicting requirements from the users.
Another problem was that the Geneva cell was reaching almost 2000 compute nodes. Scaling up was starting to be challenging with all the database connections from the compute nodes (this was before running nova-conductor!) and all the pressure in the RabbitMQ cluster. Also, we understood that we had “too many eggs in the same basket”! Any problem affecting the control plane of this large cell would affect almost the entire infrastructure.
Of course the control plane was set up with high-availability in mind. However, this was one of the main causes of our infrastructure problems.
We had a RabbitMQ cluster with mirrored queues per cell. Unfortunately, RabbitMQ issues were constant. There were 2 main reasons:
- The amount of compute nodes connected to the clusters;
- Network partitions.
In order to support a large number of compute nodes connected to the same RabbitMQ cluster, we needed to fine tune the RabbitMQ configuration options. Today, there are several sources of information that can help in this task (for example the OpenStack Large Scale SIG), but 7 years ago it was very challenging and a continuous try and error approach in the production infrastructure, because it was very difficult to replicate the load in a QA environment.
Old RabbitMQ versions also didn’t play very well with network partitions, leaving the cluster unusable in most of the cases. Whenever the RabbitMQ cluster suffered a network partition, the infrastructure was heavily affected. Also, increasing the number of RabbitMQ nodes in the cluster to deal with the increased number of compute nodes would increase the exposure to even more network partitions.
The OpenStack projects databases were also configured on top of a high available solution. Again, they were not immune to network issues and the recovery was very challenging.
Finally, we were running a distributed control plane in physical servers. This meant that to be really distributed and achieve fault tolerance we needed several physical servers to spread and replicate the different OpenStack projects. The expensive servers available were bought to run intensive workloads and not the less CPU demanding OpenStack projects.
After the learnings of the initial months running the OpenStack infrastructure in production, we decided that it was time to upgrade the architecture to be more resilient and reduce the control plane footprint.
The changes introduced continue to be the foundation of our architecture today:
- Have more but smaller cells;
- Run all the control plane in VMs on top of the infrastructure that they manage;
- No database clusters;
- No RabbitMQ clusters.
Let’s discuss all these different topics.
More but smaller cells
There are several different ways to deploy and scale OpenStack. In our quest to improve the design of the infrastructure architecture we concluded that having multiple cells with only a few hundred compute nodes was a proper fit for our use cases.
Cells offer several advantages that allow us to scale the infrastructure to thousands of compute nodes and at the same time have a resilient, flexible and configurable infrastructure. I would like to highlight:
- Single endpoint: scale horizontally and transparently between different data centres;
- Availability and resilience;
- Isolate failure domains;
- Dedicate cells to projects;
- Hardware types per cell;
- Different compute nodes configurations per cell.
To leverage all these advantages we now don’t have cells with more than 200 compute nodes. This number of compute nodes usually matches the procurement tenders which eases the node distribution. Also, we now don’t mix different hardware types in the same cell.
Having multiple cells and consequently independent control planes for each one, allows to massively scale the infrastructure with relatively small failure domains. Because of the small failure domains (the cell), we don’t need to have a highly available control plane for each individual cell. In fact, each cell control plane (RabbitMQ, nova-conductor, nova-api) runs in a small and individual virtual machine on top of the same infrastructure.
Cells are also used to separate our different workloads. We now dedicate cells for very intensive CPU workloads (LHC batch processing, …), special projects that require a particular configuration (GPUs compute nodes, Hyper-converged compute nodes, …) and the generic workloads (IT services, Remote desktops, …). By having the possibility to fine tune the compute nodes configuration for each particular use case we increase the overall efficiency and resource utilisation. For example, for all cells dedicated to very intensive CPU workloads we don’t allow CPU overcommit, all compute nodes are configured to offer CPU passthrough, huge pages and have KSM disabled.
We also have very small cells with less than 10 compute nodes, configured for very specific research projects.
Cells enabled us to scale and constantly adapt to meet the computing requirements of the Organization.
Move the control plane from physical nodes to VMs
After the decision of deploying “more and smaller cells”, it was clear that we would need to allocate more physical servers only for the control plane to maintain the same level of fault tolerance in the infrastructure. Running all the control plane in physical servers was revealing to be not flexible, not scalable, and at the end very expensive.
So, we decided to follow a controversial approach…
Have all the OpenStack control plane (OpenStack projects, RabbitMQ servers, and later even most of the databases) in virtual machines running in the same infrastructure that they manage.
This allows us to have a very distributed and fault tolerant control plane, leveraging the different availability zones available in the infrastructure.
At the beginning we still kept few physical servers running the minimum set of OpenStack services to bootstrap the infrastructure in case of a complete shutdown of the data centres. Today, with all experience that we gained over the years and procedures that we implemented, those physical servers were all removed.
To give some examples, we run the Keystone project in relatively small virtual machines (8GB of RAM, 4vCPUs) that are spread between the 3 availability zones (distributed between 18 Nova cells) in the infrastructure. Currently, we run 16 virtual machines for the Keystone service. The same approach is used for all the other OpenStack projects. Nova-api, Nova-conductor, Nova-scheduler, Glance, Horizon, Neutron… including RabbitMQ clusters and most of the databases.
This approach has proven to be very effective and allowed us to scale rapidly when keeping a small control plane footprint.
Clustering!? No Clustering!?
A typical OpenStack architecture design includes running the databases and the message broker in a high available cluster configuration. Of course when we started, we did the same!
In the CERN IT Department there is a team responsible for the management of the databases for the different IT services and the large experiment databases. At that time, the database team was using “Oracle Clusterware” to provide high availability for MySQL databases. All OpenStack projects used this solution.
For RabbitMQ we had a different cluster (3 independent nodes) with mirrored queues for each OpenStack project.
We had several issues with both high available solutions especially when these were affected by network partitions. They were the main cause of all the outages.
When the high available solution fails, the service recovery is not always easy.
From the beginning, the goal was to enrol all the computing capacity available in the data centres into OpenStack (> 8K compute nodes, ~300K cores). To scale the infrastructure to those numbers we needed to have a simple architecture, predictable and easy to recover in case of problems.
We made the decision to not have any of the OpenStack projects databases running on top of a high available solution.
For each OpenStack project and Nova cells we have a different MySQL instance. The storage for the databases is backed by NetApp. In case of an issue in one of the MySQL instances, the databases team just creates another instance and connects the storage. In fact, most of the databases (with the exception of Keystone, Neutron and Nova api databases) also run in virtual machines on top of cloud infrastructure.
The same decision to not run a clustered solution was made for RabbitMQ (with the exception of Neutron).
This simple approach has allowed us to run a large infrastructure with very small number of outages. And when they happen, it’s easy and fast to recover.
The service growth of the CERN cloud infrastructure is not only in the number of compute nodes, virtual machines hosted or users. Over the years we also increased our service offering. In 2013 we started with only 4 OpenStack projects (Keystone, Nova, Glance and Horizon) but today we run 14 OpenStack projects.
Let’s now dive through the deployment history and milestones of some of these OpenStack projects.
Network is a critical component of any infrastructure.
The CERN data centre was fully operational, running critical workloads to support the Organization and Experiments including the Large Hadron Collider, when we started to deploy the OpenStack infrastructure.
The cloud infrastructure needed to be deployed on top of the existing resources without compromising the existing workloads.
Any change in the network model was extremely risky. This meant that we needed to build the cloud infrastructure on top of a network model that was originally designed to support a few thousands static physical servers.
In 2013 OpenStack’s network component was nova-network (Quantum/Neutron was still in very early stages). Nova-network was very limited in terms of functionality but it offered everything that we needed: the simple “linux bridge” model.
The CERN network is a flat network with segmentation per subsets of subnets. This required the development of a new nova-network component that was integrated with the CERN network management database. By default, Nova selects first the compute node for a virtual machine and only then the IP address is allocated. In our case, this IP address needs to be carefully selected considering the compute node for the packets to be routable. Also, virtual machines need to be registered in the CERN network database in order to have connectivity, traceability from the security team and integration with all the other CERN management services. As a final curiosity, all the instances running in CERN’s cloud infrastructure get a public IP address.
We moved to Neutron a few years after running the infrastructure in production. With Neutron we continued to use the same network model (linux bridge) and similar patches to integrate with the CERN network management system.
All new cells since then were configured to use Neutron. However, today we are still in the process to migrate all the old cells from nova-network to Neutron. More than 500 compute nodes and 4900 virtual machines are still managed by nova-network. We hope to finally migrate all those old cells from nova-network to Neutron during this year.
The introduction and scale of Neutron lead us to an important architecture change. The introduction of regions.
Unlike Nova, Neutron doesn’t support any sharding. All the neutron-agents connect to the same RabbitMQ cluster. When having thousands of compute nodes this creates a scaling issue for the RabbitMQ cluster, but more importantly a single point of failure in the infrastructure. Neutron-servers scale horizontally, but if an issue affects the RabbitMQ cluster dedicated to Neutron, the infrastructure is affected.
To mitigate this issue, we decided to split the infrastructure between different regions. This allows us to reduce the risk of an outage related to a RabbitMQ cluster issue.
Today, we have 3 main regions.
Storage as a Service
Storage as a service is a fundamental component in any cloud infrastructure. In our initial prototypes we had “nova-volume” and later Cinder configured with LVM. However, we didn’t feel comfortable offering this solution in a production environment. The CERN storage team had a Ceph prototype that looked very promising working as a Cinder backend for block storage.
After a few months running the production infrastructure we deployed Cinder (2014) with Ceph as a backend. As expected the service had a steep growth and today is one the most popular services that we offer.
When CephFS became stable, we deployed the Manila project. With the rise of Kubernetes clusters, Manila is now a critical service providing shared storage to the containers workloads. More recently we also started to offer an S3 endpoint backed by Ceph.
Because we didn’t have a storage solution for the cloud infrastructure in 2013, we needed to get creative and find a reliable solution to store the cloud images. We configured Glance to use the old, but very popular at CERN, AFS. When Ceph was deployed, all the cloud images were migrated to Ceph.
Baremetal as a Service
Most of the users moved all their services into the cloud infrastructure, however there are workloads that still require dedicated physical servers. For those, the old procedures were still in place and it was very difficult to track the physical servers after being allocated to a project.
Ironic would allow users to manage all those requests using the same APIs that they were already familiar with to create virtual machines. Actually we had a more ambitious goal. Manage all the physical servers available in the data centre (including the cloud compute nodes) using Ironic. This would allow us to manage the entire life cycle of all physical servers in the data centre using Ironic, simplifying the existing procedures.
We deployed Ironic in 2018. Since then we have been working with different CERN teams and Ironic upstream to improve the life cycle procedures (enrol, repair, retirement, …) of the physical servers. All new capacity added into the data centre is enrolled and managed by Ironic. Then, if these resources are allocated for the cloud infrastructure, they are instantiated as compute nodes.
We are now in the process to adopt into Ironic all the compute nodes that were already deployed and running production workloads. Currently more than 1000 compute nodes were adopted. In total we have 7000 physical servers managed by Ironic.
One of the challenges has been to scale the Ironic/Nova infrastructure. Recently, we sharded the Ironic/Nova infrastructure using conductor groups. This allows us to have failure domains in Ironic/Nova and speed up all the periodic tasks when having thousands of nodes managed by Ironic.
As you can see in Fig. 3, during the years we have introduced many other OpenStack projects into the production infrastructure (Ceilometer, Heat, Barbican, Mistral, Rally, EC2API, Magnum). They have specific use cases in the Organization.
One of the most successful projects has been Magnum. Introduced in 2016, it’s the easy way to create managed Kubernetes clusters in the CERN cloud infrastructure. Several services in the Organization have been migrated to run on top of Kubernetes clusters that are deployed by Magnum. CERN also has an important role in the upstream development of Magnum.
We deployed the Ceilometer project a few months after moving into production. Running Ceilometer at scale was a big challenge. It had a complex architecture and some design issues that made it very difficult to deploy, manage and actually retrieve data. After 3 years, we decided to remove this project from the infrastructure, and replace its functionality with small in house tools. The project itself changed completely over the years, dropping the storage component.
Other projects never made it into the production infrastructure (Trove, Murano, Qinling, Watcher). After testing them we found that they didn’t fit our use cases or the community supporting them was very small. Most are now also removed from the OpenStack governance.
During all these years we learnt a lot on how to build and operate a OpenStack cloud infrastructure with thousands of everything.
Complex operations like upgrades are now “routine” operations, considering the procedures that we have in place and also the maturity of all OpenStack projects. However, the short release cycle and the number of projects means that we are always upgrading a project in the infrastructure.
We don’t run compute nodes with Hyper-V anymore. Few years ago all the instances running on top of Hyper-V were migrated to KVM, which simplifies the support and management of the infrastructure.
The data centre in Hungary was removed from the infrastructure in 2019. Some of the capacity was transferred to data centre containers next to the CERN main site. For the infrastructure this just meant that some cells were removed and others added.
Upgrades of the operating system to major releases continue to be very challenging in the compute nodes. Usually, an in place upgrade doesn’t work and a reinstallation is necessary. Since 2013 we moved from SLC6 (Scientific Linux CERN 6) to CentOS 7 and we are now considering the upgrade to CentOS 8 / Stream or even a different operating system.
Security issues are extremely demanding, because of the tight schedule in which they usually need to be deployed but also the disruption that they can cause to users. In 2018, because of the Meltdown/Spectre vulnerabilities, we needed to reboot all the shared infrastructure disrupting thousands of users. At the same time we disabled SMT, which cut the number of exposed cores in half. These created performance issues that we needed to investigate in a few particular hardware types.
Rally has an important role in probing the infrastructure. Every day more than 5000 virtual machines and many other resources are created just to probe the infrastructure.
When having a large infrastructure, automation is key.
Over the years we have been automating all the possible procedures. From projects lifecycle, quotas and flavours requests, to repairs procedures and user communication, …everything is automated. To leverage this automation we use Rundeck, and Mistral. These projects have been helping us with all the repetitive and time consuming tasks. However, in automation there is always something more that can be done.
Show me the Numbers
Let’s now explore some of the numbers of the CERN cloud infrastructure. Most of the important metrics can be seen in one of our monitoring dashboards (Fig. 4).
I would like to highlight the growth of the Magnum clusters and Manila shares during the last years. They show a shift on how some of the workloads are being deployed.
At time of this writing we also have 3 main regions with 75 Nova cells in total. All the OpenStack projects are supported by 115 MySQL instances.
Finally more than 75% of the computing capacity is reserved to crunch the LHC data.
During the last 10 years the computing model changed completely. From dedicated physical servers, to virtual machines, to container orchestration, to managed physical nodes.
OpenStack was from the beginning a key component in the CERN computing transformation. It allowed us to build a private cloud infrastructure that helps the Organization to achieve its scientific goals.
The infrastructure requirements continue to evolve and we follow exploring new technologies and projects (GPUs, Machine Learning frameworks, Functions as a Service, …). Also, we continue to investigate new deployment models to facilitate operations and increase the infrastructure efficiency. For example running the OpenStack control plane on top of Kubernetes, the development of Preemptible instances, …
We work very closely with the OpenStack upstream community. From the development side, CERN has contributed nearly 1000 code patches from 34 team members. But it is not just code, we regularly share our experience in the OpenStack summits, user meetings, and have several team members as project cores and other important roles in the community, such as members of the technical committee and board of directors. I really believe this is the key for the success of the CERN cloud infrastructure. Winning the first OpenStack Superuser award in 2014 made us very proud but also increased our influence in the OpenStack community.
I can’t finish this blog post without giving credit to all the people that worked and collaborated in the development of the CERN cloud infrastructure during the last 7 years. It was a challenging journey!
Looking forward to the next years of the CERN cloud infrastructure.