Adoption of in-production physical nodes into Ironic and Nova

When Ironic is chosen to manage bare metal nodes, it is not unusual that there is already an existing in-production fleet of physical servers. This may leave the operator in the sub-optimal situation where tools and workflows need to handle Ironic and pre-Ironic systems. To address this, Ironic supports an “adoption” feature: adoption allows to add nodes which Ironic should regard as in-use, and they can therefore take a slightly different path through the Ironic state machine. While this helps with direct or stand-alone use of Ironic, an additional complication arises when Ironic is used in conjunction with Nova (and Placement). These components do not support adoption and are therefore not aware when physical nodes are adopted into Ironic. Consequently, there is no way to manage such pre-Ironic nodes via openstack server commands. In this post, we will explain how we transparently adopted in-production nodes into Ironic and Nova/Placement to arrive at a situation where there is no difference between Ironic and pre-Ironic physical instances.

The initial situation and workflow overview

The CERN IT data centers host around 15’000 physical servers. While this number has been pretty constant over the past years, servers are of course constantly retired and replaced. Ironic has moved to production in CERN IT in 2018 and all new deliveries since then have their servers enrolled and managed via the OpenStack bare metal management tool. In addition, a few thousand servers needed to move from one data center to another and we used the opportunity to add them into Ironic as well. To this day, around 6’000 servers are therefore managed by Ironic. The goal, however, remains to manage (close to) all of the servers in CERN IT with Ironic in order to remove the duplicity in the tool chain for Ironic and pre-Ironic nodes. This is why we started to look into adopting the pre-Ironic instances into Ironic and Nova/Placement.

Fig. 1 - CERN Ironic Dashboard

Fig. 1 - CERN Ironic Dashboard

The basic idea to arrive with instances in Nova which are connected to physical nodes in Ironic is relatively simple: we instantiate instances in Nova (so that all databases in the compute service have the correct instance information), but do not touch the underlying nodes in Ironic. The key to achieve this are the fake-hardware hardware type and the fake drivers. The fundamental steps in our procedure are:

  1. create a bare metal flavor
  2. create a hosting project
  3. enroll the nodes into Ironic
  4. change the hardware type and the interfaces to fake drivers
  5. one by one, add the nodes to the placement aggregate and create the instances
  6. change the hardware type and the interfaces back to the real ones

As you can see, this procedure does not rely on Ironic’s adoption feature at all, but instead treats production nodes as new nodes which can be instantiated – while the instantiation is cut short and replaced by no-ops. In the following, we will go over the individual steps in a little more detail.

Creating a bare metal flavor and a hosting project

In our deployment, we usually have a flavor per hardware type (and physical location). As usual, the flavors and the hardware type are linked by a custom resource class. To distinguish adopted instances from instances which were managed by Ironic from the start, we decided to have a special prefix (a1.) for the adoption flavors we created:

  • p1.dl0291174.S513-A-IP123 (normal physical flavor)
  • a1.dl7428883.S513-V-IP456 (adoption flavor)

This is of course not strictly necessary, but it may serve the jittery operator at some point. Besides creating a project to host the new physical instances, we need to pick a name for the resource class, set it on the flavor, and grant the project access:

$ openstack flavor set --property resources:$RESOURCE_NAME=1 $FLAVOR_NAME
$ openstack flavor set --project $PROJECT $FLAVOR_NAME

At this point, we have a project with a flavor, but no physical nodes to use. We will add them in the next step.

Enrolling the underlying physical nodes and make them available

The (in-production) physical servers are enrolled the same way as complete new nodes would be:

$ openstack baremetal node create --conductor-group $CONDUCTOR_GROUP --resource-class $RESOURCE_CLASS --driver ipmi ...

The resource class needs to match the one from the flavor we created in the previous step.

Once enrolled, we need to create a port (which would be done by Introspection usually) and move the nodes to the manageable state:

$ openstack baremetal port create --node $BM_NODE_UUID $MAC_ADDRESS
$ openstack baremetal node manage $NAME

Once this is done, we replace the ‘deploy_interface’ and some of the drivers with fake ones:

$ openstack baremetal node set --driver fake-hardware $NAME
$ openstack baremetal node set --management-interface fake $NAME
$ openstack baremetal node set --deploy-interface fake $NAME
$ openstack baremetal node set --boot-interface fake $NAME

Why not enroll the nodes with the fake hardware type and the fake drivers straight away? Using the fake components marks the resource provider’s inventory as reserved and does therefore not allow for allocation later on. The reason seems to be that the Ironic driver in Nova needs to be able to check the power state to mark the inventory as available (reserved=0). This does not work with the fake power interface while the fake hardware type does not have support for non-fake power interfaces, therefore a slight detour is needed.

Eventually, we set the software RAID configuration:

$ openstack baremetal node set --raid-interface agent $NAM
$ openstack baremetal node set --target-raid-config '{"logical_disks": ...

This is of course not necessary if you do not use Ironic’s software RAID support.

Finally, we provide the node to make it available for instantiation:

$ openstack baremetal node provide $NAME

As providing nodes usually triggers automatic cleaning (which means booting the node into a RAM disk image with the Ironic Python Agent, something we would like to avoid for these in-production nodes!), we protected this step by some safeguards in our scripts. So, make sure the drivers are really the fake ones before calling this command as otherwise you may end up with a nice clean node …

After the node made it to available, this should now leave us with some allocation candidates in placement (once the resource tracker has reported them). We can check this with:

$ openstack allocation candidate list --resource CUSTOM_$RESOURCE_CLASS='1'

We repeat this enrollment for all nodes we would like to instantiate.

When all nodes are reported by Placement, we will need to discover the new nodes:

$ nova-manage cell_v2 discover_hosts

At this stage we should be ready to instantiate new physical instances!

Creating the instances

Since we may have enrolled multiple nodes in the previous step, there might be multiple allocation candidates for the same custom resource class now. The instances we are about to create, however, shall have names which match the current names of the physical servers and need to be scheduled on the matching machines. How do we achieve this?

While we initially considered to mark the inventory of all undesired resource providers as reserved with

$ openstack resource provider inventory set --resource CUSTOM_$RESOURCE_CLASS:reserved=1 ...

and only unreserve the desired one with 

$ openstack resource provider inventory set --resource CUSTOM_$RESOURCE_CLASS:reserved=0 ...

we dropped this approach, mostly due to lack of support in the API microversion needed for this operation in our version of the OSC. Instead, we decided to control this via the mapping to the placement aggregate:

$ openstack resource provider aggregate set --generation $GENERATION --aggregate $AGGREGATE_ID $BM_NODE_UUID

This reduced the relevant number of allocation candidates to 1 at any moment and allowed us to create the physical instances with the usual command:

$ openstack server create ...

A few seconds later ACTIVE instances show up in our project.

Fig. 2 - The first 148 adopted nodes show up in conductor group 012

Fig. 2 - The first 148 adopted nodes show up in conductor group 012

Change the interfaces back to the normal ones

Now that we have active instances, we need to revert the drivers of the Ironic node back to the real ones:

$ openstack baremetal node maintenance set $BM_NODE_UUID
$ openstack baremetal node set --driver ipmi $BM_NODE_UUID
$ openstack baremetal node set --management-interface ipmi $BM_NODE_UUID
$ openstack baremetal node set --deploy-interface iscsi $BM_NODE_UUID
$ openstack baremetal node set --boot-interface pxe $BM_NODE_UUID
$ openstack baremetal node maintenance unset $BM_NODE_UUID

This will arm openstack server commands and the physical instances will behave the same way as if they had been created in Nova and Ironic from the start.

That’s it, we are done :)

Didn’t you forget about … networking?

Yes and no: our Ironic deployment does not rely on Neutron for the allocation of IP addresses to bare metal instances. Addresses are assigned and managed by an external service with which Nova interacts when creating instances. In fact, in the bare metal case, this interaction is mostly an update of the host name and various metadata fields. Nonetheless, we needed to disable some sanity checks on creation and introduce an instance property field to signal that no interaction or update of the network databases is desired when handling these new instances during adoption. As part of reverting the Ironic drivers to the real ones, we also removed this property from the instances. This property mechanism was particularly handy during testing and the adoption of the first nodes (where we still needed to adapt the procedure) … this time, the specialties of networking at CERN did make things easier.

What’s next?

Currently, we have enrolled ~150 nodes with this procedure, pushing the number of Ironic nodes in CERN IT just above 6’000. We will need to check if all our procedures, such as repairs, work with these adopted nodes, and therefore wait a little before we start adopting more nodes, but the plan is clearly to get the majority of the 15’000 nodes CERN IT manages into Ironic!

Acknowledgments

Many thanks to Daniel Abad, Surya Seetharaman, and Belmiro Moreira for their contributions to this work!


comments powered by Disqus