SCL Management Architecture#
About this document#
This document should provide a starting point to familiarize with core concepts underlying the Separation Control Layer (SCL) management system. Overall familiarity with the Concepts of the SCL is expected.
This document should answer the following questions:
- What are the initial key goals that shaped the SCL design?
- Which services does the SCL consist of? How do they interact?
- How are the services deployed in single-node and multi-node environments?
Recap of initial goals and requirements#
This is a rough summary of goals that highly impacted the SCL design.
- Functional:
- Infrastructure as a Service software with strong tenant isolation
- Non-functional (in any order):
- Declarative specification of resources and their state similar to the proven design of Kubernetes, to have a reliable and consistent representation of the intended system state and make developer / operator / user adoption easier
- Scalable architecture with strong consistency
- Security conscious engineering, for example:
- Very small / auditable TCB
- Separation of concerns
- Secured inter-service communication
- High maintainability of the code
- Rust (strikes a good balance between maintainability vs. safety vs. ergonomics)
- Good operatability of the services:
- Observability
- Fault-tolerant components with self-healing behaviour
- Flexibility of service constellations: enable multiple deployment scenarios, e.g. single-node and multi-node.
- Extendability of the functionality: A growth-driven approach was chosen that enables adding features one after another, growing from an MVP to a production-ready system, while swapping out stop-gap prototypes with those from partners (hypervisor, storage, networking, ...).
- Context
- SCL as a system that integrates lots of other components
Services#
Overview#
The service architecture of the SCL management system overall follows mainly the core principles behind Kubernetes1, specifically due to its scalability, testability, and familiarity for users and operators:
The central idea is to iteratively transform the underlying system state using a set of individual control loops based on state machines into a user-declared target state. In the course, all state information is held in a central, replicated database, which can be seen as the "single source of truth" in the system as it is the only place where the system state is stored persistently.
The system consists of the following core components:
- The Infrastructure Management API (IM-API) acts as a proxy for the SCL Management and performs OpenID Connect based authentication and authorization for separation contexts before it forwards requests to the SCL API server.
- The SCL API server is the central communication endpoint and controls all database (etcd) accesses.
- etcd is used as the database service. It offers a strongly consistent, replicated single source of truth. It stores a representation of the system state, where most SCL Objects are conceptually described by desired state (
spec), actual state (status), and metadata. - Various controllers drive the current state towards the desired state by invoking other / lower APIs. The controllers register themselves at the SCL API server, watch relevant SCL Object changes and execute required actions. In addition, they perform periodic health checks.
- Node API instances are running on compute nodes and are used by most controllers to interact with the underlying system.
Every one of these components can be scaled horizontally.
Notes regarding the components overview figure above:
- The dashed lines of the L3 Network Controller depict a transition state, in which the controller will be deployed on a single compute node and enable it to also act as a gateway node. In the future, the L3 controller will be deployed similar to the other controllers.
- The dashed Volume Controller is a prototype designed for network storage but because of unavailability of the technology implemented via local storage. It is kept as a prototype for developing multi-node storage for the SCL. A local storage functionality has been merged into the VM Controller and should be used instead if not intending to implement network storage (see VM Controller for details).
- Other services attached to the Node API, e.g. a L3 Network API or a Storage API, might be added as needed.
Infrastructure Management API#
See also: OpenAPI specification of the IM-API.
The Infrastructure Management API (IM-API), realized by a particular configuration of the OpenAPI Proxy, controls external accesses to the SCL API. This is achieved via the following means:
- The offered API is a reduced subset of the SCL API intended for end-users (endpoints for operators and controllers are only available in the SCL API).
- The service integrates with an Identity Management system (such as Keycloak) for role-based access control.
- It makes it possible to integrate industry-standard-compatible APIs (e.g., Hashicorp Terraform2).
Clients present authorization as claims encoded in a JSON Web Token, submitted as a bearer token (RFC 6750, section 2.1) with each request. Tokens must be signed by a trusted authorization server (IDM) that exposes its public signing keys as a JWK Set. Access is then granted or rejected based on OAuth 2.0 scope and metadata identifying the accessible resources. If access is rejected, the server sends an appropriate error response to the client; if access is granted, the request is forwarded to a configured SCL API endpoint and the resulting response forwarded to the client.
In extension to the IM-API, a Hashicorp Terraform2 provider is offered, which facilitates easy integration with state-of-the-art workflows leveraging this system.
In future iterations, the IM-API can be extended by further functionality such as enabling an audit trail by injecting appropriate identifiers into the requests and responses which can then be tracked throughout the system.
SCL API#
See also: OpenAPI specification of the SCL API.
The SCL API is a stateless, versioned, and declarative REST API offered via HTTPS.
It is the central endpoint over which all communication concerning the
system state (querying as well as changing it) happens. All accesses to the system
state and therefore requests to the SCL API must be authenticated via mTLS
and pass validation: Only specified SclObject types, states,
and state transitions can be stored in the persistent database,
which is ensured by the SCL API.
With this design, a consistent view is ensured across all services. Even though it may appear like the SCL API, as it is placed before the database, could become a single point of failure (e.g. after DoS attacks), this is not the case: Thanks for the stateless design, the service can be replicated to an arbitrary number of instances.
The SCL API itself does not perform any sub-system interaction (e.g. with compute nodes) to realize the desired state. The API also does not have any user management (e.g., who is allowed to access which SC). This is done by the Infrastructure Management API.
Etcd#
etcd is used to store the system state of the SCL management. It is a database natively
built for scalability and strong consistency. Thus, every SCL API instance can be executed
with a backing etcd instance that runs in a cluster and synchronizes the DB state
with all other DB instances. Communication between the SCL API and etcd is secured via
mTLS, just like the communication between etcd instances themselves.
Controllers#
General architecture and control flow#
The SCL controllers act on state changes reported via the SCL API and propagate them in a resource-specific manner to the underlying systems such as the hypervisors, SmartNICs, storage systems, and so on. Implementation-wise they are individual services executing a "control loop" that adheres to a resource-specific state machine. The core aspect of controller operation is a "current state vs. desired state" comparison to derive and implement necessary actions. The comparison of the two states can be triggered, e.g., by the service starting up, periodically, or - most importantly - by events sent from the SCL API indicating that a resource object changed.
The following figure depicts the overall operational loop that is executed by each SCL controller, regardless of its associated resource type.
This process is deliberately kept simple to make it as failure-safe as possible. Furthermore, a high testability of the resulting code is achieved by the shown design, which is made use of by the extensive SCL test suite.
Should the performance characteristics of any controller turn out to be insufficient for specific use-cases or resource types, a straight-forward extension by state caching mechanisms (e.g., keeping the last SCL API state in the controller) is possible.
Scheduler#
The prototypical scheduler is responsible to identify and assign a suitable Node with sufficient resources to unassigned VMs. It is the only controller that does mpt perform any I/O apart from interacting with the SCL API. The scheduling is currently implemented via a first fit approach. It determines the load of a Node by subtracting the total resources of all assigned VMs from the Node's advertised resources.
Warning
Due to the prototypical implementation, Nodes could be overbooked if multiple instances of the Scheduler are running. In order to prevent such behavior, a strict synchronization of a Node's booked resources has to be implemented. This is planned in the context of implementing redundancy and HA features for the SCL in general.
L2 Controller#
Every Separation Context has a dedicated layer 2 network. These networks are managed by the controller via the L2 network functionality of the Node API:
The controller has two responsibilities:
- VLAN management: The SC handler part of the controller makes sure that there is the same VLAN for every SC on all registered Nodes. If an SC is undergoing deletion, the controller will clean up the VLANs on all Nodes. The identifier used for the Node API interaction is the unique VLAN tag that every SC has (assigned by the SCL API).
- TAP management: The VM handler part of the controller creates tap devices designated for connecting VMs within the associated VLAN. The tap devices get created on the Node where a VM was scheduled. After that, the state machine of the VM SCL Object is updated to indicate the progress, so that other controllers can take action. The identifier used to derive the tap name is currently specified by the user.
The behavior of the controller is depicted in the following figure:
L3 Controller#
Note
In contrast to most controllers, the L3 Controller does currently not interact with Nodes via the Node API. Instead, it directly interacts with the system where the controller is deployed, strongly relying on system modification done by the Node (L2 Network) API. This is expected to change in the future.
The next figure depicts the connection of separation contexts and contained VMs to an external network with the single-node L3 setup. A Router resource created and properly configured via the SCL API enables VMs to reach the internet via a configured gateway device, and optionally makes ports of VMs inside the SC available to the gateway network.
A Router consists of:
- an external IP that will be assigned by the controller to a veth device connected to the SCL bridge,
- an internal IP that will be assigned by the controller to the SC specific bridge and that should be used as gateway address within SC VMs,
- optional port forwarding rules that map ports of the external IP to user specified TCP or UDP addresses (for instance, some SSH port of some SC VMs) inside the SC specific network namespace.
The SCL bridge can be connected to either an external network (such as the internet) or an arbitrary network setup provided by independent components.
The core logic of the controller is depicted below:
Limitations
- Currently, there is no central management of external IPs, yet. This has to be assigned by a trusted component (SCL API or another controller) to prevent conflicts in the future. Currently, users need to ensure that a) external IPs are in the same network as the SCL bridge and b) are unique.
- There is currently no support for updates of forwarded UDP or TCP ports. It is likely that at least new forwarding rules may be appended in the near future.
VM Controller#
The VM controller processes VM SCL Objects and acts on them by interacting with the Compute API part of the Node API.
The processes executed by the VM controller based on the desired state specified by the SCL API and the actual state reported by the operating system as well as the hypervisor is visualized in following figure:
Based on this, an example process for the creation of a virtual machine via the IM-API from a user perspective is shown in the sequence diagram below.
The latter encompasses also the acquisition of an access token from the IDM that is a mandatory first step performed by the user before they access the IM-API.
Furthermore, the controller registration process is shown, whereas the scheduler and the VM controller open a "watch" channel with the SCL API that, in turn, leverages the corresponding watch functionality of etcd to provide updates on modified resources in a push-based manner.
VM Boot Volume#
For each VM, a boot volume must be configured. This can either be a referenced SCL volume that is managed by the deprecated Local Volume Controller for single-node setups or a local storage volume directly associated with the VM and its lifetime.
Historic Note
Initially, the SCL was designed to support only network storage volumes, which, as the name implies, are available to any compute node over the network and are thus ready to be attached as devices to VMs regardless of where a VM is scheduled.
Since this feature was not available at the very beginning of the implementation and the focus was initially set on single-node setups, a now deprecated Local Volume Controller was implemented as an interim solution, which provided file-based volumes using basic Linux system functions and has no Node awareness.
With the support of scenarios with multiple compute nodes, however, the Local Volume Controller can no longer be used because it lacks Node-awareness. Since proper network storage volumes are still not available, the SCL VM object was adapted to additionally support Local Storage Volumes directly associated with the VM itself and the Node where the VM was scheduled. This avoids an entire range of problems related to Node locality (initialization, migration and required key exchange, etc.) for the time being.
Node API#
The SCL controllers require the Node API to be available on compute nodes in order to interact with them.
Scalability and fault tolerance#
High availability and load balancing#
As sketched for the SCL API, to achieve high availability, multiple instances of each SCL service can be executed. In a production deployment, the services are intended to be distributed over physically-separate nodes as depicted in the following figure.
All state synchronization occurs via the distributed database cluster.
Synchronicity and load distribution are ensured by the used etcd database system and needs no specific handling by the SCL management system.
The SCL API itself can be easily distributed as it is stateless and only serves as (authenticating and validating) proxy in front of the database system.
This way, it is also straight-forward to add an HTTP load balancer in front of the API servers.
The SCL controllers can be replicated as well. In a production deployment, most actions can be taken over by one from a group of controllers. The synchronization between controllers to assign such work happens via the database, proxied through the SCL API. By that, the set of controllers able to act on a specific group of resources can also be extended and distributed among multiple nodes to increase availability and reduce the load to individual instances.
Self-healing behavior#
Due to the iterative behavior of the controllers, partial "self-healing" properties can be achieved: For example, if a node is lost, resources may be re-scheduled and re-started on other nodes. However, in some cases, such behavior is not desired, e.g., if it could incur data loss and manual action on failure is preferable. At least, all controllers must cope with full and partial power outage scenarios plus random hardware defects.
In general, the controller design is such that only volatile state is recovered automatically:
- Persistent side effects are not recovered; these incidents are logged and reported to the API. Example: The controller detects that a Volume with reported the status
Activedoes not have any corresponding local file. The controller will update the status toFailed. Recreating the Volume from scratch could be a bad idea because of various reasons, e.g. the content behind the URL could be different (possibly changed by an attacker). - Volatile side effects are recovered. Example: Network namespaces and network-booted VMs are re-created if they are supposed to be present.
Deployment constellations#
In general, the SCL was designed in such a way that "single-node" setups are just a special case of "multi-node" setup. However, due to some missing requirements or technologies, there are some minor things to point out.
Multi-node setup#
In multi-node setups, we distinguish between management nodes and compute nodes.
Management nodes#
Management nodes run at least the following services:
- The SCL API,
- an etcd instance,
- the Scheduler,
- the VM Controller,
- the L2 Network Controller, and
- optionally the IM-API if there is no dedicated machine for that.
Compute nodes#
Compute nodes run at least the Node API.
As shown below, the L3 controller should only be deployed on a single compute node, which also assumes the role of a gateway node (you could register a Node without any resources to prevent tenant VM activity there, if desired). The same node must be reachable by clients so that they can access their services running in the SCL.
Single-node setup#
Single-node setups are execution environments with just a single machine: The machine conflates multiple roles (see the management node and compute node distinction from above) that are otherwise (for stronger security guarantees and better scalability) typically distributed across several machines. Thus, this deployment scenario combines all of the services listed below on a single physical node.