Design Session - Hyperlaunch -------------------- Wednesday 26th May, at the Xen Design and Developer Summit 2021 Session Hosts: Christopher Clark & Daniel Smith
tl;dr: - use cases for Hyperlaunch include supporting bare metal apps - latency is a critical requirement for workloads - determines success/failure of the system - scheduling is hard; Xen has options, including RTDS - Zephyr in dom0 being explored in the Arm embedded community - XSM Roles work is to support flexible deployment structure - System Device Tree is important for Hyperlaunch to integrate - migration from dom0less to be supported - Lopper tool translates SDT to traditional Device Trees for domains - Boot Domain could run Lopper - could be done as a unikraft unikernel - US/EU Supply Chain SBOM need aligns with Hyperlaunch + Trenchboot - options for funding to accelerate the work: - PCI passthrough, Recovery Domain, XSM framework improvements - Design docs for Hyperlaunch available [patch posted to merge to Xen tree] https://lists.xenproject.org/archives/html/xen-devel/2021-05/pdfq6mIMNPNoM.p... https://lists.xenproject.org/archives/html/xen-devel/2021-05/pdfQlbS0F4suy.p...
Slides from the Hyperlaunch Keynote: https://static.sched.com/hosted_files/xen2021/d7/Hyperlaunch%20-%20Keynote_%... Video: https://www.youtube.com/watch?v=Xwtq2Q0ylj0&list=PLYyw7IQjL-zGcRPN6EjiTu...
Slides from the XSM Roles presentation: https://static.sched.com/hosted_files/xen2021/75/Tuesday_A%20new%20Role%20mo... Video: https://www.youtube.com/watch?v=j1fDn8ZbyVE&list=PLYyw7IQjL-zGcRPN6EjiTu...
Hyperlaunch at the Xen Project wiki: https://wiki.xenproject.org/wiki/Hyperlaunch
-------------------- Open Discussion: - floor open for audience requirements, use cases for Hyperlaunch
Stefano: - use case: fast unikernel boot (on embedded known as "bare metal applications") - boot to up as quickly as possible
- difference between unikernels and bare metal applications: - a bare metal application is a tiny driver for a hardware block - ie. a hardware block in programmable logic so no existing driver
- a bare metal application: typically just a driver that executes as the "unikernel" - usually only a few of them
- latency is the biggest concern for bare metal apps - hypervisor scheduling: a concern - priority reason: _must_ respond to hardware action in a very limited amount of time - ie. Latency more important than anything else - losing latency is software failure: disaster happens - consequence: Adding a scheduler makes it a lot harder - not doing any scheduling is typically easier - also need to do cache partitioning, and more
- a bare metal app doesn't need any PV drivers since it doesn't communicate with any other software, just the hardware block. - access to mmio + an interrupt or two sufficient
Christopher: - use of unikernels aligns with what is wanted for the boot domain: - ie. use short, single-purpose domains for platform services to avoid turning the boot domain into another dom0 by continuing to add functionality - eg. Qubes OS Mirage firewall VM, or something similar from unikraft
Daniel: - design: the hypervisor finishes the system, waits for boot domain to exit and complete the launch - enters 'finalization phase', finishes bringing everything up: eg. unpausing other domains not unpaused by boot domain - boot domain wiped from memory
--- topic: Scheduling
Christopher: - For small, single-purpose domains: have a need to schedule these
Stefano: - Illustrative example: 2 domains: dom0 Linux, domU bare metal app - no scheduling, to make sure deadlines not broken - made domU pause dom0 during critical execution: - Interesting inversion of priority.
Point is that domU is the most critical thing on the entire system. ie. if domU meets deadlines and dom0 not present, system still functional.
Christopher: - related: Connor's talk at this Summit re: moving scheduling out of Xen into dom0; https://xen2021.sched.com/event/jAEs/the-root-vm-a-new-xen-domain-species-co... - also the Bromium architecture, and Daniel's HAT architecture https://xen2020.sched.com/event/baXt/design-session-talk-reliable-platform-s... - has concept of protection domains - interested in DomU running the fundamental workload but not being Control Domain, doesn't have those permissions
For this use case -- domU pauses dom0 for domU to meet its deadlines -- permission model must have been changed.
Stefano: - adhoc provision of two hypercalls so domU could pause/unpause dom0 - not easy to make generic: - not just vcpu, must pause _everything_ except self - 5 lines of code for a hack, 10 months to do it properly, upstream, etc!
- Critical section: an interrupt occurs, must act within a very limited amount of time; else the whole thing fails
- Critical section is way smaller than a slot of the scheduler
- Make sure everything else is paused, to get the full bandwidth of not just the CPU, but also DDR, no interrupts. Don't screw up those 15 microseconds
George: - how long does it take to pause all the other VMs on the system? - eg: a foreach domain, foreach vcpu, and just pause them, but: involves sending interrupts, waiting for the thing to finish, etc
Stefano: - I knew which event started the critical section, so I made that event the trigger for pausing dom0.
[via chat:] Demi Marie: What if we had a hard real-time scheduler like seL4 does? [via chat:] Artem: RTDS? NULL? [via chat:] Andy: yup - those [via chat:] Julien: https://wiki.xenproject.org/wiki/RTDS-Based-Scheduler [via chat:] Artem: also ARINC653 [via chat:] Scott: sounds like he wants the hypervisor to disable interrupt virtualization and sit in a tight loop running a single guest on certain cores [via chat:] Artem: core pooling? [via chat:] Demi Marie: IIRC seL4 can do this with the mixed-criticality scheduling work [via chat:] Artem: RTDS can do that but AFAIK it cannot reschedule slack
Daniel: - you need a scheduler that is aware of these critical interrupts that when they occur, it means that that domain has to have exclusivity over the system and can take care of ensuring that you get scheduling exclusivity over the system.
Stefano: responding to Demi, re: "seL4 can do this with the mixed-criticality scheduling work" - Yes, other domains in the past used this technique
George: you don't actually need to pause the other domains; - you just need to make sure that the other CPUs stop doing stuff.
Stefano: what I did: slept in Xen, not even pause the CPU: busy-looping Xen
George: in a sense is correct; similar to core scheduling, sibling cores switch to not doing anything
Daniel: yes, lots of academic papers on these problems, eg. implemented in seL4 and other kernels. XSM Roles: was done to help more advanced Hyperlaunch scenarios' - (I don't like this idea but:) you could build a role-based scheduler
Christopher: ARINC653 scheduler mentioned - Artem, have you experience with it?
Artem: no, sticking with RTDS. Also used it with full preemption for Xen. - Really interested in RTDS. - want to explore using slack time for domains with best effort priorities - RTDS seems like the best option for future development.
Our scenarios, on Arm: - distinguish between: hardware-controlling domain, hardware domains, and controlling domain: - using dom0 as a controlling domain, able to recreate domains if needed
- using device tree and don't have ACPI: split hardware between domains - each domain can talk directly to some piece of hardware - ie. they all are, in a sense, hardware domains - each can be independently restarted to deal with faulty hardware drivers - eg. we can restart the GPU from dom0
- dom0 path to safety certifications: working on Zephyr as a dom0 - event channels working - an early draft implementation
- aims: - a small RTOS acting as a starter in dom0 - don't put other domain kernels in dom0 - instead: a bootloader - dom0 starts a domain, gives a generic bootloader, common for all other domains, and then other domains have their own filesystems - guest domain's know which kernel to use, so dom0 becomes very small and very generic, and not depending on other domain's kernels, etc. - ie. dom0 is purely for control functions
--- topic: how does Hyperlaunch help?
Stefano: - domU should not be started from dom0 - two domains, no PV drivers at all - a clear use case for dom0less - more detailed XSM policies allows dom0 to not be fully privileged - XSM policy can allows one domU to stop the other domU
--- topic: request to review the design doc
Daniel: - we want to make sure that we're good on this idea of the boot domain - that we understand how these handoffs are going - the roles work, the subtask to get that integrated in so that we can do these disaggregated boots.
New definitions for Roles within the Xen system: - get away from concepts of 'is_control_domain' and 'is_hardware_domain' - talk about what Role we're asking a domain to do and function as - want a common language for roles (eg. avoid (possibly unaware) misconceptions of current differences in views on what a Control Domain is and what a Hardware Domain is)
Review the design doc, give us some feedback; will be adding a design doc for the Roles work as well -- have a draft form of it and just want to flush it out further, and hopefully we can get all of that adopted.
--- topic: Question from Julien: is the plan to completely remove dom0less or keep the two together?
Christopher: integrate, so no boundary between the two - Everything with dom0less should continue to work
Daniel: yes - dom0less constructing domains from the hypervisor will continue, become common code, used by both Arm and x86. - biggest difference: migration from dom0less to hyperlaunch trees; not sure what that migration period will be. - much broader Device Tree definition - trying to ensure aligned with System Device Tree - (dom0less today has own specific Device Tree configuration) - for some period of time, the parser for the dom0less Device Tree is going to have to coexist with the Hyperlaunch one
--- topic: System Device Tree and Lopper
Stefano: System Device Tree: - very similar to Hyperlaunch and dom0less - defines 'domains': VMs for Xen, or could be bare metal things running on a coprocessor - next few months: finish cleaning up the definition of domains in System Device Tree, and cover VMs properly
Align Hyperlaunch with the System Device Tree domains. - already need migration from dom0less to System Device Tree - don't want to do two migrations
System Device Tree comes with a tool called 'Lopper': [ https://github.com/devicetree-org/lopper ]
Lopper takes a single System Device Tree and generates multiple traditional Device Trees, one for each domain in the System Device Tree. - Device Tree for VMs can be very different from the one on the host - Device Tree for bare metal domains can be much closer
Lopper supports python plugins - eg. a Lopper plugin to convert the System Device Tree format into dom0less format, so works with current Xen parsing - changing the Xen parsing eventually would be better
Daniel: - with Hyperlaunch: could boot with the System Device Tree, and pass it into the Boot Domain, where Lopper runs, and Lopper can generate domain device trees for guest domains to start
Stefano: would be very cool! - System Device Tree (and Lopper, in python) so far always used at build time
Daniel: the unikraft project has its micropython unikernel - for embedding scripts as a unikernel - eg. a unikraft unikernel python domain with Lopper and Boot Domain logic - takes System Device Tree used to construct all the domains, and does the Device Tree generation - from a security standpoint: nice: hypervisor's not generating Device Trees - all at runtime in a clean, safe architecture
Christopher: interesting for CI looping as well
--- topic: Scope, Funding, Alignment of work
Rich: Q: You said that you were managing the scope, because it could become quite big: Could you talk about: - Some of the things that you have left out of scope? - Areas where funding would help? - Areas where other contributors would help? - How Trenchboot is connected to this or just launch integrity in general?
"In both the US and the EU, there is a top-down effort for supply chain security, powered by ransomware and bitcoin, so " [ money is available ] " to get more integrity in the software stack, and they're pushing 'Secure Bill Of Materials (SBOM)', which we saw at the Yocto event. So if you have a Secure Bill Of Materials, and your Hyperlaunch system with Trenchboot can prove that the thing running matches the manifests, people might want to pay money for that, and help drive your roadmap."
Daniel: Yes.
Trenchboot: Correct, the whole idea of this spawned out of the same thoughts that created Trenchboot [ https://trenchboot.org/ ]
- proposed back in May/June 2018, driven by: how do we want to use Trenchboot in a Xen launch system where we had the security properties that we're seeking, but without blowing up, in terms of size and code and responsibility, into the hypervisor - ref: talk at Trenchboot Developer Forum - [ https://www.youtube.com/watch?v=qWMRcfQdc6c ]
- standard pattern following with Trenchboot: launch into a kernel that then launches into an integrity measurement system, a security engine - ie. for Xen: we do a DRTM launch into Xen, that starts a Boot Domain, our security engine that runs in a restricted environment that's protected to take measurements of the system that provides you attestable information, attestable evidence, to what's in your system, to the degree that's possible. - at the same time, not everybody wanted to have a capability specifically focussed on that, so there had already been discussions about a bootstrap domain, that we linked to when Daniel De Graaf did the original Hardware Domain - he posted an example Boot Domain capability, - so building all of this as the foundation
Rich: are there things you have wanted to do but have postponed or are there tasks where you need external people to help, or external funding sources that would allow those features to be addressed?
Christopher: - PCI passthrough is the big one - really important - highly complementary to Hyperlaunch to passthrough PCI devices right from start of all of initial VMs - but complex
- the Recovery Domain - mentioned in the Design Document - ability to have a VM built, configured, and when failure is detected during host boot - eg. malfunction of a critical VM - can put rescue logic in there to enable recovery
Daniel: - For the Roles work, done the minimum Hyperlaunch needed - but could definitely go much further - get the XSM Framework cleaned up - get Flask in much better position - more advanced Roles - reevaluating all the XSM hooks in terms of Roles and everything - getting all of the security framework in a better state.
-------------------- The recording for this Design Session is available at: https://www.youtube.com/watch?v=j75orDMXO2M&list=PLYyw7IQjL-zGcRPN6EjiTu...