e0aefd11d9
This patch extends the protection domain framework with a third plugin that is a hybrid of the previous two. The hardware task switching mechanism has a strictly-defined format for TSS data structures that causes more space to be consumed than would otherwise be required. This patch defines a smaller data structure that is allocated for each protection domain, only requiring 32 bytes instead of 128 bytes. It uses the same multi-segment memory layout as the TSS-based plugin and leaves paging disabled. However, it uses a similar mechanism as the paging plugin to perform system call dispatches and returns. For additional information, please refer to cpu/x86/mm/README.md.
998 lines
50 KiB
Markdown
998 lines
50 KiB
Markdown
X86 Lightweight Protection Domain Support for Contiki
|
|
=====================================================
|
|
|
|
Introduction
|
|
------------
|
|
|
|
The X86 port of Contiki implements a simple, lightweight form of
|
|
protection domains using a pluggable framework. Currently, there are
|
|
three plugins available:
|
|
|
|
- Flat memory model with paging.
|
|
- Multi-segment memory model with either hardware- or
|
|
software-switched segments. The hardware-switched segments
|
|
approach is based on Task-State Segment (TSS) structures.
|
|
|
|
For an introduction to paging and TSS and possible ways in which they
|
|
can be used, refer to the following resources:
|
|
|
|
- Intel Combined Manual (Intel 64 and IA-32 Architectures Software
|
|
Developer's Manual), Vol. 3, Chapter 4
|
|
- Programming the 80386, by John H. Crawford and Patrick
|
|
P. Gelsinger, Chapter 5
|
|
|
|
The overall goal of a protection domain implementation within this
|
|
framework is to define a set of resources that should be accessible to
|
|
each protection domain and to prevent that protection domain from
|
|
accessing other resources. The details of each implementation of
|
|
protection domains may differ substantially, but they should all be
|
|
guided by the principle of least privilege [1]. However, that
|
|
idealized principle is balanced against the practical objectives of
|
|
limiting the number of relatively time-consuming context switches and
|
|
minimizing changes to existing code. In fact, no changes were made to
|
|
code outside of the CPU- and platform-specific code directories for
|
|
the initial plugins.
|
|
|
|
Each protection domain can optionally be associated with a metadata
|
|
and/or MMIO region. The hardware can support additional regions per
|
|
protection domain, but that would increase complexity and is unneeded
|
|
for the existing protection domains.
|
|
|
|
After boot, all code runs in the context of some protection domain.
|
|
Two default protection domains are implemented:
|
|
|
|
- kern: Kernel protection domain that is more privileged than any
|
|
other protection domain. As little code as possible should be placed
|
|
in this protection domain.
|
|
- app: Application protection domain used whenever special privileges
|
|
are not required.
|
|
|
|
Additional protection domains are defined as needed. For example,
|
|
each driver may reside in a separate protection domain, although not
|
|
all drivers require additional privileges beyond those available in
|
|
the relevant scheduling context in the app protection domain. The
|
|
Ethernet and UART drivers are assigned separate protection domains.
|
|
Non-driver protection domains can also be defined. Other drivers only
|
|
require access to programmed IO ports accessible via the IN* and OUT*
|
|
instructions, and such drivers do not require separate protection
|
|
domains. They run in the Contiki preemptive scheduling context and
|
|
the kernel protection domain, both of which are granted access to all
|
|
IO ports.
|
|
|
|
Each protection domain may have associated system calls. A system
|
|
call transfers control from a client protection domain to a defined
|
|
entrypoint in a server protection domain. As their name suggests,
|
|
system calls adhere to a synchronous call-return model (rather than
|
|
some alternative such as an asynchronous message-passing model). To
|
|
invoke a system call, the client provides two identifiers to the
|
|
system call dispatcher. The first identifies the server domain and
|
|
the second identifies the system call to be invoked. The protection
|
|
domain implementation should associate allowable system calls with
|
|
particular server protection domains and reject any system call
|
|
requests that are not within that set of allowable system calls. The
|
|
system call implementations do not restrict the clients that are
|
|
permitted to invoke each system call. No modifications that the
|
|
client can make to the server domain and system call identifiers can
|
|
open up new entrypoints into the server domain. The entrypoints are
|
|
fixed at boot time.
|
|
|
|
However, if the identifiers were stored in shared memory, it may be
|
|
possible for a protection domain to influence the system calls issued
|
|
by some other protection domain, which may be undesirable. Thus, the
|
|
server domain identifiers are stored in memory that can only be
|
|
written by the kernel protection domain and the system call
|
|
identifiers are embedded in the code.
|
|
|
|
The system call dispatcher is responsible for reconfiguring the system
|
|
to enforce the appropriate resource access controls for the server
|
|
protection domain. It should then transfer control to the approved
|
|
entrypoint for the requested system call.
|
|
|
|
Contiki defines a process concept that is orthogonal to protection
|
|
domains [2]. A single Contiki process may run code in multiple
|
|
protection domains at various points in time. Contiki processes run
|
|
in a cooperative scheduling context. Contiki also defines a
|
|
preemptive scheduling context for interrupt handlers and real-time
|
|
timers. When protection domain support is enabled, interrupts are
|
|
only enabled when the application protection domain is active and is
|
|
running code in the cooperative scheduling context. Code running in
|
|
the preemptive context may also invoke multiple protection domains.
|
|
Contiki can also support preemptive multithreading, but support for
|
|
that has not yet been added to the X86 port so we do not discuss it
|
|
further.
|
|
|
|
A single stack is shared by all code that runs in the cooperative
|
|
scheduling context in all protection domains, and separate stacks are
|
|
defined for short interrupt dispatchers in the preemptive scheduling
|
|
context and for exception handlers and software system call
|
|
dispatchers. Except for the interrupt dispatchers, code in the
|
|
preemptive scheduling context also shares the same stack with the
|
|
cooperative scheduling context. All protection domains also share a
|
|
main data section, so similar considerations are also relevant to
|
|
that.
|
|
|
|
Introducing multi-core support would complicate things further, since
|
|
another core running a protection domain that the first core never
|
|
invoked could access data from the protection domain on the first
|
|
core. It may be possible to adequately address such concerns by
|
|
allocating per-core stacks.
|
|
|
|
Note that this stack arrangement means that a given protection domain
|
|
may read and write data written to the stack by some other protection
|
|
domain. For example, a protection domain B may push data onto the
|
|
stack and later pop that data off of the stack, but a protection
|
|
domain A that invoked protection domain B may still be able to read
|
|
the data that was pushed and popped to and from the stack, since
|
|
popping the data off of the stack does not automatically erase that
|
|
stack memory location. Another possibility is that protection domain
|
|
B may modify a stack entry pushed by protection domain A before it
|
|
invoked protection domain B, and protection domain A may later use the
|
|
modified value. Permitting legitimate accesses to callers' stacks is
|
|
in fact the primary motivation for this stack arrangement, in that it
|
|
makes it simple for A to pass data to and from B (on the shared stack)
|
|
when requesting services from B. A system call invocation is nearly
|
|
transparent to the developer, appearing almost identical to an
|
|
ordinary function call. However, B can access any data on the stack.
|
|
The third case is that A can read data placed on the stack by B after
|
|
B returns, unless B wipes that data from the stack before returning.
|
|
A related sub-case is that if an interrupt handler is invoked, it
|
|
pushes the current contents of the general-purpose registers onto the
|
|
stack, which may then be revealed to other protection domains besides
|
|
the one that was interrupted. However, interrupts are only actually
|
|
enabled in the application protection domain.
|
|
|
|
Similarly, register contents may be accessed and modified across
|
|
protection domain boundaries in some protection domain
|
|
implementations. The TSS task switching mechanism automatically saves
|
|
and restores many registers to and from TSS data structures when
|
|
switching tasks, but the other protection domain implementations do
|
|
not perform analogous operations.
|
|
|
|
For the reasons described above, each protection domain should only
|
|
invoke other protection domains that it trusts to properly handle data
|
|
on the stack.
|
|
|
|
Design
|
|
------
|
|
|
|
### Boot Process
|
|
|
|
The system boots in the following phases.
|
|
|
|
#### UEFI Bootstrap
|
|
|
|
Primary implementation sources:
|
|
|
|
- cpu/x86/uefi/bootstrap_uefi.c
|
|
|
|
When the OS is compiled as a UEFI binary, a short bootstrap phase that
|
|
is UEFI-compliant is run initially. It simply performs a minimal set
|
|
of functions to exit the UEFI boot services and then transfer control
|
|
to the Multiboot bootstrap phase.
|
|
|
|
#### Multiboot Bootstrap
|
|
|
|
Primary implementation sources:
|
|
|
|
- cpu/x86/bootstrap_quarkX1000.S
|
|
|
|
This phase disables interrupts, sets the stack pointer to the top of
|
|
the main stack, and then invokes boot stage 0.
|
|
|
|
#### Boot Stage 0
|
|
|
|
Primary implementation sources:
|
|
|
|
- cpu/x86/init/common/cpu.c
|
|
- cpu/x86/init/common/gdt.c
|
|
|
|
The UEFI firmware or Multiboot-compliant bootloader should have
|
|
configured an initial Global Descriptor Table (GDT) with flat segments
|
|
and configured the CPU to operate in protected mode with paging
|
|
disabled. Flat segments each map the whole 4GiB physical memory
|
|
space. This is the state of the system when the OS enters boot stage
|
|
0. This stage is responsible for setting up a new GDT and loading the
|
|
segment registers with the appropriate descriptors from the new GDT to
|
|
enable boot stage 1 to run. Code in stage 1 for multi-segment
|
|
protection domain implementations require that the appropriate
|
|
segment-based address translations be configured.
|
|
|
|
#### Boot Stage 1
|
|
|
|
Primary implementation sources:
|
|
|
|
- cpu/x86/init/common/cpu.c
|
|
- cpu/x86/init/common/idt.c
|
|
- cpu/x86/mm/prot-domains.c
|
|
|
|
Boot stage 1 intializes the Interrupt Descriptor Table (IDT) and
|
|
installs a handler for double-fault exceptions. Handlers for
|
|
additional interrupts and exceptions are installed later in boot
|
|
stages 1 and 2.
|
|
|
|
This stage also initializes protection domain support and enters the
|
|
kernel protection domain.
|
|
|
|
#### Boot Stage 2
|
|
|
|
Primary implementation sources:
|
|
|
|
- cpu/x86/init/common/cpu.c
|
|
- platform/galileo/contiki-main.c
|
|
|
|
The entrypoint for the kernel protection domain is 'main'. Boot stage
|
|
2 initializes hardware devices and associated interrupts. It then
|
|
transfers control to the application protection domain. Note that
|
|
this is a transfer of control, not a call that would be matched with
|
|
some future return. This is an important distinction, because
|
|
protection domains are not reentrant. Thus, if the kernel protection
|
|
domain called the application protection domain, it would not be
|
|
possible to invoke any kernel system calls until the system is reset,
|
|
since the application protection domain never exits/returns while the
|
|
system is running. There are not actually any kernel system calls
|
|
provided in the initial implementation of protection domains, but they
|
|
may be added in the future.
|
|
|
|
The core protection domain configuration (e.g. allowable system calls
|
|
and entrypoints, registered protection domains, etc.) is frozen by the
|
|
conclusion of boot stage 2 to help prevent erroneous changes that
|
|
could reduce the robustness of the system. The way that it is frozen
|
|
is that there are no kernel system calls that are intended to permit
|
|
changes to the core protection domain configuration. Thus, once the
|
|
kernel protection domain has exited, the only way the core protection
|
|
domain configuration can change would be due to undesirable memory
|
|
manipulations (e.g. due to a faulty device driver).
|
|
|
|
#### Boot Stage 3
|
|
|
|
Primary implementation sources:
|
|
|
|
- platform/galileo/contiki-main.c
|
|
|
|
Boot stage 3 performs initialization procedures that are less
|
|
tightly-coupled to hardware. For example, it launches Contiki
|
|
processes and invokes Contiki configuration routines.
|
|
|
|
### Privilege Levels
|
|
|
|
When protection domain support is inactive, all code runs at
|
|
ring/privilege level 0. When protection domain support is active,
|
|
only exception handlers and system call dispatchers (including
|
|
dispatchers for system call returns) run at ring level 0. Code in the
|
|
preemptive scheduling context runs at ring level 2 and code in the
|
|
cooperative scheduling context runs at ring level 3. Ring levels with
|
|
higher numbers are less privileged than those with lower numbers.
|
|
Ring level 1 is unused.
|
|
|
|
### IO and Interrupt Privileges
|
|
|
|
The kernel protection domain cooperative scheduling context needs
|
|
access to IO ports, for device initialization. Some other protection
|
|
domains also require such access. The IO Privilege Level (IOPL) that
|
|
is assigned to a protection domain using the relevant bits in the
|
|
EFLAGS field could be set according to whether IO port access is
|
|
required in that protection domain. This is straightforward for TSS,
|
|
which includes separate flags settings for each protection domain.
|
|
However, this would introduce additional complexity and overhead in
|
|
the critical system call and return dispatchers for other plugins.
|
|
Instead, the IOPL is always set to block IO access from the
|
|
cooperative scheduling context. Port IO instructions in that context
|
|
will then generate general protection faults, and the exception
|
|
handler decodes and emulates authorized port IO instructions.
|
|
|
|
Interrupts are handled at ring level 2, since they do not use any
|
|
privileged instructions. They do cause the interrupt flag to be
|
|
cleared as they are delivered. The interrupt flag can only be
|
|
modified by instructions executing at a ring level that is numerically
|
|
less than or equal to the IOPL. Each interrupt handler needs to set
|
|
the interrupt flag using the IRET instruction when it returns.
|
|
Protection domains that require access to port IO (currently just the
|
|
kernel protection domain) are configured with an IOPL of 3 whereas
|
|
others are configured with an IOPL of 2. That is why interrupts are
|
|
configured to run at ring level 2. Interrupts are only enabled in the
|
|
application protection domain.
|
|
|
|
Some interrupt handlers require access to port IO, and all are
|
|
permitted such access, since they need it anyway for restoring the
|
|
interrupt flag when returning. IO port access is a very powerful
|
|
privilege, since it can be used to remap MMIO regions of PCI devices,
|
|
reconfigure PCI devices, etc. Thus, further restricting access to IO
|
|
ports may improve the robustness of the system, but would increase
|
|
complexity and space requirements and possibly necessitate additional
|
|
context switches, since IO port access is controlled by the combined
|
|
settings of IOPL as well as an optional IO bitmap in the TSS.
|
|
|
|
### Interrupt and Exception Dispatching
|
|
|
|
Primary implementation sources:
|
|
- cpu/x86/init/common/interrupt.h
|
|
|
|
Separate stacks are allocated for dispatching interrupts and
|
|
exceptions. However, to save space, the main bodies of some interrupt
|
|
and exception handlers are run on the main stack. A handler may
|
|
expect to have access to data from the interrupt or exception stack,
|
|
so the interrupt or exception dispatcher copies that data prior to
|
|
pivoting to the main stack and executing the handler.
|
|
|
|
### Protection Domain Control Structures (PDCSes)
|
|
|
|
Each protection domain is managed by the kernel and privileged
|
|
functions using a PDCS. The structure of the PDCS is partially
|
|
hardware-imposed in the cases of the two segment-based plugins, since
|
|
the PDCS contains the Local Descriptor Table (LDT) and the TSS, if
|
|
applicable. In the paging plugin, the PDCS structure is entirely
|
|
software-defined. None of the initial protection domain plugins
|
|
support re-entrant protection domains due to hardware-imposed
|
|
limitations of TSS and to simplify the implementation of the other
|
|
plugins by enabling domain-specific information (e.g. system call
|
|
return address) to be trivially stored in each PDCS.
|
|
|
|
### Paging-Based Protection Domains
|
|
|
|
Primary implementation sources:
|
|
|
|
- cpu/x86/mm/paging-prot-domains.c
|
|
- cpu/x86/mm/syscalls-int.c
|
|
- cpu/x86/mm/syscalls-int-asm.S
|
|
|
|
#### Introduction
|
|
|
|
Only a single page table is used for all protection domains. A flat
|
|
memory model is used. Almost all linear-to-physical address mappings
|
|
are identity mappings, with the exceptions being the MMIO and metadata
|
|
regions. The X86 port of Contiki currently only supports at most one
|
|
MMIO and one metadata range per driver, and the paging-based
|
|
protection domain implementation always starts at particular linear
|
|
addresses when mapping an MMIO or metadata range. This may reduce
|
|
overhead, due to the way protection domain switches are implemented.
|
|
|
|
#### System Call and Return Dispatching
|
|
|
|
The system call dispatcher executes at ring level 0, since it uses the
|
|
privileged INVLPG or MOV CR3 instructions to invalidate TLB entries.
|
|
The dispatcher modifies page table entries to grant only the
|
|
permissions required by the protection domain being activated. It
|
|
then optionally uses the INVLPG instruction to invalidate any TLB
|
|
entries for any page table entries that were modified. If INVLPG is
|
|
not used to invalidate specific TLB entries, then CR3 is reloaded to
|
|
invalidate the entire TLB (global entries would be excluded, but they
|
|
are not used in this implementation).
|
|
|
|
It is more efficient to always start at a particular linear address
|
|
when mapping an MMIO or metadata region, since the page table entries
|
|
for that region can be updated to unmap any previous region of that
|
|
type, map the new region, and then invalidated to cause the new
|
|
settings to take effect. The alternative using an identity
|
|
linear-to-physical address mapping for regions would be to unmap the
|
|
previous region by editing one set of page table entries and to then
|
|
map the new region by editing a different set of page table entries
|
|
and to finally perform invalidations for both sets of page table
|
|
entries. Another drawback of such an identity address mapping is that
|
|
additional page tables may need to be allocated to represent the
|
|
various MMIO regions, since page tables are indexed by linear address
|
|
and MMIO regions are often at high physical addresses. Note that this
|
|
is specific to MMIO regions, since metadata regions are not at
|
|
particularly high physical addresses. Additionally, if different base
|
|
linear addresses are used, it is necessary to communicate those to the
|
|
system call handler code so that the regions can be accessed. This
|
|
would require care to prevent an adversary from manipulating the
|
|
addresses and it may increase complexity.
|
|
|
|
The overall process of handling a system call can be illustrated at a
|
|
high level as follows. Some minor steps are omitted in the interest
|
|
of clarity and brevity.
|
|
|
|
```
|
|
== BEGIN Client protection domain ==========================================
|
|
-- BEGIN Caller ------------------------------------------------------------
|
|
1. Call system call stub.
|
|
--
|
|
20. Continue execution...
|
|
-- END Caller --------------------------------------------------------------
|
|
-- BEGIN System call stub --------------------------------------------------
|
|
2. Already in desired (server) protection domain?
|
|
- No: Issue software interrupt #100 to request system call.
|
|
- Yes: Jump to system call body.
|
|
-- END System call stub ----------------------------------------------------
|
|
== END Client protection domain ============================================
|
|
== BEGIN Ring level 0 ======================================================
|
|
-- BEGIN System call dispatcher---------------------------------------------
|
|
3. Check that the requested system call is allowed. Get entrypoint.
|
|
4. Check that the server protection domain is available (not yet present
|
|
in the protection domain call stack) and then mark it as busy.
|
|
5. Save the caller return address from the main stack into the client
|
|
PDCS.
|
|
6. Overwrite the caller return address on the main stack to point to
|
|
system call return stub.
|
|
7. Push server protection domain onto protection domain call stack.
|
|
8. Update the interrupt return stack EIP to start of system call body.
|
|
9. Update and invalidate page table entries to grant only the permissions
|
|
required by the server protection domain.
|
|
10. Update interrupt flag to disable interrupts, since interrupts are only
|
|
enabled in app protection domain, which exports no system calls.
|
|
11. Perform interrupt return (IRET).
|
|
-- END System call dispatcher ----------------------------------------------
|
|
-- BEGIN System call return dispatcher -------------------------------------
|
|
15. Mark protection domain on top of protection domain call stack as
|
|
available.
|
|
16. Retrieve the caller return address from the kernel data structure for
|
|
the client protection domain and use it to overwrite the EIP in the
|
|
interrupt return stack.
|
|
17. Update and invalidate page table entries to grant only the permissions
|
|
required by the client protection domain.
|
|
18. Update interrupt flag to only enable interrupts if returning to app
|
|
protection domain cooperative scheduling context.
|
|
19. Perform interrupt return (IRET).
|
|
-- END System call dispatcher ----------------------------------------------
|
|
== END Ring level 0 ========================================================
|
|
== BEGIN Server protection domain ==========================================
|
|
-- BEGIN System call body --------------------------------------------------
|
|
12. Execute the work for the requested system call.
|
|
13. Return (to system call return stub, unless invoked from server
|
|
protection domain, in which case return is to caller).
|
|
-- END System call body ----------------------------------------------------
|
|
-- BEGIN System call return stub -------------------------------------------
|
|
14. Issue software interrupt #101 to request system call return.
|
|
-- END System call return stub ---------------------------------------------
|
|
== END Server protection domain ============================================
|
|
```
|
|
|
|
The first step in performing a system call is to invoke a system call
|
|
stub that actually issues the software interrupt to request a system
|
|
call dispatch. This approach reduces disruption to existing code,
|
|
since macros are used to generate separate stubs and corresponding
|
|
system call bodies with a single system call signature definition.
|
|
|
|
#### Memory Layout
|
|
|
|
The approximate memory layout of the system is depicted below,
|
|
starting with the highest physical addresses and proceeding to lower
|
|
physical addresses. Optional permissions are denoted with
|
|
parentheses. See cpu/x86/quarkX1000_paging.ld for details of how this
|
|
memory layout is implemented.
|
|
|
|
```
|
|
| Kernel | App | Other |
|
|
... +--------+--------+--------+
|
|
+------------------------------------------+ | | | |
|
|
| Domain X MMIO | | | | (RW) |
|
|
+------------------------------------------+ | | | |
|
|
... | | | |
|
|
+------------------------------------------+ | | | |
|
|
| Domain X DMA-accessible metadata | | | | (RW) |
|
|
| (section .dma_bss) | | | | |
|
|
+------------------------------------------+ | | | |
|
|
+------------------------------------------+ | | | |
|
|
| Domain X metadata (section .meta_bss) | | | | (RW) |
|
|
+------------------------------------------+ | | | |
|
|
... | | | |
|
|
+------------------------------------------+ | | | |
|
|
| Kernel-private data | | RW | | |
|
|
| (sections .prot_dom_bss, .gdt_bss, etc.) | | | | |
|
|
+------------------------------------------+ | | | |
|
|
+------------------------------------------+ | | | |
|
|
| System call data (section .syscall_bss) | | RW | R | R |
|
|
+------------------------------------------+ | | | |
|
|
+------------------------------------------+ | | | |
|
|
| Kernel-owned data (section .kern_bss) | | RW | R | R |
|
|
+------------------------------------------+ | | | |
|
|
+------------------------------------------+ | | | |
|
|
| Page-aligned, Kernel-owned data | | RW | R | R |
|
|
| (section .page_aligned_kern_bss) | | | | |
|
|
+------------------------------------------+ | | | |
|
|
+------------------------------------------+ | | | |
|
|
| Common data | | RW | RW | RW |
|
|
| (sections .data, .rodata*, .bss, etc.) | | | | |
|
|
+------------------------------------------+ | | | |
|
|
(not-present guard band page) | | | |
|
|
+------------------------------------------+ | | | |
|
|
| Exception stack | | RW | RW | RW |
|
|
| (section .exc_stack) | | | | |
|
|
+------------------------------------------+ | | | |
|
|
+------------------------------------------+ | | | |
|
|
| Interrupt stack | | RW | RW | RW |
|
|
| (section .int_stack) | | | | |
|
|
+------------------------------------------+ | | | |
|
|
+------------------------------------------+ | | | |
|
|
| Main stack (section .main_stack) | | RW | RW | RW |
|
|
+------------------------------------------+ | | | |
|
|
(not-present guard band page) | | | |
|
|
+------------------------------------------+ | | | |
|
|
| Main code (.text) | | RX | RX | RX |
|
|
+------------------------------------------+ | | | |
|
|
+------------------------------------------+ | | | |
|
|
| Bootstrap code (section .boot_text) | | | | |
|
|
+------------------------------------------+ | | | |
|
|
+------------------------------------------+ | | | |
|
|
| Multiboot header | | | | |
|
|
+------------------------------------------+ | | | |
|
|
...
|
|
```
|
|
|
|
The only protection domain that is permitted to access kernel-owned
|
|
data is the kernel protection domain. Some devices can also be
|
|
instructed to perform DMA to kernel-owned data, although that is an
|
|
incorrect configuration.
|
|
|
|
Paging only differentiates between memory accesses from ring 3 (user
|
|
level) and those from rings 0-2 (supervisor level). To avoid granting
|
|
code running in the preemptive scheduling context supervisory write
|
|
access to kernel data structures (including the page tables), those
|
|
structures are marked read-only (except when the kernel protection
|
|
domain is active) and the Write Protect (WP) bit in Control Register 0
|
|
(CR0) is cleared only when it is necessary to update a write-protected
|
|
structure. Only ring 0 is allowed to modify CR0.
|
|
|
|
Optional metadata for each protection domain is intended to only be
|
|
accessible from the associated protection domain and devices.
|
|
|
|
Read accesses to executable code have not been observed to be needed
|
|
in at least a limited set of tests, but they are permitted, since
|
|
paging does not support an execute-only permission setting. On the
|
|
other hand, the Execute-Disable feature is used to prevent execution
|
|
of non-code memory regions. All non-startup code is mapped in all
|
|
protection domains. Limiting the code that is executable within each
|
|
protection domain to just the code that is actually needed within that
|
|
protection domain could improve the robustness of the system, but it
|
|
is challenging to determine all code that may be needed in a given
|
|
protection domain (e.g. all needed library routines).
|
|
|
|
Stack accesses to non-stack memory are not needed, but they are
|
|
permitted. However, one page of unmapped linear address space is
|
|
placed above and below the stacks to detect erroneous stack accesses
|
|
to those linear address regions, which are the types of accesses most
|
|
likely to occur during a stack overflow or underflow condition. The
|
|
main stack is placed just below the interrupt stack, which is just
|
|
below the exception stack. Stack overflows are more common than stack
|
|
underflows, which motivates arranging the stacks such that an overflow
|
|
from a less-critical stack will not affect a more-critical stack.
|
|
Furthermore, the main stack is the most likely to overflow, since the
|
|
code that uses it is typically the most voluminous and difficult to
|
|
characterize. That provides additional motivation for positioning it
|
|
such that an overflow results in an immediate page fault. An
|
|
alternative design placing each stack on a separate group of
|
|
contiguous pages may improve the robustness of the system by
|
|
permitting the insertion of unmapped guard pages around them to
|
|
generate page faults in the event an overflow or underflow occurs on
|
|
any stack. However, that would consume additional memory.
|
|
|
|
Data in the .rodata sections is marked read/write, even though it may
|
|
be possible to improve the robustness of the system by marking that
|
|
data as read-only. Doing so would introduce additional complexity
|
|
into the system.
|
|
|
|
### Hardware-Switched Segment-Based Protection Domains
|
|
|
|
Primary implementation sources:
|
|
|
|
- cpu/x86/mm/tss-prot-domains.c
|
|
- cpu/x86/mm/tss-prot-domains-asm.S
|
|
|
|
#### Introduction
|
|
|
|
One TSS is allocated for each protection domain. Each one is
|
|
associated with its own dedicated LDT. The memory resources assigned
|
|
to each protection domain are represented as segment descriptors in
|
|
the LDT for the protection domain. Additional shared memory resources
|
|
are represented as segment descriptors in the GDT.
|
|
|
|
#### System Call and Return Dispatching
|
|
|
|
The system call dispatcher runs in the context of the server
|
|
protection domain. It is a common piece of code that is shared among
|
|
all protection domains. Thus, each TSS, except the application TSS,
|
|
has its EIP field initialized to the entrypoint for the system call
|
|
dispatcher so that will be the first code to run when the first switch
|
|
to that task is performed.
|
|
|
|
The overall process of handling a system call can be illustrated at a
|
|
high level as follows. Some minor steps are omitted from this
|
|
illustration in the interest of clarity and brevity.
|
|
|
|
```
|
|
== BEGIN Client protection domain ==========================================
|
|
-- BEGIN Caller ------------------------------------------------------------
|
|
1. Call system call stub.
|
|
--
|
|
13. Continue execution...
|
|
-- END Caller --------------------------------------------------------------
|
|
-- BEGIN System call stub --------------------------------------------------
|
|
2. Already in desired (server) protection domain?
|
|
- No: Request task switch to server protection domain.
|
|
- Yes: Jump to system call body.
|
|
--
|
|
12. Return to caller.
|
|
-- END System call stub ----------------------------------------------------
|
|
== END Client protection domain ============================================
|
|
== BEGIN Server protection domain ==========================================
|
|
-- BEGIN System call dispatcher---------------------------------------------
|
|
3. Check that the requested system call is allowed. Get entrypoint.
|
|
4. Switch to the main stack.
|
|
5. Pop the client return address off the stack to a callee-saved register.
|
|
6. Push the address of the system call return dispatcher onto the stack.
|
|
7. Jump to system call body.
|
|
--
|
|
10. Restore the client return address to the stack.
|
|
11. Request task switch to client protection domain.
|
|
-- END System call dispatcher ----------------------------------------------
|
|
-- BEGIN System call body --------------------------------------------------
|
|
8. Execute the work for the requested system call.
|
|
9. Return (to system call return stub, unless invoked from server
|
|
protection domain, in which case return is to caller).
|
|
-- END System call body ----------------------------------------------------
|
|
== END Server protection domain ============================================
|
|
```
|
|
|
|
An additional exception handler is needed, for the "Device Not
|
|
Available" exception. The handler comprises just a CLTS and an IRET
|
|
instruction. The CLTS instruction is privileged, which is why it must
|
|
be run at ring level 0. This exception handler is invoked when a
|
|
floating point instruction is used following a task switch, and its
|
|
sole purpose is to enable the floating point instruction to execute
|
|
after the exception handler returns. See the TSS resources listed
|
|
above for more details regarding interactions between task switching
|
|
and floating point instructions.
|
|
|
|
Each segment register may represent a different data region within
|
|
each protection domain, although the FS register is used for two
|
|
separate purposes at different times. The segments are defined as
|
|
follows:
|
|
|
|
- CS (code segment) maps all non-startup code with execute-only
|
|
permissions in all protection domains. Limiting the code that is
|
|
executable within each protection domain to just the code that is
|
|
actually needed within that protection domain could improve the
|
|
robustness of the system, but it is challenging to determine all
|
|
code that may be needed in a given protection domain (e.g. all
|
|
needed library routines). Furthermore, that code may not all be
|
|
contiguous, and each segment descriptor can only map a contiguous
|
|
memory region. Finally, segment-based memory addressing is
|
|
relative to an offset of zero from the beginning of each segment,
|
|
introducing additional complexity if such fine-grained memory
|
|
management were to be used.
|
|
- DS (default data segment) typically maps the main stack and all
|
|
non-stack data memory that is accessible from all protection
|
|
domains. Limiting the data that is accessible via DS within each
|
|
protection domain to just the subset of the data that is actually
|
|
needed within that protection domain could improve the robustness
|
|
of the system, but it is challenging for similar reasons to those
|
|
that apply to CS. Access to the main stack via DS is supported so
|
|
that code that copies the stack pointer to a register and attempts
|
|
to access stack entries via DS works correctly. Disallowing access
|
|
to the main stack via DS could improve the robustness of the
|
|
system, but that may require modifying code that expects to be able
|
|
to access the stack via DS.
|
|
- ES is loaded with the same segment descriptor as DS so that string
|
|
operations (e.g. the MOVS instruction) work correctly.
|
|
- FS usually maps the kernel-owned data region. That region can only
|
|
be written via FS in the kernel protection domain. FS contains a
|
|
descriptor specifying a read-only mapping in all other protection
|
|
domains except the application protection domain, in which FS is
|
|
nullified. Requiring that code specifically request access to the
|
|
kernel-owned data region by using the FS segment may improve the
|
|
robustness of the system by blocking undesired accesses to the
|
|
kernel-owned data region via memory access instructions within the
|
|
kernel protection domain that implicitly access DS. The reason for
|
|
granting read-only access to the kernel-owned data region from most
|
|
protection domains is that the system call dispatcher runs in the
|
|
context of the server protection domain to minimize overhead, and
|
|
it requires access to the kernel-owned data region. It may improve
|
|
the robustness of the system to avoid this by running the system
|
|
call dispatcher in a more-privileged ring level (e.g. ring 1)
|
|
within the protection domain and just granting access to the
|
|
kernel-owned data region from that ring. However, that would
|
|
necessitate a ring level transition to ring 3 when dispatching the
|
|
system call, which would increase overhead. The application
|
|
protection domain does not export any system calls, so it does not
|
|
require access to the kernel-owned data region.
|
|
- FS is temporarily loaded with a segment descriptor that maps just
|
|
an MMIO region used by a driver protection domain when such a
|
|
driver needs to perform MMIO accesses.
|
|
- GS maps an optional region of readable and writable metadata that
|
|
can be associated with a protection domain. In protection domains
|
|
that are not associated with metadata, GS is nullified.
|
|
- SS usually maps just the main stack. This may improve the
|
|
robustness of the system by enabling immediate detection of stack
|
|
underflows and overflows rather than allowing such a condition to
|
|
result in silent data corruption. Interrupt handlers use a stack
|
|
segment that covers the main stack and also includes a region above
|
|
the main stack that is specifically for use by interrupt handlers.
|
|
In like manner, exception handlers use a stack segment that covers
|
|
both of the other stacks and includes an additional region. This
|
|
is to support the interrupt dispatchers that copy parameters from
|
|
the interrupt-specific stack region to the main stack prior to
|
|
pivoting to the main stack to execute an interrupt handler body.
|
|
|
|
The approximate memory layout of the system is depicted below,
|
|
starting with the highest physical addresses and proceeding to lower
|
|
physical addresses. The memory ranges that are mapped at various
|
|
times by each of the segment registers are also depicted. Read the
|
|
descriptions of each segment above for more information about what
|
|
memory range may be mapped by each segment register at various times
|
|
with various protection domain configurations. Parenthetical notes
|
|
indicate the protection domains that can use each mapping. The suffix
|
|
[L] indicates that the descriptor is loaded from LDT. Optional
|
|
mappings are denoted by a '?' after the protection domain label. The
|
|
'other' protection domain label refers to protection domains other
|
|
than the application and kernel domains.
|
|
|
|
```
|
|
...
|
|
+------------------------------------------+ \
|
|
| Domain X MMIO | +- FS[L]
|
|
+------------------------------------------+ / (other?)
|
|
...
|
|
+------------------------------------------+ \
|
|
| Domain X DMA-accessible metadata | +- GS[L] (other?)
|
|
| (section .dma_bss) | |
|
|
+------------------------------------------+ /
|
|
+------------------------------------------+ \
|
|
| Domain X metadata (section .meta_bss) | +- GS[L] (other?)
|
|
+------------------------------------------+ /
|
|
...
|
|
+------------------------------------------+ \
|
|
| Kernel-private data | |
|
|
| (sections .prot_dom_bss, .gdt_bss, etc.) | +- FS[L] (kern)
|
|
+------------------------------------------+ |
|
|
+------------------------------------------+ \
|
|
| System call data (section .syscall_bss) | |
|
|
+------------------------------------------+ +- FS[L] (all)
|
|
+------------------------------------------+ |
|
|
| Kernel-owned data (section .kern_bss) | |
|
|
+------------------------------------------+ /
|
|
+------------------------------------------+ \
|
|
| Common data | |
|
|
| (sections .data, .rodata*, .bss, etc.) | |
|
|
+------------------------------------------+ +- DS, ES
|
|
+------------------------------------------+ \ | (all)
|
|
| Exception stack (section .exc_stack) | | |
|
|
|+----------------------------------------+| \ |
|
|
|| Interrupt stack (section .int_stack) || | |
|
|
||+--------------------------------------+|| \ |
|
|
||| Main stack (section .main_stack) ||| +- SS (all) |
|
|
+++--------------------------------------+++ / /
|
|
+------------------------------------------+ \
|
|
| Main code (.text) | +- CS (all)
|
|
+------------------------------------------+ /
|
|
+------------------------------------------+
|
|
| Bootstrap code (section .boot_text) |
|
|
+------------------------------------------+
|
|
+------------------------------------------+
|
|
| Multiboot header |
|
|
+------------------------------------------+
|
|
...
|
|
```
|
|
|
|
This memory layout is more efficient than the layout that is possible
|
|
with paging-based protection domains, since segments have byte
|
|
granularity, whereas the minimum unit of control supported by paging
|
|
is a 4KiB page. For example, this means that metadata may need to be
|
|
padded to be a multiple of the page size. This may also permit
|
|
potentially-undesirable accesses to padded areas of code and data
|
|
regions that do not entirely fill the pages that they occupy.
|
|
|
|
Kernel data structure access, including to the descriptor tables
|
|
themselves, is normally restricted to the code running at ring level
|
|
0, specifically the exception handlers and the system call and return
|
|
dispatchers. It is also accessible from the cooperative scheduling
|
|
context in the kernel protection domain. Interrupt delivery is
|
|
disabled in the kernel protection domain, so the preemptive scheduling
|
|
context is not used.
|
|
|
|
SS, DS, and ES all have the same base address, since the compiler may
|
|
assume that a flat memory model is in use. Memory accesses that use a
|
|
base register of SP/ESP or BP/EBP or that are generated by certain
|
|
other instructions (e.g. PUSH, RET, etc.) are directed to SS by
|
|
default, whereas other accesses are directed to DS or ES by default.
|
|
The compiler may use an instruction that directs an access to DS or ES
|
|
even if the data being accessed is on the stack, which is why these
|
|
three segments must use the same base address. However, it is
|
|
possible to use a lower limit for SS than for DS and ES for the
|
|
following reasons. Compilers commonly provide an option for
|
|
preventing the frame pointer, EBP, from being omitted and possibly
|
|
used to point to non-stack data. In our tests, compilers never used
|
|
ESP to point to non-stack data.
|
|
|
|
Each task switch ends up saving and restoring more state than is
|
|
actually useful to us, but the implementation attempts to minimize
|
|
overhead by configuring the register values in each TSS to reduce the
|
|
number of register loads that are needed in the system call
|
|
dispatcher. Specifically, two callee-saved registers are populated
|
|
with base addresses used when computing addresses in the entrypoint
|
|
information table as well as a mask corresponding to the ID of the
|
|
server protection domain that is used to check whether the requested
|
|
system call is exported by the server protection domain. Callee-saved
|
|
registers are used, since the task return will update the saved
|
|
register values.
|
|
|
|
Note that this implies that the intervening code run between the task
|
|
call and return can modify critical data used by the system call
|
|
dispatcher. However, this is analogous to the considerations
|
|
associated with sharing a single stack amongst all protection domains
|
|
and should be addressed similarly, by only invoking protection domains
|
|
that are trusted by the caller to not modify the saved critical
|
|
values. This consideration is specific to the TSS-based dispatcher
|
|
and is not shared by the ring 0 dispatcher used in the other
|
|
plugins.
|
|
|
|
Data in the .rodata sections is marked read/write, even though it may
|
|
be possible to improve the robustness of the system by marking that
|
|
data as read-only. Doing so would introduce even more complexity into
|
|
the system than would be the case with paging-based protection
|
|
domains, since it would require allocating different segment
|
|
descriptors for the read-only vs. the read/write data.
|
|
|
|
#### Supporting Null-Pointer Checks
|
|
|
|
A lot of code considers a pointer value of 0 to be invalid. However,
|
|
segment offsets always start at 0. To accommodate the common software
|
|
behavior, at least the first byte of each segment is marked as
|
|
unusable. An exception to this is that the first byte of the stack
|
|
segments is usable.
|
|
|
|
#### Interrupt and Exception Dispatching
|
|
|
|
A distinctive challenge that occurs during interrupt and exception
|
|
dispatching is that the state of the segment registers when an
|
|
interrupt or exception occurs is somewhat unpredictable. For example,
|
|
an exception may occur while MMIO is being performed, meaning that FS
|
|
is loaded with the MMIO descriptor instead of the kernel descriptor.
|
|
Leaving the segment registers configured in that way could cause
|
|
incorrect interrupt or exception handler behavior. Thus, the
|
|
interrupt or exception dispatcher must save the current segment
|
|
configuration, switch to a configuration that is suitable for the
|
|
handler body, and then restore the saved segment configuration after
|
|
the handler body returns. Another motivation for this is that the
|
|
interrupted code may have corrupted the segment register configuration
|
|
in an unexpected manner, since segment register load instructions are
|
|
unprivileged. Similar segment register updates must be performed for
|
|
similar reasons when dispatching system calls.
|
|
|
|
### Software-Switched Segment-Based Protection Domains
|
|
|
|
Primary implementation sources:
|
|
|
|
- cpu/x86/mm/swseg-prot-domains.c
|
|
|
|
The requirement to allocate a TSS for each protection domain in the
|
|
hardware-switched segments plugin may consume a substantial amount of
|
|
space, since the size of each TSS is fixed by hardware to be at least
|
|
104 bytes. The software-switched segments plugin saves space by
|
|
defining a more compact PDCS. However, the layout and definitions of
|
|
the segments is identical to what was described above for the
|
|
hardware-switched segments plugin.
|
|
|
|
The system call and return procedure is mostly identical to that for
|
|
paging-based protection domains. However, instead of updating and
|
|
invalidating page tables, the dispatchers update the LDT and some of
|
|
the segment registers.
|
|
|
|
### Pointer Validation
|
|
|
|
Primary implementation sources:
|
|
- cpu/x86/mm/syscalls.h
|
|
|
|
At the beginning of each system call routine, it is necessary to check
|
|
that any untrusted pointer that could have been influenced by a caller
|
|
(i.e. a stack parameter or global variable) refers to a location above
|
|
the return address and to halt otherwise. This is to prevent a
|
|
protection domain from calling a different protection domain and
|
|
passing a pointer that references a location in the callee's stack
|
|
other than its parameters to influence the execution of the callee in
|
|
an unintended manner. For example, if an incoming pointer referenced
|
|
the return address, it could potentially redirect execution with the
|
|
privileges of the callee protection domain.
|
|
|
|
When the paging-based plugin is in use, it is also necessary to check
|
|
that the pointer is either within the stack region or the shared data
|
|
region (or a guard band region, since that will generate a fault) to
|
|
prevent redirection of data accesses to MMIO or metadata regions. The
|
|
other plugins already configure segments to restrict accesses to DS to
|
|
just those regions. Pointers provided as inputs to system calls as
|
|
defined above should never be dereferenced in any segment other than
|
|
DS.
|
|
|
|
The pointer is both validated and copied to a new storage location,
|
|
which must be within the callee's local stack region (excluding the
|
|
parameter region). This is to mitigate scenarios such as two pointers
|
|
being validated and an adversary later inducing a write through one of
|
|
the pointers to the other pointer to corrupt the latter pointer before
|
|
it is used.
|
|
|
|
Any pointer whose value is fixed at link or load time does not need to
|
|
be validated prior to use, since no adversary within the defined
|
|
threat model is able to influence the link or load process.
|
|
|
|
### DMA Restrictions
|
|
|
|
Primary implementation sources:
|
|
- cpu/x86/drivers/quarkX1000/imr.c
|
|
- cpu/x86/drivers/quarkX1000/imr-conf.c
|
|
|
|
The CPU is not the only agent with the ability to issue requests to
|
|
the interconnect within the SoC. For example, SoC peripherals such as
|
|
the Ethernet driver use DMA to efficiently access memory buffers.
|
|
This could introduce a risk that DMA could be used to bypass the
|
|
memory protections enforced on the CPU by segmentation or paging. For
|
|
example, a device driver could instruct a device to access a memory
|
|
region to which the kernel has not granted the driver's protection
|
|
domain permission to access.
|
|
|
|
The Isolated Memory Region (IMR) feature is configured to restrict the
|
|
memory that can be accessed by system agents other than the CPU [3].
|
|
It only allows those system agents to access portions of the Contiki
|
|
memory space that are specifically intended to be used with DMA. The
|
|
source code for each protection domain specifies that its optional
|
|
metadata region needs to be accessible from other system agents
|
|
besides the CPU by using ATTR_BSS_DMA instead of ATTR_BSS_META when
|
|
allocating storage for the metadata.
|
|
|
|
Extending the Framework
|
|
-----------------------
|
|
|
|
### Adding a New Protection Domain
|
|
|
|
The following steps are required. See the existing device drivers for
|
|
examples of various types of protection domains and how they are
|
|
initialized.
|
|
|
|
- Allocate storage for the PDCS and the corresponding
|
|
client-accessible data structure using the PROT_DOMAINS_ALLOC
|
|
macro.
|
|
- Apply the ATTR_BSS_META attribute to the metadata structure, if
|
|
applicable. Apply the ATTR_BSS_DMA attribute instead if the
|
|
metadata structure needs to be DMA-accessible. Pad the metadata
|
|
structure to completely fill an integer multiple of the minimum
|
|
page size, 4096, when paging-based protection domains are in use.
|
|
See the definition of quarkX1000_eth_meta_t for an example.
|
|
- Perform the following steps during boot stage 2:
|
|
- Initialize the protection domain ID in the client-accessible data
|
|
structure using the PROT_DOMAINS_INIT_ID macro.
|
|
- Register the domain. See prot-domains.c:prot_domains_init for an
|
|
example of registering a non-driver protection domain. See
|
|
cpu/x86/drivers/quarkX1000/eth.c:quarkX1000_eth_init for an
|
|
example of registering a PCI driver protection domain with an
|
|
MMIO region and a metadata region.
|
|
|
|
### Adding a New System Call
|
|
|
|
The following steps are required:
|
|
|
|
- Define the system call procedure using the SYSCALLS_DEFINE or
|
|
SYSCALLS_DEFINE_SINGLETON macro. See
|
|
cpu/x86/drivers/legacy_pc/uart-16x50.c:uart_16x50_tx for an example
|
|
of a non-singleton system call. See
|
|
cpu/x86/drivers/quarkX1000/eth.c:quarkX1000_eth_send for an example
|
|
of a singleton system call. A singleton system call is one for
|
|
which at most one server protection domain will be associated with
|
|
it.
|
|
- During boot phase 2, associate the system call with one or more
|
|
server protection domains using the SYSCALLS_AUTHZ macro.
|
|
|
|
Usage
|
|
-----
|
|
|
|
To enable protection domain support, add "X86_CONF_PROT_DOMAINS=" to
|
|
the command line and specify one of the following options:
|
|
|
|
- paging
|
|
- tss
|
|
- swseg
|
|
|
|
The paging option accepts a sub-option to determine whether the TLB is
|
|
fully- or selectively-invalidated during protection domain switches.
|
|
By default, full invalidation is selected. Set the
|
|
X86_CONF_USE_INVLPG variable to 1 to override the default.
|
|
|
|
References
|
|
----------
|
|
|
|
[1] J. H. Saltzer, "Protection and the Control of Information Sharing
|
|
in Multics," Commun. ACM, vol. 17, no. 7, pp. 388-402, Jul. 1974.
|
|
|
|
[2] https://github.com/contiki-os/contiki/wiki/Processes
|
|
|
|
[3] "Intel(R) Quark(TM) SoC X1000 Secure Boot Programmer's Reference
|
|
Manual,"
|
|
http://www.intel.com/support/processors/quark/sb/CS-035228.htm
|