[RFC] MPU support for debugging


Boie, Andrew P
 

This is not a fully-fleshed out specification, but since there is a lot of interest in memory protection in Zephyr on multiple fronts I thought this would be a good way to get the conversation going on how we can use an MPU in Zephyr and what implications it has for the OS design. Hoping to get a consensus on the scope and capabilities of this before diving too deep into implementation details.

TL;DR SUMMARY

Memory protection could be used to greatly assist in kernel debugging. At this time I think it should be positioned as an optional debug feature and not a security feature. We want to target common MPU features found on MCUs. Systems without MPU hardware, but have an MMU, can use MMU with identity page tables instead. We want to be able to create "user" threads with reduced privileges on what memory they can write to, so that if they overflow their stack, branch to non-code, or write to memory that doesn't belong to them, the system will trigger a fatal error instead of silently wreaking havoc. We want to implement this in a way that is not invasive to the current Zephyr driver/OS APIs.

PROBLEM STATEMENTS

As I'm sure we all know, debugging multi-threaded applications is a complex task, made much more complex if threads can silently corrupt each other or the OS. Thread stack overflows are also very difficult to debug, with the usual symptom being strange, unpredictable behavior until the system eventually crashes.

We'd like to introduce support for memory protection units in Zephyr to help with these debug issues. To me the following characteristics would be desirable:

- Introduce the notion of User/Supervisor threads. User threads would only have write access to their own stack, plus some additional memory regions parameterized at runtime.
- Illegal memory access by a user thread, stack overflow, or branching the PC to memory that is not code should result in a fatal system error, using SysFatalErrorHandler.
- Kernel and driver APIs should be the same regardless of whether memory protection is used. For example, during debugging the end user may enable memory protection to catch issues, but for the final production build it would be switched off for performance reasons.
- For at least the initial implementation of this feature we should try to do it changing any existing public kernel or driver APIs as little as possible, although we will need to introduce some new ones.

MPU LIMITATIONS

It's important to keep in mind the capabilities of a typical MPU unit. In all the implementations I have seen, the memory regions need to be aligned to a multiple of their size, and the size must be a power of two. (ARM lets you further subdivide any given region into 8 sub-regions which can be selectively enabled). Regions may overlap, but there can only be a fixed number of regions active at any given time, with 8 or 16 being common.

Some arches like x86 don't have MPU hardware, but if an MMU is present it can be configured to act like an MPU using an identity page table.

THIS IS NOT A SECURITY FEATURE

If you look at all the kernel APIs in kernel.h, they all take pointers to various data structures owned by the caller. Let's take a Zephyr timer as an example. If you want to use a timer you need to declare a struct k_timer and then make kernel API calls passing its pointer value, which will involve a privilege elevation for the system to do its work. Some of the members of the k_timer struct are really private to the kernel and we could cause all kinds of havoc if garbage data is passed in, or the k_timer struct is corrupted in some way, or someone was trying to do something malicious.

We have a similar problem with drivers, where pointers to struct device are passed around and dereferenced in order to get at their API calls. At a fundamental level, Zephyr in its current state is designed to be application+kernel all rolled into one address space.

I do not see any good way around this problem unless we completely redesign all kernel and driver APIs:
- Data structures private to the kernel for kernel objects would need to be completely inaccessible to user threads.
- User threads wanting to work with some kernel object would not allocate memory for it and pass in a pointer. We'd need an API where the user thread requests an object (like a timer), the kernel allocates space for it somehow, and returns to the user thread a descriptor or handle to use when performing API calls.
- For situations where we only want to use memory protection for debugging, this layer of abstraction in all the APIs to use descriptors instead of pointers would impose a lot of unnecessary overhead for situations where MPU is disabled or not available.

Ideas are very welcome here, but my feeling is that "preventing threads from crashing the OS in all cases" or "prevent threads from doing malicious things" is probably not completely feasible in Zephyr unless we transform it into a remarkably different OS than it is now. Given the limitations of a typical MPU I suspect this would probably require an MMU to do it right. I feel there are a lot of things we can do to help with debugging if we assume good intent and are not trying to protect against malicious behavior, or garbage being passed to kernel APIs.

In this proposal, threads can switch between supervisor and user privileges at will. This greatly simplifies the implementation:

- No need for system calls for privilege elevation. Through macro magic, if memory protection is turned on we can wrap kernel/driver APIs with some code that elevates privileges before calling into the kernel, check that any supplied pointer values are actually owned by the calling thread, and drop privileges when returning to the caller.
- Many drivers and subsystems allow for the registration of callbacks upon events. Some of these run in ISR context and should be left as-is. However for example in the network stack you can supply callbacks invoked by supervisor network stack worker threads to perform protocol parsing on the application side, and it would be great if these callbacks could run in a reduced set of privileges. The worker threads in the subsystem that make the callbacks could lower their privileges (granting access to buffer memory regions if necessary) when making the callback, restoring them when it returns.

THREAD PERMISSIONS

By default all threads start their life in supervisor mode. I think what Vincenzo recently uploaded is a good policy for supervisor threads: https://gerrit.zephyrproject.org/r/#/q/topic:zephyr_mpu.

Threads may drop down to user mode with an API call. This would grant the thread write access to only its stack memory, plus additional memory regions parameterized at runtime. Threads may go back to supervisor mode with another API call, there's no restrictions on moving back and forth between these states.

Even though we allow threads to move between user and supervisor mode at will, we are trying to catch mistakes. So some kernel/driver APIs that are intended to be called from user mode will have an implicit privilege elevation, but others which are intended to be supervisor-only will not.

For globals (data/bss) I think we should separate the data used by the OS (libzephyr.a) and globals declared by the application, put them in different memory regions. We should make it easy to grant any given thread write access to globals defined within the application. I'm thinking of a programming model similar to pthreads: each pthread's stack is private, and you can't touch OS data structures, but all the pthreads can access/modify globals within the process they run in. In Zephyr we more or less have many threads running in exactly one process.

Additionally, application globals that are never intended to be touched by user threads could be tagged with a __supervisor variable attribute which will roll them in with the other OS globals.

ZEPHYR IMPLICATIONS

Rough to-do list to make this happen:
- Fully specify the relevant API calls for this feature
- Move the struct k_thread out of the thread stack where it currently lives. We don't want the thread clobbering this if too much stack is used, which is currently what happens since it's at the lowest memory addresses of the stack. May need to reserve some additional memory on the kernel side for this data to live in.
- Some APIs which assume use of the system heap for all allocations may need to be extended for user-supplied memory pools
- Separate out kernel globals (bss/data) from application globals. A typical user thread may need to have access to application globals, but should never need to write to kernel data structures in a non-supervisor state. Application globals tagged with __supervisor would be put with the OS data structures.
- Implement magic for implicit privilege elevation in kernel/driver APIs when calling them from user mode
- Define a memory protection API specification between the core kernel and arch/ code which actually talks to the MPU (or MMU)
- Possible adjustments to how the network stack does callbacks
- Test cases to prove that this all works


Benjamin Walsh <benjamin.walsh@...>
 

Some random thoughts on this, do with them as you wish. :)

To get more detailed feedback, I think you need to document how exactly
you intend on using the MPU.

Like:

- region 0 to protect the kernel test+rodata (supervisor, ro)
- region 1 to protect the kernel data and kernel thread stacks
(supervisor, rw)
- region 2 for text+rodata for all user threads (user, ro)
- region 3 for data+stacks of threads in group A (user, rw, nx (if
available))
- region 4 for data+stacks of threads in group B (user, rw, nx (if
available))
- region 5 for shared memory between group A and group B (always
accessible) (user, rw, nx (if available))
- regions are fixed
- privileges for region 3 change from user to supervisor when a thread
in group B gets context switched in
- privileges for region 4 change from user to supervisor when a thread
in group A gets context switched in
- privileges for region 3 change from supervisor to user when a thread
in group A gets context switched in
- privileges for region 4 change from supervisor to user when a thread
in group B gets context switched in

or

- region 0 to protect the kernel test+rodata (supervisor, ro)
- region 1 to protect the kernel data and kernel thread stacks
(supervisor, rw)
- region 2 of text+rodata for all user threads (user, ro)
- region 3 for data+stack of currently running user thread (user, rw, nx
(if available)).
- region 4 for shared memory between all threads (always accessible)
(user, rw, nx (if available)).
- region 3 base addr + size get reprogrammed on user threads context
switches to match incoming thread's data+stack section
- kernel can reprogram region 3 base addr + size to match a thread it
wants to pass data to in a user-space buffer

or something else.

On Fri, Mar 10, 2017 at 10:11:20PM +0000, Boie, Andrew P wrote:
This is not a fully-fleshed out specification, but since there is a
lot of interest in memory protection in Zephyr on multiple fronts I
thought this would be a good way to get the conversation going on how
we can use an MPU in Zephyr and what implications it has for the OS
design. Hoping to get a consensus on the scope and capabilities of
this before diving too deep into implementation details.

TL;DR SUMMARY

Memory protection could be used to greatly assist in kernel debugging.
At this time I think it should be positioned as an optional debug
feature and not a security feature. We want to target common MPU
features found on MCUs. Systems without MPU hardware, but have an MMU,
can use MMU with identity page tables instead. We want to be able to
create "user" threads with reduced privileges on what memory they can
write to, so that if they overflow their stack, branch to non-code, or
write to memory that doesn't belong to them, the system will trigger a
fatal error instead of silently wreaking havoc. We want to implement
this in a way that is not invasive to the current Zephyr driver/OS
APIs.

PROBLEM STATEMENTS

As I'm sure we all know, debugging multi-threaded applications is a
complex task, made much more complex if threads can silently corrupt
each other or the OS. Thread stack overflows are also very difficult
to debug, with the usual symptom being strange, unpredictable behavior
until the system eventually crashes.

We'd like to introduce support for memory protection units in Zephyr
to help with these debug issues. To me the following characteristics
would be desirable:

- Introduce the notion of User/Supervisor threads. User threads would
only have write access to their own stack, plus some additional memory
^^^^^^^^^^^^^^^
And whatever other stacks are in the same protection region: stack
overflow will only corrupt the current region, but they can still happen
and wreck havoc within the protection area. Or are you thinking of
rewriting the memory protection area boundaries on every context switch
to with bounds around the current thread's stack ? I don't mean only
changing the permissions, but the boundaries themselves ?

regions parameterized at runtime.

- Illegal memory access by a user thread, stack overflow, or branching
the PC to memory that is not code should result in a fatal system
error, using SysFatalErrorHandler.

- Kernel and driver APIs should be the same regardless of whether
memory protection is used. For example, during debugging the end user
may enable memory protection to catch issues, but for the final
production build it would be switched off for performance reasons.

- For at least the initial implementation of this feature we should
try to do it changing any existing public kernel or driver APIs as
little as possible, although we will need to introduce some new ones.

MPU LIMITATIONS

It's important to keep in mind the capabilities of a typical MPU unit.
In all the implementations I have seen, the memory regions need to be
aligned to a multiple of their size, and the size must be a power of
two. (ARM lets you further subdivide any given region into 8
sub-regions which can be selectively enabled). Regions may overlap,
but there can only be a fixed number of regions active at any given
time, with 8 or 16 being common.

Some arches like x86 don't have MPU hardware, but if an MMU is present
it can be configured to act like an MPU using an identity page table.

THIS IS NOT A SECURITY FEATURE
This bothers me a bit. If you ever want to turn this into a more
fleshed-out user-space model, would you introduce a third API/model or
modify them again ?

My gut feeling would be to design for the full "user-space" model, and
then have the reduced-capability MPU-model be a special case of it.

Then again, depends on the timeframe and how future-proof you want to
design this.

If you look at all the kernel APIs in kernel.h, they all take pointers
to various data structures owned by the caller. Let's take a Zephyr
timer as an example. If you want to use a timer you need to declare a
struct k_timer and then make kernel API calls passing its pointer
value, which will involve a privilege elevation for the system to do
its work. Some of the members of the k_timer struct are really private
to the kernel and we could cause all kinds of havoc if garbage data is
passed in, or the k_timer struct is corrupted in some way, or someone
was trying to do something malicious.

We have a similar problem with drivers, where pointers to struct
device are passed around and dereferenced in order to get at their API
calls. At a fundamental level, Zephyr in its current state is designed
to be application+kernel all rolled into one address space.

I do not see any good way around this problem unless we completely
redesign all kernel and driver APIs:
Yeah, I think you would need a different API. With pre-zephyr, when we
added user-space and allowed the microkernel APIs from user-space, we
were playing with IDs already (the microkernel API referenced IDs, not
addresses), and we still had to put a bunch of restrictions on what the
APIs could do from user-space, which made the API clunky to use.

You could supplement the API with user-space "wrappers" that allocate
and initialize objects and return and operate on an ID, rather than
playing with addresses.

e.g.

k_sem_id_t k_sem_create(int limit, int count)

or maybe

z_sem_id_t z_sem_create(int limit, int count)

since they are not really kernel APIs when called from users-space, and then

int z_sem_take(z_sem_id_t id, int32_t timeout)

Then you need a pool of object or a real kernel memory allocator.

- Data structures private to the kernel for kernel objects would need
to be completely inaccessible to user threads.

- User threads wanting to work with some kernel object would not
allocate memory for it and pass in a pointer. We'd need an API where
the user thread requests an object (like a timer), the kernel
allocates space for it somehow, and returns to the user thread a
descriptor or handle to use when performing API calls.
Oh well, I wrote the previous comment before getting to this part. :)

- For situations where we only want to use memory protection for
debugging, this layer of abstraction in all the APIs to use
descriptors instead of pointers would impose a lot of unnecessary
overhead for situations where MPU is disabled or not available.

Ideas are very welcome here, but my feeling is that "preventing
threads from crashing the OS in all cases" or "prevent threads from
doing malicious things" is probably not completely feasible in Zephyr
unless we transform it into a remarkably different OS than it is now.
I think this needs more of a "split" between kernel and "user-space"
APIs than a redesign. But then, kernel applications could not be rebuilt
as user-space applications. Applications using the user-space API could
probably be rebuilt as kernel applications, if we provided the
user-space API as a very thin layer around the current kernel APIs in
the kernel. Nothing would prevent removing the user-space layer and run
kernel-only.

Given the limitations of a typical MPU I suspect this would probably
require an MMU to do it right. I feel there are a lot of things we can
do to help with debugging if we assume good intent and are not trying
to protect against malicious behavior, or garbage being passed to
kernel APIs.

In this proposal, threads can switch between supervisor and user
privileges at will. This greatly simplifies the implementation:

- No need for system calls for privilege elevation. Through macro
magic, if memory protection is turned on we can wrap kernel/driver
What do you mean by "macro magic" ? There has to be a mechanism to
trigger the privilege elevation. On ARM, you have to use the SVC
instruction, handle the exception, write the CONTROL register, and come
back to the next instruction after SVC. That's basically one system
call, which purpose is to elevate the privilege of the current thread.

APIs with some code that elevates privileges before calling into the
kernel, check that any supplied pointer values are actually owned by
the calling thread, and drop privileges when returning to the caller.

- Many drivers and subsystems allow for the registration of callbacks
upon events. Some of these run in ISR context and should be left
as-is. However for example in the network stack you can supply
callbacks invoked by supervisor network stack worker threads to
perform protocol parsing on the application side, and it would be
great if these callbacks could run in a reduced set of privileges. The
worker threads in the subsystem that make the callbacks could lower
their privileges (granting access to buffer memory regions if
necessary) when making the callback, restoring them when it returns.

THREAD PERMISSIONS

By default all threads start their life in supervisor mode. I think
what Vincenzo recently uploaded is a good policy for supervisor
threads: https://gerrit.zephyrproject.org/r/#/q/topic:zephyr_mpu.

Threads may drop down to user mode with an API call. This would grant
the thread write access to only its stack memory, plus additional
memory regions parameterized at runtime. Threads may go back to
supervisor mode with another API call, there's no restrictions on
moving back and forth between these states.

Even though we allow threads to move between user and supervisor mode
at will, we are trying to catch mistakes. So some kernel/driver APIs
that are intended to be called from user mode will have an implicit
privilege elevation, but others which are intended to be
supervisor-only will not.

For globals (data/bss) I think we should separate the data used by the
OS (libzephyr.a) and globals declared by the application, put them in
different memory regions. We should make it easy to grant any given
thread write access to globals defined within the application. I'm
thinking of a programming model similar to pthreads:

each pthread's stack is private
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

That's not really true: within a process, threads can stomp on each
other's stacks (that said, I haven't done a lot of pthread programming,
and I don't know if e.g. the Linux kernel allows protecting a thread's
stack from the other threads within a process).

and you can't touch OS data structures, but all the
pthreads can access/modify globals within the process they run in. In
Zephyr we more or less have many threads running in exactly one
process.

Additionally, application globals that are never intended to be
touched by user threads could be tagged with a __supervisor variable
attribute which will roll them in with the other OS globals.

ZEPHYR IMPLICATIONS

Rough to-do list to make this happen:

- Fully specify the relevant API calls for this feature

- Move the struct k_thread out of the thread stack where it currently
lives. We don't want the thread clobbering this if too much stack is
used, which is currently what happens since it's at the lowest memory
addresses of the stack. May need to reserve some additional memory on
the kernel side for this data to live in.
Normally, you need a kernel stack in addition to the user-space stack to
do this right. In pre-zephyr user-space, we tried doing everything on
the user-space stack to save space, but finally abandoned that approach
when facing too many issues.

- Some APIs which assume use of the system heap for all allocations may
need to be extended for user-supplied memory pools

- Separate out kernel globals (bss/data) from application globals. A
typical user thread may need to have access to application globals,
but should never need to write to kernel data structures in a
non-supervisor state. Application globals tagged with __supervisor
would be put with the OS data structures.

- Implement magic for implicit privilege elevation in kernel/driver
APIs when calling them from user mode

- Define a memory protection API specification between the core kernel
and arch/ code which actually talks to the MPU (or MMU)

- Possible adjustments to how the network stack does callbacks

- Test cases to prove that this all works


Boie, Andrew P
 

On Sat, 2017-03-11 at 17:21 -0500, Benjamin Walsh wrote:
Some random thoughts on this, do with them as you wish. :)

To get more detailed feedback, I think you need to document how
exactly
you intend on using the MPU.

- Introduce the notion of User/Supervisor threads. User threads
would
only have write access to their own stack, plus some additional
memory
                            ^^^^^^^^^^^^^^^
And whatever other stacks are in the same protection region: stack
overflow will only corrupt the current region, but they can still
happen
and wreck havoc within the protection area. Or are you thinking of
rewriting the memory protection area boundaries on every context
switch
to with bounds around the current thread's stack ? I don't mean only
changing the permissions, but the boundaries themselves ?
Yes, that's the intention. Every context switch _Swap() would involve
reprogramming the MPU (or MMU page tables) to appropriate values for
the incoming thread.

THIS IS NOT A SECURITY FEATURE
This bothers me a bit. If you ever want to turn this into a more
fleshed-out user-space model, would you introduce a third API/model
or
modify them again ?

My gut feeling would be to design for the full "user-space" model,
and
then have the reduced-capability MPU-model be a special case of it.

Then again, depends on the timeframe and how future-proof you want to
design this.
My expectation was that we can implement this proposal without changing
our existing kernel APIs. Kernel APIs would be supplemented with a few
extra calls to drop a thread down to User mode and set up its memory
regions. I don't see much else changing.

To go to a full userspace model, my thought process was that we have to
re-do all the kernel APIs to work with descriptors instead of pointers,
introduce an allocator on the kernel side, etc. My thinking was that it
will fundamentally change the class of devices that we are targeting
with this OS, and I was loath to introduce such a sweeping change to
all the kernel APIs.

However, I think you may have convinced me, read on..


I think this needs more of a "split" between kernel and "user-space"
APIs than a redesign. But then, kernel applications could not be
rebuilt
as user-space applications. Applications using the user-space API
could
probably be rebuilt as kernel applications, if we provided the
user-space API as a very thin layer around the current kernel APIs in
the kernel. Nothing would prevent removing the user-space layer and
run
kernel-only.
I like this idea. Supplement all the k_* APIs with z_* APIs (or
whatever naming convention people agree on) which are intended to be
run from userspace context. They will handle allocation and descriptor
management and call into the kernel APIs. Let people working on
severely RAM-constrained devices like Quark D2000 keep using kernel
APIs like they are now. Everybody wins.

The real trick of course is in the scenario where the protection is
only being used during the debug phase of development, how to make the
userspace layer thin enough with memory protection disabled that it
would be roughly comparable in footprint and performance to an app that
just used kernel APIs directly.

- No need for system calls for privilege elevation. Through macro
magic, if memory protection is turned on we can wrap kernel/driver
What do you mean by "macro magic" ? There has to be a mechanism to
trigger the privilege elevation. On ARM, you have to use the SVC
instruction, handle the exception, write the CONTROL register, and
come
back to the next instruction after SVC. That's basically one system
call, which purpose is to elevate the privilege of the current
thread.
I am waving my arms here, but I was thinking this would be a relatively
small amount of code to do a generic privilege escalation to run
whatever kernel code is desired by the caller, rather than a carefully
controlled table of valid system call entry points.

But this is moot, I like the idea of supplementing the kernel APIs with
a set of parallel userspace APIs much better.

Thanks for your feedback, I'm going to spend some more time thinking
about this, but now I'm leaning towards a full userspace solution (with
MPU as a special case with physical=virtual addresses) with a
supplemental API for userspace applications that handles descriptor
management and kernel object allocation. I'll send another proposal
with this taken into account.

Andrew


Jon Medhurst (Tixy) <tixy@...>
 

On Mon, 2017-03-13 at 23:08 +0000, Boie, Andrew P wrote:
Thanks for your feedback, I'm going to spend some more time thinking
about this, but now I'm leaning towards a full userspace solution (with
MPU as a special case with physical=virtual addresses) with a
supplemental API for userspace applications that handles descriptor
management and kernel object allocation.
And validation of all the arguments passed into each function call. No
point going to all this trouble if userspace code can pass in dodgy
pointers or lengths etc. to the kernel.

--
Tixy


Benjamin Walsh <benjamin.walsh@...>
 

On Tue, Mar 14, 2017 at 09:01:53AM +0000, Jon Medhurst (Tixy) wrote:
On Mon, 2017-03-13 at 23:08 +0000, Boie, Andrew P wrote:
Thanks for your feedback, I'm going to spend some more time thinking
about this, but now I'm leaning towards a full userspace solution (with
MPU as a special case with physical=virtual addresses) with a
supplemental API for userspace applications that handles descriptor
management and kernel object allocation.
And validation of all the arguments passed into each function call. No
point going to all this trouble if userspace code can pass in dodgy
pointers or lengths etc. to the kernel.
Agreed, but I think this can be made optional if the application is
trusted and user-space is used more as a debugging aid. Although, that's
Yet Another Kconfig configuration to verify.


Piotr Mienkowski
 

I would like to add one more point to the discussion. It is maybe not directly related to the topic but should likely be considered when designing MPU support.

Occasionally, mainly in case of device drivers, on MCUs that have cache it is required to use the so called non-cacheable RAM regions. A memory region for which caching has been turned off. This task is typically done by MPU/MMU and Zephyr MPU architecture should also support it. I.e. as a developer I would like to have a possibility to place a specific variable / set of variables in a non-cacheable RAM region.

Following is one real life example when using non-cacheable RAM region is necessary (for the interested):
Atmel SAM Ethernet module, like most other Ethernet modules, is using scatter-gather technique to exchange data with the Ethernet driver. To communicate the location of the data buffers the driver sets-up a so called descriptor list. This is effectively a place in RAM containing a sequence of 32-bit words representing buffer location and its status. If cache is enabled every time the device driver updates data in the descriptor list it should theoretically also call cache clean operation. This would ensure that the data is written to RAM where it can be read by the hardware module. Unfortunately, at the same time the device driver is modifying one location in the descriptor list the hardware module can be modifying one next to it. Since cache clean operation works on a full cache line - 8 bytes in case of Atmel device - it would overwrite any such modifications done by the hardware module. As such, ensuring cache coherency using standard techniques is not possible here. One workaround would be to force write-through cache policy but this is really just a workaround.

Piotr


Boie, Andrew P
 

On Tue, Mar 14, 2017 at 09:01:53AM +0000, Jon Medhurst (Tixy) wrote:
On Mon, 2017-03-13 at 23:08 +0000, Boie, Andrew P wrote:
Thanks for your feedback, I'm going to spend some more time thinking
about this, but now I'm leaning towards a full userspace solution
(with MPU as a special case with physical=virtual addresses) with a
supplemental API for userspace applications that handles descriptor
management and kernel object allocation.
And validation of all the arguments passed into each function call. No
point going to all this trouble if userspace code can pass in dodgy
pointers or lengths etc. to the kernel.
Agreed, but I think this can be made optional if the application is trusted and
user-space is used more as a debugging aid. Although, that's Yet Another
Kconfig configuration to verify.
I had planned on checking user pointers unconditionally, I think it's a common programming mistake. When running in supervisor mode on behalf of a user thread, we should make sure any buffers passed in are actually owned and writable by the caller.

Andrew


Benjamin Walsh <benjamin.walsh@...>
 

On Tue, Mar 14, 2017 at 05:29:57PM +0100, Piotr Mienkowski wrote:
I would like to add one more point to the discussion. It is maybe not
directly related to the topic but should likely be considered when
designing MPU support.

Occasionally, mainly in case of device drivers, on MCUs that have cache
it is required to use the so called non-cacheable RAM regions. A memory
region for which caching has been turned off. This task is typically
done by MPU/MMU and Zephyr MPU architecture should also support it. I.e.
as a developer I would like to have a possibility to place a specific
variable / set of variables in a non-cacheable RAM region.

Following is one real life example when using non-cacheable RAM region
is necessary (for the interested):
Atmel SAM Ethernet module, like most other Ethernet modules, is using
scatter-gather technique to exchange data with the Ethernet driver. To
communicate the location of the data buffers the driver sets-up a so
called descriptor list. This is effectively a place in RAM containing a
sequence of 32-bit words representing buffer location and its status. If
cache is enabled every time the device driver updates data in the
descriptor list it should theoretically also call cache clean operation.
This would ensure that the data is written to RAM where it can be read
by the hardware module. Unfortunately, at the same time the device
driver is modifying one location in the descriptor list the hardware
module can be modifying one next to it. Since cache clean operation
works on a full cache line - 8 bytes in case of Atmel device - it would
overwrite any such modifications done by the hardware module. As such,
ensuring cache coherency using standard techniques is not possible here.
One workaround would be to force write-through cache policy but this is
really just a workaround.
Yet another example that shows the importance of laying out early what
the different MPU regions will be used for, as stated in the top comment
in my original reply to the RFC.


Boie, Andrew P
 

On Tue, 2017-03-14 at 17:29 +0100, Piotr Mienkowski wrote:
I would like to add one more point to the discussion. It is maybe not directly
related to the topic but should likely be considered when designing MPU
support.

Occasionally, mainly in case of device drivers, on MCUs that have cache it is
required to use the so called non-cacheable RAM regions. A memory region for
which caching has been turned off. This task is typically done by MPU/MMU and
Zephyr MPU architecture should also support it. I.e. as a developer I would
like to have a possibility to place a specific variable / set of variables in
a non-cacheable RAM region.
This is a great topic to bring up. In addition to an MPU policy to protect
threads for debugging, we do need to specify a system-level policy that get
applied at boot, even if we are not protecting individual threads.

Are you thinking that this would be something declared at the SOC level? I think
because of the size and alignment constraints of MPU regions, we may want to
configure these reasons in a central area. You may be interested to look at
Vincenzo's patches, which define MPU regions for a few ARM SOCs at boot:

https://gerrit.zephyrproject.org/r/#/q/topic:zephyr_mpu

This brings up another thing we need to consider: for thread-level debugging,
what is the minimum number of free MPU regions necessary to make it work?

For example, on an ARM Cortex-M, AFAIK you can only configure 8 MPU regions
total, each of which must be aligned to a multiple of their size. If 5 or 6 of
them are already used for system-wide MPU policy, it may be not be feasible to
introduce all the thread-level protection ideas that have been discussed so far.

Looking at this another way, maybe we need to consider different levels of
memory protection support, each building on top of the previous level. What
level any given board target supports will be determined by the available memory
protection hardware and its capabilities, as well as how much extra RAM we can
waste to accommodate the alignment constraints of the MPU hardware:

1) No memory protection

2) System-wide memory protection policy, set at boot by board or SOC code.

3) Per-thread stack overflow protection. We configure the MPU, on a per-thread
basis, to trigger an exception if the thread tries to write past its available
stack space. I think this should only require 1 extra region, just a sentinel
area immediately before the thread's stack to catch writes, with the struct
k_thread stored elsewhere. I think this is something simple we can do which will
make a lot of people happy given how painful stack overflows can be to debug if
you don't know they are happening.

4) Per-thread memory protection. User threads can only write to their own stack
+ additional runtime-configurable memory regions. System calls to interact with
the kernel, whose memory is otherwise untouchable. Basically what we have been
talking about so far.

5) Virtualized memory (MMU). Application and kernel run in different virtualized
memory spaces. Introduce the possibility of different isolated zephyr processes.

We may never need or want to get as far as #5, although I think whatever design
we come up with ought to leave the door open for it.

Andrew


Piotr Mienkowski
 


On 14.03.2017 20:08, Boie, Andrew P wrote:
On Tue, 2017-03-14 at 17:29 +0100, Piotr Mienkowski wrote:
I would like to add one more point to the discussion. It is maybe not directly
related to the topic but should likely be considered when designing MPU
support.

Occasionally, mainly in case of device drivers, on MCUs that have cache it is
required to use the so called non-cacheable RAM regions. A memory region for
which caching has been turned off. This task is typically done by MPU/MMU and
Zephyr MPU architecture should also support it. I.e. as a developer I would
like to have a possibility to place a specific variable / set of variables in
a non-cacheable RAM region.
This is a great topic to bring up. In addition to an MPU policy to protect
threads for debugging, we do need to specify a system-level policy that get
applied at boot, even if we are not protecting individual threads.

Are you thinking that this would be something declared at the SOC level? I think
because of the size and alignment constraints of MPU regions, we may want to
configure these reasons in a central area. You may be interested to look at
Vincenzo's patches, which define MPU regions for a few ARM SOCs at boot:

https://gerrit.zephyrproject.org/r/#/q/topic:zephyr_mpu
I was indeed thinking that non-cachable RAM region would be declared at the SoC level. It's simple and efficient. However, probably not compatible with per thread memory protection as a security feature model.

Thanks for the link. That looks interesting indeed, though to support non-cachable RAM region we would also need to modify the linker script. It would probably be best to do it at the same time or after we touch the linker script to add support for all the features we are talking about here.
Looking at this another way, maybe we need to consider different levels of
memory protection support, each building on top of the previous level. What
level any given board target supports will be determined by the available memory
protection hardware and its capabilities, as well as how much extra RAM we can
waste to accommodate the alignment constraints of the MPU hardware:

1) No memory protection

2) System-wide memory protection policy, set at boot by board or SOC code.

3) Per-thread stack overflow protection. We configure the MPU, on a per-thread
basis, to trigger an exception if the thread tries to write past its available
stack space. I think this should only require 1 extra region, just a sentinel
area immediately before the thread's stack to catch writes, with the struct
k_thread stored elsewhere. I think this is something simple we can do which will
make a lot of people happy given how painful stack overflows can be to debug if
you don't know they are happening.

4) Per-thread memory protection. User threads can only write to their own stack
+ additional runtime-configurable memory regions. System calls to interact with
the kernel, whose memory is otherwise untouchable. Basically what we have been
talking about so far.

5) Virtualized memory (MMU). Application and kernel run in different virtualized
memory spaces. Introduce the possibility of different isolated zephyr processes.
That all sounds very reasonable. Every next level of memory protection support builds upon the effort spent and experience gained at the previous one. The only danger with having multiple protection levels is that it may become opaque to the end user.

I have question to the per-thread memory protection model. Maybe the answer is obvious. What about data passed between threads, like data buffers passed through FIFOs? E.g. our networking stack supports zero copy mechanism. A pointer to the data buffer that was filled in by a data link layer driver (working typically in the interrupt context) is passed to the RX thread, user application thread, maybe TX thread. Are such data buffers going to live in a memory region that is accessible to all?

Regards,
Piotr