Re: RFC: return type of functions passed to DEVICE_INIT()


Thomas, Ramesh
 

On Fri, 2016-02-19 at 10:31 -0500, Benjamin Walsh wrote:
On Thu, Feb 18, 2016 at 10:49:01PM +0000, Thomas, Ramesh wrote:

On Thu, 2016-02-18 at 12:34 -0800, Dirk Brandewie wrote:

On 02/18/2016 10:03 AM, Benjamin Walsh wrote:

My take on it is that for Zephyr a failed device initialization
should be considered a fatal event. My expectation is that the
Zephyr user will only be enabling relevant (and important) devices to
their project. If one of these devices should fail, then that is a
serious system error and _NanoFatalErrorHandler() should be invoked.
How about a CONFIG option to determine if kernel should abort boot? The
What I was proposing in my last reply, too keep things simple, is one
Kconfig option to enable/disable handling of error codes by init
functions. If it's disabled, nothing is handled automatically and it
relies on everything working (when deployed) and __ASSERTs (when
debugging).

If it's enabled, some runtime handling is already needed, so you can
piggyback on it. One way is to standardize on a specific error code to
be returned by init functions to tell the kernel to abort boot. No need
for a Kconfig option for that case.
That should be fine. So the kernel decision to abort will be based on a
"fatal" error return from any of the device's init function.

We still need to decide what does "abort" mean for kernel. It could
mean -
1. Break out of _sys_device_do_config_level() when one device init
fails and continue boot. App takes necessary action when it finds device
inits had failed.
2. Kernel continues with all device inits and boot. Basically ignores
the error returned but devices have them logged so app will know when it
tries to use them.
3. Kernel itself aborts boot.

I prefer #2 or #1 (in that order) because that gives app flexibility of
dealing with the situation.

One more thing to note is that app will not get binding for all
required devices e.g. ioapic, loapic, sys_clock etc. So it might not
know if they failed unless it queries all devices explicitly which is
unlikely unless we provide some API for that. IMO these should never
fail and we need not consider any kernel action for this (so we can
still stay away from #3 above). Just wanted to bring this up as app's
ability to know device failure is being considered as part of the
design.


behavior should be determined by the kind of application. Some can live
with less features while for others it is all or nothing. In both cases
a logging feature that can be queried by the app would be useful.

Any of those could be a valid approach I think, but we have to decide on
one. And right now, we have the worst since we return those error codes
which are meant for runtime handling, but they just go into the void.
+1 _sys_device_do_config_level() should at least pass on the error
returned even if one device fails. This as well as the
"resume_all/suspend_all" functions can be called by app in power
management case. It would be useful for the app to know that something
went wrong.

Question still remains whether _sys_device_do_config_level() should
continue with other devices if one fails. IMO it should continue but
should return error if one device had failed. At boot time, kernel can
save the error condition somewhere to be queried by the app.
The thing is that you can have several failures, one per-device object
basically. That's why I was proposing saving the errors in some field in
the device object and queuing them to be retrieved by the application.
Storing status in device object sounds good.

I was considering the possible use case of PM app calling
_sys_device_do_config_level(). In that case a return value of 0 or 1
would help it know some thing went wrong instead of having to query all
devices to find out. But I think this does not impact the main topic of
your RFC and can be addressed separately if necessary.


Brainstorming:

If we want to let the application handle the initialization issues, we
probably need some kind of queue that gets filled by the init system
when init functions return errors, and that the application drains to
see what failed. We might want to queue the associated device objects,
and have an errno field in there, or something like that.
The app would be anyway getting a failure when it tries to use the
device. It most probably cannot rectify any device error on the fly with
more detailed info. Just my thought

A simple status from _sys_device_do_config_level() is probably enough so
the app can decide whether to continue or abort.
This still has to be saved somewhere that can be retrieved by the app.
The kernel does not pass anything to main() currently, and in the
microkernel, there isn't necessarily a main() either.
Yes, passing in main() or task entry points are messy. Storing
somewhere for app to query is better. May be even provide an API to
help in the query if that helps.

Join devel@lists.zephyrproject.org to automatically receive all group messages.