RFC: return type of functions passed to DEVICE_INIT()


Benjamin Walsh <benjamin.walsh@...>
 

Folks,

For some reason, the signature of functions passed to the DEVICE_INIT()
<init_fn> parameter has a return type of 'int', but the return value is
never checked within _sys_device_do_config_level(). Some init functions
do return an error code, such as the ARC init code and the bluetooth
init routines, but that just gets ignored.

Question: should we have init functions of return type 'void' then ?
That would shave a few bytes in every init function if we don't have to
return a value.

Cheers,
Ben

--
Benjamin Walsh, SMTS
Wind River Rocket
Zephyr kernel maintainer
www.windriver.com


Mitsis, Peter <Peter.Mitsis@...>
 

See [Peter]

-----Original Message-----
From: Benjamin Walsh [mailto:benjamin.walsh(a)windriver.com]
Sent: February-12-16 1:37 PM
To: devel(a)lists.zephyrproject.org
Subject: [devel] RFC: return type of functions passed to DEVICE_INIT()

Folks,

For some reason, the signature of functions passed to the DEVICE_INIT()
<init_fn> parameter has a return type of 'int', but the return value is never
checked within _sys_device_do_config_level(). Some init functions do return an
error code, such as the ARC init code and the bluetooth init routines, but that
just gets ignored.

Question: should we have init functions of return type 'void' then ?
That would shave a few bytes in every init function if we don't have to return a
value.
[Peter] - We generally try to operate under the assumption that failures will not occur.
That being said, we do have some instances where we do check for errors, and some of these are considered fatal.

My take on it is that for Zephyr a failed device initialization should be considered a fatal event.
My expectation is that the Zephyr user will only be enabling relevant (and important) devices to their project.
If one of these devices should fail, then that is a serious system error and _NanoFatalErrorHandler() should be invoked.

If this train of thought holds up to scrutiny, and if the aim is to save a few bytes then I would think that it would be better to have the device initialization routines return a failure code and have _sys_device_do_config_level() check for it and invoke the fatal error handler upon the detection of failure. Otherwise we duplicate the overhead of calling the fatal error handler in each device initialization routine.

Just my two cents.

Cheers,
Ben

--
Benjamin Walsh, SMTS
Wind River Rocket
Zephyr kernel maintainer
www.windriver.com


Benjamin Walsh <benjamin.walsh@...>
 

For some reason, the signature of functions passed to the DEVICE_INIT()
<init_fn> parameter has a return type of 'int', but the return value is never
checked within _sys_device_do_config_level(). Some init functions do return an
error code, such as the ARC init code and the bluetooth init routines, but that
just gets ignored.

Question: should we have init functions of return type 'void' then ?
That would shave a few bytes in every init function if we don't have to return a
value.
We generally try to operate under the assumption that failures will
not occur. That being said, we do have some instances where we do
check for errors, and some of these are considered fatal.

My take on it is that for Zephyr a failed device initialization should
be considered a fatal event. My expectation is that the Zephyr user
will only be enabling relevant (and important) devices to their
project. If one of these devices should fail, then that is a serious
system error and _NanoFatalErrorHandler() should be invoked.

If this train of thought holds up to scrutiny, and if the aim is to
save a few bytes then I would think that it would be better to have
the device initialization routines return a failure code and have
_sys_device_do_config_level() check for it and invoke the fatal error
handler upon the detection of failure. Otherwise we duplicate the
overhead of calling the fatal error handler in each device
initialization routine.
Drivers' initialization routines can use __ASSERT() if they want to catch
an error in a debug build.


Benjamin Walsh <benjamin.walsh@...>
 

For some reason, the signature of functions passed to the
DEVICE_INIT() <init_fn> parameter has a return type of 'int', but
the return value is never checked within
_sys_device_do_config_level(). Some init functions do return an
error code, such as the ARC init code and the bluetooth init
routines, but that just gets ignored.

Question: should we have init functions of return type 'void' then
? That would shave a few bytes in every init function if we don't
have to return a value.
We generally try to operate under the assumption that failures will
not occur. That being said, we do have some instances where we do
check for errors, and some of these are considered fatal.

My take on it is that for Zephyr a failed device initialization
should be considered a fatal event. My expectation is that the
Zephyr user will only be enabling relevant (and important) devices
to their project. If one of these devices should fail, then that is
a serious system error and _NanoFatalErrorHandler() should be
invoked.

If this train of thought holds up to scrutiny, and if the aim is to
save a few bytes then I would think that it would be better to have
the device initialization routines return a failure code and have
_sys_device_do_config_level() check for it and invoke the fatal
error handler upon the detection of failure. Otherwise we duplicate
the overhead of calling the fatal error handler in each device
initialization routine.
Drivers' initialization routines can use __ASSERT() if they want to
catch an error in a debug build.
I have not received any other feedback w.r.t. this: silence implies
consent ? :)


Iván Briano <ivan.briano at intel.com...>
 

On Wed, 17 Feb 2016 14:35:24 -0500, Benjamin Walsh wrote:
For some reason, the signature of functions passed to the
DEVICE_INIT() <init_fn> parameter has a return type of 'int', but
the return value is never checked within
_sys_device_do_config_level(). Some init functions do return an
error code, such as the ARC init code and the bluetooth init
routines, but that just gets ignored.

Question: should we have init functions of return type 'void' then
? That would shave a few bytes in every init function if we don't
have to return a value.
We generally try to operate under the assumption that failures will
not occur. That being said, we do have some instances where we do
check for errors, and some of these are considered fatal.

My take on it is that for Zephyr a failed device initialization
should be considered a fatal event. My expectation is that the
Zephyr user will only be enabling relevant (and important) devices
to their project. If one of these devices should fail, then that is
a serious system error and _NanoFatalErrorHandler() should be
invoked.

If this train of thought holds up to scrutiny, and if the aim is to
save a few bytes then I would think that it would be better to have
the device initialization routines return a failure code and have
_sys_device_do_config_level() check for it and invoke the fatal
error handler upon the detection of failure. Otherwise we duplicate
the overhead of calling the fatal error handler in each device
initialization routine.
Drivers' initialization routines can use __ASSERT() if they want to
catch an error in a debug build.
I have not received any other feedback w.r.t. this: silence implies
consent ? :)
Fine by me, on both accounts.


Dirk Brandewie <dirk.j.brandewie@...>
 

On 02/12/2016 11:38 AM, Mitsis, Peter wrote:
See [Peter]

-----Original Message-----
From: Benjamin Walsh [mailto:benjamin.walsh(a)windriver.com]
Sent: February-12-16 1:37 PM
To: devel(a)lists.zephyrproject.org
Subject: [devel] RFC: return type of functions passed to DEVICE_INIT()

Folks,

For some reason, the signature of functions passed to the DEVICE_INIT()
<init_fn> parameter has a return type of 'int', but the return value is never
checked within _sys_device_do_config_level(). Some init functions do return an
error code, such as the ARC init code and the bluetooth init routines, but that
just gets ignored.

Question: should we have init functions of return type 'void' then ?
That would shave a few bytes in every init function if we don't have to return a
value.
[Peter] - We generally try to operate under the assumption that failures will not occur.
That being said, we do have some instances where we do check for errors, and some of these are considered fatal.

My take on it is that for Zephyr a failed device initialization should be considered a fatal event.
My expectation is that the Zephyr user will only be enabling relevant (and important) devices to their project.
If one of these devices should fail, then that is a serious system error and _NanoFatalErrorHandler() should be invoked.

If this train of thought holds up to scrutiny, and if the aim is to save a few bytes then I would think that it would be better to have the device initialization routines return a failure code and have _sys_device_do_config_level() check for it and invoke the fatal error handler upon the detection of failure. Otherwise we duplicate the overhead of calling the fatal error handler in each device initialization routine.
Sorry for the slow response. I agree with Peter here I think we should
be checking the return value and doing something useful with the
result. Maybe not _NanoFatalErrorHandler() but something notifying
the application that something bad happened. A given device not
initializing may not be fatal to the the whole application, just one
feature is currently unavailable.

How we could/should report this type of error is an open question :-).



Just my two cents.

Cheers,
Ben

--
Benjamin Walsh, SMTS
Wind River Rocket
Zephyr kernel maintainer
www.windriver.com


Benjamin Walsh <benjamin.walsh@...>
 

For some reason, the signature of functions passed to the
DEVICE_INIT() <init_fn> parameter has a return type of 'int', but
the return value is never checked within
_sys_device_do_config_level(). Some init functions do return an
error code, such as the ARC init code and the bluetooth init
routines, but that just gets ignored.

Question: should we have init functions of return type 'void' then ?
That would shave a few bytes in every init function if we don't have
to return a value.
We generally try to operate under the assumption that failures will
not occur. That being said, we do have some instances where we do
check for errors, and some of these are considered fatal.

My take on it is that for Zephyr a failed device initialization
should be considered a fatal event. My expectation is that the
Zephyr user will only be enabling relevant (and important) devices to
their project. If one of these devices should fail, then that is a
serious system error and _NanoFatalErrorHandler() should be invoked.

If this train of thought holds up to scrutiny, and if the aim is to
save a few bytes then I would think that it would be better to have
the device initialization routines return a failure code and have
_sys_device_do_config_level() check for it and invoke the fatal error
handler upon the detection of failure. Otherwise we duplicate the
overhead of calling the fatal error handler in each device
initialization routine.
Sorry for the slow response. I agree with Peter here I think we should
be checking the return value and doing something useful with the
result. Maybe not _NanoFatalErrorHandler() but something notifying
the application that something bad happened. A given device not
initializing may not be fatal to the the whole application, just one
feature is currently unavailable.
For the kind of systems we are targeting, do we really expect the
application to handle devices not initializing correctly, being designed
so that parts are disabled if some parts of the initialization fail
(devices or others), or do we expect applications to require everything
to be present for them to function correctly ? I would have thought the
latter, but I can be convinced.

Then, if the latter, do we expect the application catching the errors at
runtime when deployed or during development (basically catching software
errors mostly) not malfunctionning hardware. Here, I was thinking the
latter as well, which is why I was proposing __ASSERT() calls catching
initialization errors in debug loads only. And this fits with one of the
core values of the OS, which is small footprint.

Any of those could be a valid approach I think, but we have to decide on
one. And right now, we have the worst since we return those error codes
which are meant for runtime handling, but they just go into the void.

How we could/should report this type of error is an open question :-).
Brainstorming:

If we want to let the application handle the initialization issues, we
probably need some kind of queue that gets filled by the init system
when init functions return errors, and that the application drains to
see what failed. We might want to queue the associated device objects,
and have an errno field in there, or something like that.


Dirk Brandewie <dirk.j.brandewie@...>
 

On 02/18/2016 10:03 AM, Benjamin Walsh wrote:

My take on it is that for Zephyr a failed device initialization
should be considered a fatal event. My expectation is that the
Zephyr user will only be enabling relevant (and important) devices to
their project. If one of these devices should fail, then that is a
serious system error and _NanoFatalErrorHandler() should be invoked.

If this train of thought holds up to scrutiny, and if the aim is to
save a few bytes then I would think that it would be better to have
the device initialization routines return a failure code and have
_sys_device_do_config_level() check for it and invoke the fatal error
handler upon the detection of failure. Otherwise we duplicate the
overhead of calling the fatal error handler in each device
initialization routine.
Sorry for the slow response. I agree with Peter here I think we should
be checking the return value and doing something useful with the
result. Maybe not _NanoFatalErrorHandler() but something notifying
the application that something bad happened. A given device not
initializing may not be fatal to the the whole application, just one
feature is currently unavailable.
For the kind of systems we are targeting, do we really expect the
application to handle devices not initializing correctly, being designed
so that parts are disabled if some parts of the initialization fail
(devices or others), or do we expect applications to require everything
to be present for them to function correctly ? I would have thought the
latter, but I can be convinced.
Delving into the realm of the hypothetical :-)

What about devices that have drivers in the system but are not present
(pluggable) or can't initialize because some resource external to the
device can't be contacted (network server).

The application may be able to still do useful work albeit with reduced
functionality.

Then, if the latter, do we expect the application catching the errors at
runtime when deployed or during development (basically catching software
errors mostly) not malfunctionning hardware. Here, I was thinking the
latter as well, which is why I was proposing __ASSERT() calls catching
initialization errors in debug loads only. And this fits with one of the
core values of the OS, which is small footprint.
Both models are useful for different reasons :-D

Any of those could be a valid approach I think, but we have to decide on
one. And right now, we have the worst since we return those error codes
which are meant for runtime handling, but they just go into the void.
Agreed we need to pick and stay with it for some amount of time until
we see a few real uses/applications/platforms.


How we could/should report this type of error is an open question :-).
Brainstorming:

If we want to let the application handle the initialization issues, we
probably need some kind of queue that gets filled by the init system
when init functions return errors, and that the application drains to
see what failed. We might want to queue the associated device objects,
and have an errno field in there, or something like that.
How about having the driver return an error code saying whether the
failure is a fatal error or not. For the drivers that we have now where
we *know* that if it fails it is a hardware or configuration error
which is fatal. So we go with the _NanoFatalErrorHandler() error path.

If a non-fatal error occurred (may work at next reset) just ignore it
an move on. The application can detect if the device is dead/not
present by the return codes from the driver call(s). Then the
application can decide what and how to report the error to the user.



Thomas, Ramesh
 

On Thu, 2016-02-18 at 12:34 -0800, Dirk Brandewie wrote:

On 02/18/2016 10:03 AM, Benjamin Walsh wrote:

My take on it is that for Zephyr a failed device initialization
should be considered a fatal event. My expectation is that the
Zephyr user will only be enabling relevant (and important) devices to
their project. If one of these devices should fail, then that is a
serious system error and _NanoFatalErrorHandler() should be invoked.
How about a CONFIG option to determine if kernel should abort boot? The
behavior should be determined by the kind of application. Some can live
with less features while for others it is all or nothing. In both cases
a logging feature that can be queried by the app would be useful.

Any of those could be a valid approach I think, but we have to decide on
one. And right now, we have the worst since we return those error codes
which are meant for runtime handling, but they just go into the void.
+1 _sys_device_do_config_level() should at least pass on the error
returned even if one device fails. This as well as the
"resume_all/suspend_all" functions can be called by app in power
management case. It would be useful for the app to know that something
went wrong.

Question still remains whether _sys_device_do_config_level() should
continue with other devices if one fails. IMO it should continue but
should return error if one device had failed. At boot time, kernel can
save the error condition somewhere to be queried by the app.

Brainstorming:

If we want to let the application handle the initialization issues, we
probably need some kind of queue that gets filled by the init system
when init functions return errors, and that the application drains to
see what failed. We might want to queue the associated device objects,
and have an errno field in there, or something like that.
The app would be anyway getting a failure when it tries to use the
device. It most probably cannot rectify any device error on the fly with
more detailed info. Just my thought

A simple status from _sys_device_do_config_level() is probably enough so
the app can decide whether to continue or abort.


Benjamin Walsh <benjamin.walsh@...>
 

My take on it is that for Zephyr a failed device initialization
should be considered a fatal event. My expectation is that the
Zephyr user will only be enabling relevant (and important) devices to
their project. If one of these devices should fail, then that is a
serious system error and _NanoFatalErrorHandler() should be invoked.

If this train of thought holds up to scrutiny, and if the aim is to
save a few bytes then I would think that it would be better to have
the device initialization routines return a failure code and have
_sys_device_do_config_level() check for it and invoke the fatal error
handler upon the detection of failure. Otherwise we duplicate the
overhead of calling the fatal error handler in each device
initialization routine.
Sorry for the slow response. I agree with Peter here I think we should
be checking the return value and doing something useful with the
result. Maybe not _NanoFatalErrorHandler() but something notifying
the application that something bad happened. A given device not
initializing may not be fatal to the the whole application, just one
feature is currently unavailable.
For the kind of systems we are targeting, do we really expect the
application to handle devices not initializing correctly, being designed
so that parts are disabled if some parts of the initialization fail
(devices or others), or do we expect applications to require everything
to be present for them to function correctly ? I would have thought the
latter, but I can be convinced.
Delving into the realm of the hypothetical :-)

What about devices that have drivers in the system but are not present
(pluggable) or can't initialize because some resource external to the
device can't be contacted (network server).

The application may be able to still do useful work albeit with reduced
functionality.

Then, if the latter, do we expect the application catching the errors at
runtime when deployed or during development (basically catching software
errors mostly) not malfunctionning hardware. Here, I was thinking the
latter as well, which is why I was proposing __ASSERT() calls catching
initialization errors in debug loads only. And this fits with one of the
core values of the OS, which is small footprint.
Both models are useful for different reasons :-D

Any of those could be a valid approach I think, but we have to decide on
one. And right now, we have the worst since we return those error codes
which are meant for runtime handling, but they just go into the void.
Agreed we need to pick and stay with it for some amount of time until
we see a few real uses/applications/platforms.
OK, following your proposal below, what we could put in place is
standardizing on error codes that init routines must return if they want
the kernel init system to automatically trigger a fatal error.

Then, we could also allow configuring out the error handling if someone
needs to squeeze that last amount of space. One more Kconfig option! :)
The error handling would be enabled by default of course.

How we could/should report this type of error is an open question :-).
Brainstorming:

If we want to let the application handle the initialization issues, we
probably need some kind of queue that gets filled by the init system
when init functions return errors, and that the application drains to
see what failed. We might want to queue the associated device objects,
and have an errno field in there, or something like that.
How about having the driver return an error code saying whether the
failure is a fatal error or not. For the drivers that we have now where
we *know* that if it fails it is a hardware or configuration error
which is fatal. So we go with the _NanoFatalErrorHandler() error path.
That sounds good.

If a non-fatal error occurred (may work at next reset) just ignore it
an move on. The application can detect if the device is dead/not
present by the return codes from the driver call(s). Then the
application can decide what and how to report the error to the user.
That's another way of doing it. It's a bit less explicit than a list
of errors, but less overhead, and reuses what's already available.


Daniel Leung <daniel.leung@...>
 

On Thu, Feb 18, 2016 at 05:54:56PM -0500, Benjamin Walsh wrote:
My take on it is that for Zephyr a failed device initialization
should be considered a fatal event. My expectation is that the
Zephyr user will only be enabling relevant (and important) devices to
their project. If one of these devices should fail, then that is a
serious system error and _NanoFatalErrorHandler() should be invoked.

If this train of thought holds up to scrutiny, and if the aim is to
save a few bytes then I would think that it would be better to have
the device initialization routines return a failure code and have
_sys_device_do_config_level() check for it and invoke the fatal error
handler upon the detection of failure. Otherwise we duplicate the
overhead of calling the fatal error handler in each device
initialization routine.
Sorry for the slow response. I agree with Peter here I think we should
be checking the return value and doing something useful with the
result. Maybe not _NanoFatalErrorHandler() but something notifying
the application that something bad happened. A given device not
initializing may not be fatal to the the whole application, just one
feature is currently unavailable.
For the kind of systems we are targeting, do we really expect the
application to handle devices not initializing correctly, being designed
so that parts are disabled if some parts of the initialization fail
(devices or others), or do we expect applications to require everything
to be present for them to function correctly ? I would have thought the
latter, but I can be convinced.
Delving into the realm of the hypothetical :-)

What about devices that have drivers in the system but are not present
(pluggable) or can't initialize because some resource external to the
device can't be contacted (network server).

The application may be able to still do useful work albeit with reduced
functionality.

Then, if the latter, do we expect the application catching the errors at
runtime when deployed or during development (basically catching software
errors mostly) not malfunctionning hardware. Here, I was thinking the
latter as well, which is why I was proposing __ASSERT() calls catching
initialization errors in debug loads only. And this fits with one of the
core values of the OS, which is small footprint.
Both models are useful for different reasons :-D

Any of those could be a valid approach I think, but we have to decide on
one. And right now, we have the worst since we return those error codes
which are meant for runtime handling, but they just go into the void.
Agreed we need to pick and stay with it for some amount of time until
we see a few real uses/applications/platforms.
OK, following your proposal below, what we could put in place is
standardizing on error codes that init routines must return if they want
the kernel init system to automatically trigger a fatal error.

Then, we could also allow configuring out the error handling if someone
needs to squeeze that last amount of space. One more Kconfig option! :)
The error handling would be enabled by default of course.
For those non-fatal errors, what should we do for runtime driver behaviors?
Should the drivers themselves fail API calls? Or should we let
device_get_binding() return NULL?

How we could/should report this type of error is an open question :-).
Brainstorming:

If we want to let the application handle the initialization issues, we
probably need some kind of queue that gets filled by the init system
when init functions return errors, and that the application drains to
see what failed. We might want to queue the associated device objects,
and have an errno field in there, or something like that.
How about having the driver return an error code saying whether the
failure is a fatal error or not. For the drivers that we have now where
we *know* that if it fails it is a hardware or configuration error
which is fatal. So we go with the _NanoFatalErrorHandler() error path.
That sounds good.

If a non-fatal error occurred (may work at next reset) just ignore it
an move on. The application can detect if the device is dead/not
present by the return codes from the driver call(s). Then the
application can decide what and how to report the error to the user.
That's another way of doing it. It's a bit less explicit than a list
of errors, but less overhead, and reuses what's already available.
-----
Daniel Leung


Benjamin Walsh <benjamin.walsh@...>
 

On Thu, Feb 18, 2016 at 10:49:01PM +0000, Thomas, Ramesh wrote:

On Thu, 2016-02-18 at 12:34 -0800, Dirk Brandewie wrote:

On 02/18/2016 10:03 AM, Benjamin Walsh wrote:

My take on it is that for Zephyr a failed device initialization
should be considered a fatal event. My expectation is that the
Zephyr user will only be enabling relevant (and important) devices to
their project. If one of these devices should fail, then that is a
serious system error and _NanoFatalErrorHandler() should be invoked.
How about a CONFIG option to determine if kernel should abort boot? The
What I was proposing in my last reply, too keep things simple, is one
Kconfig option to enable/disable handling of error codes by init
functions. If it's disabled, nothing is handled automatically and it
relies on everything working (when deployed) and __ASSERTs (when
debugging).

If it's enabled, some runtime handling is already needed, so you can
piggyback on it. One way is to standardize on a specific error code to
be returned by init functions to tell the kernel to abort boot. No need
for a Kconfig option for that case.

behavior should be determined by the kind of application. Some can live
with less features while for others it is all or nothing. In both cases
a logging feature that can be queried by the app would be useful.

Any of those could be a valid approach I think, but we have to decide on
one. And right now, we have the worst since we return those error codes
which are meant for runtime handling, but they just go into the void.
+1 _sys_device_do_config_level() should at least pass on the error
returned even if one device fails. This as well as the
"resume_all/suspend_all" functions can be called by app in power
management case. It would be useful for the app to know that something
went wrong.

Question still remains whether _sys_device_do_config_level() should
continue with other devices if one fails. IMO it should continue but
should return error if one device had failed. At boot time, kernel can
save the error condition somewhere to be queried by the app.
The thing is that you can have several failures, one per-device object
basically. That's why I was proposing saving the errors in some field in
the device object and queuing them to be retrieved by the application.

Brainstorming:

If we want to let the application handle the initialization issues, we
probably need some kind of queue that gets filled by the init system
when init functions return errors, and that the application drains to
see what failed. We might want to queue the associated device objects,
and have an errno field in there, or something like that.
The app would be anyway getting a failure when it tries to use the
device. It most probably cannot rectify any device error on the fly with
more detailed info. Just my thought

A simple status from _sys_device_do_config_level() is probably enough so
the app can decide whether to continue or abort.
This still has to be saved somewhere that can be retrieved by the app.
The kernel does not pass anything to main() currently, and in the
microkernel, there isn't necessarily a main() either.


Dirk Brandewie <dirk.j.brandewie@...>
 

On 02/18/2016 03:04 PM, Daniel Leung wrote:
On Thu, Feb 18, 2016 at 05:54:56PM -0500, Benjamin Walsh wrote:
My take on it is that for Zephyr a failed device initialization
should be considered a fatal event. My expectation is that the
Zephyr user will only be enabling relevant (and important) devices to
their project. If one of these devices should fail, then that is a
serious system error and _NanoFatalErrorHandler() should be invoked.

If this train of thought holds up to scrutiny, and if the aim is to
save a few bytes then I would think that it would be better to have
the device initialization routines return a failure code and have
_sys_device_do_config_level() check for it and invoke the fatal error
handler upon the detection of failure. Otherwise we duplicate the
overhead of calling the fatal error handler in each device
initialization routine.
Sorry for the slow response. I agree with Peter here I think we should
be checking the return value and doing something useful with the
result. Maybe not _NanoFatalErrorHandler() but something notifying
the application that something bad happened. A given device not
initializing may not be fatal to the the whole application, just one
feature is currently unavailable.
For the kind of systems we are targeting, do we really expect the
application to handle devices not initializing correctly, being designed
so that parts are disabled if some parts of the initialization fail
(devices or others), or do we expect applications to require everything
to be present for them to function correctly ? I would have thought the
latter, but I can be convinced.
Delving into the realm of the hypothetical :-)

What about devices that have drivers in the system but are not present
(pluggable) or can't initialize because some resource external to the
device can't be contacted (network server).

The application may be able to still do useful work albeit with reduced
functionality.

Then, if the latter, do we expect the application catching the errors at
runtime when deployed or during development (basically catching software
errors mostly) not malfunctionning hardware. Here, I was thinking the
latter as well, which is why I was proposing __ASSERT() calls catching
initialization errors in debug loads only. And this fits with one of the
core values of the OS, which is small footprint.
Both models are useful for different reasons :-D

Any of those could be a valid approach I think, but we have to decide on
one. And right now, we have the worst since we return those error codes
which are meant for runtime handling, but they just go into the void.
Agreed we need to pick and stay with it for some amount of time until
we see a few real uses/applications/platforms.
OK, following your proposal below, what we could put in place is
standardizing on error codes that init routines must return if they want
the kernel init system to automatically trigger a fatal error.

Then, we could also allow configuring out the error handling if someone
needs to squeeze that last amount of space. One more Kconfig option! :)
The error handling would be enabled by default of course.
For those non-fatal errors, what should we do for runtime driver behaviors?
Should the drivers themselves fail API calls? Or should we let
device_get_binding() return NULL?
How about something like this:

diff --git a/kernel/nanokernel/device.c b/kernel/nanokernel/device.c
index f86f95f..82774c4 100644
--- a/kernel/nanokernel/device.c
+++ b/kernel/nanokernel/device.c
@@ -58,7 +58,7 @@ struct device *device_get_binding(char *name)
struct device *info;

for (info = __device_init_start; info != __device_init_end; info++) {
- if (!strcmp(name, info->config->name)) {
+ if (!strcmp(name, info->config->name) && info->driver_api) {
return info;
}
}

if there is a non-fatal error the driver will not mark itself available
and we can get rid of the null checks in the API headers since the user of the
driver will never get a reference to it if init() did not complete successfully


How we could/should report this type of error is an open question :-).
Brainstorming:

If we want to let the application handle the initialization issues, we
probably need some kind of queue that gets filled by the init system
when init functions return errors, and that the application drains to
see what failed. We might want to queue the associated device objects,
and have an errno field in there, or something like that.
How about having the driver return an error code saying whether the
failure is a fatal error or not. For the drivers that we have now where
we *know* that if it fails it is a hardware or configuration error
which is fatal. So we go with the _NanoFatalErrorHandler() error path.
That sounds good.

If a non-fatal error occurred (may work at next reset) just ignore it
an move on. The application can detect if the device is dead/not
present by the return codes from the driver call(s). Then the
application can decide what and how to report the error to the user.
That's another way of doing it. It's a bit less explicit than a list
of errors, but less overhead, and reuses what's already available.
-----
Daniel Leung


Daniel Leung <daniel.leung@...>
 

On Fri, Feb 19, 2016 at 08:31:48AM -0800, Dirk Brandewie wrote:


On 02/18/2016 03:04 PM, Daniel Leung wrote:
On Thu, Feb 18, 2016 at 05:54:56PM -0500, Benjamin Walsh wrote:
My take on it is that for Zephyr a failed device initialization
should be considered a fatal event. My expectation is that the
Zephyr user will only be enabling relevant (and important) devices to
their project. If one of these devices should fail, then that is a
serious system error and _NanoFatalErrorHandler() should be invoked.

If this train of thought holds up to scrutiny, and if the aim is to
save a few bytes then I would think that it would be better to have
the device initialization routines return a failure code and have
_sys_device_do_config_level() check for it and invoke the fatal error
handler upon the detection of failure. Otherwise we duplicate the
overhead of calling the fatal error handler in each device
initialization routine.
Sorry for the slow response. I agree with Peter here I think we should
be checking the return value and doing something useful with the
result. Maybe not _NanoFatalErrorHandler() but something notifying
the application that something bad happened. A given device not
initializing may not be fatal to the the whole application, just one
feature is currently unavailable.
For the kind of systems we are targeting, do we really expect the
application to handle devices not initializing correctly, being designed
so that parts are disabled if some parts of the initialization fail
(devices or others), or do we expect applications to require everything
to be present for them to function correctly ? I would have thought the
latter, but I can be convinced.
Delving into the realm of the hypothetical :-)

What about devices that have drivers in the system but are not present
(pluggable) or can't initialize because some resource external to the
device can't be contacted (network server).

The application may be able to still do useful work albeit with reduced
functionality.

Then, if the latter, do we expect the application catching the errors at
runtime when deployed or during development (basically catching software
errors mostly) not malfunctionning hardware. Here, I was thinking the
latter as well, which is why I was proposing __ASSERT() calls catching
initialization errors in debug loads only. And this fits with one of the
core values of the OS, which is small footprint.
Both models are useful for different reasons :-D

Any of those could be a valid approach I think, but we have to decide on
one. And right now, we have the worst since we return those error codes
which are meant for runtime handling, but they just go into the void.
Agreed we need to pick and stay with it for some amount of time until
we see a few real uses/applications/platforms.
OK, following your proposal below, what we could put in place is
standardizing on error codes that init routines must return if they want
the kernel init system to automatically trigger a fatal error.

Then, we could also allow configuring out the error handling if someone
needs to squeeze that last amount of space. One more Kconfig option! :)
The error handling would be enabled by default of course.
For those non-fatal errors, what should we do for runtime driver behaviors?
Should the drivers themselves fail API calls? Or should we let
device_get_binding() return NULL?
How about something like this:

diff --git a/kernel/nanokernel/device.c b/kernel/nanokernel/device.c
index f86f95f..82774c4 100644
--- a/kernel/nanokernel/device.c
+++ b/kernel/nanokernel/device.c
@@ -58,7 +58,7 @@ struct device *device_get_binding(char *name)
struct device *info;

for (info = __device_init_start; info != __device_init_end; info++) {
- if (!strcmp(name, info->config->name)) {
+ if (!strcmp(name, info->config->name) && info->driver_api) {
return info;
}
}

if there is a non-fatal error the driver will not mark itself available
and we can get rid of the null checks in the API headers since the user of the
driver will never get a reference to it if init() did not complete successfully
This should work. We need to rework some of the drivers as they make
the driver_api assignment unconditionally.


----------
Daniel Leung


Thomas, Ramesh
 

On Fri, 2016-02-19 at 10:31 -0500, Benjamin Walsh wrote:
On Thu, Feb 18, 2016 at 10:49:01PM +0000, Thomas, Ramesh wrote:

On Thu, 2016-02-18 at 12:34 -0800, Dirk Brandewie wrote:

On 02/18/2016 10:03 AM, Benjamin Walsh wrote:

My take on it is that for Zephyr a failed device initialization
should be considered a fatal event. My expectation is that the
Zephyr user will only be enabling relevant (and important) devices to
their project. If one of these devices should fail, then that is a
serious system error and _NanoFatalErrorHandler() should be invoked.
How about a CONFIG option to determine if kernel should abort boot? The
What I was proposing in my last reply, too keep things simple, is one
Kconfig option to enable/disable handling of error codes by init
functions. If it's disabled, nothing is handled automatically and it
relies on everything working (when deployed) and __ASSERTs (when
debugging).

If it's enabled, some runtime handling is already needed, so you can
piggyback on it. One way is to standardize on a specific error code to
be returned by init functions to tell the kernel to abort boot. No need
for a Kconfig option for that case.
That should be fine. So the kernel decision to abort will be based on a
"fatal" error return from any of the device's init function.

We still need to decide what does "abort" mean for kernel. It could
mean -
1. Break out of _sys_device_do_config_level() when one device init
fails and continue boot. App takes necessary action when it finds device
inits had failed.
2. Kernel continues with all device inits and boot. Basically ignores
the error returned but devices have them logged so app will know when it
tries to use them.
3. Kernel itself aborts boot.

I prefer #2 or #1 (in that order) because that gives app flexibility of
dealing with the situation.

One more thing to note is that app will not get binding for all
required devices e.g. ioapic, loapic, sys_clock etc. So it might not
know if they failed unless it queries all devices explicitly which is
unlikely unless we provide some API for that. IMO these should never
fail and we need not consider any kernel action for this (so we can
still stay away from #3 above). Just wanted to bring this up as app's
ability to know device failure is being considered as part of the
design.


behavior should be determined by the kind of application. Some can live
with less features while for others it is all or nothing. In both cases
a logging feature that can be queried by the app would be useful.

Any of those could be a valid approach I think, but we have to decide on
one. And right now, we have the worst since we return those error codes
which are meant for runtime handling, but they just go into the void.
+1 _sys_device_do_config_level() should at least pass on the error
returned even if one device fails. This as well as the
"resume_all/suspend_all" functions can be called by app in power
management case. It would be useful for the app to know that something
went wrong.

Question still remains whether _sys_device_do_config_level() should
continue with other devices if one fails. IMO it should continue but
should return error if one device had failed. At boot time, kernel can
save the error condition somewhere to be queried by the app.
The thing is that you can have several failures, one per-device object
basically. That's why I was proposing saving the errors in some field in
the device object and queuing them to be retrieved by the application.
Storing status in device object sounds good.

I was considering the possible use case of PM app calling
_sys_device_do_config_level(). In that case a return value of 0 or 1
would help it know some thing went wrong instead of having to query all
devices to find out. But I think this does not impact the main topic of
your RFC and can be addressed separately if necessary.


Brainstorming:

If we want to let the application handle the initialization issues, we
probably need some kind of queue that gets filled by the init system
when init functions return errors, and that the application drains to
see what failed. We might want to queue the associated device objects,
and have an errno field in there, or something like that.
The app would be anyway getting a failure when it tries to use the
device. It most probably cannot rectify any device error on the fly with
more detailed info. Just my thought

A simple status from _sys_device_do_config_level() is probably enough so
the app can decide whether to continue or abort.
This still has to be saved somewhere that can be retrieved by the app.
The kernel does not pass anything to main() currently, and in the
microkernel, there isn't necessarily a main() either.
Yes, passing in main() or task entry points are messy. Storing
somewhere for app to query is better. May be even provide an API to
help in the query if that helps.


Thomas, Ramesh
 

On Fri, 2016-02-19 at 10:05 -0800, Daniel Leung wrote:
On Fri, Feb 19, 2016 at 08:31:48AM -0800, Dirk Brandewie wrote:


On 02/18/2016 03:04 PM, Daniel Leung wrote:
On Thu, Feb 18, 2016 at 05:54:56PM -0500, Benjamin Walsh wrote:
My take on it is that for Zephyr a failed device initialization
should be considered a fatal event. My expectation is that the
Zephyr user will only be enabling relevant (and important) devices to
their project. If one of these devices should fail, then that is a
serious system error and _NanoFatalErrorHandler() should be invoked.

If this train of thought holds up to scrutiny, and if the aim is to
save a few bytes then I would think that it would be better to have
the device initialization routines return a failure code and have
_sys_device_do_config_level() check for it and invoke the fatal error
handler upon the detection of failure. Otherwise we duplicate the
overhead of calling the fatal error handler in each device
initialization routine.
Sorry for the slow response. I agree with Peter here I think we should
be checking the return value and doing something useful with the
result. Maybe not _NanoFatalErrorHandler() but something notifying
the application that something bad happened. A given device not
initializing may not be fatal to the the whole application, just one
feature is currently unavailable.
For the kind of systems we are targeting, do we really expect the
application to handle devices not initializing correctly, being designed
so that parts are disabled if some parts of the initialization fail
(devices or others), or do we expect applications to require everything
to be present for them to function correctly ? I would have thought the
latter, but I can be convinced.
Delving into the realm of the hypothetical :-)

What about devices that have drivers in the system but are not present
(pluggable) or can't initialize because some resource external to the
device can't be contacted (network server).

The application may be able to still do useful work albeit with reduced
functionality.

Then, if the latter, do we expect the application catching the errors at
runtime when deployed or during development (basically catching software
errors mostly) not malfunctionning hardware. Here, I was thinking the
latter as well, which is why I was proposing __ASSERT() calls catching
initialization errors in debug loads only. And this fits with one of the
core values of the OS, which is small footprint.
Both models are useful for different reasons :-D

Any of those could be a valid approach I think, but we have to decide on
one. And right now, we have the worst since we return those error codes
which are meant for runtime handling, but they just go into the void.
Agreed we need to pick and stay with it for some amount of time until
we see a few real uses/applications/platforms.
OK, following your proposal below, what we could put in place is
standardizing on error codes that init routines must return if they want
the kernel init system to automatically trigger a fatal error.

Then, we could also allow configuring out the error handling if someone
needs to squeeze that last amount of space. One more Kconfig option! :)
The error handling would be enabled by default of course.
For those non-fatal errors, what should we do for runtime driver behaviors?
Should the drivers themselves fail API calls? Or should we let
device_get_binding() return NULL?
How about something like this:

diff --git a/kernel/nanokernel/device.c b/kernel/nanokernel/device.c
index f86f95f..82774c4 100644
--- a/kernel/nanokernel/device.c
+++ b/kernel/nanokernel/device.c
@@ -58,7 +58,7 @@ struct device *device_get_binding(char *name)
struct device *info;

for (info = __device_init_start; info != __device_init_end; info++) {
- if (!strcmp(name, info->config->name)) {
+ if (!strcmp(name, info->config->name) && info->driver_api) {
return info;
}
}

if there is a non-fatal error the driver will not mark itself available
and we can get rid of the null checks in the API headers since the user of the
driver will never get a reference to it if init() did not complete successfully
Why not create a dedicated status field in device object?

Additionally we can provide an API to query status of devices. This
will help query status of devices it does not bind to.


This should work. We need to rework some of the drivers as they make
the driver_api assignment unconditionally.


----------
Daniel Leung


Dirk Brandewie <dirk.j.brandewie@...>
 

On 02/19/2016 05:31 PM, Thomas, Ramesh wrote:
On Fri, 2016-02-19 at 10:05 -0800, Daniel Leung wrote:
On Fri, Feb 19, 2016 at 08:31:48AM -0800, Dirk Brandewie wrote:


On 02/18/2016 03:04 PM, Daniel Leung wrote:
On Thu, Feb 18, 2016 at 05:54:56PM -0500, Benjamin Walsh wrote:
My take on it is that for Zephyr a failed device initialization
should be considered a fatal event. My expectation is that the
Zephyr user will only be enabling relevant (and important) devices to
their project. If one of these devices should fail, then that is a
serious system error and _NanoFatalErrorHandler() should be invoked.

If this train of thought holds up to scrutiny, and if the aim is to
save a few bytes then I would think that it would be better to have
the device initialization routines return a failure code and have
_sys_device_do_config_level() check for it and invoke the fatal error
handler upon the detection of failure. Otherwise we duplicate the
overhead of calling the fatal error handler in each device
initialization routine.
Sorry for the slow response. I agree with Peter here I think we should
be checking the return value and doing something useful with the
result. Maybe not _NanoFatalErrorHandler() but something notifying
the application that something bad happened. A given device not
initializing may not be fatal to the the whole application, just one
feature is currently unavailable.
For the kind of systems we are targeting, do we really expect the
application to handle devices not initializing correctly, being designed
so that parts are disabled if some parts of the initialization fail
(devices or others), or do we expect applications to require everything
to be present for them to function correctly ? I would have thought the
latter, but I can be convinced.
Delving into the realm of the hypothetical :-)

What about devices that have drivers in the system but are not present
(pluggable) or can't initialize because some resource external to the
device can't be contacted (network server).

The application may be able to still do useful work albeit with reduced
functionality.

Then, if the latter, do we expect the application catching the errors at
runtime when deployed or during development (basically catching software
errors mostly) not malfunctionning hardware. Here, I was thinking the
latter as well, which is why I was proposing __ASSERT() calls catching
initialization errors in debug loads only. And this fits with one of the
core values of the OS, which is small footprint.
Both models are useful for different reasons :-D

Any of those could be a valid approach I think, but we have to decide on
one. And right now, we have the worst since we return those error codes
which are meant for runtime handling, but they just go into the void.
Agreed we need to pick and stay with it for some amount of time until
we see a few real uses/applications/platforms.
OK, following your proposal below, what we could put in place is
standardizing on error codes that init routines must return if they want
the kernel init system to automatically trigger a fatal error.

Then, we could also allow configuring out the error handling if someone
needs to squeeze that last amount of space. One more Kconfig option! :)
The error handling would be enabled by default of course.
For those non-fatal errors, what should we do for runtime driver behaviors?
Should the drivers themselves fail API calls? Or should we let
device_get_binding() return NULL?
How about something like this:

diff --git a/kernel/nanokernel/device.c b/kernel/nanokernel/device.c
index f86f95f..82774c4 100644
--- a/kernel/nanokernel/device.c
+++ b/kernel/nanokernel/device.c
@@ -58,7 +58,7 @@ struct device *device_get_binding(char *name)
struct device *info;

for (info = __device_init_start; info != __device_init_end; info++) {
- if (!strcmp(name, info->config->name)) {
+ if (!strcmp(name, info->config->name) && info->driver_api) {
return info;
}
}

if there is a non-fatal error the driver will not mark itself available
and we can get rid of the null checks in the API headers since the user of the
driver will never get a reference to it if init() did not complete successfully
Why not create a dedicated status field in device object?
It would take more RAM :-) Not sure exactly what you would want to
return in the status? If device init() fails there is nothing the
application code can do about it other that fail gracefully.

Additionally we can provide an API to query status of devices. This
will help query status of devices it does not bind to.
If you are not binding to a device why do you need the status? If
there is a good use case propose the API with the patch ;-)


This should work. We need to rework some of the drivers as they make
the driver_api assignment unconditionally.


----------
Daniel Leung


Dirk Brandewie <dirk.j.brandewie@...>
 

On 02/19/2016 05:20 PM, Thomas, Ramesh wrote:
I was considering the possible use case of PM app calling
_sys_device_do_config_level(). In that case a return value of 0 or 1
would help it know some thing went wrong instead of having to query all
devices to find out. But I think this does not impact the main topic of
your RFC and can be addressed separately if necessary.
_sys_device_do_config_level() is *not* a public API and is *not*
guaranteed be stable or even exist from release to release.

AFAIK calling _sys_device_do_config_level() again would not break any of
our current drivers but is the driver is is creating some state in
init() (e.g. security context, network connection) all hell is likely
to break loose. The driver could maintain whether init() has been
called to work around this but at the cost of RAM.

What is the use case? What functionality do you need?

--Dirk


Thomas, Ramesh
 

On Mon, 2016-02-22 at 07:39 -0800, Dirk Brandewie wrote:

On 02/19/2016 05:31 PM, Thomas, Ramesh wrote:
On Fri, 2016-02-19 at 10:05 -0800, Daniel Leung wrote:
On Fri, Feb 19, 2016 at 08:31:48AM -0800, Dirk Brandewie wrote:


On 02/18/2016 03:04 PM, Daniel Leung wrote:
On Thu, Feb 18, 2016 at 05:54:56PM -0500, Benjamin Walsh wrote:
My take on it is that for Zephyr a failed device initialization
should be considered a fatal event. My expectation is that the
Zephyr user will only be enabling relevant (and important) devices to
their project. If one of these devices should fail, then that is a
serious system error and _NanoFatalErrorHandler() should be invoked.

If this train of thought holds up to scrutiny, and if the aim is to
save a few bytes then I would think that it would be better to have
the device initialization routines return a failure code and have
_sys_device_do_config_level() check for it and invoke the fatal error
handler upon the detection of failure. Otherwise we duplicate the
overhead of calling the fatal error handler in each device
initialization routine.
Sorry for the slow response. I agree with Peter here I think we should
be checking the return value and doing something useful with the
result. Maybe not _NanoFatalErrorHandler() but something notifying
the application that something bad happened. A given device not
initializing may not be fatal to the the whole application, just one
feature is currently unavailable.
For the kind of systems we are targeting, do we really expect the
application to handle devices not initializing correctly, being designed
so that parts are disabled if some parts of the initialization fail
(devices or others), or do we expect applications to require everything
to be present for them to function correctly ? I would have thought the
latter, but I can be convinced.
Delving into the realm of the hypothetical :-)

What about devices that have drivers in the system but are not present
(pluggable) or can't initialize because some resource external to the
device can't be contacted (network server).

The application may be able to still do useful work albeit with reduced
functionality.

Then, if the latter, do we expect the application catching the errors at
runtime when deployed or during development (basically catching software
errors mostly) not malfunctionning hardware. Here, I was thinking the
latter as well, which is why I was proposing __ASSERT() calls catching
initialization errors in debug loads only. And this fits with one of the
core values of the OS, which is small footprint.
Both models are useful for different reasons :-D

Any of those could be a valid approach I think, but we have to decide on
one. And right now, we have the worst since we return those error codes
which are meant for runtime handling, but they just go into the void.
Agreed we need to pick and stay with it for some amount of time until
we see a few real uses/applications/platforms.
OK, following your proposal below, what we could put in place is
standardizing on error codes that init routines must return if they want
the kernel init system to automatically trigger a fatal error.

Then, we could also allow configuring out the error handling if someone
needs to squeeze that last amount of space. One more Kconfig option! :)
The error handling would be enabled by default of course.
For those non-fatal errors, what should we do for runtime driver behaviors?
Should the drivers themselves fail API calls? Or should we let
device_get_binding() return NULL?
How about something like this:

diff --git a/kernel/nanokernel/device.c b/kernel/nanokernel/device.c
index f86f95f..82774c4 100644
--- a/kernel/nanokernel/device.c
+++ b/kernel/nanokernel/device.c
@@ -58,7 +58,7 @@ struct device *device_get_binding(char *name)
struct device *info;

for (info = __device_init_start; info != __device_init_end; info++) {
- if (!strcmp(name, info->config->name)) {
+ if (!strcmp(name, info->config->name) && info->driver_api) {
return info;
}
}

if there is a non-fatal error the driver will not mark itself available
and we can get rid of the null checks in the API headers since the user of the
driver will never get a reference to it if init() did not complete successfully
Why not create a dedicated status field in device object?
It would take more RAM :-) Not sure exactly what you would want to
return in the status? If device init() fails there is nothing the
application code can do about it other that fail gracefully.
I do agree with you that the app cannot do much in failures. But then
kernel need not do even what you propose with failing binding. Let the
app figure out when it tries to use the port. Also drivers can provide
apis for diagnosis if they need e.g. applications that can be remotely
diagnosed and fixed.


Additionally we can provide an API to query status of devices. This
will help query status of devices it does not bind to.
If you are not binding to a device why do you need the status? If
there is a good use case propose the API with the patch ;-)
The API proposal was based on your proposal using get_binding(). Binding
will not work for devices that do not have a name e.g. loapic/ioapic.

But, I guess first we need to see if there is a real need for an error
logging mechanism with kernel involvment :-) It should be noted that
some error logging can be achieved by driver api itself (without kernel
help) if the application needs.


This should work. We need to rework some of the drivers as they make
the driver_api assignment unconditionally.


----------
Daniel Leung


Thomas, Ramesh
 

On Mon, 2016-02-22 at 07:55 -0800, Dirk Brandewie wrote:

On 02/19/2016 05:20 PM, Thomas, Ramesh wrote:
I was considering the possible use case of PM app calling
_sys_device_do_config_level(). In that case a return value of 0 or 1
would help it know some thing went wrong instead of having to query all
devices to find out. But I think this does not impact the main topic of
your RFC and can be addressed separately if necessary.
_sys_device_do_config_level() is *not* a public API and is *not*
guaranteed be stable or even exist from release to release.

AFAIK calling _sys_device_do_config_level() again would not break any of
our current drivers but is the driver is is creating some state in
init() (e.g. security context, network connection) all hell is likely
to break loose. The driver could maintain whether init() has been
called to work around this but at the cost of RAM.

What is the use case? What functionality do you need?
Thanks for pointing out the facts and issues :-) I agree it cannot be
used as a replacement for proper handling of resume() by devices.


--Dirk