On Thu, Feb 18, 2016 at 05:54:56PM -0500, Benjamin Walsh wrote:
My take on it is that for Zephyr a failed device initialization should be considered a fatal event. My expectation is that the Zephyr user will only be enabling relevant (and important) devices to their project. If one of these devices should fail, then that is a serious system error and _NanoFatalErrorHandler() should be invoked.
If this train of thought holds up to scrutiny, and if the aim is to save a few bytes then I would think that it would be better to have the device initialization routines return a failure code and have _sys_device_do_config_level() check for it and invoke the fatal error handler upon the detection of failure. Otherwise we duplicate the overhead of calling the fatal error handler in each device initialization routine.
Sorry for the slow response. I agree with Peter here I think we should be checking the return value and doing something useful with the result. Maybe not _NanoFatalErrorHandler() but something notifying the application that something bad happened. A given device not initializing may not be fatal to the the whole application, just one feature is currently unavailable.
For the kind of systems we are targeting, do we really expect the application to handle devices not initializing correctly, being designed so that parts are disabled if some parts of the initialization fail (devices or others), or do we expect applications to require everything to be present for them to function correctly ? I would have thought the latter, but I can be convinced.
Delving into the realm of the hypothetical :-)
What about devices that have drivers in the system but are not present (pluggable) or can't initialize because some resource external to the device can't be contacted (network server).
The application may be able to still do useful work albeit with reduced functionality.
Then, if the latter, do we expect the application catching the errors at runtime when deployed or during development (basically catching software errors mostly) not malfunctionning hardware. Here, I was thinking the latter as well, which is why I was proposing __ASSERT() calls catching initialization errors in debug loads only. And this fits with one of the core values of the OS, which is small footprint.
Both models are useful for different reasons :-D
Any of those could be a valid approach I think, but we have to decide on one. And right now, we have the worst since we return those error codes which are meant for runtime handling, but they just go into the void.
Agreed we need to pick and stay with it for some amount of time until we see a few real uses/applications/platforms.
OK, following your proposal below, what we could put in place is standardizing on error codes that init routines must return if they want the kernel init system to automatically trigger a fatal error.
Then, we could also allow configuring out the error handling if someone needs to squeeze that last amount of space. One more Kconfig option! :) The error handling would be enabled by default of course.
For those non-fatal errors, what should we do for runtime driver behaviors? Should the drivers themselves fail API calls? Or should we let device_get_binding() return NULL?
How we could/should report this type of error is an open question :-).
Brainstorming:
If we want to let the application handle the initialization issues, we probably need some kind of queue that gets filled by the init system when init functions return errors, and that the application drains to see what failed. We might want to queue the associated device objects, and have an errno field in there, or something like that.
How about having the driver return an error code saying whether the failure is a fatal error or not. For the drivers that we have now where we *know* that if it fails it is a hardware or configuration error which is fatal. So we go with the _NanoFatalErrorHandler() error path.
That sounds good.
If a non-fatal error occurred (may work at next reset) just ignore it an move on. The application can detect if the device is dead/not present by the return codes from the driver call(s). Then the application can decide what and how to report the error to the user.
That's another way of doing it. It's a bit less explicit than a list of errors, but less overhead, and reuses what's already available.