Re: Fibers Become Unrunnable in Nanokernel


Michael Rosen
 

I am working off of Zephyr 1.5 in the nanokernel configuration on a
custom Intel Curie board. We have a number of fibers running on both
cores, but seem to be having trouble with fibers on the ARC core. We
haven't been able to pin down why, but occasionally (or something
rapidly), a number of fibers seem to simply stop running. Diving into
the issue in GDB reveals that while all the fibers have roughly the
form below, some fibers don't appear to be sleeping NOR are they on
the runnable list:

void fiber(int a, int b) {
struct nano_timer timer;

nano_timer_init(&timer, NULL);

while (1) {
nano_timer_start(&timer, MSEC(FIBER_PERIOD_MS));
... // Do work
nano_timer_test(&timer, TICKS_UNLIMITED);
}
}

As far as I can tell, the "timer" is expired and the struct tcs's for
the fibers are not in the runnable list. All other fibers in the
system on ARC seem to be in the runnable list as expected. Also, from
some basic stack analysis, it appears that the unrunnable fibers are
still in the nano_timer_test function. One thing worth noting is that
while most fibers are just doing some math and storing it in memory;
but two of them are accessing a SPI and I2C device. When these fibers
are prevented from accessing the device, the system seems to run
smoothly; otherwise it doesn't. Has anything like this ever been
encountered before?

Note also that moving to Zephyr 1.6 would be significant effort as we
have implemented a number of custom drivers and other features that
would take a significant time to port.
This does not really solve your problem, but Zephyr 1.6 contains a legacy layer that provides all the APIs of the old kernels on top of the new kernel. It's not a NOP to move to 1.6, since you might have some issues with e.g. stack sizes, or some other > > idiosyncrasies, but it might be less painful than you think.

About your issue: the first thing I always suspect with weird behaviour like this is stack smashing. There is a kconfig option for ARC that enables stack overflow/underflow detection. Do you have that option enabled ?
I do have CONFIG_ARC_STACK_CHECKING enabled as well have examined the _nanokernel and all the struct tcs's for each fiber (I did have a stack overflow on x86 before and noticed that the struct tcs for that fiber got trampled, this doesn't appear to be the case here); I am not getting the exception and all appears to be in order for the structs.

Interestingly, I was able to get it to reliably happen in my codebase now and what I see doesn't look very good. I find that two of six fiber are no longer running and the runnable list (I stop x86 first then ARC, resulting in all ARC timers expiring) contains the four running fibers. I also find that the two fibers that are no longer running are actually linked together (fiber_not_running_1_tcs.link points to fiber_not_running_2_tcs and fiber_not_running_2_tcs has a null pointer in its list). Its almost like the two unrunning fibers are inserted into the list at the same time another fiber comes along and inserts itself into the list (or at least, it gets the _nanokernel.fiber pointer (sees its null), then a timer interrupt comes along and inserts the two expired fibers into the list and returns to have the original fiber insert itself at _nanokernel.fiber; but I have not idea how this could happen). As I mentioned before, the fibers doing work are using semaphores to wait on I2C and SPI transactions, and if we remove these transactions from taking place, everything runs fine (though its hard to tell with timing bugs...).

Mike

Join devel@lists.zephyrproject.org to automatically receive all group messages.