(Big) problems achieving fair scheduling of a semaphore ISR producer - thread consumer case


Paul Sokolovsky
 

Hello,

This a continuation of a discussion started in
https://gerrit.zephyrproject.org/r/12668 and on IRC yesterday.

The problem:
Recently contributed console_getchar()'s internal buffer
easily overflows when large enough chunk of data is received via UART,
which happens e.g.: a) when sending console input programmatically
from host; b) clipboard is pasted into a terminal app from host.

Expected behavior:
A simple producer - consumer circular buffer synchronized using
semaphore is used, UART is slow (KHz), consumer thread is fast (MHz),
hence no circular buffer is expected.

Way to reproduce:
Go to samples/subsys/console/getchar. Build the example, deploy, run.
Can be tested with QEMU too. When started, select this paragraph from
the email (starting with "Go to ..."), copy to clipboard, paste. Behold
a big number of "Console buffer overflow - char dropped" messages
appear, all characters haven't been pasted.

Discussion:
Even from the reproduction steps above, if you look at what *have* been
pasted, you'll that's initial chars of content, so it looks (and later
confirmed) following happens: instead of producer and consumer working
in tandem, initially mostly producer gets run in loop, overflowing
buffer, and only then consumer kicks in. The producer is a function run
from an ISR:
https://gerrit.zephyrproject.org/r/gitweb?p=zephyr.git;a=blob;f=subsys/console/getchar.c;hb=refs/heads/master#l21 ,
consumer is a function run in the Zephyr main thread below it.

Let's do the math: UART speed is 115200 baud / (1 + 8 + 1 bits) = 11520
bytes/s, ISR runs with an average frequency 11.5 KHz. My test hardware
is e.g. BOARD=96b_carbon, running at 84MHz. For each ISR call, there're
7304 instructions of scheduler and main thread to run on average.
Should be plenty to run the simple consumer above.

We note effects of the possible UART hardware buffer. The sample above
uses rather small circular buffer, but in a real app, it's 256 bytes.
Growing it further doesn't help. Even with HW buffer, sustained char
production rate is still 11.5 KHz, and the SW buffer can't keep up with
that due to strong bias on production, with heavily throttled
consumption.

History of investigation:
1. I initially thought that maybe need to prod Zephyr to do thread
rescheduling explicitly, looked for yield, dropped it, and rejoiced by
seeing the issue to be fixed by BOARD=qemu_x86 ! Later I noticed it
still doesn't fix it for BOARD=qemu_cortex_m3. To start a discussion, I
submitted a patch as is, even though k_yield is documenting as working
on threads (not ISRs). Patch was merged, later reverted, the discussion
started hopefully either way ;-).

2. One of the initial ideas was that maybe buffer DoSing was an
artifact of QEMU. The behavior is confirmed for 96b_carbon.

3. One of the early suggestion I got on IRC was to do flow control with
disabling/enabling UART IRQs. I wasn't happy about such suggestion, as
the code above is generic console buffering, and with several thousands
of cycles to process each char, everything should "just work".

4. I did a lot of investigation yesterday (with ARM, as that's hard
case, which isn't "fixed" with the incorrect fix). There was a
suspicion that context switching, implemented with PendSV interrupt
doesn't always kick in. In the end, I had to discard my results, as I
used printk output for diagnosing it, which is slow, and makes it
normal a case that if I report recv char with printk, then afterwards
there's another char in buf, causing ISR to run immediately again, so
PendSV handler can't run. (To elaborate, behavior in the absence of
printk still resembles of PendSV not running often enough, it just
can't proven in the presence of printk's).

5. I briefly tried "kernel event logger" facility. Well, "briefly"
doesn't say it right, I spent couple of hours trying to make it useful
for my case, and hunt down issues (patches submitted and merged), so at
one point I decided to not look for more bugs and switched back to
printk. Perhaps following this path further makes sense though to
disambiguate issues in p.4.

6. In the end I tried suggestion with UART IRQ disabling. If you read
this far, you won't be surprised that it didn't work. Even after I
disable IRQ, my ISR routine still gets called one more time. This
leaves production vs consumption rate at 2:1, still leading to buffer
overflow, and it elaborates my concern with adding such code to console
routine: it leaves it at mercy of a particular hardware implementation
(IRQ disabling/enabling might be not implemented for example).

7. With some creativity, I was able to find a USB-UART adapter to try
the sample on arduino_101. So, while incorrect fix fixes the issue
completely for qemu_x86, on the real hardware, it doesn't - while
there're no buffer overflow messages, majority of chars are dropped.




So, it's a typical "nothing works" type of bug each of us faced and
enjoyed. I really need 2nd (3rd, 4th) opinion on this. My kind request
though is: please don't try to shift it all on UART and talk about the
need of DTR/RTS, etc. UART is damn slow device by the current
standards, which should "just work" with dozens-of-MHz CPUs without
any flow control on hardware side.

--
Best Regards,
Paul

Linaro.org | Open source software for ARM SoCs
Follow Linaro: http://www.facebook.com/pages/Linaro
http://twitter.com/#!/linaroorg - http://www.linaro.org/linaro-blog


Daniel Thompson <daniel.thompson@...>
 

On 07/04/17 07:04, Paul Sokolovsky wrote:
Way to reproduce:
Go to samples/subsys/console/getchar. Build the example, deploy, run.
Can be tested with QEMU too. When started, select this paragraph from
the email (starting with "Go to ..."), copy to clipboard, paste. Behold
a big number of "Console buffer overflow - char dropped" messages
appear, all characters haven't been pasted.
I did a quick test on a carbon and I can't reproduce with the getchar example.

I *am* seeing overflows but the overflows I observe are what would be expected based on the quantity of input buffering (16) and the depth of output buffering (1).

The getchar output emits ~16 characters for each one received so we expect the overflow to occur in the second line of output (which it does). Increasing the input buffer to 32 results in the overflow occurring in the third line of output.


So, it's a typical "nothing works" type of bug each of us faced and
enjoyed. I really need 2nd (3rd, 4th) opinion on this. My kind request
though is: please don't try to shift it all on UART and talk about the
need of DTR/RTS, etc. UART is damn slow device by the current
standards, which should "just work" with dozens-of-MHz CPUs without
any flow control on hardware side.
If the UART TX side produces more characters than it receives in input then flow control (or very large buffers) are the *only* way to avoid losing characters when the input is running at maximum speed.

Anything with a character echo and a longer message on each '\n' will eventually overflow unless we use flow control (or bound the input).

One possible mitigation would be to provide with an API to determin if the pull-console is backlogging. If that happens than an application might choose to disable its character echo to help it catch up.

I remember yesterday you said you had disabled all the local echos... did you reach a point where the TX/RX ratio was less than 1:1?


Daniel.


Paul Sokolovsky
 

On Fri, 7 Apr 2017 09:06:14 +0100
Daniel Thompson <daniel.thompson@linaro.org> wrote:

[]

So, it's a typical "nothing works" type of bug each of us faced and
enjoyed. I really need 2nd (3rd, 4th) opinion on this. My kind
request though is: please don't try to shift it all on UART and
talk about the need of DTR/RTS, etc. UART is damn slow device by
the current standards, which should "just work" with dozens-of-MHz
CPUs without any flow control on hardware side.
If the UART TX side produces more characters than it receives in
input then flow control (or very large buffers) are the *only* way to
avoid losing characters when the input is running at maximum speed.
OMG, so do you want to say that the problem is that we use busy-polling
for UART TX??


--
Best Regards,
Paul

Linaro.org | Open source software for ARM SoCs
Follow Linaro: http://www.facebook.com/pages/Linaro
http://twitter.com/#!/linaroorg - http://www.linaro.org/linaro-blog


Paul Sokolovsky
 

On Fri, 7 Apr 2017 09:06:14 +0100
Daniel Thompson <daniel.thompson@linaro.org> wrote:

[]

So, it's a typical "nothing works" type of bug each of us faced and
enjoyed. I really need 2nd (3rd, 4th) opinion on this. My kind
request though is: please don't try to shift it all on UART and
talk about the need of DTR/RTS, etc. UART is damn slow device by
the current standards, which should "just work" with dozens-of-MHz
CPUs without any flow control on hardware side.
If the UART TX side produces more characters than it receives in
input then flow control (or very large buffers) are the *only* way to
avoid losing characters when the input is running at maximum speed.

Anything with a character echo and a longer message on each '\n' will
eventually overflow unless we use flow control (or bound the input).

One possible mitigation would be to provide with an API to determin
if the pull-console is backlogging. If that happens than an
application might choose to disable its character echo to help it
catch up.

I remember yesterday you said you had disabled all the local echos...
did you reach a point where the TX/RX ratio was less than 1:1?
Ok, master, following patch:

--- a/samples/subsys/console/getchar/src/main.c
+++ b/samples/subsys/console/getchar/src/main.c
@@ -9,6 +9,6 @@ void main(void)
while (1) {
uint8_t c = console_getchar();

- printk("char: [0x%x] %c\n", c, c);
+// printk("char: [0x%x] %c\n", c, c);
}
}

...
Console buffer overflow - char dropped
Console buffer overflow - char dropped
...

So, kindly let me keep repeating it again - the problem is not with
UART, the problem is with scheduling a thread which is woken up on a
sema signaled in ISR. (Yes, it would be nice to reproduce that with
another IRQ but UART).

The only way I was able to achieve 1:1 ration is by applying incorrect
k_yield patch to qemu_x86 build - that works flawless and perfectly
(printing "+" on k_sem_give, "-" on k_sem_take produces the expected
pattern of "+-+-+-+-+-" whereas in other cases the pattern is like
"+++++++++++++++++++++++++++++++++++-+++++++++++-++++++++++++++-").



Daniel.


--
Best Regards,
Paul

Linaro.org | Open source software for ARM SoCs
Follow Linaro: http://www.facebook.com/pages/Linaro
http://twitter.com/#!/linaroorg - http://www.linaro.org/linaro-blog


Daniel Thompson <daniel.thompson@...>
 

On 07/04/17 09:21, Paul Sokolovsky wrote:
On Fri, 7 Apr 2017 09:06:14 +0100
Daniel Thompson <daniel.thompson@linaro.org> wrote:

[]

So, it's a typical "nothing works" type of bug each of us faced and
enjoyed. I really need 2nd (3rd, 4th) opinion on this. My kind
request though is: please don't try to shift it all on UART and
talk about the need of DTR/RTS, etc. UART is damn slow device by
the current standards, which should "just work" with dozens-of-MHz
CPUs without any flow control on hardware side.
If the UART TX side produces more characters than it receives in
input then flow control (or very large buffers) are the *only* way to
avoid losing characters when the input is running at maximum speed.
OMG, so do you want to say that the problem is that we use busy-polling
for UART TX??
Yes, printk() eventually boils down to uart_poll_out() and will spin the thread that calls it until the message is sent to the UART. printk() is arguably also *correct*: if you buffer printk() messages you have problems getting reliable output from interrupt handlers.

In any case, adding a push support to the console might improve the power/performance of console applications (which would spend less time spinning) but it cannot solve the console overflow problem. When the TX/RX ration is >1 since you would also exhaust the output buffer too.


Daniel.


Daniel Thompson <daniel.thompson@...>
 

On 07/04/17 09:31, Paul Sokolovsky wrote:
On Fri, 7 Apr 2017 09:06:14 +0100
Daniel Thompson <daniel.thompson@linaro.org> wrote:

[]

So, it's a typical "nothing works" type of bug each of us faced and
enjoyed. I really need 2nd (3rd, 4th) opinion on this. My kind
request though is: please don't try to shift it all on UART and
talk about the need of DTR/RTS, etc. UART is damn slow device by
the current standards, which should "just work" with dozens-of-MHz
CPUs without any flow control on hardware side.
If the UART TX side produces more characters than it receives in
input then flow control (or very large buffers) are the *only* way to
avoid losing characters when the input is running at maximum speed.

Anything with a character echo and a longer message on each '\n' will
eventually overflow unless we use flow control (or bound the input).

One possible mitigation would be to provide with an API to determin
if the pull-console is backlogging. If that happens than an
application might choose to disable its character echo to help it
catch up.

I remember yesterday you said you had disabled all the local echos...
did you reach a point where the TX/RX ratio was less than 1:1?
Ok, master, following patch:
;-)

I've been very careful not to say you're wrong. I *only* said the example you brought up was not a reproduction of an incorrect thread schedule.


--- a/samples/subsys/console/getchar/src/main.c
+++ b/samples/subsys/console/getchar/src/main.c
@@ -9,6 +9,6 @@ void main(void)
while (1) {
uint8_t c = console_getchar();

- printk("char: [0x%x] %c\n", c, c);
+// printk("char: [0x%x] %c\n", c, c);
}
}

...
Console buffer overflow - char dropped
Console buffer overflow - char dropped
...
Did you run this on carbon? I've not been able to reproduce on this board.

It *does* reproduce on qemu_cortex_m3 but the UART simulation looks a bit dubious. I can't directly test the input but the simulated UART is *very* obviously producing characters way faster than the normal baud rate... it appears to simply be running as fast as it can.


Daniel.


Paul Sokolovsky
 

On Fri, 7 Apr 2017 09:56:30 +0100
Daniel Thompson <daniel.thompson@linaro.org> wrote:

[]

I remember yesterday you said you had disabled all the local
echos... did you reach a point where the TX/RX ratio was less than
1:1?
Ok, master, following patch:
;-)

I've been very careful not to say you're wrong. I *only* said the
example you brought up was not a reproduction of an incorrect thread
schedule.
Well, I'd love to be pointed where I'm wrong and be shown how to do it
right. For the background, receiving a char in an interrupt or from a
callback, storing it in a buffer and then letting main application pull
chars from buffer (or block on it) is a very mundane task, with many
systems working like that. In MicroPython, we have that implemented in
bare metal (well, via STM32Cube/CC3200 HAL libs), in
Xtensa proprietary cooperative multitasking RTOS (well, backed by
bare-metal handling too), in STM32 USB-CDC, and all that "just works".
Zephyr so far the only one (of the tried) which has problems.

As UART is a pretty simple device and you can do only so much about
handling it, and I assume everything of that was already done in
Zephyr, I blame threading/sync primitives, as much more differentiating
matter. I may be very wrong (and ready to "eat my hat").



--- a/samples/subsys/console/getchar/src/main.c
+++ b/samples/subsys/console/getchar/src/main.c
@@ -9,6 +9,6 @@ void main(void)
while (1) {
uint8_t c = console_getchar();

- printk("char: [0x%x] %c\n", c, c);
+// printk("char: [0x%x] %c\n", c, c);
}
}

...
Console buffer overflow - char dropped
Console buffer overflow - char dropped
...
Did you run this on carbon? I've not been able to reproduce on this
board.
Ok, now I did. With the patch above, I can't reproduce the overflow
message. But now let's try to echo the input char, and for a
difference, set CONFIG_CONSOLE_GETCHAR_BUFSIZE=64 so we didn't suspect
too short a buffer.

I still can't reproduce overflow message, but now we're entering land of
obvious UART problems. So, to keep 1:1 rx/tx ratio we translate
input CR using picocom:

picocom -b115200 --omap crcrlf --emap "" /dev/ttyUSB0

Now paste the whole "Ok, now I did ..." para above. The behavior: first
subchunk being pasted, the noticeable pause, then second subchunk
appears, at which point board no longer reacts to input and appears
dead.

Blame "my" code? Well, fire up the classical samples/subsys/shell, and
paste there:

Attempt 1:
shell> Ok, now I id. With
(lockup)

Attempt 2:
shell> Ok, now Idid. With atch abov, I can't eproduce te overflo
mssage.But now le's try toecho the iput char, nd for a
dference, et CONFIG_ONSOLE_GECHAR_BUFSIE=64 so wedidn't susect

Attempt 3:
Could paste whole 5 times with wildly lost chars, before the thing
finally gave up and locked up.

All that is nothing new. See e.g.
https://jira.zephyrproject.org/browse/ZEP-467 and what patch "fixed"
it. I write "fixed" because, well, hang indeed seems to be fixed on on
arduino_101 \o/, but it all still works, umm, in expected way:

shell> Now paste the whole "Ok, now I did ..." para above. The behavor: frst
sbchun bein past, th notieablepause thensecon subcunk
apears at wich pnt bord nolonge reacs to nput nd apeaToro msan

(But again, passed "UART ship it" test of being pastes into for a
minute without hangs.)

So, probably can come to a conclusion: for reliable operation, both UART
rx *and* tx should be buffered and interrupt-driven.

It *does* reproduce on qemu_cortex_m3 but the UART simulation looks a
bit dubious. I can't directly test the input but the simulated UART
is *very* obviously producing characters way faster than the normal
baud rate... it appears to simply be running as fast as it can.
So, can come to 2nd conclusion - QEMU emulation can't be trusted much.
That's the saddest part, because I was interested first and foremost
in the local QEMU test automation. I mean, the saddest part is
it's unclear what to do about it. Do you mean that you looked at the
source of the old-old QEMU version used by Zephyr SDK and so this
simplistic UART emulation? That "old-old" part is what deterred me from
doing the same, I guess I should try QEMU 2.8 after all. But if it's
not fixed there, I'm at loss again - should I try all the gazillion
forks around, or try to fix it myself against mainline (didn't hack on
qemu for a while)?..



Daniel.


--
Best Regards,
Paul

Linaro.org | Open source software for ARM SoCs
Follow Linaro: http://www.facebook.com/pages/Linaro
http://twitter.com/#!/linaroorg - http://www.linaro.org/linaro-blog


Daniel Thompson <daniel.thompson@...>
 

On 07/04/17 11:21, Paul Sokolovsky wrote:
--- a/samples/subsys/console/getchar/src/main.c
+++ b/samples/subsys/console/getchar/src/main.c
@@ -9,6 +9,6 @@ void main(void)
while (1) {
uint8_t c = console_getchar();

- printk("char: [0x%x] %c\n", c, c);
+// printk("char: [0x%x] %c\n", c, c);
}
}

...
Console buffer overflow - char dropped
Console buffer overflow - char dropped
...
Did you run this on carbon? I've not been able to reproduce on this
board.
Ok, now I did. With the patch above, I can't reproduce the overflow
message. But now let's try to echo the input char, and for a
difference, set CONFIG_CONSOLE_GETCHAR_BUFSIZE=64 so we didn't suspect
too short a buffer.

I still can't reproduce overflow message,
... funny you should say that.

I've been taking a quick glance at the vendor HAL here. It implements some kind of weird private locking scheme that looks like it doesn't play nicely with a pre-emptive RTOS (or even with interrupt handlers).

I think this means that when we are transmitting a character from a task it becomes unlikely that the system will emit an error message from the ISR (including the overflow message).

That doesn't mean overflow is happening. Just that the serial driver looks broken enough to make such testing unreliable.

but now we're entering land of
obvious UART problems. So, to keep 1:1 rx/tx ratio we translate
input CR using picocom:

picocom -b115200 --omap crcrlf --emap "" /dev/ttyUSB0

Now paste the whole "Ok, now I did ..." para above. The behavior: first
subchunk being pasted, the noticeable pause, then second subchunk
appears, at which point board no longer reacts to input and appears
dead.

Blame "my" code? Well, fire up the classical samples/subsys/shell, and
paste there:

Attempt 1:
shell> Ok, now I id. With
(lockup)

Attempt 2:
shell> Ok, now Idid. With atch abov, I can't eproduce te overflo
mssage.But now le's try toecho the iput char, nd for a
dference, et CONFIG_ONSOLE_GECHAR_BUFSIE=64 so wedidn't susect

Attempt 3:
Could paste whole 5 times with wildly lost chars, before the thing
finally gave up and locked up.

All that is nothing new. See e.g.
https://jira.zephyrproject.org/browse/ZEP-467 and what patch "fixed"
it. I write "fixed" because, well, hang indeed seems to be fixed on on
arduino_101 \o/, but it all still works, umm, in expected way:

shell> Now paste the whole "Ok, now I did ..." para above. The behavor: frst
sbchun bein past, th notieablepause thensecon subcunk
apears at wich pnt bord nolonge reacs to nput nd apeaToro msan

(But again, passed "UART ship it" test of being pastes into for a
minute without hangs.)

So, probably can come to a conclusion: for reliable operation, both UART
rx *and* tx should be buffered and interrupt-driven.
I don't understand how the evidence leads to this conclusion.

Personally I'm still rather concerned that the driver may not be robust (although, just as you blaming the scheduler, I haven't collected much evidence to support this).

I've done a quick review of the driver and so far I haven't seen anything that explains the character loss (although it would good to neutralize the private locking so we can see the output from the ISR).

I'm afraid, for me, time is up thinking about this for a while. However, if I was looking at this further, I'd consider reinstating the old UART driver (HAL is sufficiently complexified that it becomes hard to reason about) and see if it behaves the same way...


It *does* reproduce on qemu_cortex_m3 but the UART simulation looks a
bit dubious. I can't directly test the input but the simulated UART
is *very* obviously producing characters way faster than the normal
baud rate... it appears to simply be running as fast as it can.
So, can come to 2nd conclusion - QEMU emulation can't be trusted much.
That's the saddest part, because I was interested first and foremost
in the local QEMU test automation. I mean, the saddest part is
it's unclear what to do about it. Do you mean that you looked at the
source of the old-old QEMU version used by Zephyr SDK and so this
simplistic UART emulation?
No. I simply noticed that is created *pages* of output really quickly (<3ms).

That "old-old" part is what deterred me from
doing the same, I guess I should try QEMU 2.8 after all. But if it's
not fixed there, I'm at loss again - should I try all the gazillion
forks around, or try to fix it myself against mainline (didn't hack on
qemu for a while)?..
Fixing qemu would be awesome (you know I like to see upstream bug fixes rather than moaning). If you can reproduce in mainline I'd start there and then request a backport...


Daniel.


Andy Ross
 

Paul Sokolovsky <paul.sokolovsky@linaro.org> wrote (on Thursday, April 06, 2017 11:04PM):
My kind request though is: please don't try to shift it all on UART
and talk about the need of DTR/RTS, etc. UART is damn slow device by
the current standards, which should "just work" with dozens-of-MHz
CPUs without any flow control on hardware side.
I'm not completely sure I buy that. If your consumer thread isn't
being scheduled due to all the UART interrupts, the CPU is clearly
*not* fast enough to handle the bytes arriving at the line rate (maybe
that's something we should fix, though -- have you looked to see where
the cycles are going?).

I mean... I know you don't want to hear this but the situation you
have is precisely what buffering and hardware flow control are
designed to solve. Make the buffer big enough to handle the sudden
flood, or flag your buffer-full condition to the other side with
RTC/CTS (I know this won't work on lots of console devices which don't
wire the flow control lines). Or if you can, adjust the protocol to
include a software flow control mechanism like an ACK with a max
"packet" size (not XON/XOFF though, that never works).

Andy


Paul Sokolovsky
 

Hello Daniel,

On Fri, 7 Apr 2017 12:38:34 +0100
Daniel Thompson <daniel.thompson@linaro.org> wrote:

[]

...
Console buffer overflow - char dropped
Console buffer overflow - char dropped
...
Did you run this on carbon? I've not been able to reproduce on this
board.
Ok, now I did. With the patch above, I can't reproduce the overflow
message. But now let's try to echo the input char, and for a
difference, set CONFIG_CONSOLE_GETCHAR_BUFSIZE=64 so we didn't
suspect too short a buffer.

I still can't reproduce overflow message,
... funny you should say that.

I've been taking a quick glance at the vendor HAL here. It implements
some kind of weird private locking scheme that looks like it doesn't
play nicely with a pre-emptive RTOS (or even with interrupt handlers).
Ack, thanks, I see it.

[]

So, probably can come to a conclusion: for reliable operation, both
UART rx *and* tx should be buffered and interrupt-driven.
I don't understand how the evidence leads to this conclusion.
Well, it's reasoning from the opposite direction, let's recount:

1. As was quoted, MicroPython has multiple implementations of that
handling for multiple environments, all "just working". Output
buffering is used at all times though.

2. Pure mathematical reasoning that if a char at a given baud rate is
received each X cycles, then busy-polling to send a char will take
these X cycles. But while hardware receives a char each X cycles on its
own, before we can spend X cycles to send it, we first all also need to
spend some other cycles to handle an interrupt, extract received char,
etc. So, the ratio is never 1:1, but instead 1:1.xx, so we sustainably
will be getting late, and lose or overflow eventually.

3. Simply due to lack of better leads to a problem ("broken UART
drivers and handling" started to sound much more convincing than the
original "scheduling is broken" guess, which couldn't be proven).

Personally I'm still rather concerned that the driver may not be
robust (although, just as you blaming the scheduler, I haven't
collected much evidence to support this).
"The driver"? It's "the drivers, and for a while". If you think that
the problem is peculiar to stm32, then there was example of
arduino_101 having had a big problem, and still having some, frdm_k64f
also had (== has) similar problems, etc.

I've done a quick review of the driver and so far I haven't seen
anything that explains the character loss (although it would good to
neutralize the private locking so we can see the output from the ISR).

I'm afraid, for me, time is up thinking about this for a while.
However, if I was looking at this further, I'd consider reinstating
the old UART driver (HAL is sufficiently complexified that it becomes
hard to reason about) and see if it behaves the same way...
Well, thanks for helping the investigation, I appreciate that, as
usual! Reinstate the old driver - for one-off testing? If so, I'd find
hack-hacking the existing one to be more practical. And anything beyond
that borders on organizational and political issues. Suffice that
others provided feedback on that (p.1 at
https://youtu.be/XUJK2htXxKw?t=1885), I don't want to go there ;-).

[]

--
Best Regards,
Paul

Linaro.org | Open source software for ARM SoCs
Follow Linaro: http://www.facebook.com/pages/Linaro
http://twitter.com/#!/linaroorg - http://www.linaro.org/linaro-blog


Paul Sokolovsky
 

Hello Andy,

On Fri, 7 Apr 2017 08:29:08 -0700
Andy Ross <andrew.j.ross@intel.com> wrote:

Paul Sokolovsky <paul.sokolovsky@linaro.org> wrote (on Thursday,
April 06, 2017 11:04PM):
My kind request though is: please don't try to shift it all on UART
and talk about the need of DTR/RTS, etc. UART is damn slow device by
the current standards, which should "just work" with dozens-of-MHz
CPUs without any flow control on hardware side.
I'm not completely sure I buy that. If your consumer thread isn't
being scheduled due to all the UART interrupts, the CPU is clearly
*not* fast enough
Thanks for the reply. The original hunch was that there was some kind
of issue with the scheduling implementation. I explained where it came
from - because other explanations, like "fast thread can't keep up with
slow ISR" or "the problem is in UART handling" seemed to be too simple
an answer and thus unrealistic to be true. It more and more looks like
I was wrong, and the issue lies on UART handling side, and is a
relatively obvious to be looked at and ignored :-/.

to handle the bytes arriving at the line rate (maybe
that's something we should fix, though -- have you looked to see where
the cycles are going?).
Speculatively, yes - they go into busy-polling implementation of UART
TX. And as mentioned in my previous email, we always spend more cycles
to send a char (because of the overheads we do besides send), than to
receive, so with busy-polling send, we'll never get a reliable behavior.

I mean... I know you don't want to hear this but the situation you
have is precisely what buffering and hardware flow control are
designed to solve. Make the buffer big enough to handle the sudden
flood, or flag your buffer-full condition to the other side with
RTC/CTS (I know this won't work on lots of console devices which don't
wire the flow control lines).
So, either everyone does wrong, or people save on wires because they
aren't needed in prevalent number of modern usecases.

Or if you can, adjust the protocol to
include a software flow control mechanism like an ACK with a max
"packet" size (not XON/XOFF though, that never works).
Right, the last one is 2nd most popular workaround nowadays - almost
all protocols are request-response, and require response much sooner
than after sending a megabyte of data, but some do send many kilobytes
nonetheless. The 1st workaround is still just handling a UART with a
fast CPU such that flow control issues aren't hit in the realistic
scenarios. "Realistic scenarios" include being able to paste a dozen of
lines in an echoing application. That works in many systems
- why don't we target Zephyr being able to do that too?


Andy

Thanks,
Paul

Linaro.org | Open source software for ARM SoCs
Follow Linaro: http://www.facebook.com/pages/Linaro
http://twitter.com/#!/linaroorg - http://www.linaro.org/linaro-blog


Daniel Thompson <daniel.thompson@...>
 

On 08/04/17 10:58, Paul Sokolovsky wrote:
Hello Daniel,

On Fri, 7 Apr 2017 12:38:34 +0100
Daniel Thompson <daniel.thompson@linaro.org> wrote:

[]

...
Console buffer overflow - char dropped
Console buffer overflow - char dropped
...
Did you run this on carbon? I've not been able to reproduce on this
board.
Ok, now I did. With the patch above, I can't reproduce the overflow
message. But now let's try to echo the input char, and for a
difference, set CONFIG_CONSOLE_GETCHAR_BUFSIZE=64 so we didn't
suspect too short a buffer.

I still can't reproduce overflow message,
... funny you should say that.

I've been taking a quick glance at the vendor HAL here. It implements
some kind of weird private locking scheme that looks like it doesn't
play nicely with a pre-emptive RTOS (or even with interrupt handlers).
Ack, thanks, I see it.

[]

So, probably can come to a conclusion: for reliable operation, both
UART rx *and* tx should be buffered and interrupt-driven.
I don't understand how the evidence leads to this conclusion.
Well, it's reasoning from the opposite direction, let's recount:

1. As was quoted, MicroPython has multiple implementations of that
handling for multiple environments, all "just working". Output
buffering is used at all times though.
Sorry I was getting carried away with the specific analysis for character drop: if input characters are responded to in >70000 cycles and the microcontroller is otherwise idle then input buffer and output buffering are equivalent.

I agree *entirely* that micropython requires output buffering however, because neither of the above constraints are true here (assuming expression parsing consumes a lot of cycles).


2. Pure mathematical reasoning that if a char at a given baud rate is
received each X cycles, then busy-polling to send a char will take
these X cycles. But while hardware receives a char each X cycles on its
own, before we can spend X cycles to send it, we first all also need to
spend some other cycles to handle an interrupt, extract received char,
etc. So, the ratio is never 1:1, but instead 1:1.xx, so we sustainably
will be getting late, and lose or overflow eventually.
This would be true if there were 0 bytes of output buffering but that should not be the case: there should be 1 bytes of buffering (e.g. we poll *before* subsequent TX, not after current TX).


3. Simply due to lack of better leads to a problem ("broken UART
drivers and handling" started to sound much more convincing than the
original "scheduling is broken" guess, which couldn't be proven).

Personally I'm still rather concerned that the driver may not be
robust (although, just as you blaming the scheduler, I haven't
collected much evidence to support this).
"The driver"? It's "the drivers, and for a while".
However mature Zephyr gets it will *always* be the driver you should examine for bugs first. It is inevitable that drivers are less well tested and less heavily reviewed than the core or arch code

For example on Zephyr currently only a handful of people care about STM32 drivers at present, dozens care about arch/arm and everyone in the project cares about the core code...

If you think that
the problem is peculiar to stm32, then there was example of
arduino_101 having had a big problem,
... which *was* an APIC driver problem.


and still having some, frdm_k64f
also had (== has) similar problems, etc.

I've done a quick review of the driver and so far I haven't seen
anything that explains the character loss (although it would good to
neutralize the private locking so we can see the output from the ISR).

I'm afraid, for me, time is up thinking about this for a while.
However, if I was looking at this further, I'd consider reinstating
the old UART driver (HAL is sufficiently complexified that it becomes
hard to reason about) and see if it behaves the same way...
Well, thanks for helping the investigation, I appreciate that, as
usual! Reinstate the old driver - for one-off testing? If so, I'd find
hack-hacking the existing one to be more practical. And anything beyond
that borders on organizational and political issues. Suffice that
others provided feedback on that (p.1 at
https://youtu.be/XUJK2htXxKw?t=1885), I don't want to go there ;-).
No worries. I was thinking a "git revert" is fast and cheap (providing one is not afraid of solving merge conflicts) for a quick test... also its always useful to distinguish regressions from this-never-worked.

Anyhow, excited to see your pull requests this morning. I'll take a look at that in a minute.

Daniel.


Erwan Gouriou
 

Hi Paul, Daniel,



On 10 April 2017 at 10:19, Daniel Thompson <daniel.thompson@...> wrote:
On 08/04/17 10:58, Paul Sokolovsky wrote:
Hello Daniel,

On Fri, 7 Apr 2017 12:38:34 +0100
Daniel Thompson <daniel.thompson@...> wrote:

[]

...
Console buffer overflow - char dropped
Console buffer overflow - char dropped
...

Did you run this on carbon? I've not been able to reproduce on this
board.

Ok, now I did. With the patch above, I can't reproduce the overflow
message. But now let's try to echo the input char, and for a
difference, set CONFIG_CONSOLE_GETCHAR_BUFSIZE=64 so we didn't
suspect too short a buffer.

I still can't reproduce overflow message,

... funny you should say that.

I've been taking a quick glance at the vendor HAL here. It implements
some kind of weird private locking scheme that looks like it doesn't
play nicely with a pre-emptive RTOS (or even with interrupt handlers).

Ack, thanks, I see it.

[]

So, probably can come to a conclusion: for reliable operation, both
UART rx *and* tx should be buffered and interrupt-driven.

I don't understand how the evidence leads to this conclusion.

Well, it's reasoning from the opposite direction, let's recount:

1. As was quoted, MicroPython has multiple implementations of that
handling for multiple environments, all "just working". Output
buffering is used at all times though.

Sorry I was getting carried away with the specific analysis for character drop: if input characters are responded to in >70000 cycles and the microcontroller is otherwise idle then input buffer and output buffering are equivalent.

I agree *entirely* that micropython requires output buffering however, because neither of the above constraints are true here (assuming expression parsing consumes a lot of cycles).


2. Pure mathematical reasoning that if a char at a given baud rate is
received each X cycles, then busy-polling to send a char will take
these X cycles. But while hardware receives a char each X cycles on its
own, before we can spend X cycles to send it, we first all also need to
spend some other cycles to handle an interrupt, extract received char,
etc. So, the ratio is never 1:1, but instead 1:1.xx, so we sustainably
will be getting late, and lose or overflow eventually.

This would be true if there were 0 bytes of output buffering but that should not be the case: there should be 1 bytes of buffering (e.g. we poll *before* subsequent TX, not after current TX).


3. Simply due to lack of better leads to a problem ("broken UART
drivers and handling" started to sound much more convincing than the
original "scheduling is broken" guess, which couldn't be proven).

Personally I'm still rather concerned that the driver may not be
robust (although, just as you blaming the scheduler, I haven't
collected much evidence to support this).

"The driver"? It's "the drivers, and for a while".

However mature Zephyr gets it will *always* be the driver you should examine for bugs first. It is inevitable that drivers are less well tested and less heavily reviewed than the core or arch code

For example on Zephyr currently only a handful of people care about STM32 drivers at present, dozens care about arch/arm and everyone in the project cares about the core code...

If you think that
the problem is peculiar to stm32, then there was example of
arduino_101 having had a big problem,

... which *was* an APIC driver problem.


and still having some, frdm_k64f
also had (== has) similar problems, etc.

I've done a quick review of the driver and so far I haven't seen
anything that explains the character loss (although it would good to
neutralize the private locking so we can see the output from the ISR).

I'm afraid, for me, time is up thinking about this for a while.
However, if I was looking at this further, I'd consider reinstating
the old UART driver (HAL is sufficiently complexified that it becomes
hard to reason about) and see if it behaves the same way...

Well, thanks for helping the investigation, I appreciate that, as
usual! Reinstate the old driver - for one-off testing? If so, I'd find
hack-hacking the existing one to be more practical. And anything beyond
that borders on organizational and political issues. Suffice that
others provided feedback on that (p.1 at
https://youtu.be/XUJK2htXxKw?t=1885), I don't want to go there ;-).

No worries. I was thinking a "git revert" is fast and cheap (providing one is not afraid of solving merge conflicts) for a quick test... also its always useful to distinguish regressions from this-never-worked.

About stm32 driver and Cube HAL, I'd like to mention that a Low Level cube API is now available
on almost all stm32 families (LL). This API intends to be lightweight and modular and hence have
a better fit with Zephyr architecture.

I'd like to start implementing drivers with this LL API soon.
It would still be a HAL (hence I understand some people would still complain), but would allow to have
a driver implementation closer to zephyr needs, removing a part of redundant functionalities and still
allow to implement (as far as possible) stm32 generic drivers.

This won't solve current issue (as I understood), but I hope it could help to simplify things on stm32 side.

Erwan
Anyhow, excited to see your pull requests this morning. I'll take a look at that in a minute.

Daniel.

_______________________________________________
Zephyr-devel mailing list
Zephyr-devel@...ct.org
https://lists.zephyrproject.org/mailman/listinfo/zephyr-devel



Paul Sokolovsky
 

Hello Erwan,

On Mon, 10 Apr 2017 17:18:13 +0200
Erwan Gouriou <erwan.gouriou@linaro.org> wrote:

[]

About stm32 driver and Cube HAL, I'd like to mention that a Low
Level cube
API is now available
on almost all stm32 families (LL). This API intends to be lightweight
and modular and hence have
a better fit with Zephyr architecture.

I'd like to start implementing drivers with this LL API soon.
It would still be a HAL (hence I understand some people would still
complain),
I appreciate the response.

We discussed the HAL issue at the OpenIoT Summit, and well, both sides
have good arguments. I appreciate arguments of SoC vendors, the main
ones being that it allows to support entire families and groups of
families, and that it allows to reuse code which already undergone
prolonged development and (some level of) testing. But it's indeed not
a good situation when using HAL causes additional artifacts in Zephyr
drivers, like arguably discovered by Daniel in STM32 UART poll-out
implementation.

I had a look at (few) LL functions, and saw they're indeed lightweight,
so using them sounds like a good plan.

but would allow to have
a driver implementation closer to zephyr needs, removing a part of
redundant functionalities and still
allow to implement (as far as possible) stm32 generic drivers.

This won't solve current issue (as I understood), but I hope it could
help to simplify things on stm32 side.
So, the current biggest issue, as I see it, is that Zephyr doesn't seem
to provide a well-established baseline for the behavior of various
subsystems, in the case discussed - UART console. This in turns leads
to artifacts and deviations in driver implementations for different
SoCs - simply because there's no good integration test for all the
different calls.

I also received reply that the subsystem is known to work for large
amounts of data, pointing e.g. tests/bluetooth/tester/ . Indeed, it
works with the current implementation, but only due to specially
constructed protocol, so that testcase doesn't expose
deficiencies. (Un)fortunately, exposing it's easy - with just a
clipboard paste into a terminal application, with mere UART echo
running on the device side. (And that's just a testcase, the actual
issue is non-reliability with more generic UART protocols, than e.g.
used by tests/bluetooth/tester).

So, I propose to establish *that* as a UART console baseline, and
indeed, it will require changes down to the level of SoC UART
drivers. An initial draft patch for STM32 was posted:
https://gerrit.zephyrproject.org/r/12821


Erwan
[]

--
Best Regards,
Paul

Linaro.org | Open source software for ARM SoCs
Follow Linaro: http://www.facebook.com/pages/Linaro
http://twitter.com/#!/linaroorg - http://www.linaro.org/linaro-blog


Daniel Thompson <daniel.thompson@...>
 

On 10/04/17 17:16, Paul Sokolovsky wrote:
Hello Erwan,

On Mon, 10 Apr 2017 17:18:13 +0200
Erwan Gouriou <erwan.gouriou@linaro.org> wrote:

[]

About stm32 driver and Cube HAL, I'd like to mention that a Low
Level cube
API is now available
on almost all stm32 families (LL). This API intends to be lightweight
and modular and hence have
a better fit with Zephyr architecture.

I'd like to start implementing drivers with this LL API soon.
It would still be a HAL (hence I understand some people would still
complain),
I appreciate the response.

We discussed the HAL issue at the OpenIoT Summit, and well, both sides
have good arguments. I appreciate arguments of SoC vendors, the main
ones being that it allows to support entire families and groups of
families, and that it allows to reuse code which already undergone
prolonged development and (some level of) testing. But it's indeed not
a good situation when using HAL causes additional artifacts in Zephyr
drivers, like arguably discovered by Daniel in STM32 UART poll-out
implementation.

I had a look at (few) LL functions, and saw they're indeed lightweight,
so using them sounds like a good plan.
Certainly there's a world of difference between providing low level register accessors and HAL based drivers. I'm glad to see things moving towards the LL stuff.


but would allow to have
a driver implementation closer to zephyr needs, removing a part of
redundant functionalities and still
allow to implement (as far as possible) stm32 generic drivers.

This won't solve current issue (as I understood), but I hope it could
help to simplify things on stm32 side.
So, the current biggest issue, as I see it, is that Zephyr doesn't seem
to provide a well-established baseline for the behavior of various
subsystems, in the case discussed - UART console. This in turns leads
to artifacts and deviations in driver implementations for different
SoCs - simply because there's no good integration test for all the
different calls.

I also received reply that the subsystem is known to work for large
amounts of data, pointing e.g. tests/bluetooth/tester/ . Indeed, it
works with the current implementation, but only due to specially
constructed protocol, so that testcase doesn't expose
deficiencies. (Un)fortunately, exposing it's easy - with just a
clipboard paste into a terminal application, with mere UART echo
running on the device side. (And that's just a testcase, the actual
issue is non-reliability with more generic UART protocols, than e.g.
used by tests/bluetooth/tester).

So, I propose to establish *that* as a UART console baseline, and
indeed, it will require changes down to the level of SoC UART
drivers. An initial draft patch for STM32 was posted:
https://gerrit.zephyrproject.org/r/12821
Having looked at the other FIFO fill functions I think that this was merely the STM32 driver not confirming to the existing driver interface. I don't think there is any change of API needed to go forward here.


Daniel.