Situation with net APIs testing


Paul Sokolovsky
 

Hello,

As I'm approaching final steps in preparing BSD Sockets patchset for
submission, I'm looking into a way to add some tests for it. Testing of
networking functionality is by default hard, because in general,
networking hardware would be required for that, even if "virtual", like
tunslip6, etc. tools from net-tools repo running, to support QEMU
networking. During prototyping work, I learnt there're loopback
capabilities when binding and connecting to the same netif, but it
still requires net-tools running just to get QEMU start up with
networking support.

Well, I took a look at tests/net and saw the whole bunch of tests,
whoa! I gave a try to tests/net/tcp , some cases passed, some failed,
hmm. But then I killed net-tools/loop-slip-tab.sh script and the test
ran in the same manner. Whoa, so we have means to run networking
without any requirements on the host side, which means we can run them
as part of sanitycheck testsuite! But, 8 tests of tests/net/ have
build_only=true, any wonder they're broken?

Anyway, I looked at what's involved in net-tools free running, and
figured it's CONFIG_NET_L2_DUMMY. Added it to my sockets test, and got
only segfault in return. After debugging it, turned out it's the same
issue as already faced by me and other folks: if there're no netifs
defined, networking code is going to crash (instead of printing clear
error to the user): https://jira.zephyrproject.org/browse/ZEP-2105

But how the tests/net/ run then and don't crash? Here's the answer:

zephyr/tests/net$ grep NET_DEVICE -r * | wc
22 42 1532

So, almost each and every test defines its own test interface.

One would think that if we have 22 repetitive implementations of test
interfaces, whose main purpose is to be just loopback interfaces, then
we'd have a loopback interface in the main codebase. But nope, as
confirmed by Jukka on IRC, we don't.

Summary:

1. Writing networking tests is hard, but it Zephyr, it takes
extraordinary, agonizing effort. The most annoying is that all needed
pieces are there, but instead of presenting a nice picture, they form
a mess which greets you with crashes if you try to change anything.

2. There're existing big (~20K each) test which fail. Apparently,
because they aren't run, so bitrot. Why do we need these huge, detailed
tests if we don't run them? (An alternative explanation is that there's
something wrong with my system, and yep, I'd be glad to know what I'm
still don't do right with Zephyr after working on it for a year.)


I'd be glad if more experienced developers could confirm if it's really
like the above, or I miss something. And I'll be happy to work on the
above issues, but in the meantime, I'll need to submit BSD Sockets with
rather bare and hard to run (not automated) tests due to the situation
above.



Thanks,

--
Best Regards,
Paul

Linaro.org | Open source software for ARM SoCs
Follow Linaro: http://www.facebook.com/pages/Linaro
http://twitter.com/#!/linaroorg - http://www.linaro.org/linaro-blog


Jukka Rissanen
 

Hi Paul,

On Wed, 2017-06-14 at 18:10 +0300, Paul Sokolovsky wrote:
Hello,

As I'm approaching final steps in preparing BSD Sockets patchset for
submission, I'm looking into a way to add some tests for it. Testing
of
networking functionality is by default hard, because in general,
networking hardware would be required for that, even if "virtual",
like
tunslip6, etc. tools from net-tools repo running, to support QEMU
networking. During prototyping work, I learnt there're loopback
capabilities when binding and connecting to the same netif, but it
still requires net-tools running just to get QEMU start up with
networking support.

Well, I took a look at tests/net and saw the whole bunch of tests,
whoa! I gave a try to tests/net/tcp , some cases passed, some failed,
hmm. But then I killed net-tools/loop-slip-tab.sh script and the test
ran in the same manner. Whoa, so we have means to run networking
without any requirements on the host side, which means we can run
them
as part of sanitycheck testsuite! But, 8 tests of tests/net/ have
build_only=true, any wonder they're broken?
When we had gerrit and jenkins, some of the net tests run slightly
longer that what was desired, so they were marked as build only. Now
that situation is different with github and shippable, we can change
this. So I will prepare a patch that activates those tests that can be
activated.

I looked through tests/net what is current status of the tests:

ieee802154/crypto
* This cannot be run on qemu as it requires suitable hw

tcp
* Test does not pass, needs fixing

mld
* Test does not pass, needs fixing

ipv6
* Test does not pass, needs fixing

lib/mqtt_publisher
* Test requires real qemu to run. This needs to be converted 

lib/mqtt_subscriber
* Test requires real qemu to run. This needs to be converted 

buf
* This test runs ok so build_only=true can be removed.

all
* This is intentional compile test that activates all network config
options and tries to compile the binary. The result binary cannot be
run mostly because of memory requirements and no suitable test
environment. The only issue with this test is that we should remember
to add and enable new net config options into this test case.

All other tests programs (24 pieces), that consists of quite many
individual tests, are run automatically by CI, so the situation is not
so bleak as you indicated here.

I will fix the relevant failing tests as they have bit rotted after the
tests were written. Converting two mqtt tests to not use real qemu
requires a bit more work.



Anyway, I looked at what's involved in net-tools free running, and
figured it's CONFIG_NET_L2_DUMMY. Added it to my sockets test, and
got
only segfault in return. After debugging it, turned out it's the same
issue as already faced by me and other folks: if there're no netifs
defined, networking code is going to crash (instead of printing clear
error to the user): https://jira.zephyrproject.org/browse/ZEP-2105

But how the tests/net/ run then and don't crash? Here's the answer:

zephyr/tests/net$ grep NET_DEVICE -r * | wc
     22      42    1532

So, almost each and every test defines its own test interface.

One would think that if we have 22 repetitive implementations of test
interfaces, whose main purpose is to be just loopback interfaces,
No, the interface is not a loopback interface although it might look
like that. The purpose of the interface that is created in each of the
test is to simulate a real network so that we do not have to connect to
outside world but the test is self contained. So it kind of looks like
a loopback interface but in this case the source and destination IP
addresses are not the same (as would be the case with loopback
interface), as typically we want to test some real behavior of the
system so src/dest addresses should differ.

The loopback support has limited use cases actually and we probably
need to make that optional (behind Kconfig option) in the code as
normally there should be no need to send anything back to itself in the
real world.

then
we'd have a loopback interface in the main codebase. But nope, as
confirmed by Jukka on IRC, we don't.

Summary:

1. Writing networking tests is hard, but it Zephyr, it takes
extraordinary, agonizing effort. The most annoying is that all needed
pieces are there, but instead of presenting a nice picture, they form
a mess which greets you with crashes if you try to change anything.
I am not sure what kind of mess you mean here but patches are welcome
as always to rectify this.


2. There're existing big (~20K each) test which fail. Apparently,
because they aren't run, so bitrot. Why do we need these huge,
detailed
tests if we don't run them? (An alternative explanation is that
Some explanation given above.

there's
something wrong with my system, and yep, I'd be glad to know what I'm
still don't do right with Zephyr after working on it for a year.)
Hmm, I missed the point of your last sentence.



I'd be glad if more experienced developers could confirm if it's
really
like the above, or I miss something. And I'll be happy to work on the
above issues, but in the meantime, I'll need to submit BSD Sockets
with
rather bare and hard to run (not automated) tests due to the
situation
above.
Cheers,
Jukka


Paul Sokolovsky
 

Hello Jukka,

On Thu, 15 Jun 2017 10:46:29 +0300
Jukka Rissanen <jukka.rissanen@...> wrote:

[]

as part of sanitycheck testsuite! But, 8 tests of tests/net/ have
build_only=true, any wonder they're broken?
When we had gerrit and jenkins, some of the net tests run slightly
longer that what was desired, so they were marked as build only. Now
that situation is different with github and shippable, we can change
this. So I will prepare a patch that activates those tests that can be
activated.
That explains it, thanks.

[]

All other tests programs (24 pieces), that consists of quite many
individual tests, are run automatically by CI, so the situation is not
so bleak as you indicated here.
Great, thanks for clarifying this. Though I hope you'd agree that seeing
such tests as "context" or "tcp" fail does lead to concerns.

I will fix the relevant failing tests as they have bit rotted after
the tests were written.
Nice, thanks for finding time for this!

Converting two mqtt tests to not use real qemu
requires a bit more work.

[]

One would think that if we have 22 repetitive implementations of
test interfaces, whose main purpose is to be just loopback
interfaces,
No, the interface is not a loopback interface although it might look
like that. The purpose of the interface that is created in each of the
test is to simulate a real network so that we do not have to connect
to outside world but the test is self contained. So it kind of looks
like a loopback interface but in this case the source and destination
IP addresses are not the same (as would be the case with loopback
interface), as typically we want to test some real behavior of the
system so src/dest addresses should differ.
I see. I may imagine they offer more functionality than just a loopback
interface. I also may imagine that each of 22 test device
implementations are slightly different to cater for particular
testcases. Nothing of that help someone wanting to write a new test
unfortunately. How would one know that there's no standard device
implementation for dependency-free testing, and having figured that,
how one would choose which of 22 cases to use as a template?

The loopback support has limited use cases actually and we probably
need to make that optional (behind Kconfig option) in the code as
normally there should be no need to send anything back to itself in
the real world.
I absolutely agree that loopback device would be mostly useful for
development and testing, not for production. It also offers limited
testing indeed. But it has one big advantage: tests using it can be
easily run using existing sanitycheck framework. And if loopback
device existed in the main code, such tests would be also easy to write,
unlike now. Summing up, I'd like to give a try to implement one for the
mainline.


then
we'd have a loopback interface in the main codebase. But nope, as
confirmed by Jukka on IRC, we don't.

Summary:

1. Writing networking tests is hard, but it Zephyr, it takes
extraordinary, agonizing effort. The most annoying is that all
needed pieces are there, but instead of presenting a nice picture,
they form a mess which greets you with crashes if you try to change
anything.
I am not sure what kind of mess you mean here but patches are welcome
as always to rectify this.
Well, patches alone won't help here. Writing tests is always hard (for
various reasons, including "tests are code, so why not write 'real'
code instead?"). So, it would be nice to think how to facilitate that.
Specific proposal is to add a loopback netif, I assume it's ok, so will
go for a patch.

[]

there's
something wrong with my system, and yep, I'd be glad to know what
I'm still don't do right with Zephyr after working on it for a
year.)
Hmm, I missed the point of your last sentence.
Well, it's the same issue: various things in Zephyr are "harder than
necessary", so one can never know if something really broke or one
doesn't do all the needed things to run it successfully. It would be
nice to think of making the default config of Zephyr either run OOB, or
fail with clear error messages, not crash or lockup. That's again a big
meta-task, not something which can be "fixed with a patch", but would
be nice to see if different stakeholders of Zephyr agree that there's
an issue which needs attention.



Thanks for all the replies!

--
Best Regards,
Paul

Linaro.org | Open source software for ARM SoCs
Follow Linaro: http://www.facebook.com/pages/Linaro
http://twitter.com/#!/linaroorg - http://www.linaro.org/linaro-blog