Re: BSD Sockets in mainline, and how that affects design decisions for the rest of IP stack (e.g. send MTU handling)

Tomasz Bursztyka

On 26/10/2017 14:37, Paul Sokolovsky wrote:
Hello Tomasz,

Thanks for responding and bringing up this discussion - it got
backlogged (so I'm doing homework on it in the background).

On Wed, 25 Oct 2017 18:13:18 +0200
Tomasz Bursztyka <tomasz.bursztyka@...> wrote:

Hi guys,

It was
posted as .
Again, at that time, there was no consensus about way to solve it,
so it was implemented only for BSD Sockets API.

Much later, was posted.
It works in following way: it allows an application to create an
oversized packet

There're many more
details than presented above, and the devil is definitely in
details, there's no absolutely "right" solution, it's a compromise.
I hope that Jukka and Tomasz, who are proponents of the second
(GH-1330) approach can correct me on the benefits of it.
Actually I missed the fact PR 1330 was about MTU handling. Does not
sound generic enough.

In the end I don't approve both of the proposed solution.
That sounds fresh, thanks ;-)

Let me
explain why:

First, let's not rush on this MTU handling just yet, though it is
much needed. We first need this:

Ack, that's good thing to do...

it will simplify a lot how packet are allocated. I haven't touched
MTU stuff since I did the net_pkt move because of this feature we'll

I foresee a lot of possible improvements with this issue resolved:
certainly MTU handling, better memory management than current frag
model, but also better response against low memory
... but I don't see how it directly relates to the topic of this RFC,
which is selecting paradigm to deal with the case that we have finite
units of buffering, and how that should affect user-facing API design.
I was indeed only responding on MTU handling (as both PR do in a way).

There're definitely a lot to improve and optimize in our IP stack, and
the issue you mention is one of them. But it's going to be just that -
the optimization. But what we discuss is how to structure API:

1. Accept that the amount of buffering we can do is very finite, and
make applications be aware of that and ready to handle - the POSIX
inspired way. If done that way, we can just use a network packet as
a buffering unit and further optimize that handling.

2. Keep pretending that we can buffer mini-infinite amount of data.
It's mini-infinite because we still won't be able to buffer more than
RAM allows (actually, more than TX slab allows), and that's still too
little, so won't work for "real" amounts of data, which still will need
to fall back to p.1 handling above. Packet buffers are still used for
buffering, but looking at Jukka's implementation, they are used as
generic data buffers, and require pretty heavy post-processing - first
splitting oversized buffers into packet-friendly sizes (#1330),
stuffing protocol headers in front (we already do that, and that's
pretty awful and not zero-copy at all), etc. Again, all that happens
with no free memory available - it was already spent to buffer that
"mini-infinite" amount of data.

You also say that you don't like any of these choices. Well, there're
only so many ways to do. What do you have in mind?
As I am not using at all user APIs, I can't tell what would be the best.
But the issue is mostly found in the API usage it seems.

On socket you have one opaque type to access the network and read/write on it.
The data buffer is then split in two: user has to manage his raw data in his own buffers,
(and take to/from it when receiving/sending) where the actual network data
(encapsulated raw data) is handled behind by socket on send/recv.
Both sending/receiving logic is easy, as long as you have enough memory for user's buffer.

In zephyr, both are directly found at once in net_pkt. The user does not have
to create its own buffers: he just has to pick from net slabs, populate it, finalize and done.

From a memory usage point of view, the later is easier and more efficient
as long as the net stack is doing it well, obviously, like properly handling MTU, etc...
but mostly on _tx_ only. On rx, as the data is scattered around buffer fragments, it requires
the user do add logic on his side to parse the received data (the encapsulated one).
Thus the net_frag_read() and derived functions. Which can be a bit complicated to grasp I guess.

About the "mini-inifinite", that's up to the user to handle net_pkt_append() returned amount.
A bit like send() in socket. Though the difference here if of course data still needs to be send.

As I did change net_nbuf to net_pkt, my point doing it was exactly that net_pkt should represent
one unique IP packet. So from that, I would say net_pkt_append() must not go over MTU.

Note however that this MTU can be either HW MTU or IP based one.
For instance on 15.4/6lo you don't want to limit IPv6 packets on HW MTU, but you use this
IPv6 (minimum) size of 1280 bytes which in turn is fragmented through 6lo to fit in as many
as necessary 15.4 frames. But if not using 6lo, it would need to use HW MTU only.
(There is a possibility to handle bigger IPv6 packets through IPv6 fragmentation, I can't remember
is that generates as many net_pkt as necessary, or if it does all on one which would go against
net_pkt usage)

As I am not working against user API, I don't have good overview on how it's supposed to work.

Well, maybe my blabbering can help you guys to decide which way should be the best.


Join to automatically receive all group messages.