Hi all,
we are currently having lots of discussions on (non-)blocking I/O. I'd like to put these thoughts out there for that discussion, because there seems to be a misunderstanding in terms.
Blocking never goes away, it is just reduced to other orders of magnitude. Types of "blocking":
- blocking or non-blocking pipes: will writing to a file or socket stop the program until the pipe is ready for writing? (basic "OS level" I/O)
- synchronous or asynchronous event handling: will the program stop until a remote side has responded, or can the program handle other events in the meantime? (one job queue, one worker == osmo_select_main())
- sequential or parallelized event handling: can events be handled concurrently, or just one after the other? (one or more job queues, more than one worker)
- concurrent access of resources: a given resource is not thread-safe, hence one thread needs to wait for the other to release a resource lock. (This is always present, the aim is to hit a sweet spot of least locking.)
In osmo programs I have worked on, we do the first two, but not the other two. Asynchronous event handling is the bare minimum for a server program to be functional. Non-blocking pipes is a common addition that is easy to do.
From then on we enter the world of parallelization, and things get very complex very quickly. It is possible to cause more blocking than before. It is possible to significantly increase the load, instead of improving.
I am familiar with parallelized non-blocking event handling and I/O, from realtime audio+video+control hacking. We do not use any of these techniques in osmo programs I have worked on -- for good reasons, I thought.
The spectrum, from most blocking to most non-blocking: - single-threaded, single queue with async defer; - task queue with multiple worker threads; - scheduling based on fairness or urgency; - map/reduce across a cloud, in a functional language.
We're almost all the way to the blocking end of the parallelization spectrum. So far I thought that this was a conscious choice. Async-but-blocking is low complexity, with large benefits in maintainability and stability.
Example:
If we have pending, say, 10 incoming packets on three different links, we handle each packet one by one when it is its turn. If one subscriber's incoming measurement report triggers longish handover calculations, any events like an MGCP ACK or SCCP CC for some other subscribers will have to wait in line, even though they might take a thousand fold less time to complete.
OsmoBSC works well in that fashion, even for hundreds of cells and multiple MSC: compared to audio+video+control, CNI signalling has huge tolerances on timing. This is why 3GPP separates control-plane from user-plane.
It is important to balance all of these aspects!
---
It was mentioned somewhere that our VTY is both asynchronous and non-blocking. I do not agree at all and would like to explain, as an example of the above.
Our vty server is NOT asynchronous. When a VTY request comes in, the vty function must directly vty_out() the response. We cannot defer the VTY response asynchronously like any other protocol can (see example below).
Our VTY structures, and the program-specific internal state that VTY manipulates and queries, are not thread-safe. The VTY server cannot be parallelized as it is now.
A contrived example:
Let's say we wanted to query nft counters from a VTY command:
* read VTY command from user, * do some nft command asynchronously, * and print back the result when nft is done.
Naively, we could store the struct vty * somewhere, and exit the vty handling function. When nft is done some time later, just vty_out() the result to that struct vty * that we still have from earlier. But there are problems:
If the user closed the telnet session in the meantime, this struct vty * is stale and the program will crash. We need a cancel mechanism to avoid that.
Also when a VTY command function is done, we directly transmit the next VTY prompt. vty-test scripts (`expect`) won't function properly when more response data arrives after the prompt is received; human users may be confused.
So our VTY server is *both* Synchronous and Blocking. It is not trivial to make it async (like all of our other protocols are) and non-blocking (which we have nowhere in osmo-cni yet).
---
These are the kinds of mechanisms I care about in our discussions: - "blocking" on what time scale? - tradeoff with code complexity and maintainability. - tradeoff with code stability and determinism. - tradeoff with system performance load due to additional management and caching.
One does not simply put things in threads. There are very non-trivial aspects that *always* come with it, one of them is super good, most of them are pretty bad.
~N
Hi Neels,
I'm in the middle of OsmoDevCon logistics preparations; some initial comments:
On Wed, May 01, 2024 at 08:10:39PM +0200, Neels Hofmeyr wrote:
If we have pending, say, 10 incoming packets on three different links, we handle each packet one by one when it is its turn. If one subscriber's incoming measurement report triggers longish handover calculations, any events like an MGCP ACK or SCCP CC for some other subscribers will have to wait in line, even though they might take a thousand fold less time to complete.
that is not "blocking" in any definition or context I have worked in so far.
"blocking" to me is defined by an application making a system call that may put the task on a wait queue or some other mechanism that delays execution until something somewhere else happens.
In the kernel programming world, people tend to speak of "something that may sleep".
When a VTY request comes in, the vty function must directly vty_out() the response.
yes, but that vty_out does *not* cause a blocking syscall. Hence it is not blocking.
Any gnerated I/O data is kept inside a buffer inside our process, which is drained at the speed that the kernel can handle the amount of data for the given socket without blocking.
We cannot defer the VTY response asynchronously like any other protocol can (see example below).
We are deferring those responses, see above. We are not deferring generating the response, but that doesn't matter, as generating the resposne is something we do internally, without performing syscalls that may block.
Our VTY structures, and the program-specific internal state that VTY manipulates and queries, are not thread-safe. The VTY server cannot be parallelized as it is now.
nobody has claimed that. I don't think threading should be confused with blocking I/O.