Fostering Software Reliability in an Increasing Hostile World
Greg Utas - Pentennea Inc.
Introduction. I believe that
standards that specified the following capabilities would help to
foster the production of reliable software:
1. handling signals as C++ exceptions
2. cooperative scheduling
3. proportional scheduling
These capabilities are currently absent from both the POSIX standard
and the carrier-grade Linux initiative. However, they are important to
large, scalable servers, which are typically soft real-time systems
that must provide near-continuous availability.
Signal Handling in C++. POSIX
discusses signals at some length. But because POSIX is language
independent, it does not deal with signal handling in C++.
Consequently, this would probably have to be formulated as a C++
standard.
The general requirement is the ability for signal handlers to integrate
signals with C++ by mapping them to exceptions. This is possible in
some environments but depends on the combination of compiler and
operating system. The issues include
- the need to fall back on the calls setjmp and longjmp if the
environment does not support throwing C++ exceptions from signal
handlers
- the lack of C++ cleanup (principally for any auto_ptr) when
longjmp unwinds the stack
- the ability to map a signal caused by a stack overflow to a C++
exception (exceptions are usually constructed on the stack, so this can
easily cause another overflow)
- the fact that some embedded operating systems do not detect stack
overflows
- the need to deal with both synchronous and asynchronous signals
- the fact that some C++ implementations map signals to exceptions
in a proprietary manner (such as the "structured exceptions" in
Microsoft Visual C++)
Cooperative Scheduling. POSIX
limits itself to preemptive, priority, and round-robin scheduling. This
was the outcome of adding a hard real-time requirement (priority
scheduling) to timesharing policies (round robin with preemption).
However, cooperative scheduling improves the reliability and capacity
of soft real-time systems by allowing threads, instead of the
scheduler, to decide when task switching can occur. This reduces the
amount of task switching. More importantly, by allowing transactions to
run unpreemptably, it eliminates the error-prone practice of having to
identify and protect critical regions at a granular level.
The general requirement is the ability for a thread to lock, which
suppresses preemption until the thread "cooperates" by unlocking,
allowing other threads to run. This raises at least two issues:
- how to deal with a thread that fails to cooperate (by getting
into an infinite loop, for example)
- how to deal with a critical region between a thread and an ISR
(I/O, for example)
Proportional Scheduling. This
scheduling discipline is also important in soft real-time systems, in
which one thread should rarely have absolute priority over another. For
example, say that a system assigns payload work absolute priority over
administrative work. If it then receives more work than it can handle,
its operators are apt to reboot it when it fails to respond to their
console commands.
Here, the general requirement is the ability to assign a faction,
rather than a priority, to each thread. CPU time is then apportioned
among the factions by giving each one a subset of the timeslices that
recur over a broader time interval. This eliminates task starvation by
allowing payload work to receive most of the CPU time while still
allotting some time to administrative work. The details include
- how to assign a faction to a thread
- how to assign timeslices to factions
- what to do when a faction is supposed to run but contains no
ready thread
- how to modify the number of timeslices received by a faction
while a system is in service
- how to deal with faction inversion (where a thread in a rarely
executed faction is blocking a thread in a frequently executed faction)
- how to handle situations in which absolute priority is actually
important
Conclusion. The most desirable
outcome would be to specify the above capabilities in a way that allows
various run-time environments to implement them in a native manner. An
alternative would be to specify a lower-level set of capabilities that
would allow wrapper classes to implement the higher-level capabilities
in a way that would be portable across various environments.