We claimed that local and remote message sendings were the same. Obviously we over-simplified the issue: the remote site may crash or become unreachable.
In the case of our ray tracer, the remote message sending is a
synchronous one, i.e. is a remote function call,
which is performed by
call_worker (section 4.2).
The definition of
worker resides on a remote site.
If the remote site fails and that the failure is detected by the
JoCaml runtime system, then the call to
worker will result in
raising the exception
monitor.leave(v) will never execute.
As a very untimely consequence, the image will never be completed.
To correct this misbehavior, it suffices to re-issue a failed task,
as performed by the following, new definition of
The re-issued task is made available to other agents
by the means of a new channel
compute, and of a new,
straightforward, join-pattern. Additionally the worker that failed is
forgotten about, since there is no
Observe that all exceptions are caught, not only
Here, the master/slave protocol does not rely on exceptions and
we can thus consider any exception to express a failure. This can
occur in practice, for instance if the remote site consumes all
available memory (exception
Out_of_memory), since the JoCaml
runtime system transmits exceptions.
Unfortunately not all failures are detected.
More concretely, we cannot assume that
worker(x) will always
either return a value or raise an exception.
To solve the problem, we keep a record of all active tasks
i.e. of all tasks that are being computed.
Then, near the end of image computation, we re-issue active tasks
until the image is completed.
This technique requires a new kind of monitor, of which join-definition is as follows.
The code above is a refinement of the previous monitor
The message on
state is now a triple, of an identifier
next_id, an integer), of a mapping from identifiers to task
active, an association list of which keys are
identifiers), and of a partial result (
r, as before).
Identifiers permit the safe identification of task descriptions.
They can be avoided when we are sure that tasks descriptions
are pairwise distinct, which need not be the case with general enumerators.
The new monitor exports two additional synchronous channels:
is_active, a predicate to test if a given task is active,
get_active that returns the list of active tasks.
The guarded processes for these new channels are straightforward
List.mem_assoc is from the OCaml library
and has obvious semantics).
The exported channels
wait are still here, with a few changes.
enter now takes a task description
x as argument
and returns a fresh identifier
The counter increment performed by the previous monitor is now
replaced by adding
(next_id,x) to the internal association list.
leave now takes an identifier
id as an extra
argument, which it uses to remove the completed task from the list of
active tasks (by calling the library function
Notice that, as a given task can now be computed by several
slaves, we take some care not to combine the result of a given task
more than once.
Finally the reaction rule for
wait undergoes a small, but important,
change: the message on
state is re-emitted.
Otherwise, subsequent calls to
is_active would block.
The pool is also modified. The crucial modification regards re-issuing tasks when iteration has come to an end.
When iteration is over (
step enum returns
a message on the internal channel
do_again is sent.
The worker that has not been called is also released.
The guarded process for
do_again is in charge of
retrieving active tasks from the monitor.
The synchronization on
agent(...) above is not necessary.
Nevertheless, it is clearly a good idea to wait for at least one slave to be
available before re-issuing active tasks.
The available slave is not used yet and
the message on
agent is re-emitted.
If there are no active tasks left, (
get_active() returns the
empty list), then
the pool informs the monitor that it will not allocate any additional
from all calls to
enter being performed before
called for the first time, it can be deduced that the image is now
complete. Hence the join-pattern for
wait in the monitor could
have avoided testing that
active is empty.
If there are some active tasks left, then
again is in charge of re-allocating them to available slaves.
The code above basically scans the list of active tasks.
However, before calling
a last check is made. Indeed it can be that the
id has been completed while
again was scanning the
list. Observe that when the scanning is over (join-pattern
do_again is called again, resulting in
another re-allocation of active tasks to slaves, if there still are
It may seem that our solution is a waste of processing power. However, if we compute one image only, there is little waste. Having n slaves computing the same subimage is not less efficient than having one slave computing the subimage and n−1 slaves being idle, up to communication costs. Furthermore, it can be more efficient on an heterogeneous network. If a slow slave is allocated a task at the end of the image, then other slaves will be allocated the same task quickly. As a result, image completion is delayed by the fastest amongst the slaves that are working on the last subimages.
If there are several images to compute, one can lower the amount of
useless work by having the master to control the rendering of several images
at a time.
Namely, remember that the fold pool of section 4.2
can manage several exploiting agents.
So as to control several images concurrently, we need change
render of the module
The new definition of
render simply stores
the freshly computed scene in an instance of the buffer of
An exploiting agents is a simple asynchronous channel definition
that repeatedly calls the function
It remains to start several such agents, how many depending on some user setting amax.
An alternative is unconstrained concurrency: an exploiting agent is spawned as soon as an image is available.
Notice that, with respect to the previous definition of
render_image is called asynchronously.
Now, we have three versions of
that respectively control the rendering of one image at a time,
of at most amax images at a time, and
of as many images as possible at a time.
Preliminary experiments show that setting
amax to be 2 or 3 is a reasonable choice.
However, we list all these possibilities to demonstrate the flexibility
of JoCaml. In particular, master termination is controlled by
the same counting monitor (see page ??)
in all cases.