We claimed that local and remote message sendings were the same. Obviously we over-simplified the issue: the remote site may crash or become unreachable.
In the case of our ray tracer, the remote message sending is a
synchronous one, i.e. is a remote function call,
which is performed by call_worker
(section 4.2).
The definition of worker
resides on a remote site.
If the remote site fails and that the failure is detected by the
JoCaml runtime system, then the call to worker
will result in
raising the exception Join.Exit
and
monitor.leave(v)
will never execute.
As a very untimely consequence, the image will never be completed.
To correct this misbehavior, it suffices to re-issue a failed task,
as performed by the following, new definition of call_worker
.
The re-issued task is made available to other agents
by the means of a new channel compute
, and of a new,
straightforward, join-pattern. Additionally the worker that failed is
forgotten about, since there is no agent(worker)
process
when v
is None
.
Observe that all exceptions are caught, not only Join.Exit
.
Here, the master/slave protocol does not rely on exceptions and
we can thus consider any exception to express a failure. This can
occur in practice, for instance if the remote site consumes all
available memory (exception Out_of_memory
), since the JoCaml
runtime system transmits exceptions.
Unfortunately not all failures are detected.
More concretely, we cannot assume that worker(x)
will always
either return a value or raise an exception.
To solve the problem, we keep a record of all active tasks
i.e. of all tasks that are being computed.
Then, near the end of image computation, we re-issue active tasks
until the image is completed.
This technique requires a new kind of monitor, of which join-definition is as follows.
The code above is a refinement of the previous monitor
(page ??).
The message on state
is now a triple, of an identifier
(next_id
, an integer), of a mapping from identifiers to task
descriptions (active
, an association list of which keys are
identifiers), and of a partial result (r
, as before).
Identifiers permit the safe identification of task descriptions.
They can be avoided when we are sure that tasks descriptions
are pairwise distinct, which need not be the case with general enumerators.
The new monitor exports two additional synchronous channels:
is_active
, a predicate to test if a given task is active,
and get_active
that returns the list of active tasks.
The guarded processes for these new channels are straightforward
(List.mem_assoc
is from the OCaml library
and has obvious semantics).
The exported channels enter
, leave
,
finished
and wait
are still here, with a few changes.
Channel enter
now takes a task description x
as argument
and returns a fresh identifier next_id
.
The counter increment performed by the previous monitor is now
replaced by adding (next_id,x)
to the internal association list.
Channel leave
now takes an identifier id
as an extra
argument, which it uses to remove the completed task from the list of
active tasks (by calling the library function List.remove_assoc
).
Notice that, as a given task can now be computed by several
slaves, we take some care not to combine the result of a given task
more than once.
Finally the reaction rule for wait
undergoes a small, but important,
change: the message on state
is re-emitted.
Otherwise, subsequent calls to is_active
would block.
The pool is also modified. The crucial modification regards re-issuing tasks when iteration has come to an end.
When iteration is over (step enum
returns None
),
a message on the internal channel do_again
is sent.
The worker that has not been called is also released.
The guarded process for do_again
is in charge of
retrieving active tasks from the monitor.
The synchronization on agent(...)
above is not necessary.
Nevertheless, it is clearly a good idea to wait for at least one slave to be
available before re-issuing active tasks.
The available slave is not used yet and
the message on agent
is re-emitted.
If there are no active tasks left, (get_active()
returns the
empty list), then
the pool informs the monitor that it will not allocate any additional
task (by monitor.finished()
).
In fact,
from all calls to enter
being performed before do_again
is
called for the first time, it can be deduced that the image is now
complete. Hence the join-pattern for wait
in the monitor could
have avoided testing that active
is empty.
If there are some active tasks left, then
channel again
is in charge of re-allocating them to available slaves.
The code above basically scans the list of active tasks.
However, before calling call_worker
5,
a last check is made. Indeed it can be that the
task id
has been completed while again
was scanning the
list. Observe that when the scanning is over (join-pattern
again(...,[])
), then do_again
is called again, resulting in
another re-allocation of active tasks to slaves, if there still are
active tasks.
It may seem that our solution is a waste of processing power. However, if we compute one image only, there is little waste. Having n slaves computing the same subimage is not less efficient than having one slave computing the subimage and n−1 slaves being idle, up to communication costs. Furthermore, it can be more efficient on an heterogeneous network. If a slow slave is allocated a task at the end of the image, then other slaves will be allocated the same task quickly. As a result, image completion is delayed by the fastest amongst the slaves that are working on the last subimages.
If there are several images to compute, one can lower the amount of
useless work by having the master to control the rendering of several images
at a time.
Namely, remember that the fold pool of section 4.2
can manage several exploiting agents.
So as to control several images concurrently, we need change
the function render
of the module RenderMaster
.
The new definition of render
simply stores
the freshly computed scene in an instance of the buffer of
section 3.
An exploiting agents is a simple asynchronous channel definition
that repeatedly calls the function render_image
of
page ??.
It remains to start several such agents, how many depending on some user setting amax.
An alternative is unconstrained concurrency: an exploiting agent is spawned as soon as an image is available.
Notice that, with respect to the previous definition of
render_images
,
the function render_image
is called asynchronously.
Now, we have three versions of RenderMaster.render
,
that respectively control the rendering of one image at a time,
of at most amax images at a time, and
of as many images as possible at a time.
Preliminary experiments show that setting
amax to be 2 or 3 is a reasonable choice.
However, we list all these possibilities to demonstrate the flexibility
of JoCaml. In particular, master termination is controlled by
the same counting monitor (see page ??)
in all cases.