Failures

We claimed that local and remote message sendings were the same. Obviously we over-simplified the issue: the remote site may crash or become unreachable.

7.1 Detected failures

In the case of our ray tracer, the remote message sending is a synchronous one, i.e. is a remote function call, which is performed by call_worker (section 4.2).

… and call_worker(monitor,x,worker) = let v = worker(x) in monitor.leave(v) & agent(worker) …

The definition of worker resides on a remote site. If the remote site fails and that the failure is detected by the JoCaml runtime system, then the call to worker will result in raising the exception Join.Exit and monitor.leave(v) will never execute. As a very untimely consequence, the image will never be completed.

To correct this misbehavior, it suffices to re-issue a failed task, as performed by the following, new definition of call_worker.

… or agent(worker) & compute(monitor,x) = call_worker(monitor,x,worker) and call_worker(monitor,x,worker) = let v = try Some (worker(x)) with _ -> None in match v with | None -> compute(monitor,x) | Some v -> monitor.leave(v) & agent(worker) …

The re-issued task is made available to other agents by the means of a new channel compute, and of a new, straightforward, join-pattern. Additionally the worker that failed is forgotten about, since there is no agent(worker) process when v is None.

Observe that all exceptions are caught, not only Join.Exit. Here, the master/slave protocol does not rely on exceptions and we can thus consider any exception to express a failure. This can occur in practice, for instance if the remote site consumes all available memory (exception Out_of_memory), since the JoCaml runtime system transmits exceptions.

7.2 Undetected failures

Unfortunately not all failures are detected. More concretely, we cannot assume that worker(x) will always either return a value or raise an exception. To solve the problem, we keep a record of all active tasks i.e. of all tasks that are being computed. Then, near the end of image computation, we re-issue active tasks until the image is completed.

This technique requires a new kind of monitor, of which join-definition is as follows.

def state(next_id, active, r) & enter(x) = state(next_id+1, (next_id,x)::active, r) & reply next_id to enter or state(next_id, active, r) & leave(id,v) = if List.mem_assoc id active then let active'= List.remove_assoc id active in state(next_id, active', combine v r) else state(next_id, active, r) or state(next_id, [], r) & wait() & finished() = state(next_id, [], r) & reply r to wait (* New channels: is_active and get_active *) or state(next_id, active, r) & is_active(id) = state(next_id, active, r) & reply List.mem_assoc id active to is_active or state(next_id, active, r) & get_active() = state(next_id, active, r) & reply active to get_active

The code above is a refinement of the previous monitor (page ??). The message on state is now a triple, of an identifier (next_id, an integer), of a mapping from identifiers to task descriptions (active, an association list of which keys are identifiers), and of a partial result (r, as before). Identifiers permit the safe identification of task descriptions. They can be avoided when we are sure that tasks descriptions are pairwise distinct, which need not be the case with general enumerators.

The new monitor exports two additional synchronous channels: is_active, a predicate to test if a given task is active, and get_active that returns the list of active tasks. The guarded processes for these new channels are straightforward (List.mem_assoc is from the OCaml library and has obvious semantics). The exported channels enter, leave, finished and wait are still here, with a few changes. Channel enter now takes a task description x as argument and returns a fresh identifier next_id. The counter increment performed by the previous monitor is now replaced by adding (next_id,x) to the internal association list. Channel leave now takes an identifier id as an extra argument, which it uses to remove the completed task from the list of active tasks (by calling the library function List.remove_assoc). Notice that, as a given task can now be computed by several slaves, we take some care not to combine the result of a given task more than once. Finally the reaction rule for wait undergoes a small, but important, change: the message on state is re-emitted. Otherwise, subsequent calls to is_active would block.

The pool is also modified. The crucial modification regards re-issuing tasks when iteration has come to an end.

def loop(monitor,enum) & agent(worker) = match step enum with | Some(x, next) -> let id = monitor.enter(x) in loop(monitor,next) & call_worker(monitor, id, x, worker) | None -> do_again(monitor) & agent(worker)

When iteration is over (step enum returns None), a message on the internal channel do_again is sent. The worker that has not been called is also released. The guarded process for do_again is in charge of retrieving active tasks from the monitor.

or do_again(monitor) & agent(worker) = begin match monitor.get_active() with | [] -> monitor.finished() | xs -> again(monitor,xs) end & agent(worker)

The synchronization on agent(...) above is not necessary. Nevertheless, it is clearly a good idea to wait for at least one slave to be available before re-issuing active tasks. The available slave is not used yet and the message on agent is re-emitted. If there are no active tasks left, (get_active() returns the empty list), then the pool informs the monitor that it will not allocate any additional task (by monitor.finished()). In fact, from all calls to enter being performed before do_again is called for the first time, it can be deduced that the image is now complete. Hence the join-pattern for wait in the monitor could have avoided testing that active is empty.

If there are some active tasks left, then channel again is in charge of re-allocating them to available slaves.

or again(monitor,(id,x)::xs) & agent(worker) = again(monitor,xs) & if monitor.is_active(id) then call_worker(monitor,id,x,worker) else agent(worker) or again(monitor,[]) = do_again(monitor)

The code above basically scans the list of active tasks. However, before calling call_worker⁵, a last check is made. Indeed it can be that the task id has been completed while again was scanning the list. Observe that when the scanning is over (join-pattern again(...,[])), then do_again is called again, resulting in another re-allocation of active tasks to slaves, if there still are active tasks.

It may seem that our solution is a waste of processing power. However, if we compute one image only, there is little waste. Having n slaves computing the same subimage is not less efficient than having one slave computing the subimage and n−1 slaves being idle, up to communication costs. Furthermore, it can be more efficient on an heterogeneous network. If a slow slave is allocated a task at the end of the image, then other slaves will be allocated the same task quickly. As a result, image completion is delayed by the fastest amongst the slaves that are working on the last subimages.

If there are several images to compute, one can lower the amount of useless work by having the master to control the rendering of several images at a time. Namely, remember that the fold pool of section 4.2 can manage several exploiting agents. So as to control several images concurrently, we need change the function render of the module RenderMaster. The new definition of render simply stores the freshly computed scene in an instance of the buffer of section 3.

let buffer = create_buffer () let render sc = buffer.put sc

An exploiting agents is a simple asynchronous channel definition that repeatedly calls the function render_image of page ??.

def render_images() = render_image (buffer.get()) ; render_images()

It remains to start several such agents, how many depending on some user setting a_max.

let () = for _k = 1 to a_max do spawn render_images() done

An alternative is unconstrained concurrency: an exploiting agent is spawned as soon as an image is available.

def render_images() = let sc = buffer.get() in spawn begin render_image (sc) ; 0 end ; render_images() let () = spawn render_images()

Notice that, with respect to the previous definition of render_images, the function render_image is called asynchronously. Now, we have three versions of RenderMaster.render, that respectively control the rendering of one image at a time, of at most a_max images at a time, and of as many images as possible at a time. Preliminary experiments show that setting a_max to be 2 or 3 is a reasonable choice. However, we list all these possibilities to demonstrate the flexibility of JoCaml. In particular, master termination is controlled by the same counting monitor (see page ??) in all cases.

We omit the code, it is almost the same as in the previous section.

7 Failures

7.1 Detected failures

7.2 Undetected failures