Results

8 Results

The performance of parallel programs is best illustrated by speedup measures. Speedup is the ratio of sequential execution time by parallel execution time. By sequential, we here mean the sequential ray tracer, not the concurrent ray tracer with one slave running. Of course, the measured times are wall-clock time. So as to minimize noise, any data we present is the median of three measures.

We perform most of our experiments on a cluster machine, the machine has 11 nodes, each node being a 2 × 2GHz AMD-64 bi-processor. Each node has 6 Gb of physical memory and the nodes are connected by a fast Gigabit dedicated network. Our main experiment consists in running two instances of the slave program on 1, 2, …, 11 nodes, yielding 2, 4, …, 22 participating slaves. Expressing speedup as a function of the number of slaves involved here makes sense, since cluster nodes are identical.

We also perform the additional experiment of running one slave per processor on 111 processors from the students computer lab at École polytechnique — we use 68 machines, some of which are bi-processors. The machines involved are of various CPU model and memory size, but all are rather modern. By contrast with the cluster, speedup as a function of the number of slaves involved does not make sense on such an heterogeneous network. Instead, we introduce processing power, a crude estimate computed by considering running times for a small image. We take one processor of the cluster as a base (1.0), cluster power is thus 22.0, while the local network power is about 80.0. Network conditions are standard for such a local network. All machines involved run Linux of one kind or another.

In all experiments, we run the master program on a distinct, “front-end”, machine, from which we launch the slave programs on the other machines by the means of ssh. More precisely, we start slaves in advance and measure the running time of the master.

We perform two sets of experiments, “urchin” (U) and “Sierpinski” (S) — see Appendix A. In both experiments, inputs to the ray tracer are scenes of identical structure and increasing complexity. By contrast, rendering parameters do not change: images are of size 1000 × 1000, the subimage unit is the default setting of one line, and adaptive antialiasing [21] is enabled, yielding between 2 and 27 primary rays per pixel.

Figure 3: Stylized sea urchin.

Speedups

Running times,
in seconds

Seq x22 x111

U8 768 35 20.3

U11 9120 416 127

Experiments U yield excellent results: here, we basically achieve linear speedup on the cluster (left of Figure 3). To provide an idea of the concrete benefits of concurrency, we also give some execution times (columns “Seq” and “x22” at the right of Figure 3). For instance, in the case of U11, image production time is reduced from more of 2 hours and a half to less than 7 minutes. It is still possible to go faster by using the local network (column “x111”), with a better result for the most difficult test U11.

Indeed, we here achieve a speedup of 71.8, which is surprisingly good, considering that the estimated processing power the network is 80.0 and that some of the machines involved are not idle at the time we use them. We shall thus assume that experiment U11 somehow validates our crude estimate of processing power. Namely, the (relatively) poor performance of U8 originates in the important setup times we observe while 111 slaves attempt their first connection to the master. The setup of ssh tunnels may partly explain this delay, but we cannot reject the hypothesis of a master overflow here.

We think that the good performance of experiment U stems from two main factors.

Rendering operations account for the vast majority of computations, almost the totality. They literally dwarf other operations which, by contrast, are executed sequentially by the master (parsing and image saving) or are executed once by every slave (scene building).
The subimage unit of one line is adequate.
1. It is large enough. That is, due to scene complexity and antialiasing, most of the 1000 lines computed represent sufficient work for messaging and task management costs to remain unnoticed. In particular, by observing instantaneous CPU load, we had the confirmation that slaves are seldom idle because they are waiting for the master to allocate a subimage to them.
2. It is small enough. Obviously, dividing the images into 1000 subimages to be computed by no more than 22 (or even 111) slaves allows a distribution of tasks that adapts well to varying task complexity and slave load. Besides, near the end, image completion is delayed by no more than a small fraction of the total work to be performed.

Figure 4: Decaying Sierpinski cube

Speedups

Running times,
in seconds

Seq x22 x111

S1 7.2 0.86 – ^a

S2 36.6 1.95 – ^a

S3 190 9.10 10.9

S4 759 40.6 32.05

S5 2620 204 262

a
Unstable results

Our second series (Figure 4) illustrates some of the limitations inherent to concurrent ray tracing. Series S includes simple scenes (S1 is made of a few cubes) and scenes that take time to compute (S5 is made of almost 580,000 cubes).

Experiments S1 and S2 are here for completeness, there is little point in attempting to compute such simple images faster than 7 s and 36 s. However, we achieve decent speedups, at least for S2. As regards the other experiments, significant sequential running times make relevant the idea of distributed execution. And indeed, we achieve good speedups on the cluster for experiments S3 and S4. However, the most demanding S5 experiment yields poor speedups. Those are easily explained once one knows that the sequential ray tracer starts to cast rays after about 100 s of computing time, which is mostly devoted to the interpretation of the Gml program. About 100 s is a small fraction (3.8%) of the sequential running time of 2620 s, but a significant fraction of 204 s. More precisely, we can approximate the expected speedup for N slaves as N/(0.038 × N + 0.962) (Amdahl’s law). For N=22 we have a theoretical speedup of 12.2, which is rather close to the observed speedup of 12.8.

Experiments S on the local network are not very rewarding. In particular, computing image S5 on the local network of 111 processors is actually slower than on the cluster. Here also, poor performance originates from non-parallelized work dominating parallelized work, but on a more dramatic scale. Direct measures showed us that slaves engage in computing subimages no sooner than 200 s after they start. The expected delay inferred from processing powers should be about 140–160 s. Obviously, the computers at École polytechnique are not as fast as we expect them to be, perhaps due to significant memory traffic. Furthermore, about 20 processors are so slow that they do not take any part to the concurrent computation. This is mostly due to the presence of other users⁶.

In spite of those unfavorable results, we claim to have demonstrated the efficiency of our concurrent ray tracer. Namely, for complex images of reasonable memory footprint and building time, we achieve dramatic speedups. Additionally, in a less favorable situation, we still make a decent profit out of the cluster.

6: Other causes are possible though: in one case, the machine was over-heating…