Alexey Lastovetsky

Tutorial

"PARALLEL COMPUTING
ON COMMON NETWORKS OF COMPUTERS
WITH mpC"

Contents

  1. What is mpC?
  2. First programs
  3. Networks
  4. Network type
  5. Network parent
  6. Synchronization of processes
  7. Network functions
  8. Subnetworks
  9. Vector computations
  10. Heterogeneous parallel computing

1. What is mpC?

The mpC programming language is an extension of the ANSI C language designed specially for programming parallel computations on common networks of heterogeneous computers. The main goal of parallel computing is to speed up solving the problem. Just this differs parallel computing from distributed computing the main goal of which is to make software inherently distributed over different computers work together. In case of parallel computing, partitioning the entire program into components running on different computers is only the way to speed up execution of the program and not its inherent property. Therefore, in mpC the basic attention is paid to the means that facilitate writing efficient and portable applications solving problems on common networks of computers.

2. First programs

The parallel program is a set of parallel processes interacting (that is, synchronizing their work and interchanging data) by means of message passing. The mpC programmer cannot determine how many processes constitute the program and which computers execute which processes. This is specified by some means external to the mpC language. Source mpC code only determines what computations are performed by each of the processes constituting the parallel program.

To kick off, let us consider a simplest program, p1.mpc, doing the same as the most famous C program, namely, the one outputting the text "Hello, world!" to the user's terminal.

The code of this mpC program differs very little from the code of the C program. The first difference is the [*] specifier before the name main in the definition of the main function. This specifies the kind of the function saying that the code of the function shall be executed by all processes of the parallel program. Functions similar to the function main are called basic functions in mpC. Correct work of a basic function is possible only if all processes of the parallel program call it. The mpC compiler controls correctness of basic function calls.

The second difference is the construct [host] before the function name printf in the expression where this standard C library function is called. Unlike the function main, the function printf does not need to be called in parallel by all processes of the parallel program in order to work correctly. Moreover, a call to the function in any single process of the parallel program makes sense and is correct. Such functions are called nodal in mpC. The mpC language allows both any single process of the parallel program to call a nodal function and any group of processes to call the function in parallel. In the program p1.mpc, the function printf is only executed by a single process of the parallel program, namely, by the process associated with the user's terminal from which the program was started. The mpC keyword host is related hard just with this process.

So, the execution of the program is that all processes do nothing except that the host-process calls the function printf.

The syntax of the next program, p2.mpc, differs even less from the syntax of the C "Hello, world!" program. Nevertheless, it describes more parallel computations than the program p1.mpc. Namely, its execution is that all processes of the parallel program call the function printf. The result of execution of the program depends on the operating environment. In some environments, standard output of all the parallel processes will go to the user's terminal from which the program has been started. In that case, the user will see as many greetings "Hello, world!" on the terminal as many parallel processes will constitute the program - just one greeting from each process. In other environments, you will only see greetings from the processes that run on the same computer as the host-process or even a single greeting from the host-process.

Thus, the program p2.mpc can produce differing results in different environments. This means that the program is not portable. This is a serious disadvantage of the program. The program p3.mpc outputting "Hello, world!" to the user's terminal from all processes of the parallel program is free of the disadvantage. The mpC library function MPC_Printf guarantees that each process calling this function will output "Hello, world!" just to the user's terminal.

The next program, p4.mpc, only differs from our first program p1.mpc by a richer in content message which the host-process sends to the user's terminal, meanwhile all the rest processes still do nothing. Namely, in addition to "Hello, world!" the host- process outputs the name of the computer which runs that process. To do this, we define the variable un allocated in the memory of the host-process (this is specified with the construct [host] before the name of this variable in the definition). After the host-process calls the nodal (library) function uname, the member nodename of the structure un will contain a pointer to the name of the computer running the process.

The next program, p5.mpc, sends to the user's terminal richer in content messages already from all processes of the parallel program. In addition to the greeting "Hello, world!" each process informs of the name of the computer running the process. To do this, so-called distributed variable un is defined in the program.

The variable is called distributed because each process of the parallel program holds in its memory a copy of the variable, and, hence, the region of storage, represented by this variable, is distributed over the processes. Thus, the distributed variable un is nothing but a set of normal (undistributed) variables, each of which in turn is a projection of the distributed variable onto the corresponding process.

After each process of the parallel program p5.mpc calls the function uname, the member nodename of the corresponding projection of the distributed structure un will contain a pointer to the name of the computer running this process.

Values of distributed variables and distributed expressions, such as un.nodename or &un, are distributed over processes of the parallel program in the natural way and called distributed values.

The program p6.mpc extends the output of the program p5.mpc with information about the total number of processes of the parallel program. To do it, the program defines two integer distributed variables one and number_of_processes. First, all projections of the variable one are assigned 1 as a result of execution of assignment one=1. The result of applying the postfix operator [+] to the variable one will be a distributed value whose projection to any process will be equal to the sum of values of all projections of the variable one. In other words, the projection of the value of the expression one[+] to any process of the parallel program will be equal to the total number of processes. After assigning the distributed value to the distributed variable number_of_processes, all projections of the latter will hold the same value, namely, the total number of processes of the parallel program.

The definition of the distributed variable one contains the keyword repl (a shortcut of replicated). It informs the compiler that all projections of the value of the variable shall be equal to each other in any expression in the program. Such distributed variables are called replicated in mpC (correspondingly, the value of a replicated variable is called a replicated value). Replicated variables and expressions play an important role in mpC. The mpC compiler checks the property to be replicated declared by the programmer and warns about all possible its violations.

Note that a simpler program, p7.mpc, enable the same result as the program p6.mpc by means of use of the mpC library function MPC_Total_nodes which just returns the total number of processes of the parallel program. Besides, the program p7.mpc is more efficient than the program p6.mpc, because unlike execution of the expression one[+] a parallel call of the function MPC_Total_nodes does not need data transfer between processes of the program.

On the contrary, the program p8.mpc being a result of slight modifications of the program p6.mpc is less efficient. But it demonstrates the way assignment could be used for transferring data between processes in mpC. The variable local_one is held in memory of the host-process and initialised by 1. The variable one is replicated over processes of the parallel program. Execution of the assignment one=local_one consists in broadcasting the value of the variable local_one to all processes of the program followed by its assigning to the projections of the variable one.

3. Networks

By now we have only considered the mpC programs involving in computations either all processes or only the host-process. But often the number of processes involved in parallel solution of the problem depends on the problem itself or/and the parallel algorithm of its solution and is determined by input data. For example, let a single process be used for a single group of bodies when simulating the evolution of N groups of bodies under the influence of Newtonian gravitational attraction. It means that exactly N processes will have to be involved in the corresponding parallel computations independent on the total number of processes constituting the parallel program. Remember that the parallel program is started up by some means external to the mpC language. It means that the total number of processes constituting the program is not defined by the mpC programmer, the latter only has some language means to know the number.

The program p9.mpc gives the first introduction to the language means that allow the programmer to describe parallel computations on the needed number of processes. The computations themselves remain quite simple - each of participating processes just outputs "Hello, world!" to the user's terminal. But the number of participating processes, N=3, is defined by the programmer and does not depend on the total number of processes constituting the parallel program.

In mpC, the notion of network corresponds to a group of processes jointly performing some parallel computations. In mpC, network is an abstraction facilitating the work with actual processes of the parallel program (just like the notion of data object and variable in programming languages facilitate the work with memory).

In the simplest case, the network is simply a set of virtual processors. To code computations executed by a given number of parallel processes, the programmer first of all should define a network consisting of the corresponding number of virtual processors, and only after the network is defined, the programmer can start describing parallel computations on the network.

Definition of the network causes creation of a group of processes representing the network, so each virtual processor is represented by a single process of the parallel program. Description of parallel computations on the network causes execution of the corresponding computations just on those processes that represent virtual processors of the network. The important difference of actual processes from virtual processors is that at different moments of program execution the same process can represent different virtual processors of different mpC networks. In other words, definition of the network causes mapping of virtual processors of this network to actual processes of the parallel program, and this mapping is constant during lifetime of the network.

So, the program p9.mpc first defines the network mynet of N virtual processors and then calls the nodal library function MPC_Printf on the network. Execution of the program consists in parallel call of the function MPC_Printf by those N processes of the program onto which virtual processors of the network mynet are mapped. This mapping is performed by the mpC programming system at runtime. If the programming system cannot perform such mapping (for example, if N is greater than the total number of processes of the program), the program stops abnormally with the corresponding diagnostics.

Note the similarity of language constructs [mynet] and [host]. Indeed, the keyword host can be considered as the name of the pre-defined network consisting exactly of one virtual processor mapped to the host-process associated with the user's terminal.

The program p10.mpc outputs messages from those processes of the parallel program to which the virtual processes of the network mynet are mapped. In addition to "Hello, world!", each involved process outputs the name of the computer executing the process. To do so, the program defines the variable un distributed over network mynet. Only a processes implementing one of the virtual processors of mynet hold in its memory a copy of un. Only these processes call the function uname (what is specified with the construct [mynet] before the function name), and after this call the member nodename of each projection of the distributed structure un will contain a pointer to the name of the computer running the corresponding process.

Semantics of the program p11.mpc is completely equivalent to that of the program p10.mpc. But due to the use of a special distributed label, [mynet], the latter program has a simpler syntax. Any statement labeled with such a label (in our case, this is a compound statement) will be executed completely by virtual processors of the corresponding network.

The next program, p12.mpc, demonstrates that the number of virtual processors of the network can be specified dynamically, that is, at runtime. This program treats its only external argument as such a number. The argument is specified by the user when starting up the program and is accessible at least to the host-process. The expression [host]atoi(argv[1]) is calculated by the host-process and then assigned to the integer variable n replicated over all processes of the parallel program. Execution of this assignment consists in broadcasting the calculated value to all processes of the program followed by its assignment to projections of the variable n. Before defining (creating) a network of the user-specified number of virtual processors and executing the described computations on the network, the program checks correctness of input data. If the specified number of virtual processors is incorrect (less than 1 or greater than the total number of processes of the parallel program), the program outputs to the corresponding diagnostics. Otherwise, the program defines the network mynet consisting of n virtual processors as well as the variable un distributed over the network. Then the nodal function uname is called on the network mynet resulting in the member nodename of each projection of the distributed structure un to contain a pointer to the name of the computer where the mpC programming system has placed the corresponding virtual processor. Finally, the call of the nodal function MPC_Printf on the network mynet outputs "Hello, world!" from each virtual processor together with the name of the computer hosting the virtual processor.

Lifetime of both the network mynet and the variable un is limited by the block in which they are defined. When execution of the block ends, all processes of the program that have been taken for virtual processors of the network mynet are freed and can be used for other networks. Such mpC networks are called automatic.

Lifetime of static networks is only limited by the time of program execution. Programs p13.mpc and p14.mpc demonstrate the difference between static and automatic networks. The programs look almost identical. Both consist in cyclic execution of the block defining a network and executing already familiar computations on the network. The only but essential difference is that the first program defines an automatic network meanwhile the second one defines a static network.

During execution of the program p13.mpc, at the first loop iteration (n=Nmin=3) a network of 3 virtual processors is created on the entry into the block, and this network is destructed when execution of the block ends. At the second loop iteration (n=4) a new network of 4 virtual processors is created on the entry into the block, and that network is also destructed when execution of the block ends. So at the moment of repeated initialisation of the loop (execution of the expression n++), the 4-processor network no longer exists. Finally, at the last iteration an automatic network of 5 virtual processors (n=Nmax=5) is created on the entry into the block.

During execution of the program p14.mpc, at the first loop iteration a network of 3 virtual processors is also created on the entry into the block, but this network is not destructed when execution of the block ends. It simply becomes invisible. Thus in this case the block is not a region where the network exists but a region of its visibility. Therefore, at the time of repeated initialisation of the loop and evaluation of the loop condition the static 3-processor network is existing but not available (because these points of the program are out of scope of the network name mynet). On next entries into the block at subsequent loop iterations no new networks are created but the static network, which has been created on the first entry into the block, becomes visible.

Thus, meanwhile in the program p13.mpc the same name mynet denotes absolutely different networks at different loop iterations, in the program p14.mpc this name denotes a unique network existing from the first entry in the block in which it is defined until the end of program execution.

If the kind of a network does not specified explicitly by means of use of keyword auto or static in its definition, it will be considered automatic, if declared inside a function, and static, if declared out of any function. So, all networks from programs p9.mpc, p10.mpc, p11.mpc and p12.mpc are implicitly declared automatic.

4. Network type

By now in our mpC programs all virtual processors of the same network have performed the same computations. Therefore, we did not need to separate different virtual processors inside the network. But if a parallel algorithm that should be coded implies different processes to execute differing computations, some means to separate some virtual processors inside the network are needed. The mpC language provides the programmer with the relevant means. In particular, it allows the programmer to associate the virtual processors of any network with a coordinate system and to separate a single virtual processor specifying its coordinates.

Generally speaking, in mpC one cannot just define a network but only a network of some type. Type is the most important attribute of network. In particular, it determines how to access separate virtual processors of the network. The type specification is a mandatory part of any network definition. Therefore, any network definition should be preceded by the definition of the corresponding network type. In all examples that have been considered the definition of the used network type SimpleNet can be found among other standard definitions of the mpC language in the header file mpc.h and is included in these programs with the #include directive. The definition looks as follows:

nettype SimpleNet(int n) {
  coord I=n;
};

It introduces the name SimpleNet of the network type parameterised with the integer parameter n. The body of the definition declares the coordinate variable I ranging from 0 to n-1. The type SimpleNet is the simplest parameterised network type that describes networks consisting of n virtual processors well ordered by their positions on the coordinate line.

The program p15.mpc obtained by means of slight modification of the program p9.mpc demonstrates the way execution of differing computations by different virtual processors can be coded. The program uses the binary operator coordof with the coordinate variable I and the network mynet as its left and right operands correspondingly. The result is the integer value distributed over the network mynet whose projection to a virtual processor will be equal the value of the coordinate I of this virtual processor in that network. After execution of the assignment

   my_coordinate = I coordof mynet  ,

each projection of the variable my_coordinate will hold the coordinate of the corresponding virtual processor of the network mynet. As a result, virtual processors with even coordinates will output "Hello, even world!", meanwhile ones with odd coordinates will output "Hello, odd world!".

The program p16.mpc demonstrates a network whose virtual processors are associated with a 2-coordinate system. Each virtual processor of the network outputs its coordinates and the name of the computer hosting it. Note that in the program the variable un not a network is used as the second operand of the operator coordof. In general, if the second operand of the operator coordof is an expression not a network, the expression is not evaluated but only used to determine the network that the expression is distributed over, and the operator is executed as if that network was its second operand.

5. Network parent

We have discussed that lifetime of an automatic network is limited by the block in which the network is defined. When execution of the block ends, the network ceases to exist, and all processes taken for virtual processors of the network are freed and can be used for other networks.

The question is how results of computations on automatic networks can saved and used in further computations. Our previous programs did not raise the problem, because the only result of parallel computations on networks was output of some messages to the user's terminal.

Actually in mpC networks are not absolutely independent on each other. Every newly created network has exactly one virtual processor shared with already existing networks. That virtual processor is called a parent of this newly created network and is the connecting link through which results of computations are passed if the network ceases to exist. The parent of a network is always specified by the definition of the network, explicitly or implicitly.

So far, no network was defined with explicit specification of its parent. The parent was specified implicitly, and the parent was nothing but the virtual host-processor. The solution is obvious because at any moment of program execution the existence of only one network can be guaranteed, namely, the pre-defined network host consisting of the only virtual processor always mapping onto the host-process associated with the user's terminal.

The program p17.mpc is completely equivalent to the program p16.mpc except that in the definition of the network the explicit specification of the parent substitutes the implicit one.

One more difference can be found in the definition of the network type. A line explicitly specifying the coordinates of the parent in networks of the type (the coordinates default to zeros) is added. Should for some reason we need that the parent of the network mynet had not the least but the greatest coordinates, then in the definition of the network type Mesh the specification parent [m-1,n-1] had to be used instead of parent [0,0].

By now at any moment of mpC program execution there existed not more than one network. This is not a restriction of the mpC language. The mpC language allows the programmer to write programs with arbitrary number of simultaneously existing (and visible) networks. The only limitation is the total number of processes constituting the parallel program.

In program p18.mpc, there simultaneously exist three networks - net1, net2 and net3. The parent of the network net1 is the virtual host-processor. The parent of the network net2 is the virtual processor of the network net1 with coordinate 1. The parent of the network net3 is the virtual processor of the network net1 with coordinate 2.

6. Synchronization of processes

We have already mentioned that the parallel program is a set of parallel processes synchronizing their work and interchanging data by means of message passing. The means of the mpC language that have been introduced allow the programmer to specify the number of processes needed for parallel solution of the problem as well as to distribute computations among the processes. In principle, the same means is enough for synchronization of the processes during execution of the parallel program.

The basic synchronization mechanism for parallel processes interacting via message passing is a barrier. The barrier is a point of the parallel program where a process waits for all the other processes with which it synchronizes its work. Only after all the processes synchronizing their work reach the barrier they can continue further computations. If by some reason even one of the processes does not reach the barrier, all the other processes will "hang" at this point of the program and the program itself will never stop normally.

Let we want to change the program p15.mpc in such a way that messages from virtual processors with odd coordinates come to the user's terminal only after messages from virtual processors with even coordinates. The program p19.mpc solves the problem as follows. In the block labeled by the label Barrier, the array bs located on the virtual host-process and the variable b distributed over the network mynet are defined. The number of elements in the array bs is equal to the number of virtual processors in the network mynet. Execution of the assignment bs[]=b consists in the following. Each virtual processor of the network mynet sends the value of its projection of the variable b to the virtual host-processor where this value is assigned to the element of the array bs whose index is equal to the coordinate I of the virtual processor. Execution of the assignment b=bs[] consists in sending the value of the i- th element of bs to the virtual processor of mynet with coordinate I=i where this value is assigned to the corresponding projection of b. One can see that none of the virtual processors of mynet ends the execution of the block before all the processes enter the block and end the execution of the first assignment. Only after it happens the virtual host-processor will be able to end execution of the first assignment and start executing the second one freeing one by one all the other virtual processors that was suspended at the point. Thus, this block is nothing but a barrier serialising statements that output messages from even and odd virtual processors of the network mynet correspondingly.

In the program p20.mpc a similar barrier is implemented simpler and more concise. At the same time, its efficiency is hardly worse, because the most obvious implementation of the operator [+] only differs from the barrier in the program p19.mpc by additional summing on the host-processor whose execution time is neglectably short in relation to the time of data transfer.

Actually, the programmer does not need to invent different ways to implement barriers. The mpC language provides two library functions efficiently implementing barrier synchronization for the operating environment where the parallel program runs. Firstly, it is the basic function MPC_Global_barrier synchronising the work of all processes of the parallel program. Its declaration is in the header mpc.h and looks as follows:

int [*]MPC_Global_barrier(void);

The program p21.mpc demonstrates the use of this function for serialisation of the two already familiar parallel statements. Note that unlike two previous programs, in this program not only processes implementing the network mynet but also all free processes participate in the barrier synchronization. Naturally, this implies greater number of messages transferring during execution of the barrier and hence some slowing down the program execution.

7. Network functions

The library function MPC_Barrier allows synchronising the work of the virtual processors of any network. The program p22.mpc demonstrates the use of this function. In this program, like in programs p19.mpc and p20.mpc, only the processes implementing the network mynet participate in the barrier synchronization.

In the program p22.mpc the call of the function MPC_Barrier looks a little bit unusual. Indeed, this function principally differs from all functions we have met before and represents so-called network functions. Unlike basic functions that are always executed by all processes of the parallel program, network functions are executed on networks and hence can be executed in parallel with other network or nodal functions. Unlike nodal functions which cam also be executed in parallel by all processes of one or another network, virtual processors of the network executing a network function can transfer data, and this makes them a bit similar to basic functions.

The declaration of the function MPC_Barrier is in the header file mpc.h and looks as follows:

int [net SimpleNet(n) w] MPC_Barrier( void );

Any network function has a special network formal parameter, which represents the network executing the function. In the declaration of the network function, a specification of that parameter is in brackets just before the name of the function and looks like normal network definition. In the case of the function MPC_Barrier, the specification of the network parameter looks as follows:

net SimpleNet(n) w

In addition to the formal network w executing the function MPC_Barrier, this declaration introduces the parameter n of this network. Like normal formal parameters, this parameter is available in the body of the function as if it was declared with specifiers repl and const. Since in accordance with the definition of the network type SimpleNet the parameter n is of the type int, one can say that the parameter n is treated in the body of the function MPC_Barrier as if it were a normal formal parameter declared as follows:

repl const int n

All normal formal parameters are considered distributed over the formal network parameter.

Thus the replicated over the network w integer constant parameter n determines the number of virtual processors of the network.

If the function MPC_Barrier were not a library one, it could be defined as it is done in the program p23.mpc:

   int [net SimpleNet(n) w] MPC_Barrier( void ) {
      int [w:parent]bs[n], [w]b=1;
      bs[]=b;
      b=bs[];
   }

In the body of this function the automatic array bs of n elements (the mpC language supports dynamic arrays) is defined. This array is located on the parent of the network w that is specified with the construct [w:parent] before the name of the array in its definition. In addition, the variable b distributed over the network w is also defined there. A couple of statements following the definition implement a barrier for virtual processors of the network w.

In programs p22.mpc and p23.mpc, calls to the network function MPC_Barrier pass the actual network parameter mynet as well as the actual value of the only parameter of the network type SimpleNet. At the first glance, the latter looks redundant. But it should be taken into account that networks of any type and not only SimpleNet type can be passed to this function as an actual network parameter. For example, in the program p24.mpc a network of the type Mesh is such an actual parameter. Actually, the function MPC_Barrier only treats the group of processes on which it is called as a network of the type SimpleNet.

In general, the actual network parameter can be of the type that allows its correct interpretation for various values of the parameters of the network type used in the definition of the called network function. Therefore, the values of the parameters should be explicitly determined in the function call. The program p25.mpc just demonstrates an example of the situation when the actual values of the network type passed to the network function are not the only possible ones.

8. Subnetworks

Let us remember again that the parallel program is a set of parallel processes synchronizing their work and interchanging data by means of message passing. The means of the mpC language that have been introduced allow the programmer to specify the number of processes needed for parallel solution of the problem, distribute computations among the processes as well as synchronise their work during execution of the parallel program. But the means are obviously not sufficient for specification of data transfer among processes.

Indeed, so far either all processes of the parallel program or all virtual processors of one or another network took part in data transfer, and the data transfer itself mainly consisted in either broadcasting some value to all participating processes or gathering values from all participating processes on one of them. To describe more complicated data transfer, for example, data transfer between groups of virtual processors of the network or parallel data exchange between neighbouring virtual processors of the network the introduced language means are not sufficient.

The basic means of the mpC language for describing complicated data transfers are subnetworks. Any subset of the virtual processors of a network is a subnetwork of this network. For example, any row or column of a network of the type Mesh(m,n) is a subnetwork of the network.

In the program p26.mpc each virtual processor of the network mynet having the type Mesh(2,3) outputs to the user's terminal not only the name of the computer hosting this virtual processor but also the name of the computer hosting the closest virtual processor from the neighbouring row. To do it the program defines two subnetworks row0 and row1 of the network mynet. The subnetwork row0 consists of all virtual processors of the network mynet whose coordinate I is equal to zero, that is, corresponds to the zero row of the network mynet. This fact is specified with the construct [mynet:I==0] before the name of the subnetwork in its definition. Similarly, the subnetwork row1 corresponds to the first row of the network mynet. In general, logical expressions describing virtual processors of subnetworks can be quite complex and allow specifying very sophisticated subnetworks. For example, the expression I<J && J%2==0 specifies the virtual processors of the network over the main diagonal in even columns.

Execution of the assignment [row0]neighbour[]=[row1]me[] consists in parallel transferring the corresponding projection of the distributed array me from each j-th virtual processor of the row row1 to the each j-th virtual processor of the row row0 followed by its assignment to the corresponding projection of the array neighbour. Similarly, execution of the assignment [row1]neighbour[]=[row0]me[] consists in parallel transferring the content of the corresponding projection of the distributed array me from each j-th virtual processor of the row row0 to the each j-th virtual processor of the row row1 followed by its assignment to the corresponding projection of the array neighbour. As a result, a projection of the distributed array neighbour on the virtual processor (0,j) contains the name of the computer hosting the virtual processor (1,j), and a projection of this array on the virtual processor (1,j) contains the name of the computer hosting the virtual processor (0,j).

The only difference of the program p27.mpc from the program p26.mpc is that the row subnetworks are not defined explicitly. In this case, the usage of implicitly defined subnetworks is justified because it simplifies the program code without loss of program efficiency or functionality. But there are situations when you cannot avoid explicit definition of subnetworks. For example, network functions can be called only on explicitly defined subnetworks.

9. Vector computations

In addition to subnetworks, the previous two programs for the first time used some mpC means for specifying vector computations. Like other general-purpose parallel programming language, HPF (High Performance Fortran), mpC extends most of scalar operators and allows their operands to be expressions designating sets of scalars, say, arrays or array segments as well. In programs p26.mpc and p27.mpc expressions neighbour[] and me[] designate arrays neighbour and me as a whole, and not simply as the pointers to their initial elements. The assignment neighbour[]=me[] means assigning the value of the expression me[], a vector of values of the type char, to the array neighbour.

On the one hand, vector expressions allows the programmer to simplify description of array-based computations, and on the other hand, they allow the compiler to generate code more efficiently using both inner-processor parallelism and memory hierarchy. In the program p28.mpc a single expression [+](v1[]*v2[]) describes computation of dot product of two vectors contained in arrays v1 and v2. The execution of the vector binary operator * consists in element-wise multiplication of its vector operands, and the result of the prefix unary operator [+] is the sum of elements of its vector operand.

The program p29.mpc implementing LU decomposition of a square matrix (Gaussian elimination) shows the usage of array segments in vector computations. For example, the expression a[i][i:n-1] designates the segment of the i-th row of the array a that includes all elements from a[i][i] to a[i][n-1].

10. Heterogeneous parallel computing

We have discussed that definition of a network causes mapping virtual processors of the network to actual processes of the parallel program, and this mapping is constant during the lifetime of this network. But we have not discussed how the programming system performs that mapping and how the programmer can manage it.

We have emphasized that the main goal of parallel computing is to speed up solving problems. Just this differs parallel computing from distributed computing. Therefore it is natural that minimization of the running time of the parallel program is the main target while mapping virtual processors of the network to actual processes. While performing the mapping, the programming system bases, on the one hand, on information about configuration and performance of components of the parallel computer system executing the program, and on the other hand, on information of relative volumes of computations that will be performed by different virtual processors of the defined network.

We have not specified volumes of computations in our programs yet. Therefore, the programming system considered all virtual processors of the network to perform the same volumes of computations. Proceeding from this assumption, it tried to map virtual processors to keep the total number of virtual processors mapped to an actual processor approximately proportional to its performance (naturally taking into account the maximum number of virtual processors that could be hosted by one or another real processor). Such mapping ensures all processes representing virtual processors of the network to execute computations approximately at the same speed. Therefore, if volumes of computations performed by different virtual processors between points of synchronisation or data transfer are approximately the same, the parallel program as a whole will be balanced in the sense, that the processes will not wait for each other at the points of the program.

Such mapping appeared acceptable in all our programs because, indeed, computations performed by different processors of the network were approximately the same and in addition of a very small volume. But in case of essential differences in volumes of computations performed by different virtual processors it can lead to very low speed of program execution, because in this case execution of computations by different processes at the same speed leads to the situation when processes performing small volume computations will wait at synchronisation points and points of data transfer for processes executing computations of bigger volumes. In that case, the mapping that ensures speeds of processes to be proportional to volumes of performed computations leads to a more balanced and faster parallel program.

The mpC language provides means for specification of relative volumes of computations performed by different virtual processors of one or another network. The mpC programming system uses this information to map virtual processors of the network to processes of the parallel program in such a way that ensures each virtual processor to perform computations at the speed proportional to the volume of the computations.

The program p30.mpc introduces these means. The program defines the network type HeteroNet parameterised with two parameters. The integer scalar parameter n determines the number of virtual processors of the corresponding network. The vector parameter v consists of n elements of the type double and is used just for specification of relative volumes of computations performed by different virtual processors. The definition of the network type HeteroNet contains an unusual declaration,

node { I>=0: v[I] },


saying the following: for any I>=0 the relative volume of computations performed by the virtual processor with coordinate I is equal to v[I].

The program p30.mpc calculates the mass of a metallic construction welded from N heterogeneous rails. For parallel computation of the total mass of the metallic "hedgehog", it defines the network mynet consisting of N virtual processors each calculating the mass of one of the rails. The calculation is performed by numerical 3- dimensional integration of the density function Density with a constant integration step. Obviously, the volume of computations to calculate the mass of a rail is proportional to the volume of the rail. Therefore, the replicated array volumes the i-th element of which just contains the volume of the i-th rail is used as the second actual parameter of the network type HeteroNet in the definition of the network mynet. Thus the program specifies that the volume of computations performed by the i-th virtual processor of the network mynet is proportional to the volume of the rail the mass of which the virtual processor computes.

Along with the results of calculations, the program p30.mpc outputs the wall time elapsed to execute the calculations. To do it, the program uses the library nodal function MPC_Wtime that returns the astronomical time in seconds elapsed from some moment in the past not specified but fixed for the process calling the function. Not going into philosophical speculations about relativity of time passing on different processes, note that in spite of its seeming simplicity just the wall time elapsed to solve the problem in parallel starting from data input and until output of results and measured on the host-process is the most objective and interesting for the end-user temporal characteristics of the program. As a matter of fact, it is minimisation of the characteristics that is the main goal of parallel computing.

The program p31.mpc is equivalent to the program p30.mpc except that it does not specify explicitly relative volumes of computations performed by different virtual processors of the network mynet. Therefore, while mapping the virtual processors to actual processors, the programming system regards that they perform equal volumes of computations. In this case, it leads as a rule to non-optimal mapping and hence to a longer time of solving the problem as compared to the program p30.mpc. The slowing down is especially visible in case of heterogeneous networks including processors significantly differing in performance. So, while executing the program p31.mpc it is quite possible that the mass of the biggest rail will be calculated on the weakest processor resulting in multi-fold slowing down as compared to execution of the same calculations by the program p30.mpc, which ensures that the most powerful processor will compute the mass of the biggest rail.

Stating that the mapping of virtual processors to real processors is based on information about performances of the latter, we said nothing about what processor performnace was and how the mpC programming system got the information. The issue is not as simple as it may seem at the first glance. Indeed, what does one mean saying that computer A is twice as powerful as computer B? Strictly speaking, the claim makes little sense if one is not talking about computers of the same architecture and configuration only differing in processor clock rates. Otherwise, the relative performance of computers, that is, the relative speed of executing computations, very essentially depends on what exactly computations are executed. Often, a computer showing the best performance when executing one program appears the slowest when executing another program. This is clearly seen when one analyses the publicated results of mesurement of performance of different computers using a pretty wide range of special testing program packages.

By default, the mpC programming system uses the same estimation of performances of participating real processors once obtained as a result of execution of a special parallel programduring initialization of the system in the particular parallel environment. We have already noted that such estimation is very rough and can differ significantly from the real performance demonstrated by the same processors while executing code essentially differing from code of this special testing program. Therefore, the mpC language contains some means that allows the programmer to change the default performance estimation tuning it to the computations that will be really executed by the processors.

The program p32.mpc illustrates the use of the means. The program differs from the program p30.mpc mainly by a new statement, recon, executed just before definition of the network mynet. Execution of the statement is that all physical processors running the program execute in parallel the code provided by the statement (in our case it is a call of the function RailMass with actual parameters 20.0, 4.0, 5.0 and 0.5), and the time elapsed by each of the real processors to execute the code is used to refresh the estimation of its performance. The main part of the total volume of computations performed by each virtual processor of the network mynet just falls into execution of calls to the function RailMass. Therefore, while creating the network mynet the programming system bases on the estimation of performances of real processors that is very close to their actual performance shown while executing the program.

It is very important that the recon statement allows refreshing the estimation of processor performances dynamically, at runtime, just before the estimation will be used by the programming system. It is especially important when the parallel computer system executing the mpC program is used for other compuatations as well. In this case, the real performance of processors can dynamically change dependent on their load by other external to the mpC program computations. The use of the recon statement allows writing the parallel programs that is sensitive to such dynamic variation of the load of the underlying computer system. In those programs, computations are distributed over real processors in accordance to their actual performances at the moment of execution of the computations.