dispy : Python Framework for Distributed and Parallel Computing

dispy

A dispy client program should first create a cluster for a computation, which can be either a Python function or a standalone program. The cluster can also specify which nodes can execute it. Once the cluster is created, the computation can be evaluated as many times as necessary, each time with different data, by invoking submit method on the cluster. Each time submit method is called, it returns DispyJob or job, which can later be examined for result of execution, output or error messages etc.

While dispy and other components have various options that cover rather comprehensive use cases, making it seem complex, most of the options have default values that likely work for common cases. For example, starting 'dispynode.py' program on each of the nodes on a local network and using JobCluster with computation, and possibly depends parameters may be sufficient.

There are two ways to create clusters with dispy: JobCluster and SharedJobCluster. If only one instance of dispy may be running at anytime, JobCluster is simple to use; it already contains a scheduler that will schedule jobs to nodes running 'dispynode'. If, however, multiple programs using dispy may be running simultaneously, JobCluster cannot be used - each of the schedulers in each instance of dispy will assume the nodes are controlled exclusively by each, causing conflicts. Instead, SharedJobCluster must be used. In this case, dispyscheduler must also be running on some computer and SharedJobCluster must set scheduler_node parameter with the node running dispyscheduler (default is the host that calls SharedJobCluster).

JobCluster

JobCluster(computation, nodes=['*'], depends=[], callback=None, ip_addr=None, ext_ip_addr=None, port=51347, node_port=51348, fault_recover=False, dest_path=None, loglevel=logging.WARNING, cleanup=True, pulse_interval=None, ping_interval=None, reentrant=False, secret='', keyfile=None, certfile=None) where

SharedJobCluster

SharedJobCluster has almost the same syntax, except as noted below.

SharedJobCluster(computation, nodes=['*'], depends=[], ip_addr=None, port=None, scheduler_node=None, scheduler_port=None, ext_ip_addr=None dest_path=None, loglevel=logging.WARNING, cleanup=True, reentrant=False, secret='', keyfile=None, certfile=None) where all arguments common to JobCluster are same, and A cluster has following methods:

DispyJob

The result of submit call of a cluster is an instance of DispyJob (see dispy.py), which can be used to examine status of job execution, retrieve job results etc. The job instance has id field that can be used to set any value appropriate (rest of the fields are either read-only, or not meant for user programs). For example, id field can be set to a unique value to distinguish one job from another.

Job's status field is read-only field; its value is one of Created, Running, Finished, Cancelled or Terminated, indicating current status of job. If job is created for SharedJobCluster, status is not updated to Running when job is actually running.

When a submitted job is called with job(), it returns that job's execution result, possibly waiting until the job is finished. After a job is complete,

Job's result, stdout and stderr should not be large - these are buffered and hence will consume memory (not stored on disk). Moreover, like args and kwargs, result should be serializable (picklable object). If result is (or has) an instance of python class, that class may have to provide __getstate__ function to serialize the object.

After jobs are submitted, cluster.wait() can be used to wait until all submitted jobs for that cluster have finished. If necessary, results of execution can be retrieved by either job() or job.result, as described above.

Fault Recovery

As noted above, if 'fault_recover' option is used when creating a cluster, dispy stores information about scheduled but unfinished jobs in a file. If user program then terminates unexpectedly, the nodes that execute those jobs can't send the results back to dispy. In such cases, the results for the jobs can be retrieved from the nodes with the function in dispy

fault_recover_jobs(fault_recover_file, ip_addr=None, secret='', node_port=51348, certfile=None, keyfile=None) where

This function reads the information about jobs in the fault_recover_file, retrieves DispyJob instance (that contains results, stdout, stderr, status etc.) for each job that was scheduled for execution but unfinished at the time of crash, and returns them as a list. If a job has finished executing at the time 'fault_recover_jobs' function is called, the information about that is deleted from both the node and fault_recover_file, so the results for finished jobs can't be retrieved more than once. However, if a job is still executing, the status field of DispyJob would be DispyJob.Running and the results for this job can be retrieved again (until that job finishes) by calling 'fault_recover_jobs'. Note that 'fault_recover_jobs' is available as separate function - it doesn't need JobCluster or SharedJobCluster instance. In fact, 'fault_recover_jobs' function must not be used when a cluster that uses same recover file is currently running.

Note that dispy sends only the given computation and its dependencies to the nodes; the program itself is not transferred. So if computation is a python function, it must import all the modules used by it, even if the program imported those modules before cluster is created.

Provisional/Intermediate Results

dispy_provisional_result function can be used in computations (Python functions) to send provisional or intermediate results back to the client. For example, in optimization computations, there may be many (sub) optimal results that the computations can inform the client. The client may ignore, cancel computations or create additional computations based on provisional result. When computation calls dispy_provisional_result(result), the Python object result (which must be serializable) is sent back to the client and computation continues to execute. The client should use callback option to process the information, as shown in the example:

import random, dispy

def compute(n, threshold):
    import random, time, socket
    name = socket.gethostname()
    for i in xrange(0, n):
        r = random.uniform(0, 1)
        if r <= threshold:
            # possible result
            dispy_provisional_result((name, r))
        time.sleep(0.1)
    # final result
    return None

def job_callback(job):
    if job.status == dispy.DispyJob.ProvisionalResult:
        if job.result[1] < 0.005:
            # acceptable result; terminate jobs
            print '%s computed: %s' % (job.result[0], job.result[1])
            # global jobs, cluster
            for j in jobs:
                if j.status in [dispy.DispyJob.Created, dispy.DispyJob.Running,
			        dispy.DispyJob.ProvisionalResult]:
                    cluster.cancel(j)

if __name__ == '__main__':
    cluster = dispy.JobCluster(compute, callback=job_callback)
    jobs = []
    for n in xrange(4):
        job = cluster.submit(random.randint(50,100), 0.2)
        if job is None:
            print 'creating job %s failed!' % n
            continue
        job.id = n
        jobs.append(job)
    cluster.wait()
    cluster.stats()
    cluster.close()
  

In the above example, computations send provisional result if computed number is <= threshold (0.2). If the number computed is < 0.005, job_callback deems it acceptable and terminates computations.

NAT/Firewall Forwarding

By default dispy client uses UDP and TCP ports 51347, dispynode uses UDP and TCP ports 51348, and dispyscheduler uses UDP and TCP pots 51347 and TCP port 51348. If client/node/scheduler are behind a NAT firewall/gateway, then these ports must be forwarded appropriately and 'ext_ip_addr' option must be used. For example, if dispy client is behdind NAT firewall/gateway, JobCluster/SharedJobCluster must set 'ext_ip_addr' to the NAT firewall/gateway address and forward UDP and TCP ports 51347 to the IP address where client is running. Similarly, if dispynode is behind NAT firewall/gateway, 'ext_ip_addr' option must be used.

Cloud Computing with Amazon EC2

ext_ip_addr option can be used to work with Amazon EC2 cloud computing service. With EC2 service, a node has a private IP address (called 'Private DNS Address') that uses private network of the form 10.x.x.x and public address (called 'Public DNS Address') that is of the form ec2-x-x-x-x.x.amazonaws.com. After launching instance(s), one can copy dispy files to the node(s) and run dispynode as dispynode.py --ext_ip_addr ec2-x-x-x-x.y.amazonaws.com (this address can't be used with '-i'/'--ip_addr' option, as the network interface is configured with private IP address only). This node can then be used by dispy client from outside EC2 network by specifying ec2-x-x-x-x.x.amazonaws.com in the 'nodes' list (thus, using EC2 servers to augment processing units). Roughly, dispy uses 'ext_ip_addr' similar to NAT - it announces 'ext_ip_addr' to other services instead of the configured 'ip_addr' so that external services send requests to 'ext_ip_addr' and if firewall/gateway forwards them appropriately, dispy will process them.

dispy can also be used as a command line tool; in this case the computations should only be programs and dependencies should only be files.

dispy.py -f /some/file1 -f file2 -a "arg11 arg12" -a "arg21 arg22" -a "arg3" /some/program
will distribute '/some/program' with dependencies '/some/file1' and 'file2' and then execute '/some/program' in parallel with arg11 and arg12 (two arguments to the program), arg21 and arg22 (two arguments), and arg3 (one argument).