Multi Node Condor and Pegasus on Ubuntu 12.04 on AWS EC2

By , June 24, 2014 8:24 pm

Recently I need to do some experiments with the Montage astronomical image mosaic engine, using Pegasus as the workflow management system. This involves setting up a condor cluster and Pegasus on the submit host, and several other steps to run Montage in such an environment. After extensive search on the Internet, I find out that there exists no good documentation on how to accomplish this complicate task with my favorite Linux distribution – Ubuntu 12.04. I decide write a tutorial on this topic, in the hope that it might save someone else’s time in the future.

This tutorial includes the following three parts:

Single Node Condor and Pegasus on Ubuntu 12.04 on AWS EC2
Multi Node Condor and Pegasus on Ubuntu 12.04 on AWS EC2
Running Montage with Pegasus on AWS EC2

In the previous tutorial we have setup a single node Condor cluster with Pegasus. Now we will expand the Condor clutter to include multiple worker nodes. Keep the previous EC2 instance running, and we will call this instance the Master Node. The other Condor nodes will receive tasks from the Master Node, so we will call them Worker Nodes.

In this tutorial, we will show how to add one Worker Node to the cluster.

[STEP 1: Updating Security Group Settings]

The Master Node and the Worker Node should be able to communicate with each other. The easiest way to achieve this is to run both the Master Node and the Worker Node in the same VPC, and use the same security group for both the Master Node and the Worker Node . Edit the inbound rules of the security group, add a rule to allow all traffic from within the security group.

[STEP 2: Install Condor]

Similar to the previous tutorial, download the latest version of HTCondor (native package) for Ubuntu 12.04 from the following URL. What I have downloaded is condor-8.1.6-247684-ubuntu_12.04_amd64.deb. The actual filename might change over time.

http://research.cs.wisc.edu/htcondor/downloads/

Install Condor using the following commands:

$ sudo dpkg -i condor-8.1.6-247684-ubuntu_12.04_amd64.deb
$ sudo apt-get update
$ sudo apt-get install -f
$ sudo apt-get install chkconfig
$ sudo chkconfig condor on
$ sudo service condor start

Now we should have Condor up and running, and it should be automatically started when the system boots. Check into the status of Condor using the following commands:

$ condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

slot1@ip-10-0-5-11 LINUX      X86_64 Unclaimed Benchmar  0.060 1862  0+00:00:04
slot2@ip-10-0-5-11 LINUX      X86_64 Unclaimed Idle      0.000 1862  0+00:00:05
slot3@ip-10-0-5-11 LINUX      X86_64 Unclaimed Idle      0.000 1862  0+00:00:06
slot4@ip-10-0-5-11 LINUX      X86_64 Unclaimed Idle      0.000 1862  0+00:00:07
                     Total Owner Claimed Unclaimed Matched Preempting Backfill

        X86_64/LINUX     4     0       0         4       0          0        0

               Total     4     0       0         4       0          0        0

$ condor_q


-- Submitter: ip-10-0-5-114.ec2.internal :  : ip-10-0-5-114.ec2.internal
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended

[STEP 3: Config Condor Master Node]

Use a text editor to open /etc/condor/condor_config, and add the following line to the end of the file:

ALLOW_WRITE = *

Then restart Condor with the following command:

$ sudo service condor restart

Also, find the IP address of the Master Node with the following command, you will need it to config the Worker Node.

[STEP 4: Config Condor Worker Node]

Now we go ahead to config the Worker Node. Use a text editor to open /etc/condor/condor_config.local, find the following line

CONDOR_HOST = $(FULL_HOSTNAME)

and update it with the IP address of the Master Node. Assuming that the IP address of the Master Node is 192.168.1.1, then this line should look like the following

CONDOR_HOST = 192.168.1.1

Then restart Condor using the following command:

$ sudo service condor restart

Now on both the Master Node and the Worker Node, we will be able to see both nodes. In the following example, both the Master Node and the Worker Node are c3.xlarge instances. Each of the c3.xlarge instance have 4 vCPU’s, so we are seeing 8 slots in the cluster.

$ condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

slot1@ip-10-0-5-11 LINUX      X86_64 Unclaimed Idle      0.000 1862  0+00:04:36
slot2@ip-10-0-5-11 LINUX      X86_64 Unclaimed Idle      0.000 1862  0+00:05:05
slot3@ip-10-0-5-11 LINUX      X86_64 Unclaimed Idle      0.000 1862  0+00:05:06
slot4@ip-10-0-5-11 LINUX      X86_64 Unclaimed Idle      0.000 1862  0+00:05:07
slot1@ip-10-0-5-11 LINUX      X86_64 Unclaimed Idle      0.040 1862  0+00:04:36
slot2@ip-10-0-5-11 LINUX      X86_64 Unclaimed Idle      0.000 1862  0+00:05:05
slot3@ip-10-0-5-11 LINUX      X86_64 Unclaimed Idle      0.000 1862  0+00:05:06
slot4@ip-10-0-5-11 LINUX      X86_64 Unclaimed Idle      0.000 1862  0+00:05:07
                     Total Owner Claimed Unclaimed Matched Preempting Backfill

        X86_64/LINUX     8     0       0         8       0          0        0

               Total     8     0       0         8       0          0        0

[STEP 5: Add More Worker Nodes]

To add more Worker Nodes, you can create an AMI out of the first Worker Node, than launch as many Worker Nodes as needed. Since the AMI has the above-mentioned configurations, they should be automatically added to the cluster when they are in running state.

Single Node Condor and Pegasus on Ubuntu 12.04 on AWS EC2

By , June 24, 2014 6:46 pm

Recently I need to do some experiments with the Montage astronomical image mosaic engine, using Pegasus as the workflow management system. This involves setting up a condor cluster and Pegasus on the submit host, and several other steps to run Montage in such an environment. After extensive search on the Internet, I find out that there exists no good documentation on how to accomplish this complicate task with my favorite Linux distribution – Ubuntu 12.04. I decide write a tutorial on this topic, in the hope that it might save someone else’s time in the future.

This tutorial includes the following three parts:

Single Node Condor and Pegasus on Ubuntu 12.04 on AWS EC2
Multi Node Condor and Pegasus on Ubuntu 12.04 on AWS EC2
Running Montage with Pegasus on AWS EC2

[STEP 1: Create an EC2 Instance for the Master Host]

Create an EC2 instance with a Ubuntu 12.04 AMI. For testing purposes you might wish to take advantage of the spot instance to save money. Use one of the compute optimized (C3) instance types so that you will see multiple Condor slots with a single EC2 instance. In this document I use c3.xlarge, which has 4 vCPU’s.

For a single node setup, all you need to do with the security group setting is open up port 22 to your IP address so that you can SSH to the instance when it is up.

When the instance is up and running, SSH to the instance.

ssh -i yourkey.pem ubuntu@ip_of_the_instance

[STEP 2: Install Condor]

Download the latest version of HTCondor (native package) for Ubuntu 12.04 from the following URL. What I have downloaded is condor-8.1.6-247684-ubuntu_12.04_amd64.deb. The actual filename might change over time.

http://research.cs.wisc.edu/htcondor/downloads/

Install Condor using the following commands:

$ sudo dpkg -i condor-8.1.6-247684-ubuntu_12.04_amd64.deb
$ sudo apt-get update
$ sudo apt-get install -f
$ sudo apt-get install chkconfig
$ sudo chkconfig condor on
$ sudo service condor start

Now we should have Condor up and running, and it should be automatically started when the system boots. Check into the status of Condor using the following commands:

$ condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

slot1@ip-10-0-5-11 LINUX      X86_64 Unclaimed Benchmar  0.060 1862  0+00:00:04
slot2@ip-10-0-5-11 LINUX      X86_64 Unclaimed Idle      0.000 1862  0+00:00:05
slot3@ip-10-0-5-11 LINUX      X86_64 Unclaimed Idle      0.000 1862  0+00:00:06
slot4@ip-10-0-5-11 LINUX      X86_64 Unclaimed Idle      0.000 1862  0+00:00:07
                     Total Owner Claimed Unclaimed Matched Preempting Backfill

        X86_64/LINUX     4     0       0         4       0          0        0

               Total     4     0       0         4       0          0        0
$ condor_q


-- Submitter: ip-10-0-5-114.ec2.internal :  : ip-10-0-5-114.ec2.internal
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended

[STEP 3: Install Pegasus]

Pegasus needs Java (1.6 or higher) and Python (2.4 or higher). Ubuntu 12.04 comes with Python 2.7 but not Java, so we will need to install Java first. Optionally Pegasus also needs Globus for grid support, we will take care of Globus later.

$ sudo apt-get install openjdk-7-jdk

Then we config the Pegasus repository and install Pegasus.

$ gpg --keyserver pgp.mit.edu --recv-keys 81C2A4AC
$ gpg -a --export 81C2A4AC | sudo apt-key add -  

All the following line into /etc/apt/source.list:

deb http://download.pegasus.isi.edu/wms/download/debian wheezy main

Update the repository and install Pegasus:

$ sudo apt-get update
$ sudo apt-get install pegasus

Now we should have Pegasus installed on the system. Check the installation with the following command. If you see similar output, congratulations!

$ pegasus-status
(no matching jobs found in Condor Q)

Pegasus comes with some examples, we will use these example to test the installation further.

$ cd ~
$ cp -r /usr/share/pegasus/examples .
$ cd examples/hello-world
$ ls
dax-generator.py  hello.sh  pegasusrc  submit  world.sh

Run the hello-world example:

$ ./submit 
2014.06.24 10:34:00.455 UTC:   Submitting job(s). 
2014.06.24 10:34:00.460 UTC:   1 job(s) submitted to cluster 1. 
2014.06.24 10:34:00.465 UTC:    
2014.06.24 10:34:00.471 UTC:   ----------------------------------------------------------------------- 
2014.06.24 10:34:00.476 UTC:   File for submitting this DAG to Condor           : hello_world-0.dag.condor.sub 
2014.06.24 10:34:00.481 UTC:   Log of DAGMan debugging messages                 : hello_world-0.dag.dagman.out 
2014.06.24 10:34:00.487 UTC:   Log of Condor library output                     : hello_world-0.dag.lib.out 
2014.06.24 10:34:00.492 UTC:   Log of Condor library error messages             : hello_world-0.dag.lib.err 
2014.06.24 10:34:00.497 UTC:   Log of the life of condor_dagman itself          : hello_world-0.dag.dagman.log 
2014.06.24 10:34:00.503 UTC:    
2014.06.24 10:34:00.508 UTC:   ----------------------------------------------------------------------- 
2014.06.24 10:34:00.513 UTC:    
2014.06.24 10:34:00.519 UTC:   Your workflow has been started and is running in the base directory: 
2014.06.24 10:34:00.524 UTC:    
2014.06.24 10:34:00.530 UTC:     /home/ubuntu/examples/hello-world/work/ubuntu/pegasus/hello_world/20140624T103359+0000 
2014.06.24 10:34:00.535 UTC:    
2014.06.24 10:34:00.540 UTC:   *** To monitor the workflow you can run *** 
2014.06.24 10:34:00.546 UTC:    
2014.06.24 10:34:00.551 UTC:     pegasus-status -l /home/ubuntu/examples/hello-world/work/ubuntu/pegasus/hello_world/20140624T103359+0000 
2014.06.24 10:34:00.556 UTC:    
2014.06.24 10:34:00.562 UTC:   *** To remove your workflow run *** 
2014.06.24 10:34:00.567 UTC:    
2014.06.24 10:34:00.572 UTC:     pegasus-remove /home/ubuntu/examples/hello-world/work/ubuntu/pegasus/hello_world/20140624T103359+0000 
2014.06.24 10:34:00.578 UTC:    
2014.06.24 10:34:01.024 UTC:   Time taken to execute is 1.109 seconds
Check the status of the Pegasus jobs and Condor queue using the pegasus-statua and condor_q commands:
$ pegasus-status
STAT  IN_STATE  JOB                                               
Run      01:05  hello_world-0                                     
Summary: 1 Condor job total (R:1)

$ condor_q


-- Submitter: ip-10-0-5-114.ec2.internal :  : ip-10-0-5-114.ec2.internal
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   1.0   ubuntu          6/24 10:34   0+00:01:28 R  0   0.0  pegasus-dagman -f 

1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended

After the hello-world workflow have been executed, a trace file (jobstat.log) can be found in the working directory. Workflow related information is hidden several sub-directories deep. In my case the directory is ~/examples/hello-world/work/ubuntu/pegasus/hello_world/20140624T103359+0000. Please note that the last sub-directory is a timestamp, depending on the time you submit the workflow.

As a bonus of this tutorial, I have prepared an AMI with the above-mentioned setup, and make the AMI publicly available to the community. If all you need is a single node Pegasus + Condor configuration, you don’t need to repeat any of the above-mentioned steps. All you need to do is to launch an EC2 instance with AMI ami-5ee01b36 in the US-EAST-1 (N. Virginia) region. If you need to run this in other regions, copy the AMI to the desired region and launch an instance with the copied AMI in that region.

Panorama Theme by Themocracy