Category: English

Getting Started with AWS SDK for Java (4)

By , July 29, 2016 7:46 am

The following is an example of using the AWS SimpleDB service along with AWS KMS. Since SimpleDB does not natively integrates with KMS, we will have to encrypt the data before storing it to SimpleDB, and decrypt the data after retrieving it from SimpleDB.


import java.nio.*;
import java.util.*;
import java.nio.charset.*;

import com.amazonaws.regions.*;
import com.amazonaws.auth.profile.ProfileCredentialsProvider;
import com.amazonaws.services.simpledb.*;
import com.amazonaws.services.simpledb.model.*;
import com.amazonaws.services.kms.*;
import com.amazonaws.services.kms.model.*;


public class SDB
{

	public AmazonSimpleDBClient client;
	public AWSKMSClient kms;

	public String keyId = "arn:aws:kms:ap-southeast-2:[aws-account-id]:key/[aws-kms-key-very-long-id-ere]";
	public static Charset charset = Charset.forName("ASCII");
	public static CharsetEncoder encoder = charset.newEncoder();
	public static CharsetDecoder decoder = charset.newDecoder();

	public SDB()
	{
		client = new AmazonSimpleDBClient();
		client.configureRegion(Regions.AP_SOUTHEAST_2);

		kms = new AWSKMSClient();
		kms.configureRegion(Regions.AP_SOUTHEAST_2);

	}


	public void createDomain(String domain)
	{
		try
		{
			CreateDomainRequest request = new CreateDomainRequest(domain);
			client.createDomain(request);
		} catch (Exception e)
		{
			System.out.println(e.getMessage());
			e.printStackTrace();
		}
	}

	public void deleteAttribute(String domain, String item)
	{
		try
		{
			DeleteAttributesRequest request = new DeleteAttributesRequest(domain, item);
			client.deleteAttributes(request);
		} catch (Exception e)
		{
			System.out.println(e.getMessage());
			e.printStackTrace();
		}
	}

	public void putAttribute(String domain, String item, String name, String value)
	{
		try
		{
			ReplaceableAttribute attribute = new ReplaceableAttribute(name, value, true);
			List<ReplaceableAttribute> list = new ArrayList<ReplaceableAttribute>();
			list.add(attribute);

			PutAttributesRequest request = new PutAttributesRequest(domain, item, list);
			client.putAttributes(request);

		} catch (Exception e)
		{
			System.out.println(e.getMessage());
			e.printStackTrace();
		}
	}

	public String getAttribute(String domain, String item, String name)
	{
		String value = "Empty Result";
		try
		{
			GetAttributesRequest request = new GetAttributesRequest(domain, item);
			GetAttributesResult result = client.getAttributes(request);
			List<Attribute> list = result.getAttributes();
			for (Attribute attribute : list)
			{
				if (attribute.getName().equals(name))
				{
					return attribute.getValue();
				}
			}

		} catch (Exception e)
		{
			System.out.println(e.getMessage());
			e.printStackTrace();
		}
		return value;
	}

	public String encrypt(String message)
	{
		String result = "Encryption Error.";
		try
		{
			ByteBuffer plainText = encoder.encode(CharBuffer.wrap(message));
			EncryptRequest req = new EncryptRequest().withKeyId(keyId).withPlaintext(plainText);
			ByteBuffer cipherText = kms.encrypt(req).getCiphertextBlob();
			byte[] bytes = new byte[cipherText.remaining()];
			cipherText.get(bytes);
			result =  Base64.getEncoder().encodeToString(bytes);

			System.out.println("\nEncryption:");
			System.out.println("Original Text: " + message);
			System.out.println("Encrypted Text: " + result);
		} catch (Exception e)
		{
			System.out.println(e.getMessage());
			e.printStackTrace();
		}
		return result;
	}

	public String decrypt(String message)
	{
		String result = "Decryption Error.";
		try
		{
			byte[] encryptedBytes = Base64.getDecoder().decode(message);
			ByteBuffer ciphertextBlob = ByteBuffer.wrap(encryptedBytes);
			DecryptRequest req = new DecryptRequest().withCiphertextBlob(ciphertextBlob);
			ByteBuffer plainText = kms.decrypt(req).getPlaintext();
			result = decoder.decode(plainText).toString();

			System.out.println("\nDecryption:");
			System.out.println("Original Text: " + message);
			System.out.println("Encrypted Text: " + result);
		} catch (Exception e)
		{
			System.out.println(e.getMessage());
			e.printStackTrace();
		}
		return result;
	}

	public static void main(String[] args) 
	{
		String domainName = "demo-domain";    
		String itemName   = "demo-item";
		String attributeName    = "test-attribute";
		String attributeValue = "This is the information to be stored in SimpleDB.";

		SDB test = new SDB();
		String value = test.encrypt(attributeValue);
		test.putAttribute(domainName, itemName, attributeName, value);

		try
		{
			Thread.sleep(3000);	// Sleep for some time to make sure we can get the result
		} catch (Exception e) {}

		value = test.getAttribute(domainName, itemName, attributeName);
		test.decrypt(value);
	}


}

Spark vs Hadoop – Updated Version

By , October 12, 2014 9:42 am

Databricks recently posted a blog entry Spark Breaks Previous Large-Scale Sort Record, with some claimed breakthroughs in large scale sorting. This blog post said:

The previous world record was 72 minutes, set by Yahoo using a Hadoop MapReduce cluster of 2100 nodes. Using Spark on 206 EC2 nodes, we completed the benchmark in 23 minutes. This means that Spark sorted the same data 3X faster using 10X fewer machines. All the sorting took place on disk (HDFS), without using Spark’s in-memory cache.

The blog post by Databricks did not list all the configurations about the computing nodes. I spent some time reading through their test environment, as well as the test environment used by the Yahoo team. Then I come up with the following table with details information about each tests:

Hadoop Spark
Single Node Configuration
System Dell R720xd AWS EC2 i2.8xlarge
CPU Intel Xeon E5-2630 Intel Xeon E5-2670 v2
Total CPU Cores 12 (2 Phyiscal CPUs) 32 (vCPU)
Memory 64 GB 244 GB
Stroage 12 x 3 TB SATA 8 x 800 GB SSD
Single Disk Random Read IOPS (4 KB blocks) ~50 (measured) N/A
RAID0 Random Write IOPS (4 KB blocks) 600 (max) 365,000 (minimum)
Single Disk Random Write IOPS (4 KB blocks) ~110 (measured) N/A
RAID0 Random Write IOPS (4 KB blocks) 1320(max) 315,000 (minimum)
Single Disk Sequential Read Throughput (128 KB blocks) 120 MB/s (measured) 400 MB/s (measured)
Single Disk Sequential Write Throughput (128 KB blocks) 120 MB/s (measured) 400 MB/s (measured)
RAID0 Sequential Read Throughput (128 KB blocks) 1,440 MB/s (estimated max) 3,200 MB/s (measured)
RAID0 Sequential Write Throughput (128 KB blocks) 1,440 MB/s (estimated max) 2,200 MB/s (measured)
Networking 10 Gbps 10 Gbps
Cluster Configuration
Number of Nodes 2100 206
Number of CPU Cores 25200 6592
Total Memory 134,400 GB 50,264 GB
Total Random Read IOPS (4 KB blocks) 1,260,000 (max) 75,190,000 (minimum)
Total Random Write IOPS (4 KB blocks) 2,772,000 (max) 64,890,000 (minimum)
Total Sequential Read Throughput (128 KB blocks) 3,024,000 MB/s (estimated max) 659,200 MB/s (estimated)
Total Sequential Write Throughput (128 KB blocks) 3,024,000 MB/s (estimated max) 453,200 MB/s (estimated)
100 TB Sorting Results
Time 72 minutes 23 minutes

It should be noted that the reference performance data for a 3 TB SATA drive is Seagate Barracuda XT 3TB. The reference performance data for a 800 GB SSD driver is taken the AWS documentation for the i2.8xlarge instance for the IOPS part, while the single disk and RAID0 throughput is the result of a benchmark using my own AWS account. (After reading the comments posted by rxin, I realized that I did mis-estimate the sequential IO performance for the i2.8xlarge instance, so I did some testing to get the new data and updated this post.)

As we all know, large scale sorting is a typical IO intensive application. Because we do not have enough memory to hold all the data to be sorted, the sorting has to be done in multiple batches. In each batch, data is read from the disk for processing, and the result is written back to the disk. If we observe the CPU utilization during the sorting process, we will see that most of the time the CPU is waiting for IO. In the above-mentioned configurations, the Spark cluster has great advantage over the Hadoop cluster in terms of random read (IOPS, 60X), random write (IOPS, 24X). However, the Spark cluster is not comparable with the Hadoop cluster in terms of sequential read (throughput, 0.22X), as well as sequential write (throughput, 0.15X). The performance of large scale sorting depends more on sequential IO than on random IO. Although the Spark cluster uses SSD disk and the Hadoop cluster uses SATA disks, the Spark cluster is not in an advantageous position in terms of storage. (I made a mistake in my original post by using the SSD performance data for Intel SSD 910 Series, resulting in significant over-estimation of the sequential IO performance of the AWS i2.8xlarge instance. This mistake has been corrected in this updated version.)

In terms of CPU cores, the Spark cluster has 1/4 of the CPU cores as compared with the Hadoop cluster. However, the Spark cluster uses Intel E5-2670 v2 and the Hadoop cluster uses Intel E5-2630. I got the following CPU benchmark data from www.cpubenchmark.net. The per-core performance of the CPU being used in both clusters are comparable to each other. (In large scale sorting, the pressure on CPU is not as heavy as the pressure on IO.)

CPU Comparison
Hadoop Spark
CPU Intel Xeon E5-2630 Intel Xeon E5-2670 v2
Passmark 14033  22134
Number of Cores per CPU 6  10
Passmark per Core 2338  2213

In terms of memory, the Spark cluster has 1/3 of the memory as compared with the Hadoop cluster.

So, after revisiting this topic, I tend to believe that Spark does perform much better than Hadoop. (I apologize to the Databricks team for not carrying out a careful analysis when I posted the original message. Thank you rxin for pointing out my mistakes.)

Pegasus / Montage workflow on Amazon Web Services

By , September 16, 2014 8:06 pm

I took some notes while going through Mats Rynge’s tutorial on “Pegasus / Montage workflow on Amazon Web Services“. The tutorial if officially available at the following URL, but the content is far from complete. I managed to finished this tutorial, and thought that my experience might be valuable for someone else out there in the dark.

https://confluence.pegasus.isi.edu/display/pegasus/2013+-+Montage+workflow+using+Amazon+Web+Services

Step 1. Launch an EC2 instance

In the Oregon region, launch an EC2 instance with AMI ami-f4e47cc4. It is recommend that you use the same security group for all EC2 instance to be launched in the Condor cluster. Also, in the security group, all communication between all EC2 instances using the same security group.

SSH into the instance using the following command:

ssh -i keypair.pem montage@IP
ssh-keygen
cd .ssh
cat id_rsa.pub >> authorized_keys
cd ~

Step 2. Configure Condor

Edit /etc/condor/config.d/20_security.conf,update ALLOW_WRITE and ALLOW_READ to the IP range of your VPC. For example, if the IP range of your VPC is 172.31.0.0/16, then you can set “ALLOW_WRITE = 172.31.*” and “ALLOW_READ = 172.31.*”. Then you need to restart Condor with the following command:

sudo service condor restart

Step 3. Create a Montage workflow

mkdir workfow
cd workflow
mDAG 2mass j M17 0.5 0.5 0.0002777778 . file://$PWD file://$PWD/inputs
generate-montage-replica-catalog 

Step 4. Pegasus Related

cd ~/etc
cp ../workflow/replica.catalog .
cp ../workflow/dag.xml .

Step 5. Update site.xml with the following content

<?xml version="1.0" encoding="UTF-8"?>
<sitecatalog xmlns="http://pegasus.isi.edu/schema/sitecatalog" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://pegasus.isi.edu/schema/sitecatalog http://pegasus.isi.edu/schema/sc-4.0.xsd" version="4.0">
    <site  handle="local" arch="x86" os="LINUX">
        <directory type="shared-scratch" path="/home/montage/scratch">
            <file-server operation="all" url="file:///home/montage/scratch"/>
        </directory>
        <directory type="local-storage" path="/var/www/html/outputs">
            <file-server operation="all" url="file:///var/www/html/outputs"/>
        </directory>
        <profile namespace="env" key="SSH_PRIVATE_KEY">/home/montage/.ssh/id_rsa</profile>
    </site>
    <site  handle="condor_pool" arch="x86_64" os="LINUX">
        <directory type="shared-scratch" path="/home/montage/scratch">
            <file-server operation="all" url="scp://127.0.0.1/home/montage/scratch"/>
        </directory>
        <profile namespace="pegasus" key="style" >condor</profile>
        <profile namespace="condor" key="universe" >vanilla</profile>
        <profile namespace="env" key="MONTAGE_HOME" >/opt/montage/v3.3</profile>
    </site>
</sitecatalog>

Step 6. Plan the workflow

pegasus-plan --conf pegasus.conf --dax dag.xml

Step 7. Run the workflow

pegasus-run  /home/montage/etc/montage/pegasus/montage/run0001

Step 8. Monitor the workflow

pegasus-status -l /home/montage/etc/montage/pegasus/montage/run0001
condor_status
condor_q
tail montage/pegasus/montage/run0001/jobstate.log

The last command is very nice in that you can see what is currently being run. Please note that you need to replace the path with the actual path given to you by Pegasus.

That’s all. After spending weeks searching over the Internet, everything now seems to be simple.

Virtualization, Cloud Computing, Open Source and Beyond…

By , October 15, 2012 10:12 pm

A. Virtualization

(This figure was obtained by Google search, which actually came from VMWare.)

Virtualization refers to the practice of simulating multiple virtual machines on a single physical computer. Logically each virtual machine has its own virtualized CPU, memory, storage, and networking. Through virtualization the underlying hardware resource can be utilized with more efficiency, applications can run on the same physical computer but its own runtime environment that is isolated from each other.

There exist different levels of virtualization, for example hardware virtualization and software virtualization. Hardware virtualization means providing a virtual computer by simulating the underlying hardware, and the virtual computer is capable of running a full copy of operating system. Among hardware virtualization there exist different implementations such as full virtualization (simulating a full set of the underlying hardware such that most operating systems could run on top of the virtual machine without modifications), partial virtualization (simulating only some key hardware components, operating systems might need modifications to run in such an environment) and paravirtualization (does not simulating the underlying hardware, but rather shares the underlying hardware through virtual machine manager applications, and most operating systems need modifications to run in such an environment). Virtualization on software level usually refers to the practice of providing multiple isolated runtime environment on top of a single operating system instance, and it is often called container technology.

With hardware virtualization, most modern virtualization technologies (such as VMWare, Xen and KVM) are a combination of full virtualization and paravirtualization. Virtual machines provided by hardware virtualization technologies usually run a full copy of operating system, therefore these exist large amount of similar (or even identical) processes and memory pages on the same host machine. Currently memory pages with the same content can be consolidated by techniques such as KSM, but these is so far no good method to handle similar (or even identical) processes. Therefore hardware virtualization is usually referred to as heavy-weight virtualization, and the number of virtual machines that could run on a single host machine is relatively limited.

With software virtualization, the overhead of running multiple operating system instances does not exist. Therefore software virtualization is usually referred to as light-weight virtualization, and the number of virtual runtime environments that could present on a single host machine is relatively large. For example, in theory Solaris can support 8000 containers on a single operating system instance (the actually number of supported containers is limited by hardware sources and system work load). Similarly, LXC on Linux can easily provide a large amount of virtualized runtime environments.

In terms of virtualization technologies, most companies in China seem to focus more on hardware virtualization, and deploy hardware virtualization in development and production environments. Taobao (a subsidiary of Alibaba Inc) is one of the first to study and deploy software virtualization in production environment. Their experiences proved that replacing Xen with cgroup  could result in better resource utilization.

For a specific application scenario, the decision between hardware virtualization and software virtualization should rely on whether the end users needs control over the operating system (such as kernel upgrade). If the end user only needs control over runtime environment (such as various App Engine services), software virtualization might be a better choice.

For those who want to know more about virtualization technologies, the VMWare white page Understanding Full Virtualization, Paravirtualization, and Hardware Assist is a great reference.

Generally speaking, the number of users that can access virtualization technology is very small. On Linux operating system, the user with virtual machine life cycle privileges  is usually the user with libvirt access. In a company or other entities, these users are usually system administrators.

B. Virtualization Management

In the early days,virtualization technologies solved the problem of providing multiple isolated runtime environments on a single physical computer. When the number of physical computers is small, system administrators can manually login to different servers to carry out the virtual machine life cycle management tasks. When the number of physical computers becomes big, some kind of scripting / application is needed to increase the degree of automation and relief system administrators from tedious works. Applications that enables system administrators manage multiple physical and virtual computers from a single location are called virtualization management tools. Such tools can usually accomplish the following tasks: (1) manage the life cycles of multiple virtual machines on multiple physical computers; (2) query and monitor all physical and virtual computers; and (3) establish a mapping between the name of virtual machines and the actual virtual machine instances on different computers such that virtual machine identification and management becomes easier. On Linux operating system VirtManager is a simple virtualization management tool. Among the VMWare product family VMWare vSphere is a powerful virtualization management tool.

Virtualization management tools are direct extensions of virtualization technology. The purpose of a simple virtualization management tools is to rescue system administrators out of the tedious repeating work induced by increasing number of physical and virtual machines. On such level, the scope of a virtualization management tool is usually limited to a cluster. In many cases, the virtualization management tools needs to have the user name and password to access different physical computers to perform virtual machine life cycle management. To make the management work easier the system administrator might need to setup a common management user for all physical computers in the cluster.

Virtualization management tools provide convenience for the system administrators, but do not delegate virtual machine life cycle management rights to other users.

C. Data Center Virtualization

In a data center, system administrators need to look after a large amount of different hardware and applications. As compared to a small cluster, the complexity of a data center is significantly different. Now a simple virtualization management tools is no longer capable of satisfying the need of system administrators. Therefore people developed data center virtualization management software to meet these new challenges. On the hardware layer, data center virtualization management software created the concept of “resource pools” to reorganize hardware resources, where a pool is usually a group of servers with similar configuration and purpose. Computing resources are now exposed to the end user in the form of virtual infrastructure, rather than separate servers. On the software layer, data center virtualization software created different roles for system administrators and regular end users, or more fine-grained role based access control (RBAC) based on the need of a specific application scenario. System administrators have the right to manage all the physical servers and virtual machines, but usually do not interfere the virtual machines that are running normally. Regular end users can only carry out virtual machine life cycle management tasks within the resource pool that are assigned to them, and do not have the right to manage the physical servers. In the extreme case, regular end users can only see the resource pool that are assigned to them, without any knowledge of the details about the resource pool.

Before data center virtualization technology, the action of creating and managing virtual machines are usually carried out by system administrators. In a data center virtualization software, based on RBAC the virtual machine life cycle management rights are delegated to so called “regular users”, therefore relieves the pressure on system administrators (to some degree). However, for security considerations not all employees in a company can have such a “regular user” account, which is usually assigned to managers or team leads. It is safe to assume that in data center virtualization the life cycle of virtual machines are still managed centrally.

Data center virtualization management software is a further extension of virtualization management tools.  It solved the problem of system complexity which is introduced by the increasing number of hardware devices and applications. When specific physical hardware are presented in the form of an abstracted “resource pool”, managers only need to worry about the size, work load, and health status of various resource pools, while end users only need to know about the status of the resource pool that is assigned to them. Only system administrators need to know by heart the configuration, work load and status of each and every single physical  server. However, with the concept of resource pools, all physical devices can be reorganized in a relatively logical way, which makes the life of system administrators easier.

Modern data center virtualization management software usually provides a lot of IT ops automation functionalities. Such functionalities include (1) fast deployment of a number of same or similar runtime environments based on virtual machine templates, (2) monitoring, reporting, notification, and accounting, and (3) high availability, dynamic workload management, backup and recovery. Some data center virtualization management software even provides open API’s that allow system administrators to develop and integrate additional functionalities based on the actualy application scenarios.

Among the VMWare product family VMWare vCenter is a powerful data center virtualization management software. Other good data center virtualization management softwares include Convirt, XenServer, Oracle VM and OpenQRM.

D. Cloud Computing

Cloud computing is a further abstraction of data center virtualization. In cloud computing management software, we still have different roles such as cloud managers and regular users, and have different access rights associated with different roles. Cloud managers have the rights to manage all the physical servers and virtual machines, but usually do not interfere with virtual machines running normally. Regular users can carry out virtual machine life cycle management tasks through a web browser, or through computer programs that talks with the cloud via web services.

In cloud computing, virtual machine life cycle management rights are fully delegated to regular users. However, it also shadows the concepts of resource pools and physical servers from regular users. Regular users is capable of obtaining computing resources, without the need to know about the underlying physical infrastructure. It seems that cloud computing is simply a way to providing computing resource from remote similar to Amazon EC2/S3. In fact, cloud computing represents a change in computing resource management, end users no longer need the help of system administrators to obtain and manage computing resource.

For cloud managers, delegating virtual machine life cycle management rights to regular users does not relieve them from being grilled on fire. Rather, now they have more trouble to handle. In traditional IT infrastructure, each application has its own computing resources, and trouble shooting is relatively easy because physical isolation exists between applications. When upgradign to cloud computing, multiple applications might share the same underlying physical infrastructure, and trouble shooting becomes difficult when multiple applications compete for resources. Therefore, cloud managers usually expect a full set of data center virtualization management functionalities in a cloud computing management software. For cloud managers, critical functionalities includes (1) monitoring, reporting, notification, and accounting, (2) high availability, dynamic workload management, backup and revovery, and (3) live migration, which can be use in trouble shooting or local maintainance.

We can see that from virtualizaton to cloud computing, the degree of encapsulation for physical resources increases, while virtual machine life cycle management rights are gradually delegated.

Among the VMWare product family VMWare vCloud is a cloud computing management software. Other cloud computing management softwares includeOpenStack, OpenNebula, Eucalyptus and CloudStack. Although OpenStack, OpenNebula, Eucalyptus and CloudStack are all cloud computing management softwares, they have significant difference in functionalities, which can be traced to the difference in their design. Originally OpenNebula and CloudStack were designed to be data center virtualization management software, therefore they have a good set of data center virtualization management functionalities. When the concept of cloud computing became popular, OpenNebula added OCCI and Amazon EC2 support, while CloudStack provided an additional Amazon EC2 compatible module called CloudBridge (CloudBridge was integrated into CloudStack since version 4.0). On the contratory, Eucalyptus and OpenStack were designed to be Amazon EC2 compatible cloud computing management softwares, and they are not yet that capable in terms of data center virtualization management functionalities. Between Eucalyputs and OpenStack, Eucalyptus has some first-mover advantages since they have realized the importance of data center virtualization management functionalities based on feedbacks from the market.

E. Private Cloud and Public Cloud

The so called “cloud computing” as described in section D is only a narrow definition, or Amazon EC2 like cloud computing. Broader definitions of cloud computing usually refer to the various practices of obtaining and utilizing various computing resources (such as compute and storage) from remote, which includes both data center virtualization as described in section C and cloud computing as described in section D. In both cases, computing resources are provided to the end user in the form of virtual machines, and the end user does not need to have any knowledge of the underlying physical infrastructure. If the scope of a cloud platform is to provide service within the corporate, then it can be called a “private cloud”. If the scope of a cloud platform is to provide service to the public, then it can be called a “public cloud”. Generally speaking, private cloud emphases the ability to create virtual machines with different configurations (such as the number of vCPU’s, memory and storage), because it needs to satisfy the needs from different applications within the enterprise.  On the contratory, public cloud service providers do not have much knowledge about the applications running on top of it, therefore they tend to provide standardized virtual machine products with fixed configurations, and end users can only purchase virtual machines with these fix configurations.

For public cloud service providers, their business model is similar to Amazon EC2. Therefore, most of them will choose to use a cloud computing management software as described in section D. For private cloud service providers, the decision should be make according to the computing resource management model within the enterprise. If the enterprise still wishes to execute central management of computing resources, and delegate virtual machine life cycle management rights only to managers and team leaders, a data center virtualization management software as described in section C is more appropriate. However, if the enterprise wishes to delegate virtual machine life cycle management rights to the end user, then a cloud computing management software as described in section D is more appropriate.

Traditionally, people think that a private cloud should be built upon hardware owned by the enterprise and inside a datacenter managed by the enterprise. However, when  hardware vendors join the game the border between private cloud and public cloud is becoming blurred.  Recently Rackspace announced private cloud services where customers can choose between self-own hardware and data center or hardware and data center owned by Rackspace. Oracle also announced private cloud services that are owned by Oracle and managed by Oracle. With such a new business model, a private cloud for a particular customer might be just an  isolated resource pool for a public cloud service provider(you got private cloud in my public cloud). For the public cloud service provider, its public cloud service infrastructure might in turn be part of its own bigger infrastructure (private cloud), or even a resource pool from a hardware vendor’s infrastructure(you got public cloud in my private cloud).

For the customers it is financially reasonable to use a private cloud provided by a cloud service provider. This means the CapEX needed for data center construction and hardware purchasing can be converted into OPEX, while the precious cash can be used to cultivate more business opportunities. Even if in the long term the total cost of working with such kind of private cloud will be more than alternatives based on self-own data center and hardware, the return from new business might be greater than the cost delta between two options. In the extreme case, even if the company is not successful in the end, company owners don’t need to look at a large number of newly purchased hardware and cry. Unless the real estate market grows rapidly in the short term, a failing company usually won’t feel sorry for not building its own data center. (Ahh, I should mention that for a company that has been running long enough, it is still feasible to earn money through real estate. For example, before Sun Microsystems Inc was acquired by Oracle, it did successfully make one of its financial reports look much better by selling one of its major engineering campus.)

Then, what is the role of hardware vendors in this game? When the customer’s CapEX becomes OPEX, wouldn’t it take more time for hardware vendors to collect payment?

In 1865 William Jevons14 (1835-1882), a British economist, wrote a book entitled “The Coal Question”, in which he presented data on the depletion of coal reserves yet, seemingly paradoxically, an increase in the consumption of coal in England throughout most of the 19th century. He theorized that significant improvements in the efficiency of the steam engine had increased the utility of energy from coal and, in effect, lowered the price of energy, thereby increasing consumption. This is known as the Jevons paradox, the principle that as technological progress increases the efficiency of resource utilization, consumption of
that resource will increase. Durign the past 150 years, similar over-consumption was observed in many other areas such as major industry materials, transportation, energy, and food industry.

The core value of public cloud service is that fix assets (such as servers, networking equipments, and storage) that must be purchased with hugh budget by end users now become public resources that are charged by usage. Virtualization technologies improves the efficiency and, in effect,  lowers the price of computing resources, which will eventually increase the consumption of computing resources. When we understand this logic, we can understand why HP launched HP Cloud Services in a hurry on top of OpenStack while OpenStack is still inmature for commercial deployment. It is right that HP Cloud Services might not be able to save HP from the next competition, but HP will certainly lose if it does not even join the competition. Similarly, we can understand why Oracle now becomes a cloud computing evangelist while it was sniffing at cloud computing two years ago. When Oracle acquired Sun Microsystems Inc in 2009, it suddenly became one of the major players in the hardware market. At that time the concept of cloud computing is relatively new, and Oracle’s response towards cloud computing proved that it had not yet become familiar with its new role. Now cloud computing is a lot more than just a new concept, it must be very silly if Oracle — as one of the major hardware vendors –  does not want to pursue its share in the game.

According Jevons paradox, over-consumption is a result of price decrease. Then, how should cloud computing resources be priced?

Currently, most public cloud service providers set price tags according to the configuration of the virtual machines. Take Amazon EC2 for example, it Medium virtual machine (3.75 GB memory, 2 ECU’s, 410 GB storage, $0.16 per hour) is twice as large, and twic as expensive, as its Small virtual machine (1.7 GB memory, 1 ECU, 160 GB storage, $0.08 per hour). New comers to the competition, such as HP Cloud Services, Grand Cloud (in China), and Aliyun (in China) seem to be copying Amazon EC2′s pricing strategy. The problem is, when the size of the virtual machine gets larger (with more computing resources such as vCPU, memory and storage), the performance of the virtual machine does not increase by the same proportion. A number of performance tests on Amazon EC2, HP Cloud Services, Grand Cloud, and Aliyun suggested that for a wide range of applications the performance-to-price ratio of virtual machines actually decreases as the size of virtual machines increases. It is safe to say that such pricing strategy will not encourage users to use more computing resource.

It might be more reasonable to determine the price of virtual machines according to their performance. For example, a soap manufacturer sells their products in two different packages, the smaller package has one piece and the bigger package has two pieces. Customers are willing two buy the bigger package not because it looks bigger, but because it can do twice the work of a smaller package. Similarly, virtual machine products from the same public cloud service provider should maintain a similar performance-to-price ratio. The problem is, different applications have different requirements for processor, memory and storage resources, which results in a significant difference in the performance-configuration curve. Therefore, in public cloud there is a need for a comprehensive virtual machine performance evaluation suite, which can be used to evaluate the overall performance of a virtual machine rather than just one it components such as processor, memory or storage. Based on such a comprehensive benchmark framework, we can compare not only virtual machine products from one public cloud service provider, but also different virtual machine products across different public cloud service providers.

F. Open Source

In recent years, we are observing a rule in the information industry. When a proprietory solution becomes successful in the market, there will quickly appear one or more followers — either open source or proprietory — with similar functionalities or services. (The opposite case where open source solutions come before proprietory followers is rare.) In operating systems, Linux becomes as good as and even better than Unix, and over takes the market share of Unix. In virtualization, Xen and KVM now becomes comparable of VMWare solutions, and are nibbling VMWare’s market share. In cloud computing, proprietory solution Enomaly appeared after Amazon EC2, followed by open sourced Eucalyptus and OpenStack. At the same time, traditionaly proprietory vendors are showing more friendly attitude to open source projects and open source community. For example, Microsoft established a subsidiary called  Microsoft Open Technologies in April, with the goal to promote investments  on interoperability, open standards, and open source software.

The business environment today is a lot different from the 1980′s, when the Free Software Movement was started. In fact, since Netscape invented the terminology “open source” to differentiate themselves from free software in 1998, open source has become a new business model for software R&D, marketing, and sales, rather than the opposite alternative of proprietory software. Compared to the traditional proprietory business model, the open source business model exhibits the following characteristics:

(1)In the initial phase, use buzz words such as open source and free software to gain the attention of potential customers, and business partners. For potential customers, their interests is the possibility to get (part of) the functionalities of the competing proprietory software — free or at a relatively low price. For business partners, their interestes might be that they can sell an enhanced version of the open source software (such as enterprise version), provide solutions based on the open source software, or the open source software will promote the sales of its own products.

(2)In the growth phase, major R&D resources usually come from the founding members (businesses) that initiated the project and its business partners. It is true that there are independant contributors who contribute code out of personal interests, however, the number of such individual contributors is relatively small. People promoting open source software use the phrase “developed by community”  frequently. In fact, during the past 10 years, the major R&D resources among most — if not all — major open source projects come from enterprise partners. However, some open source projects intentionally underscore the importance of enterprise partners, even mislead the audience to believe that individual contributors constitute the major part of the above-mentioned community.

(3)In the harvest phase, founding members (businesses) and its partners might sell enhanced version of the open source software, or solutions based on the open source software. Although other vendors can also sell similar products or services, but major contributors to the software obviously have more authority and reputation in the market. Regarding how businesses can make profit from open source software, Marten Mickos (currently the CEO of Eucalyptus Systems Inc) said during his tenure as the CEO of MySQL (in 2007) that success in open source requires you to serve (1) those who spend time to save money, and (2) those who spend money to save time.  Speaking from a financial point of view, success means that revenue from software sales and services should exceed the expense in R&D and marketing. In that sense, some users are able to use open source software for free because of (1) their usage is in itself some kind of participation in the open source project, which helps the marketing of the open source software, and in some cases, helps the testing and bug fixing of the open source software, and (2) those paying customers might also be paying for those who are not paying.

Then why are open source solutions usually cheaper than proprietory competitors? Generally speaking, proprietory solutions opened a whole new area from nothing and experienced many challenges in market research, product design, engineering, marketing and sales. Open source solutions, as a follower of the proprietory solution, can take the proprietory solution as a reference in market research, product design, and even take advantages of proprietory solution’s previous work in openning the market. In terms of R&D effort, open source solutions usually appear several years after proprietory solutions became successful. During that period technology advancements in related areas will lower the bar to enter into competition. Further more, open source solutions might have some outstanding features that are far better than proprietory solutions, but generally speaking the functionality, user experience, stability, and reliability of open source solutions might not be as good as those of proprietory solutions.  This is why open source solutions often promote price advantages such as “30% of the price, 80% of the functionality”. Except for price advantages, the ability to add customized functionalities is very attractive for some customers.

In China, IT companies are usually those who are willing to spend time to save money, while traditional (none IT) companies are those who are willing to spend money to save time. It should be mentioned that most of the traditional none-IT companies in China do not care about open source, but a lot of them are very interested in the ability to make customizations.

Open source as a new business model is not morally more lofty then the traditional proprietory business model. Similarly, it is not appropriate to make moral judgements for different approaches in open source practices. In the initial phase of the OpenStack project, Rackspace made public announcements saying that “OpenStack is the only fully open source cloud computing software available in the market”. Competing open source projects such as CloudStack, Eucalyptus, and OpenNebula were labled as “not truely open source” either because they had an additional enterprise version (based on the open source version) which was not open source, or a more advanced installation package (based on 100% open source smaller packages) that was provided to paying customers only. (Both Eucalyptus and CloudStack had seperate enterprise versions until April 2012. OpenNebula maintains a Pro version with all open source components for paying customers. ) Similar advertisements continued for almost 2 years, until Rackspace launched its own OpenStack-based Rackspace Private Cloud software, which is very similar to OpenNebula Pro in nature. The major difference between Rackspace Private Cloud software is free to download for all, while OpenNebula Pro is provided to paying customers only. The problem is, when the number of nodes exceeds 20 in Rackspace Private Cloud software, cloud administrators need to seek help from Rackspace, probably generating leads for fee-based customer support. Let’s leave alone the question of whether the code to set 20-node limits is open source or not for the time being. It is very difficult to explain from a moral perspective why Rackspace as a founding member of the OpenStack project adds functionalities to limit the usage of OpenStack. Rather, it would be quite reasonable if we look at such practice from a business perspective. It is fair to say that during the past two years the measurements taken by the OpenStack project in R&D, marketing and community are outstanding examples of the open source business model.

As mention before, there might exist multiple competing open source projects in a particular area of application. For example, in the broader sense of cloud computing we have Amazon EC2-like CloudStack, Eucalyptus, OpenNebula, OpenStack, and other options such as Convirt, XenServer, Oracle VM, and OpenQRM. For a particular application scenario, how can a business make a decision among so many open source options? In my experience, the software selection process can be divided into 3 different phases, including requirement analysis, technical analysis, and business analysis.

(1)During requirement analysis, we need to determine the real needs of the project and why they need a cloud solution. In China, many decision makers’ understanding on cloud computing stops at “improve efficiency, lowering operation cost, provide convinience”. They do not realize most open source solutions can satisfy such requirements in one way or another already. Further more, many decision makers refer to VMWare vCenter when talking about functionality requirements, and do not want to discuss why they need a specific functionality. Therefore, it is very important to investigate in details the actual application scenario, understand whether this is a data center virtualization management project or a Amazon EC2-like cloud computing project, and explore functionality requirements as much as possible. In some cases, both data center virtualization and Amazon EC2-like solutions can satisfy the needs of the customer, then it is up to the sales person to introduce the customers to their own solutions (such technique is called expectation management). By carrying out requirement analysis, we can filter out a significant portion of the options available.

(2)During technical analysis, compare the reference architecture of each open source solutions, with a focus on how difficult it would be to implement the reference architecture in the actual application scenario. Then compare different open source solutions in terms of functionalities, and seperate must-have functionalities from good-to-have ones. Further more, we can also compare the difficulties in installation and configuration, user experience, documentations, and customization. By carrying out technical analysis, we can make a rank for the open source solutions, and remove the last one from the list.

(3)During business analysis, make sure whether the decision maker is willing to pay for open source solutions. If yes, this is a “spend money to save time” scenario. If not, this is a “spend time to save money” scenario. For those who are willing to spend time to save money, the open source community is the major place to seek technical support, therefore the activeness of the corresponding community is a very important reference. For those who are willing to spend money to save time, they usually rely on service providers for technical support. Therefore its is very important to know the reputation of the service provider, and whether service is readily available locally. The activeness of the open source community is less important for such scenario.

In China, for application scenarios that are willing to spend money to save time, CloudStack and Eucalyptus are relatively better options. These two projects got started relatively earlier, have better stability and reliability, have good reputation in the industry, and have teams in China to provide support and services. We are seeing some startup teams in China trying to provide OpenStack-based solutions. However, these teams are too young and they still need time to accumulate necessary experiences. The Sina App Engine (SAE) team have a lot of first-hand experience with OpenStack, but they do not yet have the permission to provide support and services for other commercial customers. There are also some teams in China working with OpenNebula, but they are too small to provide support and services to others in the short term.

For application scenarios that are willing to spend time to save money, CloudStack and OpenStack are better optoins, because their user and developer community seem to be more active. Among these two options, CloudStack offers more functionality, and have more successful stories, which make it a better choice in the short term. In the long term, OpenStack is becoming more popular, but other options are making progresses too. It would be very difficult for one software to rule over in the coming 3 years. I would say that from a business perspective, CloudStack and Eucalyptus will move faster than others.

G. Additional Notes

Some friends would like me to add some more information about China. Frankly speaking, I don’t have enough data to elaborate on this topic. Liming Liu recently post a blog entry, which serves as a very good reference. The blog entry can be access from this URL, but it is in Chinese.

Regarding the activeness of different open source projects, reader can refer to my recent blog post CY12-Q3 Community Analysis — OpenStack vs OpenNebula vs Eucalyptus vs CloudStack for more information. Regarding performance testing on public cloud service providers, readers can refer to my other blog post HP Cloud Services Performance Tests for more information.

All the figures in this blog entry came from Google search. Many of the concepts mentioned in this blog entry came from Wikipedia, with modifications by the author.

 

CY12-Q3 Community Analysis — OpenStack vs OpenNebula vs Eucalyptus vs CloudStack

By , October 2, 2012 10:59 am

This article is an update version of my previous article CY12-Q2 Community Analysis — OpenStack vs OpenNebula vs Eucalyptus vs CloudStack. Readers who are intested in further discussions please contact me via email at the above-mentioned address.

A Chinese version of this article is published at the same time, which can be found at CY12-Q3 OpenStack, OpenNebula, Eucalyptus, CloudStack社区活跃度比较.

The objective of this article is to compare the OpenStack, OpenNebula, Eucalytpus and CloudStack user and developer communities, base on the communications between community members in the form of mailing lists or pubic forum discussions. The data being discussed include the total number of topics (threads), messages (posts), and participants (unique email addresses or registered members). To obtain the above-mentioned data, a Java program was written to retrieve all the forum posts and mailing list messages into a MySQL database for further processing. The analysis results were presented in the form of graphs generated by LibreOffice.

In CY12-Q3, we are adding the longly-neglected https://answers.launchpad.net/openstack and http://lists.openstack.org/pipermail/*/ into the analysis. It turns out that these two source contains a huge amount of data that is has a significant impact on the analysis result.

Also, when the CY12-Q2 report was published, some people questioned the inclusion of the incubator-cloudstack-dev mailing list. This particular mailing list contains a lot of messages that are automatically generated by JIRA. In CY12-Q3, we set up a filter to reject all messages with identifier “[jira]” in the subject.

Figure 1 and 2 represent the monthly number of topics (threads) and posts (messages). It can be seen that

(1) the volume of OpenStack and CloudStack related discussions is much higher than that of Eucalyptus and OpenNebula; and

(2) the Eucalyptus and OpenNebula clubs are exhibiting similar behaviors, with only minor differences.

Generally speaking, the number of replies to a specific topic represents the attention being received, and the depth of discussion for that particular topic. When the number of master posts (the original post that started a particular topic) is more than the number of replies, it is safe to conclude that the participation of the forum or mailing list is very low. Therefore, the ratio between “the number of  posts” and “the number of topics” represents the participation rate of an online community. In this study we call this ratio the Participatin Ratio.

In the past the OpenStack project had a much higher participation ratio than the others. However, the participation ratio of CloudStack is climing steadily. Currently CloudStack and OpenStack have the best participation ratio, which is close to 4. OpenNebula and Eucalytpus have similar participation ratios, which is close to 3.

Figure 4 shows the number of monthly participants of the four projects being discussed. It can be seen that the active participants of CloudStack and OpenStack are much higher than OpenNebula and Eucalyptus. However, during the past 3 months, the number of participants for both CloudStack and OpenStack have decreased slightly.

It should be noted that although the number of active participants of CloudStack is somewhat less than OpenStack, but the volume of discussion (in terms of monthly number of threads and messages) of the two projects are on the same level. This indicates that the active members in the CloudStack club are talking more than those in the OpenStack club (on average).

Accumulated Community Population refers to the total number of users and developers who have participated in forum or mailing list discussions. (This number does not include those who have registered into discussion forums or mailing lists but have never participated in any open discussions.) These are people who have tested or used a specific product for a while, but not necessary currently an active user.

Figure 5 shows the accumulated community populations of the four projects being discussed. The Eucalyptus project still has the biggest population, but OpenStack is quickly catching up. It is expected that the OpenStack population will exceed that of Eucalyptus in CY12-Q4. In our CY12-Q2 report we predicted that the CloudStack population will exceed the OpenNebula population in a very short period. It only took CloudStack a month to accomplish that!

If you compare the CY12-Q3 report with the CY12-Q2 report, you will find that the population curve for OpenStack has changed a lot. This is due to the inclusion of the https://answers.launchpad.net/openstack and http://lists.openstack.org/pipermail/*/ data source. It should be noted that launchpad answers and the mailing list share the same registeration database, but are displaying different names for the same person. Therefore, it is very possible that a large amount of users were counted twice for the OpenStack population. We have carried out some basic de-duplication efforts to eliminate some obvious duplications, but there are still a lot of space to optimize. A rough estimation is that the real OpenStack population would be about 85% of the numbers being shown in this analysis.

There might exist certain level of duplication for the community population of CloudStack and Eucalyptus. We did look into the data and found some duplications. However, the level of duplication seems to be very small for both projects that it does not produce much impact on the analysis results.

Figure 6 shows the monthly population growth of the four projects being discussed. During the past 3 months, the populations of OpenStack and CloudStack are growing at the same pace.

The populations of Eucalyptus and OpenNebula are growing at relatively slow paces, as compared to that of CloudStack and OpenStack.

Figure 7 is a combination of Figure 4 and Figure 6. The solid lines represent the monthly participants, while the dash lines represent the monthly new members.

For OpenStack and OpenNebula, around 30% of their monthly participants are new members.  For CloudStack and Eucalyptus, around 50% of their monthly participants are new members. This indicates OpenStack and OpenNebula communities are more “sticky” than CloudStack and Eucalyptus communities.

For each of the projects being discussed, the monthly population growth is somwhat “synchronous” with its monthly participants. That’s to say, the populatoin growth of a community is somewhat related to the “activeness” of the community. This also suggests that both the population growth and the “activeness” of a community might be event-driven. A new software release, a technical conference, or a marketing event, might be the cause of the growth in population and “activeness” of the respective community.

Figure 8 shows the total community population, active participants of the past quarter, and active participants of the past month, of the four projects being discussed. It can be seen that

(1) Eucalyptus has the largest total population, followed by OpenStack, CloudStack, and OpenNebula;

(2) OpenStack has the largest active population during the past quarter, followed by CloudStack, Eucalyptus, and OpenNebula;

(3) OpenStack has the largest active population during the past month, followed by CloudStack, Eucalyptus, and OpenNebula.

Occasionally I come across people saying “Hay, you are talking too much! What don’t you tell me which one is THE most active project in this area?” I agree that this is an important question, and I guess there are many more who do not ask simply because that they know that I don’t know the answer.

For quite some time I have been looking for a magic number to indicate the “relative activeness” of a comunity as compared to other alternatives. This magic number should be the combination of the following parameters:

(1) monthly messages, which represents the volume of the discussions;

(2) participation ratio, which represents the average number of answers to a question;

(3) active population of the past quarter, which represents the possibility to get help from community in the long term; and

(4) active population of the past month, which represents the possibility to get help from the community in the short term.

In this analysis, we choose the average values of these parameters as the reference data set, and compare the corresponding parameters of each community with the reference data set. Then we call the sum of the relative values of a community the “community activeness index” of the community. Now we can say the project with the highest “community activeness index” is THE most active project in this area.

As can be seen from Figure 9, OpenStack is currently THE most active project (with obvious advantage), followed by CloudStack, Eucalyptus, and OpenNebula.

The above-mentioned concept of “community activeness index” is still very primitive, with a lot of space to optimize. However, it is an attempt to replace the old-fashion “I think”, “I believe” and “I guess” practices with quantative analysis. In our future community analysis, we will continue to use this concept to provide a quarterly ranking for OpenStack, OpenNebula, Eucalyptus, and CloudStack. Improvements to the algorithm (such as adding/removing parameters or changing the weight of different parameters) will be make when necessary.

For many cloud computing professionals, the dramatic growth achieved by the CloudStack project during the past 6 months was quite unexpected. Therefore we conducted an email interview with Sheng Liang, the CTO of Cloud Platforms at Citrix. Below is Sheng’s explaination for CloudStack’s success in building a highly active open source community:

“Apache CloudStack has flourished under the Apache Software Foundation which kept us from having to waste efforts coming up with a new open source governance model. Developers have responded well to the Apache Way with contributions flowing in from our rapidly growing community of over 35,000 individuals. We also are pleased with the organic way technology providers and open source projects are integrating their software with Apache CloudStack. Leadership of the project has also shifted from Citrix to a number of other individual committers who have been driving an aggressive development schedule. The upcoming 4.0 release is very exciting as it’s the first major release under Apache including code from numerous  production users of CloudStack who developed features based on their experience running live cloud computing environments. Anecdotally we are seeing CloudStack deployments popping up everywhere from financial institutions and gaming companies to universities (we understand a CloudStack cluster even helped crunch research data for the Higgs Boson discovery).  I am sure the excitement around CloudStack will continue given the incredible progress in under six short months.”

From an end-user’s perspective, it is good to see the competition heating up because that means more choices with better quality. Cloud computing is still an evolving market that is highly inmature, and we expect more competition to come in the future.

For your convienience, a PDF version of this presentation can be downloaded from here. Please kindly keep the author information if you want to redistribute the content.

Panorama Theme by Themocracy