Category: 技术相关

Getting Started with AWS SDK for Java (1)

By , 2015年6月3日 10:48 上午

This is an entry level tutorial on how to use the AWS SDK for Java to interact with the various AWS services. Although we will cover a little bit about Java programming basics but this is not a tutorial on the Java programming language itself. If you want to learn the Java programming language language, I strongly recommend that you go through The Java Tutorial which was developed by Sun Microsystems in the very early days (improved and refined by Oracle later on).

To avoid exposing your AWS credentials in your code, all the examples in this tutorial use the credentials from IAM roles to authenticate with the AWS services. To run these examples, you will need to launch an EC2 instance (in this tutorial, we use Ubuntu 14.04 as the testing environment) with an IAM role. The IAM role should have sufficient permission to access the various AWS services you would like to test. For more information on this topic, please refer to the AWS documentation on IAM Roles for EC2.

[Java 8 SDK, AWS SDK for Java, Demo Code]

Assuming that you have launched an EC2 instance with a Ubuntu AMI, let’s SSH into the EC2 instance and install the Java 8 SDK:

$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java8-installer

$ javac -version
javac 1.8.0_45
$ java -version
java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)

Now let’s download and configure the AWS SDK for Java:

$ cd ~
$ wget http://sdk-for-java.amazonwebservices.com/latest/aws-java-sdk.zip
$ sudo apt-get install unzip
$ unzip aws-java-sdk.zip

Now you have a folder with the name aws-java-sdk-1.x.xx in your home folder. The AWS libraries reside in the lib sub-folder while third-party dependencies reside in the third-party sub-folder. In order to use these libraries, you will need to configure your CLASSPATH to point to these sub-folders one by one. However, for the convenience of this tutorial, we simply copy everything into our JRE’s lib/ext sub-folder. With this approach, you don’t need to worry about the CLASSPATH at all. (If you are using other versions of the Java SDK, you will need to replace the following commands with the actual location of your Java installation.)

$ sudo cp aws-java-sdk-*/lib/*.jar /usr/lib/jvm/java-8-oracle/jre/lib/ext/
$ sudo cp aws-java-sdk-*/third-party/*/*.jar /usr/lib/jvm/java-8-oracle/jre/lib/ext/

The demo code for this tutorial is available in my github repository. The demo project uses maven as the project management code. So, we will need to install maven and git, then clone the code from my github repository.

$ cd ~
$ sudo apt-get install maven git
$ git clone https://github.com/qyjohn/aws-sdk-java-demo

Now we try to build the project and run a test application:

$ cd aws-sdk-java-demo
$ mvn compile
$ mvn package
$ java -cp target/demo-1.0-SNAPSHOT.jar net.qyjohn.aws.App
Hello World!

At this point, you have successfully configured your development environment and are ready to move forward to the rest of this tutorial.

[Amazon EC2 Client]

In this section, we use the AmazonEC2Client to accomplish some basic tasks such as launching an EC2 instance, listing all EC2 instances in a particular region, as well as terminating a particular EC2 instance. The related source code for this demo is DemoEC2.java (you can click on the link to view the source code in a separate browser tab). You should also take a look at the Java docs for the AmazonEC2Client to get yourself familiar with the various properties and methods.

We create an instance of the AmazonEC2Client in the constructor. At the same time, we specify which region we are going to use. You should always specify a region when manipulation AWS resources, unless the resource you are operating on is global (for example, IAM).

public class DemoEC2 
{
	public AmazonEC2Client client;

	/**
	 *
	 * Constructor
	 *
	 */

	public DemoEC2()
	{
		// Create the AmazonEC2Client
		client = new AmazonEC2Client();
		// Set the region to ap-southeast-2
		client.setRegion(Regions.AP_SOUTHEAST_2);
	}

To launch an EC2 instance, you will need to create a StartInstancesRequest object,  then pass it to the startInstances() method of the AmazonEC2Client, which returns a StartInstancesResult object. As shown in the following demo code, the StartInstancesRequest object contains information such as the AMI, the instance type, the key pair, the subnet, the security group, the number of instances to be launched. Many many other options can be supplied to the StartInstancesRequest object. You will need to refer to the API docs when needed.

	public String launchInstance()
	{
		System.out.println("\n\nLAUNCH INSTANCE\n\n");

		try
		{
			// Construct a RunInstancesRequest.
			RunInstancesRequest request = new RunInstancesRequest();
			request.setImageId("ami-fd9cecc7");	// the AMI ID, ami-fd9cecc7 is Amazon Linux AMI 2015.03 (HVM)
			request.setInstanceType("t2.micro");	// instance type
			request.setKeyName("desktop");		// the keypair
			request.setSubnetId("subnet-2dc0d459");	// the subnet
			ArrayList list = new ArrayList();
			list.add("sg-efcc248a");			// security group, call add() again to add more than one
			request.setSecurityGroupIds(list);
			request.setMinCount(1);	// minimum number of instances to be launched
			request.setMaxCount(1);	// maximum number of instances to be launched

			// Pass the RunInstancesRequest to EC2.
			RunInstancesResult  result  = client.runInstances(request);
			String instanceId = result.getReservation().getInstances().get(0).getInstanceId();
			
			// Return the first instance id in this reservation.
			// So, don't launch multiple instances with this demo code.
			System.out.println("Launching instance " + instanceId);
			return instanceId;
		} catch (Exception e)
		{
			// Simple exception handling by printing out error message and stack trace
			System.out.println(e.getMessage());
			e.printStackTrace();
			return "ERROR";
		}
	}

To list all the EC2 instances in a region, we simply call the describeInstances() method with no argument, which returns a DescribeInstancesResult. In the DescribeInstancesResult, traverse through all the Reservation, which is stored in a List. Each Reservation includes one or more EC2 Instance, which is also stored in a List. You can get the information for each EC2 instance from the Instance object.

The concept of Reservation seems to be confusing, and many people mistakenly think that it is the same as Reserved Instances but in fact it is not. According to the boto documentation, a reservation corresponds to a command to start instances. If you launch two EC2 instance in one batch (for example, specifying the number of instances in the EC2 Console), this particular Reservation will have two EC2 instances. You can stop and started the EC2 instances to change their state, but this will not change their Reservation.

	public void listInstances()
	{
		System.out.println("\n\nLIST INSTANCE\n\n");
        	try 
		{
			// DescribeInstances
			DescribeInstancesResult result = client.describeInstances();

			// Traverse through the reservations
			List reservations = result.getReservations();
			for (Reservation reservation: reservations)
			{
				// Print out the reservation id
				String reservation_id = reservation.getReservationId();
				System.out.println("Reservation: " + reservation_id);
				// Traverse through the instances in a reservation
				List instances = reservation.getInstances();
				for (Instance instance: instances)
				{
					// Print out some information about the instance
					String id = instance.getInstanceId();
					String state = instance.getState().getName();
					System.out.println("\t" + id + "\t" + state);
				}
			}

	        } catch (Exception e) 
		{
			// Simple exception handling by printing out error message and stack trace
			System.out.println(e.getMessage());
			e.printStackTrace();
		}
	}

To terminate an EC2 instance, you will need to create a TerminateInstancesRequest object. The TerminateInstancesRequest object accepts a List of EC2 instance id through the setInstanceIds() method. Then you pass the TerminateInstancesRequest to the AmazonEC2Client’s terminateInstances() method, which returns a TerminateInstancesResult object. In the TerminateInstancesResult object, you have a List of InstanceStateChange, and each InstanceStateChange object contains information about the EC2 instance id, it’s previous state, and it’s current state.

	public void terminateInstance(String instanceId)
	{
		System.out.println("\n\nTERMINATE INSTANCE\n\n");
		try
		{
			// Construct the TerminateInstancesRequest
			TerminateInstancesRequest request = new TerminateInstancesRequest();
			ArrayList list = new ArrayList();
			list.add(instanceId);			// instance id
			request.setInstanceIds(list);

			// Pass the TerminateInstancesRequest to EC2
			TerminateInstancesResult result = client.terminateInstances(request);
			List  changes = result.getTerminatingInstances();
			for (InstanceStateChange change : changes)
			{
				String id = change.getInstanceId();
				String state_prev = change.getPreviousState().toString();
				String state_next = change.getCurrentState().toString();
				System.out.println("Instance " + id + " is changing from " + state_prev + " to " + state_next + ".");
			}
		} catch (Exception e)
		{
			// Simple exception handling by printing out error message and stack trace
			System.out.println(e.getMessage());
			e.printStackTrace();
		}
	}

Now you can make modifications to the demo code (in DemoEC2.java) with your preferred AWS region, AMI, instance type, along with other information. You can run the demo code using the following commands:

$ mvn compile
$ mvn package
$ java -cp target/demo-1.0-SNAPSHOT.jar net.qyjohn.aws.DemoEC2

This demo will launch one EC2 instance, list all the EC2 instances in the region, terminate the EC2 instance we just launched, and list all the EC2 instances in the region again. In order to demonstrate the state changes, we do a sleep of 10 seconds between each API call. After you have completed this exercise, you should intentionally introduce some errors in the code to observe the various exceptions thrown by the demo code. For example, you can point your AmazonEC2Client to use the us-east-1 region, but specify a subnet id or a security group id in another region.

[Logging Considerations]

When you encountered an error when making an API call using the AWS SDK for Java, the chance is that you have some mistake in the AWS resource you specified in the code, while the code itself is completely valid. In order to debug such issues, you will need to know what information is being sent to the AWS endpoint, and what information is returned from the AWs endpoint. In the dark age, debugging an application can only be achieved by looking into the Java object containing the request and result, then print out the information one by one using System.out.println(). Fortunately you don’t need to do this with the AWS SDK for Java, because the AWS SDK for Java is instrumented with Apache Commons Logging. All you need to do is grab a recent copy of log4j and set up the proper CLASSPATH, then ask your application to use the proper log4j configuration file.

In this demo, we have the JAR files for log4j version 2.3 in the third-party folder. we simply copy everything into our JRE’s lib/ext sub-folder. With this approach, you don’t need to worry about the CLASSPATH at all.

$ cd third-party
$ sudo cp *.jar /usr/lib/jvm/java-8-oracle/jre/lib/ext/

The log4j configuration file log4j2.xml, which is located at the top level folder of the demo code. You should refer to the log4j manual to understand how log4j works. To enable log4j for our demo application, you simply need to uncomment the Logger line in DemoEC2.java, as below:

public class DemoEC2 
{
	public AmazonEC2Client client;
	final static Logger logger = Logger.getLogger(DemoEC2.class);

After that, you will need to compile and package the demo code again. When running the Java application, you will need to pass the log4j configuration file to the java command:

$ mvn compile
$ mvn package
$ java -cp target/demo-1.0-SNAPSHOT.jar -Dlog4j.configurationFile=log4j2.xml net.qyjohn.aws.DemoEC2

As you can see, when you run the application, the request you send to the AWS endpoint, as well as the response from the AWS endpoint, are now display on your screen. With this information, it is a lot easier to debug your API calls to AWS endpoints.

By now we have completed the first chapter of this “Getting Started with AWS SDK for Java” tutorial. In the future I will publish more on this topic on a irregular base. Hopefully I will be able to cover the majority of AWS services that we use on a daily base. So, please stay tuned for my future updates.

Distributed File System on Amazon Linux — GlusterFS

By , 2015年5月13日 9:35 上午

[Introduction]

This article provides a quick started guide on how to set up and configure GlusterFS on Amazon Linux. Two EC2 instance are being launched to accomplish this goal. On both EC2 instances, there is an instance-store volume serving as the shared storage.

Edit /etc/hosts on both EC2 instances with the following entries (assuming that the private IP addresses are 172.31.0.11 and 172.31.0.12).

172.31.4.11	gluster01
172.31.4.12	gluster02

Create the a repository definition /etc/yum.repos.d/gluster-epel.repo with the following content:

# Place this file in your /etc/yum.repos.d/ directory

[glusterfs-epel]
name=GlusterFS is a clustered file-system capable of scaling to several petabytes.
baseurl=http://download.gluster.org/pub/gluster/glusterfs/LATEST/EPEL.repo/epel-6/$basearch/
enabled=1
skip_if_unavailable=1
gpgcheck=0

[glusterfs-noarch-epel]
name=GlusterFS is a clustered file-system capable of scaling to several petabytes.
baseurl=http://download.gluster.org/pub/gluster/glusterfs/LATEST/EPEL.repo/epel-6/noarch
enabled=1
skip_if_unavailable=1
gpgcheck=0

[glusterfs-source-epel]
name=GlusterFS is a clustered file-system capable of scaling to several petabytes. - Source
baseurl=http://download.gluster.org/pub/gluster/glusterfs/LATEST/EPEL.repo/epel-6/SRPMS
enabled=0
skip_if_unavailable=1
gpgcheck=0

Use the following commands to install the necessary packages and start services.

sudo yum update
sudo yum install fuse fuse-libs nfs-utils
sudo yum install glusterfs glusterfs-fuse glusterfs-geo-replication glusterfs-server

sudo chkconfig glusterd on
sudo chkconfig glusterfsd on
sudo chkconfig rpcbind on
sudo service glusterd start
sudo service rpcbind start

[Configurations]

On both EC2 instances, create file system on the instance-store volume and mount it to /glusterfs.

sudo mkdir /glusterfs
sudo mkfs.ext4 /dev/xvdb
sudo mount /dev/xvdb /glusterfs
sudo mkdir -p /glusterfs/brick

On gluster01, probe the other EC2 instance and create the GlusterFS volume. The name of the volume is “test-volume”.

sudo gluster peer probe gluster02
sudo gluster volume create test-volume replica 2 transport tcp gluster01:/glusterfs/brick gluster02:/glusterfs/brick
sudo gluster volume start test-volume

On both EC2 instance, mount the GlusterFS volume. The name of the volume is “test-volume”. Then we change the ownership of the /mnt/glusterfs folder to “ec2-user:ec2-user” so that the default “ec2-user” can use the shared folder directly.

sudo mkdir -p /mnt/glusterfs
sudo mount -t glusterfs gluster01:test-volume /mnt/glusterfs
cd /mnt
sudo chown -R ec2-user:ec2-user glustefs
mount
df -h

Now the shared file system has been set up, and you can create a text file under /mnt/glusterfs and observe that the file created appears on both EC2 instances. Please bear in mind that this is only a quick start guide, and you should not use this configuration directly in a production system without further tunings.

Distributed File System on Amazon Linux — MooseFS

By , 2015年5月13日 6:40 上午

[Introduction]

This article provides a quick started guide on how to set up and configure MooseFS on Amazon Linux. Two EC2 instance are being launched to accomplish this goal. On both EC2 instances, there is an instance-store volume serving as the shared storage. One of the EC2 instance is being used as the master server, while both EC2 instances are chunk servers (storage servers).

[Master Server Installation]

Edit /etc/hosts, add the following record (assuming that the private IP of the master node is 172.31.0.10):

172.31.0.10    mfsmaster

Then run the following commands to add the MooseFS repository:

wget http://ppa.moosefs.com/stable/yum/RPM-GPG-KEY-MooseFS
sudo cp RPM-GPG-KEY-MooseFS /etc/pki/rpm-gpg/
wget http://ppa.moosefs.com/stable/yum/MooseFS.repo
sudo cp MooseFS.repo /etc/yum.repos.d/

Install MooseFS master and CLI:

sudo yum update
sudo yum install moosefs-ce-master
sudo yum install moosefs-ce-cli
cd /etc/mfs

In the /etc/mfs folder, you should see mfsmaster.cfg and mfsexports.cfg. If they don’t exist, copy mfsmaster.cfg.dist to mfsmaster.cfg and copy mfsexports.cfg.dist to mfsexports.cfg.

Modify mfsexports.cfg to add permission for the 172.31.0.0/16 subnet. Add this line to the end of the file:

172.31.0.0/16        /    rw.alldirs,maproot=0

One more CRITICAL modification:

cd /var/lib/mfs
sudo cp metadata.mfs.empty metadata.mfs

Start MFS master

sudo service mfsmaster start

At this point, the MooseFS master node is running successfully. You can use the following MFS CLI to do a quick check:

mfscli -SIN

[Chunk Server Installation]

Edit /etc/hosts, add the following record (assuming that the private IP of the master node is 172.31.0.10):

172.31.0.10    mfsmaster

Then run the following commands to add the MooseFS repository:

wget http://ppa.moosefs.com/stable/yum/RPM-GPG-KEY-MooseFS
sudo cp RPM-GPG-KEY-MooseFS /etc/pki/rpm-gpg/
wget http://ppa.moosefs.com/stable/yum/MooseFS.repo
sudo cp MooseFS.repo /etc/yum.repos.d/

Install MooseFS chunk server and CLI:

sudo yum update
sudo yum install moosefs-ce-chunkserver
sudo yum install moosefs-ce-cli
cd /etc/mfs

In the /etc/mfs folder, you should see mfschunkserver.cfg and mfshdd.cfg. If they don’t exist, copy mfschunkserver.cfg.dist to mfschunkserver.cfg and copy mfshdd.cfg.dist to mfshdd.cfg.

Assuming that you have a second EBS volume (or instance-store volume) on your EC2 instance, which is /dev/xvdb, that you would like to use as the storage for MooseFS.

sudo mkfs.ext4 /dev/xvdb
sudo mkdir /mfs
sudo mount /dev/xvdb /mfs
sudo chown -R mfs:mfs /mfs

Then edit /etc/mfs/mfshdd.cfg, add one line to the end of the file:

/mfs

Create a file /etc/default/moosefs-ce-chunkserver, with the following content:

MFSCHUNKSERVER_ENABLE=true

Start MooseFS chunk server with the following command:

sudo service mfschunkserver start

Now you can check the status of the whole storage system using MooseFS CLI:

mfscli -h
mfscli -SCS

[Client Installation]

Edit /etc/hosts, add the following record (assuming that the private IP of the master node is 172.31.0.10):

172.31.0.10    mfsmaster

Then run the following commands to add the MooseFS repository:

wget http://ppa.moosefs.com/stable/yum/RPM-GPG-KEY-MooseFS
sudo cp RPM-GPG-KEY-MooseFS /etc/pki/rpm-gpg/
wget http://ppa.moosefs.com/stable/yum/MooseFS.repo
sudo cp MooseFS.repo /etc/yum.repos.d/

Install MooseFS client and CLI:

sudo yum update
sudo yum install fuse
sudo yum install moosefs-ce-client
sudo yum install moosefs-ce-cli

Actually mount the MooseFS shared storage to local computer:

sudo mkdir -p /mnt/mfs
sudo mfsmount /mnt/mfs -H mfsmaster
df -h

Now the shared file system has been set up, and you can create a text file under /mnt/glusterfs and observe that the file created appears on both EC2 instances. Please bear in mind that this is only a quick start guide, and you should not use this configuration directly in a production system without further tunings.

HP Public Cloud — Internal Server Error

By , 2015年5月5日 6:35 下午

屏幕快照 2014-05-06 上午9.46.21

This is a screen capture taken off my computer on May 06, 2014. The second day this screen shot — along with a detailed explanation — was sent to a couple of folks working at HP Cloud. Today is the first one-year anniversary of this screen capture, and I think it is a good time to tell the story behind it.

In February 2014, my family relocated from China to Australia. As you can imagine, I have a new local bank account with a new bank card, and my old bank card in China gradually expired. I managed to update my payment information with most of the online services that I use. However, with HP Cloud, each and every time I got this unfriendly message when submitting my payment information through the Horizon console. On May 06 2014 I received an email titled “Action Required – Payment Declined” from HP Cloud, which was very frustrating for a customer with the good intention to pay the bill and had tried his best to give the vendor the required payment information. I ended up initiating an online chat session with the HP Cloud support team, and the agent happily updated my payment information and resolved the issue. I understand that there is usually a gap between support and product team, so I took the extra effort to send the screen capture to a couple of  folks with HP Cloud email addresses out of my address book. A director level manager replied to say thank you, and pointed out that the issue had been escalated to public cloud billing.

Since then, I got this “Action Required – Payment Declined” email notification almost every month (but not every month). Each time I got this message, I attempted to update my payment information again, but was greeted with the same “Internal Server Error”. Then I ended up initiating another chat session with the HP Cloud support team, who happily resolved the issue for me.

On April 08 2015, I was in another chat with the HP Cloud support team. I told the agent that I was upset by the situation, and the agent told me that “I’m sorry for the incontinence. Some cards you used are declined by your bank and you get an automatic notice from our system. This is a bank issues.” I was very unhappy about this accusation – I had tried my best to give HP Cloud my payment information, every month I reminded HP Cloud of the correct payment method, but HP Cloud insisted on using the wrong payment information. So I asked the agent for a transcript of this particular chat. I also told the agent that should this happen again in May, I was going to write a story and publish it to the Internet. The second day, after obtaining approval from the management, the agent sent me the transcript of the chat.

Today, I have a new “Action Required – Payment Declined” message waiting in my inbox. As promised last month, I write down this story, and I will present the story to the next HP Cloud support agent. I hope that he/she will enjoy reading it, and help me resolve the issue I have.

浅谈“中国”语境下的公有云发展

By , 2015年5月4日 3:57 下午

这篇文章的目的,是简要地阐述一下在“中国”这个语境下公有云发展的一些个人观点。

一、公有云的规模

所谓公有云,简单地讲就是以服务的方式向公众提供计算资源。在这篇文章的范畴之内,计算资源主要指计算资源(虚拟机),但是在必要的时候会扩展到存储资源和网络资源。用各位从业人员背得滚瓜熟烂得术语来说,就是用户象用水用电一样使用计算资源,按需获取,按量计费。基于这样一个定义,一个真正意义上的公有云需要具备一定的规模才能够达到向“公众”提供服务的基本要求。[在这篇文章的范畴之内,托管云(Managed Cloud)被认为是公有云(Public Cloud)的一种特例。]

按照Gartner的统计数据,在2006到2014年间,全球服务器硬件市场每年的出货量稳定地在10,000,000台上下波动。其中,亚太地区占比在1/4左右,也就是2,500,000台。中国境内服务器出货数量在亚太地区的占比不详,保守地按1/5计算也有500,000台。按照3 年折旧周期进行估算,全国范围内现役的计算资源至少有1,500,000台物理服务器。作为一家服务于“中国”的产业级别的公有云服务提供商,假设其业务成熟之后拥有全国计算资源的2%,就是30,000台物理服务器。再按1:3到1:4的虚拟化比例进行估算,则虚拟机的数量为100,000台左右。公有云作为一种新型服务,其市场规模尚有相当程度的自然增长空间,因此五年之后的公有云可能达到的规模只会比这个数字更大。

根据AWS最近发布的财务数据,2015年第一季度的销售收入达到15.6亿美元。假设来自EC2以及基于EC2的其他服务对收入的贡献占50%,按照中等配置的m3.large实例(2个vCPU核心,7.5 GB内存,每小时0.14美元)来估算,相当于2,500,000个EC2实例。根据Rackspace历年的财报进行估算,2014年Rackspace用于公有云服务的物理服务器数量大概在20,000台到30,000台之间,换算成虚拟机也达到了100,000台。因此,将100,000台虚拟机作为一个基础目标,并非好高骛远。

基于这些估算,我们可以根据其规模判断一家公有云创业企业所处的成长阶段:

  • 概念阶段,小于5,000台虚拟机。公司的终极目标相对模糊,在私有云解决方案提供商和公有云服务提供商之间摇摆不定。在战术层面,缺乏明确的技术路线图,产品形态相对原始并且没有明确的技术指标。
  • 原型阶段,小于10,000台虚拟机。公司基本上将其终极目标定位为公有云服务提供商。由于公有云和私有云之间的巨大差异,必然要放弃私有云解决方案服务提供商的身份。在战术层面,基本形成相对清晰的技术路线图,基础产品(云主机)基本定型,在宕机时间和产品性能方面均有明确的技术指标。在云主机的基础上,提供能够承担中低负载的负载均衡、数据库、缓存等等周边产品。
  • 成长阶段,小于50,000台虚拟机。基础产品(云主机)能够满足高性能计算的要求,同时发展出一系列模块化的周边产品。普通用户完全依靠云服务提供商所提供的不同模块即可自主创建大规模可伸缩型应用(无需云服务提供商进行干预)。
  • 成熟阶段,小于100,000台虚拟机。在技术方面,资源利用率开始提高,规模效应开始出现。在市场方面,客户忠诚度开始提高,马太效应开始出现。这标志着公司在公有云领域已经获得了较有份量的市场份额,其产品和技术获得了一个或者多个细分市场的广泛认可。
  • 产业阶段,大于100,000台虚拟机。只有进入这一阶段,才能够认为一个服务提供商已经站稳了脚跟,可以把公有云当作一个产业来做了。至于最后能够做多大,一个好看国内的大环境,另外一个还得看公司自身的发展策略。

按照这样一个阶段划分,国内大部分公有云创业公司都还处于概念阶段,最多有一家创业公司已经进入原型阶段。阿里云不能够按照创业公司来看待,但是如果只统计其ECS部分的话,可能处于成长阶段的早期。我个人的估计,五年后公有云拥有的计算资源可能占全国计算资源的3%到5%。这意味着市场可以容纳一大一小两家进入产业阶段的公有云服务提供商,外加两到三家进入成长阶段或者成熟阶段的公有云服务提供商在一些细分市场里面深耕细作。

这也就是为什么我一直强调云计算是一片刚刚显现的蓝海。现在国内各家做公有云的公司杀得你死我活,看起来似乎已经是一片血海。在我看来,这些不过都是假象。如果一家公有云创业企业没有这样的大局观,那么我只有一个建议:“认怂服输,割肉止损,是为美德。”

二、公有云的产品

作为一个公有云服务提供商, 其产品形态必然是多种多样的。但是公有云要取得成功,就不能是私有数据中心可有可无的补充,而必须具备完全取代私有数据中心的能力。这意味这公有云要能够满足高性能计算的要求,普通用户完全依靠云服务提供商所提供的各种模块即可自主创建大规模可伸缩型应用(无需云服务提供商进行干预)。12306的查询部分迁移到阿里云勉强可以算是一个案例,问题在于这个迁移需要阿里云内部工程师的深度参与,因此不能算是一个好的案例。

鉴于产品的多样性,这里我们仅以块存储、负载均衡、自动伸缩为切入点谈谈公有云产品的特性。

块存储的磁盘IO指标,在从业人士当中是一个热门话题。相关讨论大都集中在云主机磁盘应该达到什么级别的IOPS或者是吞吐量,其实这些讨论所关注的点是完全错误的。对于公有云服务提供商来说,重要的不是云主机平均可以达到什么样的IO指标,而是如何根据客户的需求对整体IO能力进行分配。对于需要10个IOPS的低流量企业主页,为其提供100个IOPS是没有必要的。对于需要1000个IOPS的企业级应用,为其提供100个IOPS是远远不够的。套用云服务“按需获取,按量计费”的思路,IO能力需要成为可以“按需获取,按量计费”的商品。对于需要大容量低性能的用户,可以卖存储空间;对于需要低容量高性能的用户,可以卖IOPS。譬如说AWS提供三种不同规格的EBS卷: 传统机械硬盘EBS卷(magnetic)不论磁盘大小平均提供100个IOPS的IO能力,GP2型SSD EBS卷每一GB保证提供3个IOPS同时又可以允许高达3000个IOPS的爆发性IO,Provisioned IOPS型SSD EBS卷保证可以达到用户创建该EBS卷时所指定的IOPS指标。有了这样的设计,用户可以根据其实际需求购买所需要的磁盘空间或者是IOPS。尽管这样的购买依然受到服务提供商整体IO能力的限制,但是至少比所有的云主机都具备类似的“平均性能指标”要好得多。显而易见,设计这样的产品,要求云服务提供商对计算资源具有极细颗粒度的调控能力。

负载均衡也与此类似。对于一个正常的Web应用,其负载通常可以划分成三个档次:长期平均负载,长期高峰负载,短期爆发负载。在每秒只有数百个请求的情况下,负载均衡具备每秒处理一万个请求的能力是没有必要的。在每秒达到数万个请求的情况下,负载均衡只有每秒处理一万个请求的能力是远远不够的。如果用户按负载峰值购买负载均衡,结果是资源利用率偏低;如果用户按负载平均值购买负载均衡,结果是高峰期访问质量降低;如果用户按照实际负载切换负载均衡,结果是他再也不敢用公有云了。因此,负载均衡也要根据“按需获取,按量计费”的思路来设计,在负载降低的时候自动降级,在负载升高时自动升级。这样一种特性,就是自动伸缩。

将自动伸缩这个概念应用到云主机集群上,就是AWS的AutoScaling Group(ASG)。一个ASG包含一组具备相同功能的云主机,应用负载降低的时候,ASG自动杀掉多余的云主机以节省成本;应用负载升高的时候,ASG自动启动更多的云主机以应对压力。用户按照系统的实际负载购买计算资源,既不存在处理能力不足的问题,也不存在浪费计算资源的问题。

如上几个例子,都是AWS在其发展早期就已经实现的技术,其核心思想都是“按需获取,按量计费”。更重要的是,通过自动伸缩这样的概念,在满足客户负载需求的前提下没有让客户花冤枉钱。我在前段时间写了一个题为《Building a scalable web application from ground zero》的入门小教程,基本上能够反映一个中型Web应用对计算资源的需求特征。各位做公有云的不妨对照这个教程看看类似的需求如何在自己的平台上实现。AWS可能不是公有云的终极模式,但它至少是一种相对先进的模式,其产品对同行来说是极具启发意义的。一家公有云领域的创业公司,如果不了解不熟悉AWS的产品,未免有闭门造车之嫌了。

有些人可能会说,AWS的产品好是好,但是国内用户并不接受。这就涉及到创业公司到底是想做现在的市场还是想做未来的市场的问题。如果做现在的市场,就必须迎合市场的需求,按照客户的要求去设计你的产品。如果做未来的市场,就必须从技术上进行创新,指导客户按照你的思路去设计他的应用。最近几年,国内市场(尤其是互联网公司)对AWS所倡导的理念的接受程度是在稳步提高的。对比国际上几家公有云服务提供商,目前的局势是AWS一家独大,剩下几家(包括Rackspace、Windows Azure、Google Compute Engine、HP Cloud)容量的总和与AWS存在接近一个数量级的差别。究其原因,在于其他几家出于种种原因没有接受AWS所倡导的“按需获取,按量计费”理念,只是按照传统数据中心的思路来做公有云而已。在这个大背景下,国内创业公司在熟悉AWS产品的基础上,模仿AWS的产品并争取有所创新,可能是创业早期(譬如说概念阶段)相对稳妥的发展道路。

三、公有云的成长

公有云的成长,涉及两个问题:一是用户增长,一是财务回报。

在用户增长方面,阿里云目前的方法有两个,一个是将存量用户(万网的用户,天猫的商户)往云上迁移,另外一个是发展政府客户。这两种客户,其特点都是对负载的要求不高(天猫整体的负载很高,但是大部分商家的独立负载并不高),对“按需获取,按量计费”的需求并不明显。换句话说,基本上是将公有云当作传统的服务器托管的替代品来用。以阿里云目前的状况来看,将这两部分用户做好只是时间问题。从规模上看,把这两部分用户做好了,阿里云应该可以从成熟阶段进入产业阶段。问题在于,做好这两部分用户只能让阿里云拥有公有云的皮毛,并不能让阿里云拥有公有云的本质。这种情况和Rackspace往公有云转型过程中所遇到的问题类似。Rackspace创立于1998年,以服务器租赁起家,平均每年新增服务器数量10,000台左右。受AWS的影响,Rackspace从2008年起开始做公有云,但是其思路一直是用虚拟机替代物理服务器,并没有从“按需获取,按量计费”这样的思路去设计其公有云产品。仔细研究Rackspace从2006年到2014年间的财报数据,可以看到其收入总额和服务器数量基本上呈线性增长的趋势。换句话说,Rackspace只是在做物理服务器的替代品,公有云部分并未对其业务产生重大影响。另外,一个值得探讨的问题是在“中国”这个语境下是否真的需要类似于AWS的“按需获取,按量计费”的公有云?或者说,“按需获取,按量计费”这样的需求,在所有需求中到底占多大份量。根据个人的观察,“按需获取,按量计费”这样的理念,即使是在国内互联网行业当中也还有待进一步推广,在其他行业中的接受程度显然要更低。受政策影响,未来三到五年政府在计算资源采购方面全面向公有云倾斜,而这部分用户关心的只是供应商的名字是否有“云”字,至于这个”云”字后面是啥完全不在考虑之列。我不止一次听在政府部门做IT的同行说领导要求项目一定要用上阿里云,至于用阿里云干啥完全没有要求。因此,每次有朋友问我阿里云值不值得去的时候我都说阿里云的前景一片光明,如果能去的话当然要去。

按照王博士早些年的想法,阿里云还要为阿里巴巴集团提供服务。在王博士执掌阿里云的时期,阿里巴巴内部的人都觉得这是个笑话,不仅内心厌恶而且公开抵制。(关于王博士的故事,可以参考我两年前写的一篇短文《从王博士说起》。)现在章文嵩等人成为阿里云的主力,这个笑话便有了变成现实的可能性。至于这个可能性有多大,还得看阿里云后面两到三年的发展。一旦阿里云具备了为阿里巴巴集团提供服务的能力,为其他互联网企业提供服务更是不在话下。届时,阿里云可能会成为国内公有云领域毫无疑问的老大。2012年5 月我在第四届中国云计算大会的一个演讲上说“阿里云的技术也很好,但是在云计算产品的设计方面,还是比较业余的”,当时在从业人员当中引起了很大争议。三年过去了,如果在同行内部做一个横向比较的话,阿里云的基础产品和某些创业公司的产品相比尚存在较大差距。这个差距并非来自技术差异而是来自认知差异  -- 换句话说,不是因为阿里云的工程师们技术水平不行,而是因为阿里云还是没有从公有云的角度去设计产品。

与阿里云相比,创业公司基本上属于“三无”状态:没有存量用户、缺乏政府资源,尚未形成品牌。创业公司的用户增长过程,一期靠创始人的人品,二期靠技术推广,三期靠定向销售。所以创业公司的用户一般可以分成两类:某细分行业用户,其他创业公司。因此,创业公司更有可能根据自己的发展思路对其早期用户进行教育,指导早期用户按照自己的思路和产品路线设计应用。这些投入在公司发展早期看似无用,但当客户的业务逐步增长而公有云并不成为其负载或者性能瓶颈的时候,他们就会成为公有云的长期客户和成功案例。2009年Netflix全面转向AWS时业内几乎全是等着看笑话的,现在Netflix是运行在公有云上的最大型应用,同时也是AWS最有说服力的技术传教士。公有云帮助客户应对负载波动问题,使得客户可以聚焦在其自身业务上。客户的成功自然而然地导致消费增加,而其示范效应还会带来更多的客户。这样日积月累,方能形成一个良性循环。从资源投放的角度来看,提供“按需获取,按量计费”的能力要求云服务提供商预留部分计算资源用来应对客户的爆发性需求。云服务提供商只有到了一定的规模,才能够准确地预测客户对计算资源的需求,从而将闲置的计算资源降低到财务可以接受的比例之下。换句话说,客户成功才有公有云的成功,规模壮大才有公有云的盈利。

前两天看到陈沙克近期的一篇文章《一个做了15年运帷的老兵对公有云的深度剖析》,开篇就谈到2014年做公有云的几家创业公司是否盈利。问题在于公有云市场不是一个短期市场,而是一个未来十年尚有充分增长空间的市场。目前,中国的公有云市场尚属于发展早期,应该专注于产品研发和客户教育。一家公有云创业公司如果在概念阶段就实现了盈利,这种盈利很有可能是不可持续的。在这里我想澄清一个广为流传的故事,那就是“由于其电子商务业务存在大量闲置计算资源,亚马逊想到了通过零售的方式盘活这些闲置资源,并在其基础上研发了公有云服务”。这样的故事听起来虽然合理,却是完完全全的无中生有。之所以对此进行澄清,是想说明AWS在其发展的早期同样会遇到客户教育、市场培养、需求预测等等问题。通过接近10年的努力,AWS基本上解决了这些问题,并在国际公有云市场上取得了一家独大的地位。由于缺乏历史数据,我们无从得知AWS是在第几年开始进入盈利状态的。但是从S3业务的指数增长曲线来看,AWS不大可能在第四年(2010年)末就实现盈利。

谈到财务回报,就不能不谈公有云的计费模式和定价策略。在《从微观经济学看云计算发展》一文中,我从微观经济学的角度分析了企业计算资源市场的供需关系。这些分析表明,和传统的服务器销售和服务器租赁业务相比,公有云改变的不仅仅是计算资源的商业模式,它改变的是计算资源市场的供需关系。对于服务器销售和服务器租赁业务来说,客户的需求是刚性的。这意味这客户通常是根据其业务规划购买计算资源,对计算资源的价格波动并不敏感。对于公有云业务来说,客户的需求是柔性的。这意味这客户对计算资源的价格波动相对敏感,在价格下降时趋向于增加消费。对比AWS和Rackspace,可以发现只有AWS呈现这个特性,Rackspace的云计算业务并没有呈现这个特性。因此,我把客户的需求到底是刚性还是柔性作为区分虚拟机租赁和“按需获取,按量计费”的公有云的标准。如果你的客户的需求是刚性的,那么你只不过是在用传统数据中心的思路在做虚拟机租赁业务;如果你的客户的需求是柔性的,那么你就是在做“按需获取,按量计费”的公有云业务。从业务增长的角度来看,传统数据中心基本上是线性增长,而“按需获取,按量计费”的公有云业务是指数增长。

一种经济现象的出现,与其参与者的行为是密不可分的。换句话说,不能因为在AWS那里观察到了柔性需求,就断言在中国一定也会出现柔性需求。关于这一点,Rackspace和HP Cloud恐怕深有体会,因为到目前为止他们还没有观察到柔性需求。在中国,创业公司如果延用传统数据中心的思想来做公有云,结果只能是产品同质化市场红海化。反之,如果围绕“按需获取,按量计费”这个理念去进行创新,开始的时候可能相对困难,但是只有坚持下去才有走进公有云这片蓝海的可能。

在外人看来,阿里云可以说是要钱有钱,要牛有牛,有战略有战术,是公众心目中的土豪型选手,唯一的缺憾在于五行缺(对云计算有深刻理解的)产品经理。依靠阿里巴巴的品牌和万网的销售能力,目前阿里云在国内的规模最大。但是从互联网行业的角度来看,阿里云的用户体验较差。很多人可能会认为阿里巴巴的技术很好,用阿里云应该比较放心。问题在于阿里巴巴并不等同于阿里云,就如同Google并不等同于Google Compute Engine,微软也不等同于Windows Azure。在互联网行业中,技术人员对青云和UCloud的认可度更高。虽然两者都还还处于概念阶段,但是从其产品和运营来看,比较符合我对公有云的理解。这两者当中,青云看来更为激进,大有后起居上的势头。UnitedStack由于全面拥抱OpenStack而广为人知,目前还在私有云解决方案提供商和公有云服务提供商这两个角色之间摇摆不定。私有云和公有云固然都很好,但是往深了做是截然不同的两个方向。创业公司需要聚焦,因此UnitedStack需要尽早在这两个角色之间做一个决断。如果决定往公有云服务提供商这个方向去做的话,建议抽空看看OpenStack外面的世界。(插播一下广告,Rackspace和HP都在用OpenStack来做公有云,两者都处于比较尴尬的状态。国内用OpenStak来做公有云的创业公司不妨思考一下,用OpenStack做公有云到底还缺少什么。我个人的直觉是用OpenStack做底子不是不行,但是光有这个底子肯定不行。)

延伸阅读

从王博士说起

Data Source on the Economics of Computing Resource Market

从微观经济学看云计算的发展

Building a scalable web application from ground zero

 

本文全文发表于《程序员》杂志2015年11B刊电子版。下载全文

Using JConsole to Monitor Hadoop Processes

By , 2015年4月20日 5:32 下午

Many friends ask how to use JConsole to look at what Hadoop is doing in real time. This is actually quite easy. Assuming that the IP address of your master node is 192.168.10.1, then all you need to do add the following configuration into etc/hadoop/hadoop-env.sh:

export JMX_OPTS="-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false  -Djava.rmi.server.hostname=192.168.10.1 -Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote.port"

Assuming that you would like to monitor the NameNode and the DataNode, then modify HADOOP_NAME_NODE_OPTS and HADOOP_DATANODE_OPTS with JMX_OPTS and the desired port number. In the example below, we use port 8006 for the name node and port 8007 for the data node.

export HADOOP_NAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_NAMENODE_OPTS $JMX_OPTS=8006"
export HADOOP_DATANODE_OPTS="-Dhadoop.security.logger=ERROR,RFAS $HADOOP_DATANODE_OPTS $JMX_OPTS=8007"

Then you start HDFS using start-dfs.sh as normal, you will be able to use JConsole to connect to the JVM running NameNode and DataNode from remote on port 8006 and 8007, respectively. It should be noted that if you are running Hadoop on AWS EC2 then there is a catch – if you are running JConsole from outside of AWS EC2, then -Djava.rmi.server.hostname becomes the Elastic IP (EIP) of the EC2 instance. If you use the private IP, then you will only be able to connect from within your VPC.

If you want to use JConsole to monitor a particular Hadoop application such as WordCount, you can then temporarily modify etc/hadoop/hadoop-env.sh with the following configuration:

export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true $JMX_OPTS=8010"

Then you run your Hadoop application as normal, you will be able to use JConsole to connect to the JVM running your Hadoop application from remote on port 8010. It should be noted that you should not make this configuration persistent. The reason is this HADOOP_OPTS will be reused each time you running a Hadoop command, and the new Hadoop process will try to listen on the same port. (Also, HADOOP_NAMENODE_OPTS and HADOOP_DATANODE_OPTS mentioned above are extensions of HADOOP_OPTS. If we set a persistent JMX_OPTS in HADOOP_OPTS, then HADOOP_NAMENODE_OPTS and HADOOP_DATANODE_OPTS will try to use the same JMX_OPTS too.)

CY15-Q1 Community Analysis — OpenStack vs OpenNebula vs Eucalyptus vs CloudStack

By , 2015年4月4日 12:19 下午

01

This article is an update version of my previous article CY14-Q1 Community Analysis — OpenStack vs OpenNebula vs Eucalyptus vs CloudStack. Readers who are intested in further discussions please contact me via email at the above-mentioned address.

This community analysis project was initiated in CY11-Q4, and this particular report is the 11th quarterly report being published since (I skipped CY14-Q2, CY14-Q3 and CY14-Q4 for personal reasons). Traditionally I also publish a Chinese version of the same report along with the English version. Unfortunately I don’ have the capacity to do a Chinese translation for this particular report. (Sorry for my friends back in China.)

It should be noted that the opinion presented in this report belongs strictly to the author rather than any current or previous employer of the author.

02

The objective of this quarterly report is to compare the OpenStack, OpenNebula, Eucalytpus and CloudStack user and developer communities, base on the communications between community members in the form of mailing lists or pubic forum discussions. The data being discussed include the total number of topics (threads), messages (posts), and participants (unique email addresses or registered members). To obtain the above-mentioned data, a Java program was written to retrieve all the forum posts and mailing list messages into a MySQL database for further processing. The analysis results were presented in the form of graphs generated by LibreOffice.

During the past several years, some of the early forums and mailing lists became EOL’ed and were no longer accessible. The MySQL database that was built at the beginning of this project, as well as the previous versions of this quarterly report, make it possible to carry out analysis since the beginning of each project.

For projects with multiple membership systems (such as a forum and a mailing list), extensive efforts were carried out to eliminate membership double counting (counting one person twice or more in the statistics).

There have been many significant changes in the open source IaaS community since my CY14-Q1 report. Eucalyptus was acquired by HP, Sheng Liang left Citrix and started RancherOS, CloudScaling was acquired by EMC, MetaCloud was acquired by Cisco, eNovance and Inktank were acquired by RedHat. And, sadly, Nebula shutted down just a few days ago. These events have changed, and will continue to change, the horizon in the open source IaaS community.

It should be noted that in January 2015 the OpenNebula community moved to https://forum.opennebula.org, and the original mailing lists became inactive. This new data source has not been added to this analysis. As a result, the data presented here does not represent the actual status of the OpenNebula community. I will add the new data source to my next report.

03

04

Figure 1 and 2 represent the monthly number of topics (threads) and posts (messages). It can be seen that during the past 12 months, OpenStack-related discussions continued to exhibit strong (close to linear) growth. CloudStack-related discussions were declining at a rapid rate. The volume of discussions around OpenNebula and Eucalyptus were still very small, both exhibited tendencies to decline.

05

Generally speaking, the number of replies to a specific topic represents the attention being received, and the depth of discussion for that particular topic. When the number of master posts (the original post that started a particular topic) is more than the number of replies, it is safe to conclude that the participation of the forum or mailing list is very low. Therefore, the ratio between “the number of posts” and “the number of topics” represents the participation rate of an online community. In this study we call this ratio the Participation Ratio.

As can be seen from Figure 3, during the past 12 months the participation ratios of CloudStack and Eucalyptus were relatively higher, which were close to 4; the participation ratios of OpenStack and OpenNebula were relatively lower, which were a little bit higher than to 3. (OpenNebula exhibited significant decline on this aspect during the past 12 months. The reason is that the OpenNebula community is moving to a new forum https://forum.opennebula.org/, which is not yet included in this report. Thank you Tim Bell for pointing this out.)

06

user

Figure 4 shows the active participants of the four projects being discussed. It can be seen that the number of active participants of OpenStack is much higher than the other three projects. The number of active participants of CloudStack is also significantly higher than OpenNebula and Eucalyptus. By looking at the break down figures, the number of active participants for OpenStack was growing steadily, while the number of active participants for CloudStack, Eucalyptus and OpenNebula exhibited significant decrease.

07

To understand the development activities with these four open source IaaS project, we carry out git log analysis extract information about contributing developers and organizations, as well as the frequency of the commit activities. We take advantage of the fact that all of these four projects use git as the SVM for their source code. Therefore, we will use “git log –no-merges” to obtain log information from the git repositories. The extracted log information were dumped into a MySQL database for further analysis. It should be noted that for the OpenStack project, the data source includes all the sub-projects under openstack (137 sub-projects) and openstack-infra (114 sub-projects) on github.com.

It should be pointed out that git is a distributed versioning system. With git, developers work with their own local repositories. When a developer executes a commit operation, the code changes are make to the local repositories, and will not be reflected in the master repository until such commits are pushed to and merged with the master repository. It is common practice that developers tend to accumulate many commits before they feel comfortable to make a push. Therefore, some of the recent commits might not get counted towards this analysis. Based on our observations, there exists about 50% under estimation in the number of commits for the previous month, and about 20% under estimation in the number of commits for the month before.

08

Figure 10 shows the monthly number of commit operations for these four projects. Generally speaking, the commit frequency of OpenStack is much higher than the commit frequencies of the other three projects. This is because the data source for OpenStack includes a total number of 251 sub-projects, which is far greater than the other three projects. The commit frequency of CloudStack is slightly higher than Eucalyptus and OpenNebula.

09

Figure 11 shows the monthly number of commit operations for the 7 major sub-projects of OpenStack (Cinder, Glance, Horizon, Keystone, Nova, Neutron, Swift). Generally speaking, the commit frequency of the Nova sub-project is about 2 to 3 times as high as the other sub-projects. It should be noted that although the commit frequency of these sub-projects are different, but they exhibit similar time-series curves, and their highs and lows occur at the same period of time. This indicates that although these sub-projects are relatively independent, but they work around the same development plan and the same release schedule. This is an indicator that the OpenStack project is well organized in terms of sub-project management.

10

dev

Figure 12 shows the monthly number of contributors (identified by unique email addresses) for these projects. Generally speaking, the number of OpenStack contributors is much higher than the other three projects. By looking at the break down figures, the number of active contributors for OpenStack was growing steadily, while the number of active contributors for CloudStack, Eucalyptus exhibited significant decrease. For OpenNebula, the number of active contributors seemed to be quite stable, but the size of the whole developer community was relatively small.

11

Figure 13 shows the monthly number of contributors (identified by unique github.com accounts) for the 7 major sub-projects of OpenStack (Cinder, Glance, Horizon, Keystone, Nova, Neutron, Swift). During the past 12 months, the number of active contributors for Nova were decreasing, while the number of active contributors for Neutron, Horizon, and Cinder was increasing. There was not much change observed in Glance, Keystone, and Swift.

12

People usually try to identify the institute to which a contributor belongs to by his/her email address. It is true that such method is defect in nature (different institutes have different policies regarding contributing to open source projects, some institutes even encourage their employees to contribute to open source projects with their personal account), but still this parameter can be used to show the contributions of certain institutes to certain open source projects. Figure 14 shows the monthly number of unique institutes (identified by the domain name of the contributor’s email address) contributing to these projects. We can see that the number of contributing institutes for OpenStack is much larger than the other three projects, and is growing rapidly. During the same period, the number of contributing institutes for CloudStack, Eucalytus, and OpenNebula did not exhibit any growth. For CloudStack, the number of active contributing organizations seemed to be decreasing.

13

Figure 15 shows the monthly number of contributing institutes to the 7 major sub-projects of OpenStack (Cinder, Glance, Horizon, Keystone, Nova, Neutron, Swift). During the past 12 months, the number of active contributing organizations for Nova were decreasing, while the number of active contributing organizations for Neutron, Horizon, and Cinder was increasing. There was not much change observed in Glance, Keystone, and Swift.

The following table lists the organizations that make the most contributions to these projects during CY15-Q1, according to the number of commit operations, along with the percentage of their commit operations. It can be seen that both Eucalyptus and OpenNebula are open source projects dominated by single institutes, while CloudStack and OpenStack are open source projects contributed by multiple institutes. For the CloudStack projects, the influence from Citrix has gone away. In CY15-Q1, only 5.8% of the contributions came from citrix.com, as compared with the 44% in CY14-Q1 (combined contribution from citrix.com and cloud.com). For the OpenStack project, redhat.com contributed to 7.3% of the commits, while ibm.com contributed 5.0% of the commits, followed by mirantis.com (4.7%), hp.com (4.6%), rackspace.com (1.6%), Intel.com (1.4%), Yahoo-inc.com (1.2%), Doughellmann.com (1.1%), and Cisco.com (0.8%).

14

The following table lists the organizations that make the most contributions to the major sub-projects in OpenStack during CY15-Q1, according to the number of commit operations, along with the percentage of their commit operations.

15

16

17

Accumulated Developer Population refers to the total number of developers who have contributed code to a particular project (as reflected in git commits). Figure 16 shows the growth of the accumulated developer populations of these 4 projects. Currently OpenStack has the largest accumulated developer population, which is about 10 times bigger than the distant number 2 CloudStack.

18

Accumulated Contributing Organizations refers to the total number of organizations (as reflected in unique domain names associated with developer email addresses) who have contributed code to a particular project (as reflected in git commits). Figure 17 shows the growth of the accumulated contributing organizations of these 4 projects. Currently OpenStack has the largest number of contributing organizations, which is 5 times larger than CloudStack. Eucalyptus and OpenNebula have only very smalll number of contributing organizations.

For your convenience, a PDF version of this presentation can be downloaded from here. Please kindly keep the author information if you want to redistribute the content.

Further Information

The Java program being used to dump git logs into MySQL database is now available on github:

https://github.com/qyjohn/GitAnalysis

Safe Harbor Statement

Qingye Jiang (John) is Senior Member of IEEE. He is currently a full-time graduate student (Master of Philosophy) in the School of Information Technologies at the University of Sydney. His research interests include parallel and distributed computing, high performance computing, open source community, as well as the impact of technology advancements on human society. This report is part of his on-going research on the growth of open source communities (started in 2011).

Qingye Jiang (John) is at the same time a full-time employee of Amazon Web Services (AWS). However, this report is not part of his duties with AWS. The opinions presented in this report strictly belong to the author himself, and do not reflect the opinions of his employer.

If you want to quote this report, please refer to the author as “Qingye Jiang (John) from the University of Sydney”.

Acknowledgements

The author would like to thank the following persons

– Young Choon Lee (Lecturer, Macquarie University), Joseph Davis (Professor, University of Sydney), and Albert Y. Zomaya (Professor, University of Sydney), for their guidance and insightful discussions.

– Randy Bias (VP of Technology, EMC), for reminding me to come up with an updated version of this community analysis.

fdisk and lsblk report different disk size

By , 2015年2月5日 7:00 上午

Create an i2.4xlarge instance, SSH into the instance and run “sudo fdisk -l” and then “lsblk”. Compare the outputs:

[ec2-user@ip-172-31-5-22 ~]$ sudo fdisk -l
WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion.

Disk /dev/xvda: 8589 MB, 8589934592 bytes, 16777216 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: gpt


#         Start          End    Size  Type            Name
 1         4096     16777182      8G  Linux filesyste Linux
128         2048         4095      1M  BIOS boot parti BIOS Boot Partition

Disk /dev/xvdb: 800.2 GB, 800165027840 bytes, 1562822320 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/xvdc: 800.2 GB, 800165027840 bytes, 1562822320 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/xvdd: 800.2 GB, 800165027840 bytes, 1562822320 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/xvde: 800.2 GB, 800165027840 bytes, 1562822320 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

[ec2-user@ip-172-31-5-22 ~]$ df -m
Filesystem     1M-blocks  Used Available Use% Mounted on
/dev/xvda1          7934  1062      6774  14% /
devtmpfs           61475     1     61475   1% /dev
tmpfs              61483     0     61483   0% /dev/shm

As we can see, the size reported by fdisk is calculated using the following equation:

size = sectors x bytes per sector

1562822320 sectors x 512 bytes per sector = 800,165,027,840 bytes = 800.2 GB

[ec2-user@ip-172-31-5-22 ~]$ lsblk
NAME    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
xvda    202:0    0     8G  0 disk 
└─xvda1 202:1    0     8G  0 part /
xvdb    202:16   0 745.2G  0 disk 
xvdc    202:32   0 745.2G  0 disk 
xvdd    202:48   0 745.2G  0 disk 
xvde    202:64   0 745.2G  0 disk 

With lsblk, the equation is:

size = bytes / 1024 / 1024 / 1024

800,165,027,840 bytes / 1024 / 1024 / 1024 = 745.2 GB

Building a scalable web application from ground zero

By , 2015年2月4日 5:48 上午

This is an entry level tutorial for engineers who wish to build a scalable web application but do not know how to get started. The demo application takes advantage of the various services offered by AWS such as Elastic Computer Cloud (EC2), Elastic Load Balancer (ELB), Relational Database Service (RDS), ElastiCache, Simple Storage Service (S3), Identity and Access Management (IAM), as well as CloudWatch and AutoScaling. By using AWS we are making things easier so that an engineer with a little experience with Linux can easily finish this tutorual in a couple of hours. If you are building a scalable web application on your own infrastructure or on another public cloud, the theory should be the same but the actual implementation might be somewhat different.

Please note that  the code provided here is for demo only and is not intended for production usage.

[Executive Summary]

I want to build a scalable web application that can serve a large amount of users. I really don’t know how many users I will get so the technical specification is as many as possible. As shown in the above design, this is a photo sharing website with the following functions:

(1) User authentication (login / logout).

(2) Logined users can upload photos.

(3) The default page displays the latest N uploaded photos.

In this tutorial, we will accomplish this goal through the following five levels:

(0) A basic version with all the components deployed on one single server. The web application is developed with PHP, using Apache as the web serer, with MySQL as the database to store user upload information.

(1) Based on the basic version we developed in level (0), scale the application to two or more servers.

(2) Use S3 to provide a shared storage for user uploads.

(3) Dynamically scale the size of your server fleet according to the actual traffic to your web application.

(4) Implement a cache layer between the web server and the database server.

[LEVEL 0]

In this level, we will build a basic version with all the components deployed on one single server. The web application is developed with PHP, using Apache as the web serer, with MySQL as the database to store user upload information. You do not need to write any code, because I have a set of demo code prepared for you. You just need to launch an EC2 instance, carry out some basic configurations, then deploy the demo code.

Login to your AWS Console and navigate to the EC2 Console. Launch an EC2 instance with a Ubuntu 14.04 AMI. Make sure that you allocate a public IP to your EC2 instance. In your security group settings, open port 80 for HTTP and port 22 for SSH access. After the instance becomes “running” and passes health checks, SSH into your EC2 instance to setup software dependencies and download the demo code from Github to the default web server folder:

$ sudo apt-get update
$ sudo apt-get install apache2 php5 mysql-server php5-mysql php5-curl git
$ cd /var
$ sudo chown -R ubuntu:ubuntu www
$ cd /var/www/html
$ git clone https://github.com/qyjohn/web-demo

Then we create a MySQL database and a MySQL user for our demo. Here we use “web_demo” as the database name, and “username” as the MySQL user.

$ mysql -u root -p
mysql> CREATE DATABASE web_demo;
mysql> CREATE USER 'username'@'localhost' IDENTIFIED BY 'password';
mysql> GRANT ALL PRIVILEGES ON web_demo.* TO 'username'@'localhost';
mysql> quit

In the code you clone from Github, we have prepopulated some demo data as examples. We use the following command to import the demo data in web_demo.sql to the web_demo database:

$cd /var/www/html/web-demo
$ mysql -u username -p web_demo < web_demo.sql

The LEVEL 0 demo code is implemented in a two PHP files index.php and config.php. Before we can make it work, there are some minor modifications needed:

(1) Use a text editor to open config.php, then change the username and password for your MySQL installation.

(2) Change the ownership of folder "uploads" to "www-data" so that Apache can upload files to this folder.

$cd /var/www/html/web-demo
$ sudo chown -R www-data:www-data uploads

In your browser, browse to http://<ip-address>/web-demo/index.php. You should see that our application is now working. You can login with your name, then upload some photos for testing. (You might have noticed that this demo application does not ask you for a password. This is because we would like to make things as simple as possible. Handling user password is a very complicate issue, which is beyond the scope of this entry level tutorial.) Then I suggest that you spend 10 minutes reading through the demo code index.php. The demo code has reasonable documentation in the form of comments, so I am not going to explain the code here.

[LEVEL 1]

In this level, we will expand the basic version we have in LEVEL 0 and deploy it to multiple servers. Currently we already have a working web server, so we launched another EC2 instance and follow the steps in LEVEL 0 to get a second web server (you can skip mysql-server and the part to set up the database, because we don't need multiple MySQL servers). Also, we know that these two web servers must write upload data to the same database server, so we launch an RDS instance and use the RDS instance as a shared database server. In front of the two web servers, we create an ELB to distribute the workload to two web servers.

(1) Launch a second web server as describe in LEVEL 0.

(2) Launch an RDS instance running MySQL. When launching the RDS instance, create a default database named "web_demo". When the RDS instance becomes available, use the following command to import the demo data in web_demo.sql to the web_demo database on the RDS database:

$ mysql -h [endpoint-of-rds-instance] -u username -p web_demo < web_demo.sql

(3) On both web servers, modify config.php with the new database server hostname, username, password, and database name.

(4) Create an ELB and add the two web servers to the ELB. Since we do have Apache running on both web servers, you might want to use HTTP as the ping protocol with 80 as the ping port and "/" as the ping path for the health check parameter for your ELB.

(5) In your browser, browser to http://elb-endpoint/web-demo/index.php. As you can see, our demo seems to be working on multiple servers. This is so easy!

But there are issues. Sometimes your browser asks you to login, but sometime not. Also, some newly uploaded images seem to be missing, but they are back from time to time and some other images seem to be missing!

You must have noticed that the IP address in the top left position in your browser changes from time to time. We use this trick to let you know which web server is processing your request. Although the two web servers are using the same set of code and the same RDS database, they do not share your session information. When you are being served by server A and login to server A, you can upload an image. But server B is not aware of the fact that you have already login on server B. So when you are being served by server B, it will ask you to login again. Also, when you upload an image to server A, it is only available on server A. If you are being served by server B, although the database has the information about that particular upload (because both web servers write to, and read from, the same database server) but server B can't give you that image because it is on server A.

With ELB, you can configure session stickiness so that the ELB always routes traffic from the same session to the same web server. However, for a large scale applicaion with dynamic workload, web servers are being added to, or removed from, the fleet according to workload requirements. In a worse case, a particular web server might be removed from the fleet due to a fault in the web server itself. In such cases, existing sessions might be routed to other web servers, on which there exists no login information. Therefore, it is desirable that such session information can be shared among all web servers.

We user ElastiCache to resolve the issue about session sharing between multiple web servers. In the ElastiCache console, launch an ElasticCache with Memcache and obtain the endpoint information. On both web servers, install php5-memcached and configure php.ini to use memcached for session sharing.

$ sudo apt-get install php5-memcached

Then edit /etc/php5/apache2/php.ini, make the following modifications:

session.save_handler = memcached
session.save_path = "[endpoint-to-the-elasticache-instance]:11211"

Then you need to restart Apache on both web servers to make the new configuration effective.

$ sudo service apache2 restart

Now go back to your browser to do some testing. As you can see, now your session is being shared across the two web servers. You only need to login once, and your login status remains the same regardless of the back end web server.

To solve the issue of image upload, you need to have a shared storage between your two web servers. One simple solution would be using one of the web servers as NFS server, which exports the /var/www/html/web-demo folder to the subnet. The second web server will act as the NFS client, mounting the remote NFS share as /var/www/html/web-demo. As long as the permissions are properly setup, the two web servers should be able to write to, and read from the same folder.

On one of your web servers, use the following command to install NFS server:

$ sudo apt-get install nfs-kernel-server

Then edit /etc/exports to export the /var/www/html/web-demo folder (assume that 172.31.0.0/16 is the CIDR range of your subnet):

/var/www/html/web-demo       172.31.0.0/16(rw,fsid=0,insecure,no_subtree_check,async)

Then you need to restart the NFS server for the new export to take effect:

$ sudo service nfs-kernel-server restart

On the other of your web servers, use the following command to install NFS client (assume that 172.31.0.11 is the private IP address of your NFS server):

$ sudo apt-get install nfs-common
$ cd /var/www/html
$ sudo rm -Rf web-demo
$ sudo mkdir web-demo
$ sudo chown -R ubuntu:ubuntu web-demo
$ sudo mount 172.31.0.11:/var/www/html/web-demo web-demo

In your browser, again browse to http://<dns-name-of-elb>/web-demo/index.php. You should see that our application is now working on multiple web servers with a load balancer as the front end, without any code changes.

[LEVEL 2]

Using a shared file system is probably OK for web applications with reasonably limited traffic, but will be be problematic when the traffic to your site increases. At that point you can scale out your front end to have as many web servers as you want, but your web application is limited by the capability of the shared file system running on a single server. In this level, we will resovle this issue by moving the shared storage from one web server to S3. This way, the web servers only handle critical business, while the images are being served by S3.

We first terminate the web server running NFS client. Then we edit /etc/exports on the web server running NFS server to disable (comment out) the NFS exports and stop nfs-kernel-server service. We no longer need a share file system in the level.

$ sudo service nfs-kernel-server stop

Next, you will need to edit config.php and make some minor changes. In this configuration file, $s3_bucket is the name of the S3 bucket for share storage, and $s3_baseurl is the URL pointing to the S3 endpoint in the region hosting your S3 bucket. You can find the S3 endpoint from http://docs.aws.amazon.com/general/latest/gr/rande.html. you can also identify this end point in the S3 Console by viewing the properties of an S3 object in the S3 bucket.

$storage_option = "s3";
$s3_bucket  = "my_uploads_bucket";
$s3_baseurl = "https://s3-ap-southeast-2.amazonaws.com/";

Next we create an AMI from the running instance. Then we create an EC2 Role in the IAM Console, which (IAM Console -> Role -> Create New Role -> Amazon EC2 -> Amazon S3 Full Access). Then we launch a new web server using the AMI and the IAM Role we just created. After the instance becomes "running" and passes health checks, remove the original web server from the ELB (you can terminate it now) and add this new web server to the ELB. In your browser, again browse to http://<dns-name-of-elb>/web-demo/index.php. Currently the only web server behind the ELB is this newly launched instance, but it still has your login information. Newly uploaded images will go to S3 instead of local disk. As the workload of your application increases, you can launch more EC2 instance using the AMI and IAM Role we created just now, and add these new instance to the ELB. When the workload of your application decreases, you can remove some instance from the ELB and terminate them to save cost.

The reason we use IAM Role in this tutorial is that with IAM Role you do not need to supply your AWS Access Key and Secret Key in your code. Rather, Your code will assume the role assigned to the EC2 instance, and access the AWS resources that your EC2 instance is allowed to access. Today many people and organizations host their source code on github.com or some other public repositories. By using IAM roles you no longer hard code your AWS credentials in your application, thus eliminating the possibility of leaking your AWS credentials to the public.

[LEVEL 3]

Now you have a scalable web application with two web servers, and you know that you can scale in and out the fleet when needed. Is it possible to scale your fleet automatically, according to the workload of your application?

In our demo code, we have a setting to simulate workloads. If you look at config.php, you will find this piece of code:

$latency = 0;

And, in index.php, there is a corresponding statement:

sleep($latency);

That is, when a user request index.php, PHP will sleep for $latency seconds. As we know, as the workload increases, the CPU utilization (as well as other factors) increases, resulting in an increase in latency. By manually manupulating the $latency setting, we can simulate heavy workload to your web application, which is reflected in the average latency. It should be noted that you can change the latency simulation settings on each web server. When you have two web servers and you set $latency = 0 on one server and $latency = 1 on the other server, the average latency will be 0.5 second.

With AWS, you can use AutoScaling to scale your server fleet in a dynamic fashion, according to the wordload of the fleet. In this tutorial, we use average latency as a trigger for scaling actions. You can achieve this following these steps:

(1) In your EC2 Console, create a launch configuration using the AMI and the IAM Role that we created in LEVEL 2.

(2) Create an AutoScaling group using the launch configuration we created in step (2), make sure that the AutoScaling group receives traffic from your ELB. Also, change the health check type from EC2 to ELB. (This way, when the ELB determines that an instance is unhealthy, the AutoScaling group will terminate it.) You don't need to specify any scaling policy at this point.

(3) Click on your ELB and create a new CloudWatch Alarm (ELB -> Monitoring -> Create Alarm) when the average latency is greater than 1000 ms for at least 1 minutes.

(4) Click on your AutoScaling group, and create a new scaling policy (AutoScaling -> Scaling Policies), using the CloudWatch Alarm you just created. The auto scaling action can be "add 1 instance and then wait 300 seconds". This way, if the average latency of your web application exceeds 1 second, AutoScaling will add one more instance to your fleet.

You can do the testing by adjusting the $latency value on your existing web servers. Please note that to acheve 1000 ms latency, the average value of the $latency settings on all your web servers needs to be greater than 1. So, you can keep $latency = 0 on one of your webserver, and change $latency = 3 on the other web server. This way the average latency will be 1500 ms, which will trigger the CloudWatch Alarm, and hency the auto scaling policy.

When you are done with this step, you can play with scaling down by creating another CloudWatch Alarm and a corresponding auto scaling policy. The CloudWatch alarm will be alarmed when the average latency is smaller than 500 ms for at least 1 minute, and the auto scaling action can be "remove 1 instance and then wait 300 seconds".

[LEVEL 4]

For many web applications, database can be a serious bottleneck. In our photo sharing demo, usually the number of view image request is much greater than the number of upload requests. It is very possible that for many view requests, the most recent N images are actually the same. However, we are connecting to the database to fetch records for the most recent N images for each and every view requests. It would be reasonable to update the images we show on the index page in an incremental way, for example, every 1 or 2 minutes.

In this level, we will add a cache layer between the web servers and the database. When we fetch records for the most recent N images, we cache it somewhere. When there is a new view request coming in, we no longer connect to the database, but return the cached result to the user. When there is a new image upload, we update the cache. This way the cache version is always accurate.

The demo code has support for database caching through ElastiCache, using the same ElastiCache instance for session sharing. This caching behavior is not enable by default. You can edit config.php on all web servers with details regarding the cache server:

$enable_cache = true;
$cache_server  = "dns-name-of-your-elasticache-instance";

Refresh the demo application in your browser, you will see that the "Fetching N records from the database." message is now gone, indicating that the information you are seeing is obtained from ElastiCache. When you upload a new image, you will see this message again, indicating the cache is being updated.

The following code is responsible of handling this cache logic:

// Get the most recent N images
if ($enable_cache)
{
	// Attemp to get the cached records for the front page
	$mem = open_memcache_connection($cache_server);
	$images = $mem->get("front_page");
	if (!$images)
	{
		// If there is no such cached record, get it from the database
		$images = retrieve_recent_uploads($db, 10);
		// Then put the record into cache
		$mem->set("front_page", $images, time()+86400);
	}
}
else
{
	// This statement get the last 10 records from the database
	$images = retrieve_recent_uploads($db, 10);
}

Also pay attention to this code when doing image uploads. We deleted the cache after the user uploads an images. This way, when the next request comes in, we will fetch the latest records from the database, and put them into the cache again.

	if ($enable_cache)
	{
		// Delete the cached record, the user will query the database to 
		// get an updated version
		$mem = open_memcache_connection($cache_server);
		$mem->delete("front_page");
	}

[Others]

In this tutorial, we build a scalable web application using various AWS services including EC2, RDS, S3, ELB, CloudWatch, AutoScaling, IAM, and ElastiCache. It demonstrates how easy it is to build a scalable web application that can scale reasonably well using the various AWS building blocks. It should be noted that the code being used in this tutorial is for demo only, and can not be used in a production system.

The readers are encouraged to explore the following topics:

(1) How do you trouble shoot issues when things does not work in this demo. For example, your application is unable to connect to the RDS instance, or the ElastiCache. What is needed to make things work?

(2) How to identify bottleneck when there is a performance issue? How to enhance the performance of this demo application?

(3) How to make the deployment process easier?

(4) How to make this demo application more secure?

(5) Any other topic that you would like to explore.

Full Text Search: Sphinx on Ubuntu 14.04

By , 2014年12月5日 4:08 上午

Below is some quick notes on how to setup a simple PHP based website with full text search capability. The content to be searched is stored in a MySQL database, and the full text search engine is Sphinx.

Step 1 – Launch an EC2 instance on AWS, the operating system is Ubuntu 14.04.

Step 2 – SSH into the EC2 instance, install MySQL server and set up a test database. In this demo, our test database have one table “document” with three columns id, uuid, and content. Sphinx will create a search index based on the information stored in the “document” table.

$ sudo apt-get update
$ sudo apt-get install mysql-server
$ mysql -u root -p
Your MySQL connection id is 43
Server version: 5.5.40-0ubuntu0.14.04.1 (Ubuntu)

Copyright (c) 2000, 2014, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> create database test;
Query OK, 1 row affected (0.00 sec)

mysql> use test;
Database changed
mysql> create table document (id int, uuid varchar(80), content text);
Query OK, 0 rows affected (0.01 sec)

mysql> insert into document (id, uuid, content) values (1, 1234567890, 'cloud computing');
Query OK, 1 row affected (0.00 sec)

mysql> insert into document (id, uuid, content) values (2, 1234567891, 'expensive cloud computing');
Query OK, 1 row affected (0.01 sec)

mysql> insert into document (id, uuid, content) values (3, 1234567892, 'green expensive cloud computing');
Query OK, 1 row affected (0.00 sec)

mysql> insert into document (id, uuid, content) values (4, 1234567893, 'scheduling algorithm green expensive cloud computing');
Query OK, 1 row affected (0.00 sec)

mysql> exit
Bye

Step 3 – Install Sphinx

$ sudo apt-get install sphinx search
$ cd /etc/sphinxsearch
$ sudo cp sphinx.conf.sample sphinx.conf

Edit /etc/sphinxsearch/sphinx.conf to reflect your MySQL username and password. Also, find the following SQL statement

       sql_query               = \
               SELECT id, group_id, UNIX_TIMESTAMP(date_added) AS date_added, title, content \
               FROM documents

and replace it with the following SQL statement (because in our test database, our “document” table only contains three columns: id, uuid, and content.

        sql_query               = \
                SELECT id, uuid, content FROM document

Also, comment out the following two lines in /etc/sphinx/sphinx.conf, because we do not have these two columns (group_id, date_added) in our test database:

# sql_attr_uint         = group_id
# sql_attr_timestamp    = date_added

Then, edit /etc/default/sphinxsearch, change “START=no” to “START=yes“. Then start sphinx:

$ sudo indexer --all 
$ sudo service sphinxsearch start

STEP 4 – Install Apache and PHP

$ sudo apt-get install apache2 php5 php5-mysql
$ cd /var/www
$ sudo chown -R ubuntu:ubuntu html

Step 5 – A Quick Demo PHP Page

Create the following PHP page /var/www/html/sphinx.php, with the following content:

connect_error) 
{
        throw new Exception('Connection Error: ['.$conn->connect_errno.'] '.$conn->connect_error, $conn->connect_errno);
}
 
$resource = $conn->query("select * from test1 where match('green computing')");
$results = array();
while ($row = $resource->fetch_assoc()) 
{
        $results[] = $row;
}
$resource->free_result();
 
var_dump($results);
?>

Panorama Theme by Themocracy