Gyong Ju - South Korea

Archive for the ‘Webtechnology’ Category

There’s a ton of documentation available if you want to do template handling in PHP. This article is only about documenting the simple approach I use myself. I’m not going to enter the arena by stating PHP is a template language itself etc …, that’s just plain boring.

So what’s the intention here? The objective is to have a plain HTML file and replace content at certain places where you want PHP driven output to show. But by and itself the HTML file is just that, plain HTML with inclusion of CSS and JS where necessary.

My standard HTML file is shown below:

<!DOCTYPE html>
<html>
<head>
	<title>{title}</title>
	<meta http-equiv="content-type" content="text/html; charset=utf-8" />
	<meta http-equiv="content-language" content="{language}" />
	<meta name="author" content="M.E. Post" />
	<meta name="copyright" content="Copyright (c) M.E. Post 2008" />
	<link rel="stylesheet" href="{includepath}/css/include.css" type="text/css" media="screen" />
	<script type="text/javascript">
        var path = '{includepath}';
        </script>
        <script type="text/javascript" src="{includepath}/js/jquery-1.3.2.min.js"></script>
	<script type="text/javascript" src="{includepath}/js/include.js"></script>
</head>

<body>
	<div id="rap">
	  <div id="headwrap">
		  <div id="header">
			  <a href="{path}/">{title}</a>
		  </div>
		  <div id="desc">
			  <a href="{path}/">{subtitle}</a>
		  </div>
	  </div>
	  <div id="content">
		  <div class="storycontent">
		    {replace_content}
		  </div>
    </div>
  </div>
</body>
</html>

As you can see it’s a very minimal file and there are some elements in there like {includepath} and {replace_content} which are not regular html. These are the placeholders where content will be replaced.

Replacing the content is executed by the function below. It gets the content transferred through the variable $content, if the $content variable is empty it returns FALSE and aborts the function. After that it checks whether the template has already been loaded through checking the static $template, if it’s empty the template file is loaded, otherwise it will reuse the previously loaded template. All the template placeholders are replaced through a loop using mb_ereg_replace to make the text unicode compliant. The replaced template is returned as output of the function. Items like PATH et al are constants that are defined previously, you can take them out or add them to the function call if you want.

/**
* Merge the page template with the content
*
* @param string $content
* @return string
*/
function mergeContentWithTemplate($content='') {
	if (empty($content)) {
		return FALSE;
	}
	/* Static keyword is used to ensure the file is loaded only once */
	static $template = NULL;
	/* If no instance of $template has occured load the template file */
	if (is_null($template)) {
		$template_file = dirname(__FILE__) . '/../html/template.html';
		$template_file_content = file_get_contents($template_file);
	}
	mb_regex_encoding('utf-8');
	$pattern = array('{path}', '{includepath}', '{language}', '{title}', '{subtitle}', '{replace_content}');
	$replacement = array(PATH, INCLUDE_PATH, LANGUAGE, TITLE, SUBTITLE, $content);
	$pattern_size = sizeof($pattern);
	for ($i = 0; $i < $pattern_size; $i++) {
		$template_file_content = mb_ereg_replace($pattern[$i], $replacement[$i], $template_file_content);
	}
	return $template_file_content;
}

So that’s my simple little template thingy, hope it is of some use to you.

I keep forgetting this so for my own feeble memory here is the correct invocation to rsync between two EC2 instances:

rsync -avz --port=22 root@<privateDNS name remote server>:/var/www/html/<directory>/ -e "ssh -i /home/<user>/<pem file>" /var/www/html/<local directory>/

Works like a charm and much faster than using your home workstation as an intermediary to copy stuff between server instances.

Now that’s a big claim but I can assure you its true for all three aspects. It doesn’t even require heavy customisation and the approach is based on standard plugins available on the WordPress plugin site. However like everything there’s a trade off with the approach and in this case its the loss of flexibility and dynamic behaviour. This isn’t an issue with static websites but if you’re running a blog then this solution isn’t for you (stuff like comments won’t work as this requires connectivity and feedback from WordPress). It’s up to you to decide whether my approach has merits for your use case. I offer no guarantees other than that I have applied the approach below to my own systems and for me it works. It’s very rough around the edges, I have been hacking some files and I haven’t rolled my changes into a nice shrink wrap form. Enough with the disclaimers let’s get going with an actual explanation of what I’m offering.

Security

WordPress suffers from the same problem that almost all Content Management Systems (CMS) suffer from, it has a unified code base for both content publication and content management. With WordPress (and similar systems) that share the same code base it is possible to hack the content management system through the content publication system. The content publication system is the aspect of the CMS that generates the pages if a visitor hits the site. The content publication system by its very nature is an open interface to the outside world and can therefor be hacked. By the fact that it shares code with the CMS system it is inevitable that also the CMS can be compromised in an attack on the content publication system. These hacks occur time and again and are endemic to the shared code approach so they will never go away. The only way of ensuring your CMS is not hacked through your content publication system is by separating the two. Now separation in a physical (code) sense is possible but requires a huge amount of effort and in effect means a different version of WordPress through a fork. This is not what I want to achieve, I have limited time and I can’t maintain my own version of WordPress and keep up with all the new functionality that the WordPress team cranks out all the time. Therefore I mean separation in a logical sense and this I achieve through the use of WP SuperCache. WP Super Cache turns your WordPress site/blog into a collection of static pages and it uses a .htaccess mod_rewrite approach to serve customers the static pages. It also has an option to serve page components like JS, CSS and images from a Content Delivery Network (CDN). My approach to separating the CMS from content publication is that I turn the WP Super Cache cache (pardon the pun) into its own virtual host in Apache and serve content in its static form from that Virtual Host. My visitors don’t need to access the WordPress installation to get to the content, the CMS and the content publication are logically separated. Now there’s a couple of tricks required for getting this up and running and I’ll explain these later in this post.

Speed

The approach of moving your page components into a CDN is well known and relatively straightforward to achieve with solutions like WP Super Cache or W3 Total Cache. Going one step further and moving your entire site, so including your html is a little less usual but that is what I have achieved. My test site (not this one) based on the standard twentyten theme now loads in 1.223 seconds of which 0.252 seconds is spent on the DNS lookups. The html and all other page components are served through Amazon Cloudfront using Origin Pull (but any other CDN can do the same, there is no Cloudfront specific trickery involved).

How it works

There’s a couple of code changes involved and some Apache and DNS configuration changes. What do you need:

  • LAMP platform and WordPress. I used the most recent version of WordPress (3.1.2) at the time of writing. Hosting is done on Amazon EC2 with a CentOS 5.6 based system
  • WP Super Cache plugin installed
  • A CDN, I used Amazon Cloudfront
  • Access to DNS for setting CNAME records

I’m assuming you have a functioning LAMP server. The following steps need to be executed:

  • Create a virtual host in Apache for the WordPress site
  • Install WordPress and WP Super Cache plugin
  • Configure the WP Super Cache plugin
  • Code hacks to the WP Super Cache plugin
  • Set up your CDN
  • Configure your DNS
  • Test

We’re going to put the WordPress site in a directory called “wordpress” located in /var/www/html (CentOS/Fedora default) and create a special virtual host called cms.example.com:

<VirtualHost *:80>
ServerName cms.example.com
ServerAdmin admin@example.com
DocumentRoot /var/www/html/wordpress
LogLevel info
ErrorLog logs/error_log
TransferLog logs/access_log
</VirtualHost>

Install WordPress in the /var/www/html/wordpress directory and configure it with the cms.example.com home/site url. Check that the installation completed sucessfully and you can access the admin interface at http://cms.example.com/wp-admin/. Install the WP Super Cache plugin as explained by the documentation.

Configure the WP Super Cache plugin as follows:

  • Advanced settings:
    • Cache hits to this website for quick access
    • Use PHP to serve cache files
    • 304 Not Modified browser caching. Indicate when a page has not been modified since last requested
    • Cache rebuild. Serve a supercache file to anonymous users while a new file is being generated
  • CDN settings:
    • Enable CDN Support
    • Off-site URL: http://cdn.example.com (where example.com is your own domain)
  • Preload settings:
    • Preload mode (garbage collection only on legacy cache files)

Create a new directory in your webroot, e.g. “cache”:

mkdir /var/www/html/cache

Set this up as a new virtual host in Apache, let’s call this new site cache.example.com:

<VirtualHost *:80>
ServerName cache.example.com
ServerAdmin admin@example.com
DocumentRoot /var/www/html/cache/supercache/cms.example.com
ErrorLog logs/error_log
TransferLog logs/access_log
</VirtualHost>

Restart Apache to get the new Virtual Hosts activated. Copy over the wp-content/themes/[theme-name] folder to your cache directory (/var/www/html/cache/supercache/cms.example.com) but only where it concerns css, js and images. You don’t need to copy over the php files as only the web page resources are required. The same applies for the wp-includes directory if your theme uses javascript files in the js subdirectory. Check if the pages come up ok if you access http://cache.example.com. If they do you’re fine, if not troubleshoot what the issue is, e.g. look at the Apache logs/error_log file.

After this we need to do some small code wrangling, it’s going to be ugly but small and we need the absolute path of the directory that we just created. Navigate to the plugin directory of your WordPress installation and enter the wp-super-cache directory. Open file “wp-cache-phase1.php” and at the top of the file just after the include( WPCACHEHOME . ‘wp-cache-base.php’); instruction add:

include( WPCACHEHOME . 'wp-cache-base.php');
$cache_path = "/var/www/html/cache/";

Save the file and open file “wp-cache-phase2.php”. At the top of the file, just after

$cache_path = "/var/www/html/cache/";

In the same file look for function function wp_cache_get_ob(&$buffer) and in this function look for this sequence (around line 504):

 } else {
                $buffer = apply_filters( 'wpsupercache_buffer', $buffer );
                // Append WP Super Cache or Live page comment tag
                wp_cache_append_tag($buffer);

After this sequence add:

$buffer = str_replace("http://cms.example.com", "http://www.example.com", $buffer);

Reason for this is that WP Super Cache will generate pages based on its own site/home url (cms.example.com) and we need to replace this url with the actual site url (www.example.com). Hence the clumsy find and replace whilst the pages are generated by the Preload section of the WP Super Cache plugin. I’m sure it can be done nicer but I’m just proving a concept, not winning prices for clean code.

Set up your CDN so that it has two Distribution Points / Pull Zones or whatever you CDN provider calls them. One should be listening to www.example.com and have cache.example.com as its origin server and the other should be listening to cdn.example.com and also have cache.example.com as its origin server. Note the CNAME records the CDN generates for you, let’s assume the following:

  • xyz.cloudfront.net –> www.example.com
  • abc.cloudfront.net –> cdn.example.com

Go to your DNS setup and set up the following changes:

  • Have the www subdomain (I’m assuming you already have this set up otherwise create a www CNAME record) refer to xyz.cloudfront.net
  • Create a CNAME record for cdn.example.com and have this point at abc.cloudfront.net

Apply the DNS changes and wait for the changes to propagate. If you can do a successful dig on www.example.com and cdn.example.com and you get to see something like this you should be ok:

www.example.com.         3044   IN CNAME  xyz.cloudfront.net.
xyz.cloudfront.net.      60     IN CNAME  xyz.ams1.cloudfront.net.
xyz.ams1.cloudfront.net. 60     IN A      216.137.59.28
xyz.ams1.cloudfront.net. 60     IN A      216.137.59.54
xyz.ams1.cloudfront.net. 60     IN A      216.137.59.64
xyz.ams1.cloudfront.net. 60     IN A      216.137.59.115
xyz.ams1.cloudfront.net. 60     IN A      216.137.59.207
xyz.ams1.cloudfront.net. 60     IN A      216.137.59.216
xyz.ams1.cloudfront.net. 60     IN A      216.137.59.220
xyz.ams1.cloudfront.net. 60     IN A      216.137.59.254

Access your site at http://www.example.com/ and see if its working. If so start doing your performance tests and do some investigations with HTTP analysis tooling like HTTP Fox.

After you’ve established everything works fine you can make cms.example.com only accessible to yourself or your content editors, there is no real time dependency on WordPress anymore and the installation can be purely used for content management activities.

Imagine the complete global population connected to the Internet. Imagine billions of people using web and mobile applications, your web and mobile applications. That’s a staggering amount of traffic. Now visualize todays “classic” infrastructure setup with a DMZ consisting of (two brands of) firewalls (two for the security concious), load balancers, proxies and put this infrastructure between your customers and your web/mobile applications. It’s like drinking water from a fire hydrant. That “classic” infrastructure will have a very difficult time keeping up. Sure there’s solutions like wire-speed firewalls but the fact of the matter remains that any piece of infrastructure that you put between your application and your customers will have to cope with the load and therefore needs to be scaled up or out, adding costs in the process.

So why not skip on them? Don’t use separate physical firewalls, load balancers or proxies. Integrate those functions with the web application hosting platforms. Put your application in a multitude of data centers, set up your hosts and hook up the big Internet pipes. If you’re in a Public Cloud you probably don’t have any firewalls, load balancers or proxies that you control anyway so it’s better to get used to this mode of thinking.

Let’s see if is feasible to abandon network firewalls, load balancers and proxies; implementation in this case is not left as an exercise to the reader. I’ll be using open source and open standard solutions in my examples so any time I’m not specifically referring to a technology assume I mean stuff like Linux, BSD etc …

Remove the network firewall

There’s no shortage of platform based firewalls, it’s how firewalls started out in the early 90s before they became dedicated appliances. If you have a whole farm of servers serving the same application it is relatively straightforward to distribute firewall configuration files across a multitude of machines. In a web farm scenario the access ports are initially set and are very rare to change, ideally you only allow access across ports 80/443 tcp/ip (HTTP/HTTPS). Any other traffic tends to be of a more administrative nature and will be routed over different NICs with a different firewall ruleset.

The concept of bringing the firewall back to the end host rather than at the network perimeter is known as a distributed firewall [Bellovin]. The important aspect of a distributed firewall is that the management of policy is still centralized, but the enforcement of the policy is distributed (to the end hosts). Bellovin lists three components to implement a distributed firewall:

  • Policy language: A language that states what sort of connection are permitted and prohibited (filtering rules)
  • System management: A management tool that changes and enforces the security policy
  • Safe distribution: A security mechanism that safely distributes the security policy

Implementation

This can be implemented in many ways but the easiest choice would be to use netfilter and associated filter rules as the policy language, manage the filter rules as a text file and use rsync over SSH to securely distribute the policy rules. The traffic between master and slave hosts will be minimal due to the nature of rsync (only sending changed bits) and the fact that changes will hardly ever be necessary as you’re only allowing traffic over 80/443 tcp/ip (HTTP/HTTPS). An alternative for rsync is a message based approach with guaranteed delivery, something like AMQP.

Remove the load balancers

A load balancer distributes workloads evenly across two or more hosts. Positioning this on the host level will not work as one host will be quickly overwhelmed before it can offload to other hosts (in essence become the same choke point as the load balancer) so this function needs to sit outside of the hosts serving your application. The function can’t be positioned on the hosts or in front of the hosts so the only other place remaining for this function is to position it on the client. The client needs to be able to load balance requests across several hosts. This requires that the client is in some form or shape aware of the hosts. A naive implementation could be based on providing the client with a list of hosts (for instance in the form of a JSON message) and pick a host at random (round robin) or deterministic (CARP like algorithm). However this becomes unwieldy very quickly especially when you start thinking in hundreds/thousands of servers and it doesn’t offer a way to guide host selection (for example when taking hosts out of service for maintenance).

A similar problem exists when determining the association between urls and IP addresses and this has been elegantly solved with a distributed computing solution: Domain Name System. DNS is a distributed database solution with a standardized protocol. A similar approach can be devised for our situation where we need to find a suitable host for our client. Unfortunately JavaScript can’t execute DNS queries on itself and invoking a server side component defeats the purpose of this exercise so we need to come with something similar but just a bit different. We need to have a client that can execute a query to a DNS like system that returns a list of hosts that can be used in a format that can be processed by client-side JavaScript.

Implementation

From an implementation perspective this can be achieved by having the ability to query the DNS system from JavaScript. This means that the DNS server needs to support an HTTP(S) interface and can return information in a format that JavaScript interprets, for instance JSON messages. We need a DNS server with a REST/JSON interface. Such interfaces are already available, like REST-DNS, JSON DNS or can be created quite easily by yourself (use an exisiting DNS server implementation and add HTTP(S)/JSON capabilities). The JavaScript logic on the client will contain a number of root servers (comparable to DNS) that may be queried. After selecting a root server the JavaScript logic can subsequently query which service it is looking for. The root server does a lookup which hosts can service the request and responds by offering the best matching hosts in the form of a JSON message (=Service Discovery). The client can then select a host and request the service. Hosts can be taken in and out of service by managing the host entries in the root servers. A difference between this implementation and regular DNS is that there is no technical limitation on the number of published root servers. With a regular DNS process the client usually can only configure two or three name servers. The JavaScript implementation doesn’t pose this limitation. There is no standard for a JSON based DNS query yet but it would be relatively straightforward to take the current DNS protocol and reflect that in JSON. It would only have to cover lookups, zone transfers can still be based on normal DNS protocol.

Now there’s one little problem left, how do we get to the first HTML page containing the JavaScript initialization code without using load balancing to distribute these initial requests? The JavaScript is embedded in the first HTML page that the client receives when accessing the web application. This first web page contains all the (JavaScript) logic to get going. It is a static resource and it can be hosted on a Content Delivery Network (CDN). The CDN itself can be accessed through DNS-based request routing, making it resilient and scalable. By using a CDN it is not necessary to have a load balancing capability for servicing the initial static web page containing the JavaScript logic. You can decide to use a readily available CDN service for this or roll your own if you are the size of Google or Facebook. If you decide to roll your own CDN pay attention to your client proximity issues.

Remove the proxies

Proxies are versatile constructions and its wise to clarify what type of proxies exist (before removing them):

  • Forward proxy
  • Reverse proxy:
    1. Caching proxy
    2. Load balancing proxy
    3. SSL offloading proxy
    4. Security proxy (authentication/filtering)

The forward proxy is within the client environment and will not be impacted by the architecture proposed in this article, we’ll keep it out of scope. Reverse proxies are used in the host environment and will be examined subsequently.

  • The caching proxy is used to capture dynamically generated resources and turn them into (temporary) static resources through caching. This saves on host compute resources as the same page doesn’t need to be generated with each and every request.
  • The load balancing proxy basically does the same thing as a load balancer, i.e. distribute load over two or more hosts only more specifically for the HTTP protocol, sometimes using advanced features like url and content rewriting to change location and content on the fly.
  • The SSL offloading proxy handles all SSL traffic in front of the web server and thereby offloads all SSL traffic from the webserver (SSL can be quite compute intensive).
  • The Security proxy can carry out authentication (identifying the user) or security filtering (checking requests on anomalies like SQL injection, XSS) before allowing traffic to the web server.

All these Reverse Proxy functionalities have alternative implementations that are host based and can therefore be distributed horizontally across all hosts.

Implementation

  • Caching proxy: By using a CDN for static resources and using application and database caching techniques for dynamically generated resources the need for a caching proxy can be removed. Semi-dynamic resources (e.g. generated at specific intervals) can be automatically uploaded to the CDN.
  • Load balancing proxy: the functionality of load balancers has been resolved in the “Remove the load balancers” section.
  • SSL offloading proxy: This function can only be carried out on the web server if there is no SSL offloading proxy. However the host can benefit significantly from hardware SSL accelerators.
  • Security proxy: authentication can be done at the application or web server level. Security filtering can be done through host modules like the Apache mod_security module.

Conclusion

All mentioned components (firewalls, load balancers and proxies) can be completely evaded with a well thought out architecture. This avoids significant upfront costs, improves scalability by orders of magnitude and reduces management complexity.

All these elements, and a couple more, lead to an architecture than can process trillions of interactions per day because it is completely distributed and horizontally scalable. It is not constrained by infrastructure components requiring large upfront investments like load balancers, firewalls and proxies. I call this concept the LARG architecture, short for “Linked Architecture for Resource Groups” and it will be the topic of a following article.

After a certain while if I’ve been working on code I get a bit blinded by the nice things I’ve accomplished and tend to focus on what I’m not happy with. Lets make this posting about something simple I’m happy with and which looks very nice: declaring and verifying constants. I’m a big fan of constants (not so much of magic constants but that’s a different story) and I use them frequently in my code. One thing that’s always important is to check whether you’ve actually already set the constant otherwise you get a warning/error dependant on the strictness setting of your error reporting. So here’s a nice way to set and verify whether you’ve actually set the constant already:

defined('LANGUAGE') or define('LANGUAGE', 'en-us');

If that ain’t a thing of beauty I don’t know what is :-)

Connecting to my Amazon EC2 image (from which this site is running) from Mac Os X took ages to find out and turned out to be relatively simple with the correct information (isn’t that always the case). At first I didn’t think the builtin Mac OS X ssh could cut it so I started looking into various Mac OS X ssh clients (Fugu, RBrowser, CyberDuck etc ..) but none of those could handle the Amazon public/private key encryption. Then I started looking into using Putty on Mac OS X even though thats not available for Mac OS X (but with a little help from MacPorts). That bombed on problems with GTK1. Dang, what to do?

Continue Reading

For the website of my wife’s company, www.exportmanagement.nu, I needed a simple approach to direct traffic to the proper pages based on the language preference setting of the visiting browser. It’s a very simple approach, any browser with Dutch as its language setting will be directed to the main site and any other language will be directed to a smaller, English language based, website. Luckily the swiss army chainsaw named mod_rewrite came to the rescue and the following little code fragment will do just that (placed in an .htaccess file).

Continue Reading

Media files in Django are served through the web server and they can be served with a different url than the Django content itself. By spreading requests across multiple urls you can speed up your site because the browser will execute requests in parallel. The rule of thumb seems to be a maximum of 2-3 hostnames otherwise the added DNS requests negate the speed up effect.

Continue Reading

I finally succumbed to ease of use and switched from my bespoke PivotLog installation to WordPress. I thoroughly enjoyed Pivot but when switching from Textdrive to Amazon EC2 I had to change and migrate so many things that I settled for the easier solution; WordPress.

Continue Reading

As explained in one of the first posts on this blog this site is basically just one big Atom feed that gets transformed into this blog by using a bit of Apache content negotiation and client side XSLT. Besides some issues with browsers ignoring client side XSLT in a feed and forcing their own rendition of my feed which was fixed by inserting 512 bytes of crud to throw of the feed sniffing this approach has worked fine for the last four years.

Continue Reading