Gyong Ju - South Korea

Archive for the ‘Architecture’ Category

Now that’s a big claim but I can assure you its true for all three aspects. It doesn’t even require heavy customisation and the approach is based on standard plugins available on the WordPress plugin site. However like everything there’s a trade off with the approach and in this case its the loss of flexibility and dynamic behaviour. This isn’t an issue with static websites but if you’re running a blog then this solution isn’t for you (stuff like comments won’t work as this requires connectivity and feedback from WordPress). It’s up to you to decide whether my approach has merits for your use case. I offer no guarantees other than that I have applied the approach below to my own systems and for me it works. It’s very rough around the edges, I have been hacking some files and I haven’t rolled my changes into a nice shrink wrap form. Enough with the disclaimers let’s get going with an actual explanation of what I’m offering.

Security

WordPress suffers from the same problem that almost all Content Management Systems (CMS) suffer from, it has a unified code base for both content publication and content management. With WordPress (and similar systems) that share the same code base it is possible to hack the content management system through the content publication system. The content publication system is the aspect of the CMS that generates the pages if a visitor hits the site. The content publication system by its very nature is an open interface to the outside world and can therefor be hacked. By the fact that it shares code with the CMS system it is inevitable that also the CMS can be compromised in an attack on the content publication system. These hacks occur time and again and are endemic to the shared code approach so they will never go away. The only way of ensuring your CMS is not hacked through your content publication system is by separating the two. Now separation in a physical (code) sense is possible but requires a huge amount of effort and in effect means a different version of WordPress through a fork. This is not what I want to achieve, I have limited time and I can’t maintain my own version of WordPress and keep up with all the new functionality that the WordPress team cranks out all the time. Therefore I mean separation in a logical sense and this I achieve through the use of WP SuperCache. WP Super Cache turns your WordPress site/blog into a collection of static pages and it uses a .htaccess mod_rewrite approach to serve customers the static pages. It also has an option to serve page components like JS, CSS and images from a Content Delivery Network (CDN). My approach to separating the CMS from content publication is that I turn the WP Super Cache cache (pardon the pun) into its own virtual host in Apache and serve content in its static form from that Virtual Host. My visitors don’t need to access the WordPress installation to get to the content, the CMS and the content publication are logically separated. Now there’s a couple of tricks required for getting this up and running and I’ll explain these later in this post.

Speed

The approach of moving your page components into a CDN is well known and relatively straightforward to achieve with solutions like WP Super Cache or W3 Total Cache. Going one step further and moving your entire site, so including your html is a little less usual but that is what I have achieved. My test site (not this one) based on the standard twentyten theme now loads in 1.223 seconds of which 0.252 seconds is spent on the DNS lookups. The html and all other page components are served through Amazon Cloudfront using Origin Pull (but any other CDN can do the same, there is no Cloudfront specific trickery involved).

How it works

There’s a couple of code changes involved and some Apache and DNS configuration changes. What do you need:

  • LAMP platform and WordPress. I used the most recent version of WordPress (3.1.2) at the time of writing. Hosting is done on Amazon EC2 with a CentOS 5.6 based system
  • WP Super Cache plugin installed
  • A CDN, I used Amazon Cloudfront
  • Access to DNS for setting CNAME records

I’m assuming you have a functioning LAMP server. The following steps need to be executed:

  • Create a virtual host in Apache for the WordPress site
  • Install WordPress and WP Super Cache plugin
  • Configure the WP Super Cache plugin
  • Code hacks to the WP Super Cache plugin
  • Set up your CDN
  • Configure your DNS
  • Test

We’re going to put the WordPress site in a directory called “wordpress” located in /var/www/html (CentOS/Fedora default) and create a special virtual host called cms.example.com:

<VirtualHost *:80>
ServerName cms.example.com
ServerAdmin admin@example.com
DocumentRoot /var/www/html/wordpress
LogLevel info
ErrorLog logs/error_log
TransferLog logs/access_log
</VirtualHost>

Install WordPress in the /var/www/html/wordpress directory and configure it with the cms.example.com home/site url. Check that the installation completed sucessfully and you can access the admin interface at http://cms.example.com/wp-admin/. Install the WP Super Cache plugin as explained by the documentation.

Configure the WP Super Cache plugin as follows:

  • Advanced settings:
    • Cache hits to this website for quick access
    • Use PHP to serve cache files
    • 304 Not Modified browser caching. Indicate when a page has not been modified since last requested
    • Cache rebuild. Serve a supercache file to anonymous users while a new file is being generated
  • CDN settings:
    • Enable CDN Support
    • Off-site URL: http://cdn.example.com (where example.com is your own domain)
  • Preload settings:
    • Preload mode (garbage collection only on legacy cache files)

Create a new directory in your webroot, e.g. “cache”:

mkdir /var/www/html/cache

Set this up as a new virtual host in Apache, let’s call this new site cache.example.com:

<VirtualHost *:80>
ServerName cache.example.com
ServerAdmin admin@example.com
DocumentRoot /var/www/html/cache/supercache/cms.example.com
ErrorLog logs/error_log
TransferLog logs/access_log
</VirtualHost>

Restart Apache to get the new Virtual Hosts activated. Copy over the wp-content/themes/[theme-name] folder to your cache directory (/var/www/html/cache/supercache/cms.example.com) but only where it concerns css, js and images. You don’t need to copy over the php files as only the web page resources are required. The same applies for the wp-includes directory if your theme uses javascript files in the js subdirectory. Check if the pages come up ok if you access http://cache.example.com. If they do you’re fine, if not troubleshoot what the issue is, e.g. look at the Apache logs/error_log file.

After this we need to do some small code wrangling, it’s going to be ugly but small and we need the absolute path of the directory that we just created. Navigate to the plugin directory of your WordPress installation and enter the wp-super-cache directory. Open file “wp-cache-phase1.php” and at the top of the file just after the include( WPCACHEHOME . ‘wp-cache-base.php’); instruction add:

include( WPCACHEHOME . 'wp-cache-base.php');
$cache_path = "/var/www/html/cache/";

Save the file and open file “wp-cache-phase2.php”. At the top of the file, just after

$cache_path = "/var/www/html/cache/";

In the same file look for function function wp_cache_get_ob(&$buffer) and in this function look for this sequence (around line 504):

 } else {
                $buffer = apply_filters( 'wpsupercache_buffer', $buffer );
                // Append WP Super Cache or Live page comment tag
                wp_cache_append_tag($buffer);

After this sequence add:

$buffer = str_replace("http://cms.example.com", "http://www.example.com", $buffer);

Reason for this is that WP Super Cache will generate pages based on its own site/home url (cms.example.com) and we need to replace this url with the actual site url (www.example.com). Hence the clumsy find and replace whilst the pages are generated by the Preload section of the WP Super Cache plugin. I’m sure it can be done nicer but I’m just proving a concept, not winning prices for clean code.

Set up your CDN so that it has two Distribution Points / Pull Zones or whatever you CDN provider calls them. One should be listening to www.example.com and have cache.example.com as its origin server and the other should be listening to cdn.example.com and also have cache.example.com as its origin server. Note the CNAME records the CDN generates for you, let’s assume the following:

  • xyz.cloudfront.net –> www.example.com
  • abc.cloudfront.net –> cdn.example.com

Go to your DNS setup and set up the following changes:

  • Have the www subdomain (I’m assuming you already have this set up otherwise create a www CNAME record) refer to xyz.cloudfront.net
  • Create a CNAME record for cdn.example.com and have this point at abc.cloudfront.net

Apply the DNS changes and wait for the changes to propagate. If you can do a successful dig on www.example.com and cdn.example.com and you get to see something like this you should be ok:

www.example.com.         3044   IN CNAME  xyz.cloudfront.net.
xyz.cloudfront.net.      60     IN CNAME  xyz.ams1.cloudfront.net.
xyz.ams1.cloudfront.net. 60     IN A      216.137.59.28
xyz.ams1.cloudfront.net. 60     IN A      216.137.59.54
xyz.ams1.cloudfront.net. 60     IN A      216.137.59.64
xyz.ams1.cloudfront.net. 60     IN A      216.137.59.115
xyz.ams1.cloudfront.net. 60     IN A      216.137.59.207
xyz.ams1.cloudfront.net. 60     IN A      216.137.59.216
xyz.ams1.cloudfront.net. 60     IN A      216.137.59.220
xyz.ams1.cloudfront.net. 60     IN A      216.137.59.254

Access your site at http://www.example.com/ and see if its working. If so start doing your performance tests and do some investigations with HTTP analysis tooling like HTTP Fox.

After you’ve established everything works fine you can make cms.example.com only accessible to yourself or your content editors, there is no real time dependency on WordPress anymore and the installation can be purely used for content management activities.

Imagine the complete global population connected to the Internet. Imagine billions of people using web and mobile applications, your web and mobile applications. That’s a staggering amount of traffic. Now visualize todays “classic” infrastructure setup with a DMZ consisting of (two brands of) firewalls (two for the security concious), load balancers, proxies and put this infrastructure between your customers and your web/mobile applications. It’s like drinking water from a fire hydrant. That “classic” infrastructure will have a very difficult time keeping up. Sure there’s solutions like wire-speed firewalls but the fact of the matter remains that any piece of infrastructure that you put between your application and your customers will have to cope with the load and therefore needs to be scaled up or out, adding costs in the process.

So why not skip on them? Don’t use separate physical firewalls, load balancers or proxies. Integrate those functions with the web application hosting platforms. Put your application in a multitude of data centers, set up your hosts and hook up the big Internet pipes. If you’re in a Public Cloud you probably don’t have any firewalls, load balancers or proxies that you control anyway so it’s better to get used to this mode of thinking.

Let’s see if is feasible to abandon network firewalls, load balancers and proxies; implementation in this case is not left as an exercise to the reader. I’ll be using open source and open standard solutions in my examples so any time I’m not specifically referring to a technology assume I mean stuff like Linux, BSD etc …

Remove the network firewall

There’s no shortage of platform based firewalls, it’s how firewalls started out in the early 90s before they became dedicated appliances. If you have a whole farm of servers serving the same application it is relatively straightforward to distribute firewall configuration files across a multitude of machines. In a web farm scenario the access ports are initially set and are very rare to change, ideally you only allow access across ports 80/443 tcp/ip (HTTP/HTTPS). Any other traffic tends to be of a more administrative nature and will be routed over different NICs with a different firewall ruleset.

The concept of bringing the firewall back to the end host rather than at the network perimeter is known as a distributed firewall [Bellovin]. The important aspect of a distributed firewall is that the management of policy is still centralized, but the enforcement of the policy is distributed (to the end hosts). Bellovin lists three components to implement a distributed firewall:

  • Policy language: A language that states what sort of connection are permitted and prohibited (filtering rules)
  • System management: A management tool that changes and enforces the security policy
  • Safe distribution: A security mechanism that safely distributes the security policy

Implementation

This can be implemented in many ways but the easiest choice would be to use netfilter and associated filter rules as the policy language, manage the filter rules as a text file and use rsync over SSH to securely distribute the policy rules. The traffic between master and slave hosts will be minimal due to the nature of rsync (only sending changed bits) and the fact that changes will hardly ever be necessary as you’re only allowing traffic over 80/443 tcp/ip (HTTP/HTTPS). An alternative for rsync is a message based approach with guaranteed delivery, something like AMQP.

Remove the load balancers

A load balancer distributes workloads evenly across two or more hosts. Positioning this on the host level will not work as one host will be quickly overwhelmed before it can offload to other hosts (in essence become the same choke point as the load balancer) so this function needs to sit outside of the hosts serving your application. The function can’t be positioned on the hosts or in front of the hosts so the only other place remaining for this function is to position it on the client. The client needs to be able to load balance requests across several hosts. This requires that the client is in some form or shape aware of the hosts. A naive implementation could be based on providing the client with a list of hosts (for instance in the form of a JSON message) and pick a host at random (round robin) or deterministic (CARP like algorithm). However this becomes unwieldy very quickly especially when you start thinking in hundreds/thousands of servers and it doesn’t offer a way to guide host selection (for example when taking hosts out of service for maintenance).

A similar problem exists when determining the association between urls and IP addresses and this has been elegantly solved with a distributed computing solution: Domain Name System. DNS is a distributed database solution with a standardized protocol. A similar approach can be devised for our situation where we need to find a suitable host for our client. Unfortunately JavaScript can’t execute DNS queries on itself and invoking a server side component defeats the purpose of this exercise so we need to come with something similar but just a bit different. We need to have a client that can execute a query to a DNS like system that returns a list of hosts that can be used in a format that can be processed by client-side JavaScript.

Implementation

From an implementation perspective this can be achieved by having the ability to query the DNS system from JavaScript. This means that the DNS server needs to support an HTTP(S) interface and can return information in a format that JavaScript interprets, for instance JSON messages. We need a DNS server with a REST/JSON interface. Such interfaces are already available, like REST-DNS, JSON DNS or can be created quite easily by yourself (use an exisiting DNS server implementation and add HTTP(S)/JSON capabilities). The JavaScript logic on the client will contain a number of root servers (comparable to DNS) that may be queried. After selecting a root server the JavaScript logic can subsequently query which service it is looking for. The root server does a lookup which hosts can service the request and responds by offering the best matching hosts in the form of a JSON message (=Service Discovery). The client can then select a host and request the service. Hosts can be taken in and out of service by managing the host entries in the root servers. A difference between this implementation and regular DNS is that there is no technical limitation on the number of published root servers. With a regular DNS process the client usually can only configure two or three name servers. The JavaScript implementation doesn’t pose this limitation. There is no standard for a JSON based DNS query yet but it would be relatively straightforward to take the current DNS protocol and reflect that in JSON. It would only have to cover lookups, zone transfers can still be based on normal DNS protocol.

Now there’s one little problem left, how do we get to the first HTML page containing the JavaScript initialization code without using load balancing to distribute these initial requests? The JavaScript is embedded in the first HTML page that the client receives when accessing the web application. This first web page contains all the (JavaScript) logic to get going. It is a static resource and it can be hosted on a Content Delivery Network (CDN). The CDN itself can be accessed through DNS-based request routing, making it resilient and scalable. By using a CDN it is not necessary to have a load balancing capability for servicing the initial static web page containing the JavaScript logic. You can decide to use a readily available CDN service for this or roll your own if you are the size of Google or Facebook. If you decide to roll your own CDN pay attention to your client proximity issues.

Remove the proxies

Proxies are versatile constructions and its wise to clarify what type of proxies exist (before removing them):

  • Forward proxy
  • Reverse proxy:
    1. Caching proxy
    2. Load balancing proxy
    3. SSL offloading proxy
    4. Security proxy (authentication/filtering)

The forward proxy is within the client environment and will not be impacted by the architecture proposed in this article, we’ll keep it out of scope. Reverse proxies are used in the host environment and will be examined subsequently.

  • The caching proxy is used to capture dynamically generated resources and turn them into (temporary) static resources through caching. This saves on host compute resources as the same page doesn’t need to be generated with each and every request.
  • The load balancing proxy basically does the same thing as a load balancer, i.e. distribute load over two or more hosts only more specifically for the HTTP protocol, sometimes using advanced features like url and content rewriting to change location and content on the fly.
  • The SSL offloading proxy handles all SSL traffic in front of the web server and thereby offloads all SSL traffic from the webserver (SSL can be quite compute intensive).
  • The Security proxy can carry out authentication (identifying the user) or security filtering (checking requests on anomalies like SQL injection, XSS) before allowing traffic to the web server.

All these Reverse Proxy functionalities have alternative implementations that are host based and can therefore be distributed horizontally across all hosts.

Implementation

  • Caching proxy: By using a CDN for static resources and using application and database caching techniques for dynamically generated resources the need for a caching proxy can be removed. Semi-dynamic resources (e.g. generated at specific intervals) can be automatically uploaded to the CDN.
  • Load balancing proxy: the functionality of load balancers has been resolved in the “Remove the load balancers” section.
  • SSL offloading proxy: This function can only be carried out on the web server if there is no SSL offloading proxy. However the host can benefit significantly from hardware SSL accelerators.
  • Security proxy: authentication can be done at the application or web server level. Security filtering can be done through host modules like the Apache mod_security module.

Conclusion

All mentioned components (firewalls, load balancers and proxies) can be completely evaded with a well thought out architecture. This avoids significant upfront costs, improves scalability by orders of magnitude and reduces management complexity.

All these elements, and a couple more, lead to an architecture than can process trillions of interactions per day because it is completely distributed and horizontally scalable. It is not constrained by infrastructure components requiring large upfront investments like load balancers, firewalls and proxies. I call this concept the LARG architecture, short for “Linked Architecture for Resource Groups” and it will be the topic of a following article.