Imagine the complete global population connected to the Internet. Imagine billions of people using web and mobile applications, your web and mobile applications. That’s a staggering amount of traffic. Now visualize todays “classic” infrastructure setup with a DMZ consisting of (two brands of) firewalls (two for the security concious), load balancers, proxies and put this infrastructure between your customers and your web/mobile applications. It’s like drinking water from a fire hydrant. That “classic” infrastructure will have a very difficult time keeping up. Sure there’s solutions like wire-speed firewalls but the fact of the matter remains that any piece of infrastructure that you put between your application and your customers will have to cope with the load and therefore needs to be scaled up or out, adding costs in the process.
So why not skip on them? Don’t use separate physical firewalls, load balancers or proxies. Integrate those functions with the web application hosting platforms. Put your application in a multitude of data centers, set up your hosts and hook up the big Internet pipes. If you’re in a Public Cloud you probably don’t have any firewalls, load balancers or proxies that you control anyway so it’s better to get used to this mode of thinking.
Let’s see if is feasible to abandon network firewalls, load balancers and proxies; implementation in this case is not left as an exercise to the reader. I’ll be using open source and open standard solutions in my examples so any time I’m not specifically referring to a technology assume I mean stuff like Linux, BSD etc …
Remove the network firewall
There’s no shortage of platform based firewalls, it’s how firewalls started out in the early 90s before they became dedicated appliances. If you have a whole farm of servers serving the same application it is relatively straightforward to distribute firewall configuration files across a multitude of machines. In a web farm scenario the access ports are initially set and are very rare to change, ideally you only allow access across ports 80/443 tcp/ip (HTTP/HTTPS). Any other traffic tends to be of a more administrative nature and will be routed over different NICs with a different firewall ruleset.
The concept of bringing the firewall back to the end host rather than at the network perimeter is known as a distributed firewall [Bellovin]. The important aspect of a distributed firewall is that the management of policy is still centralized, but the enforcement of the policy is distributed (to the end hosts). Bellovin lists three components to implement a distributed firewall:
- Policy language: A language that states what sort of connection are permitted and prohibited (filtering rules)
- System management: A management tool that changes and enforces the security policy
- Safe distribution: A security mechanism that safely distributes the security policy
Implementation
This can be implemented in many ways but the easiest choice would be to use netfilter and associated filter rules as the policy language, manage the filter rules as a text file and use rsync over SSH to securely distribute the policy rules. The traffic between master and slave hosts will be minimal due to the nature of rsync (only sending changed bits) and the fact that changes will hardly ever be necessary as you’re only allowing traffic over 80/443 tcp/ip (HTTP/HTTPS). An alternative for rsync is a message based approach with guaranteed delivery, something like AMQP.
Remove the load balancers
A load balancer distributes workloads evenly across two or more hosts. Positioning this on the host level will not work as one host will be quickly overwhelmed before it can offload to other hosts (in essence become the same choke point as the load balancer) so this function needs to sit outside of the hosts serving your application. The function can’t be positioned on the hosts or in front of the hosts so the only other place remaining for this function is to position it on the client. The client needs to be able to load balance requests across several hosts. This requires that the client is in some form or shape aware of the hosts. A naive implementation could be based on providing the client with a list of hosts (for instance in the form of a JSON message) and pick a host at random (round robin) or deterministic (CARP like algorithm). However this becomes unwieldy very quickly especially when you start thinking in hundreds/thousands of servers and it doesn’t offer a way to guide host selection (for example when taking hosts out of service for maintenance).
A similar problem exists when determining the association between urls and IP addresses and this has been elegantly solved with a distributed computing solution: Domain Name System. DNS is a distributed database solution with a standardized protocol. A similar approach can be devised for our situation where we need to find a suitable host for our client. Unfortunately JavaScript can’t execute DNS queries on itself and invoking a server side component defeats the purpose of this exercise so we need to come with something similar but just a bit different. We need to have a client that can execute a query to a DNS like system that returns a list of hosts that can be used in a format that can be processed by client-side JavaScript.
Implementation
From an implementation perspective this can be achieved by having the ability to query the DNS system from JavaScript. This means that the DNS server needs to support an HTTP(S) interface and can return information in a format that JavaScript interprets, for instance JSON messages. We need a DNS server with a REST/JSON interface. Such interfaces are already available, like REST-DNS, JSON DNS or can be created quite easily by yourself (use an exisiting DNS server implementation and add HTTP(S)/JSON capabilities). The JavaScript logic on the client will contain a number of root servers (comparable to DNS) that may be queried. After selecting a root server the JavaScript logic can subsequently query which service it is looking for. The root server does a lookup which hosts can service the request and responds by offering the best matching hosts in the form of a JSON message (=Service Discovery). The client can then select a host and request the service. Hosts can be taken in and out of service by managing the host entries in the root servers. A difference between this implementation and regular DNS is that there is no technical limitation on the number of published root servers. With a regular DNS process the client usually can only configure two or three name servers. The JavaScript implementation doesn’t pose this limitation. There is no standard for a JSON based DNS query yet but it would be relatively straightforward to take the current DNS protocol and reflect that in JSON. It would only have to cover lookups, zone transfers can still be based on normal DNS protocol.
Now there’s one little problem left, how do we get to the first HTML page containing the JavaScript initialization code without using load balancing to distribute these initial requests? The JavaScript is embedded in the first HTML page that the client receives when accessing the web application. This first web page contains all the (JavaScript) logic to get going. It is a static resource and it can be hosted on a Content Delivery Network (CDN). The CDN itself can be accessed through DNS-based request routing, making it resilient and scalable. By using a CDN it is not necessary to have a load balancing capability for servicing the initial static web page containing the JavaScript logic. You can decide to use a readily available CDN service for this or roll your own if you are the size of Google or Facebook. If you decide to roll your own CDN pay attention to your client proximity issues.
Remove the proxies
Proxies are versatile constructions and its wise to clarify what type of proxies exist (before removing them):
- Forward proxy
- Reverse proxy:
- Caching proxy
- Load balancing proxy
- SSL offloading proxy
- Security proxy (authentication/filtering)
The forward proxy is within the client environment and will not be impacted by the architecture proposed in this article, we’ll keep it out of scope. Reverse proxies are used in the host environment and will be examined subsequently.
- The caching proxy is used to capture dynamically generated resources and turn them into (temporary) static resources through caching. This saves on host compute resources as the same page doesn’t need to be generated with each and every request.
- The load balancing proxy basically does the same thing as a load balancer, i.e. distribute load over two or more hosts only more specifically for the HTTP protocol, sometimes using advanced features like url and content rewriting to change location and content on the fly.
- The SSL offloading proxy handles all SSL traffic in front of the web server and thereby offloads all SSL traffic from the webserver (SSL can be quite compute intensive).
- The Security proxy can carry out authentication (identifying the user) or security filtering (checking requests on anomalies like SQL injection, XSS) before allowing traffic to the web server.
All these Reverse Proxy functionalities have alternative implementations that are host based and can therefore be distributed horizontally across all hosts.
Implementation
- Caching proxy: By using a CDN for static resources and using application and database caching techniques for dynamically generated resources the need for a caching proxy can be removed. Semi-dynamic resources (e.g. generated at specific intervals) can be automatically uploaded to the CDN.
- Load balancing proxy: the functionality of load balancers has been resolved in the “Remove the load balancers” section.
- SSL offloading proxy: This function can only be carried out on the web server if there is no SSL offloading proxy. However the host can benefit significantly from hardware SSL accelerators.
- Security proxy: authentication can be done at the application or web server level. Security filtering can be done through host modules like the Apache mod_security module.
Conclusion
All mentioned components (firewalls, load balancers and proxies) can be completely evaded with a well thought out architecture. This avoids significant upfront costs, improves scalability by orders of magnitude and reduces management complexity.
All these elements, and a couple more, lead to an architecture than can process trillions of interactions per day because it is completely distributed and horizontally scalable. It is not constrained by infrastructure components requiring large upfront investments like load balancers, firewalls and proxies. I call this concept the LARG architecture, short for “Linked Architecture for Resource Groups” and it will be the topic of a following article.