May 18 2016
May 18
teaser image for blog post

Members of the British Council Digital team were delighted to receive the RITA2016 award last Thursday for the huge change in IT cloud infrastructure that Ixis delivered in the summer of 2015.

The award for "Infrastructure as an Enabler" reflected the innovative change in the way the British Council undertook their hosting requirement for the initial 120 sites operating in over 100 countries across the globe and delivering a clear business benefit. Moving away from dedicated infrastructure to virtual containers provided the ability to tightly control and guarantee server resources to an individual site and a quick and easy way to duplicate an environment for QA, testing, staging and feature branch development.

Ixis partnered with Drupal container expert Platform.sh to provide the underlying infrastructure and API. We'll publish further detail on our integration as a case study.

Congratulations also go another of our clients: Westminster City Council, for their award and further two highly commended positions in this years awards.

Photo courtesy of Chaudhry Javed Iqbal on Twitter.

May 18 2016
May 18
teaser image for blog post

In this blog post I'll discuss some methods of ensuring that your software is kept up to date, and some recent examples of why you should consider security to be among your top priorities instead of viewing it as an inconvenience or hassle.

Critics often attack the stability and security of Open Source due to the frequent releases and updates as projects evolve through constant contributions to their code from the community. They claim that open source requires too many patches to stay secure, and too much maintenance as a result.

This is easily countered with the explanation that by having so many individuals working with the source code of these projects, and so many eyes on them, potential vulnerabilities and bugs are uncovered much faster than with programs built on proprietary code. It is difficult for maintainers to ignore or delay the release of updates and patches with so much public pressure and visibility, and this should be seen as a positive thing.

The reality is that achieving a secure open source infrastructure and application environment requires much the same approach as with commercial software. The same principles apply, with only the implementation details differing. The most prominent difference is the transparency that exists with open source software.

Making Headlines

Open Source software often makes headlines when it is blamed for security breaches or data loss. The most recent high profile example would be the Mossack Fonseca “Panama Papers” breach, which was blamed on either WordPress or Drupal. It would be more accurate to blame the firm itself for having poor security practices, including severely outdated software throughout the company and a lack of even basic encryption.

Mossack Fonseca were using an outdated version of Drupal: 7.23. This version was released on 8 Aug 2013, almost 3 years ago as of the time of writing. That version has at least 25 known vulnerabilities. Several of these are incredibly serious, and were responsible for the infamous “Drupalgeddon” event which led to many sites being remotely exploited. Drupal.org warned users that “anyone running anything below version 7.32 within seven hours of its release should have assumed they’d been hacked”.

Protection by Automation

Probably the most effective way to keep your software updated is to automate and enforce the process. Don’t leave it in the hands of users or clients to apply or approve updates. The complexity of this will vary depending on what you need to update, and how, but it can often be as simple as enabling the built-in automatic updates that your software may already provide, or scheduling a daily command to apply any outstanding updates.

Once you've got it automated (the easy part) you will want to think about testing these changes before they hit production systems. Depending on the impact of the security exploits that you're patching, it may be more important to install updates even without complete testing; a broken site is often better than a vulnerable site! You may not have an automated way of testing every payment permutation on a large e-commerce site, for example, but that should not dissuade you from applying a critical update that exposes credit card data. Just be sure you aren't using this rationale as an excuse to avoid implementing automated testing.

The simple way

As a very common example of how simple the application of high priority updates can be, most Linux distributions will have a tried and tested method of automatically deploying security updates through their package management systems. For example, Ubuntu/Debian have the unattended-upgrades package, and Redhat-based systems have yum-cron. At the very least you will be able to schedule the system’s package manager to perform nightly updates yourself. This will cover the OS itself as well as any officially supported software that you have installed through the package manager. This means that you probably already have a reliable method of updating 95% of the open source software that you're using with minimal effort, and potentially any third-party software if you're installing from a compatible software repository. Consult the documentation for your Linux distro (or Google!) to find out how to enable this, and you can ensure that you are applying updates as soon as they are made available.

The complex way

For larger or more complex infrastructure where you may be using configuration management software (such as Ansible, Chef, or Puppet) to enforce state and install packages, you have more options. Config management software will allow you to apply updates to your test systems first, and report back on any immediate issues applying these updates. If a service fails to restart, a service does not respond on the expected port after the upgrade, or anything goes wrong, this should be enough to stop these changes reaching production until the situation is resolved. This is the same process that you should already be following for all config changes or package upgrades, so no special measures should be necessary.

The decision to make security updates a separate scheduled task, or to implement them directly in your config management process will depend on the implementation, and it would be impossible to cover every possible method here.

Risk Management

Automatically upgrading software packages on production systems is not without risks. Many of these can be mitigated with a good workflow for applying changes (of any kind) to your servers, and confidence can be added with automated testing.

Risks

  • You need to have backups of your configuration files, or be enforcing them with config management software. You may lose custom configuration files if they are not flagged correctly in the package, or the package manager does not behave how you expect when updating the software.
  • Changes to base packages like openssl, the kernel, or system libraries can have an unexpected effect on many other packages.
  • There may be bugs or regressions in the new version. Performance may be degraded.
  • Automatic updates may not complete the entire process needed to make the system secure. For example, a kernel update will generally require a reboot, or multiple services may need to be restarted. If this does not happen as part of the process, you may still be running unsafe versions of the software despite installing upgrades.

Reasons to apply updates automatically

  • The server is not critical and occasional unplanned outages are acceptable.
  • You are unlikely to apply updates manually to this server.
  • You have a way to recover the machine if remote access via SSH becomes unavailable.
  • You have full backups of any data on the machine, or no important data is stored on it.

Reasons to NOT apply updates automatically

  • The server provides a critical service and has no failover in place, and you cannot risk unplanned outages.
  • You have custom software installed manually, or complex version dependencies that may be broken during upgrades. This includes custom kernels or kernel modules.
  • You need to follow a strict change control process on this environment.

Reboot Often

Most update systems will also be able to automatically reboot for you if this is required (such as a kernel update), and you should not be afraid of this or delay it unless you're running a critical system. If you are running a critical system, you should already have a method of hot-patching the affected systems, performing rolling/staggered reboots behind a load-balancer, or some other cloud wizardry that does not interrupt service.

Decide on a maintenance window and schedule your update system to use it whenever a reboot is required. Have monitoring in place to alert you in the event of failures, and schedule reboots within business hours wherever possible.

Drupal and Other Web-based Applications

Most web-based CMS software such as Drupal and Wordpress offer automated updates, or at least notifications. Drupal security updates for both core and contributed modules can be applied by Drush, which can in turn be scheduled easily using cron or a task-runner like Jenkins. This may not be a solution if you follow anything but the most basic of deployment workflows and/or rely on a version control system such as Git for your development (which is where these updates should go, not direct to the web server). Having your production site automatically update itself will mean that it no longer matches what you deployed, nor what is in your version control repository, and it will be bypassing any CI/testing that you have in place. It is still an option worth considering if you lack all of these things or just want to guarantee that your public-facing site is getting patches as a priority over all else.

You could make this approach work by serving the Git repo as the document root, updating Drupal automatically (using Drush in 'security only' upgrade mode on cron), then committing those changes (which should not conflict with your custom code/modules) back to the repo. Not ideal, but better than having exploitable security holes on your live servers.

If your Linux distribution (or the CMS maintainers themselves) provide the web-based software as a package, and security updates are applied to it regularly, you may even consider using their version of the application. You can treat Drupal as just another piece of software in the stack, and the only thing that you're committing to version control and deploying to servers is any custom modules to be layered on top of the (presumably) secure version provided as part of the OS.

Some options that may fit better into the common CI/Git workflows might be:

  • Detect, apply, and test security patches off-site on a dedicated server or container. If successful, commit them back to version control to your dev/integration branch.
  • Check for security updates as part of your CI system. Apply, test and merge any updates into your integration branch.

Third-party Drupal Modules (contrib)

Due to the nature of contrib Drupal modules (ie, those provided by the community) it can be difficult to update them without also bringing in other changes, such as new features (and bugs!) that the author may have introduced since the version you are currently running. Best practice would be to try to keep all of the contrib that the site uses up to date where possible, and to treat this with the same care and testing as you would updates to Drupal itself. Contrib modules often receive important bug fixes and performance improvements that you may be missing out on if you only ever update in the event of a security announcements.

Summary

  • Ensure that updates are coming from a trusted and secure (SSL) source, such as your Linux distribution's packaging repositories or the official Git repositories for your software.
  • If you do not trust the security updates enough to apply them automatically, you should probably not be using the software in the first place.
  • Ensure that you are alerted in the event of any failures in your automation.
  • Subscribe to relevant security mailing lists, RSS feeds, and user groups for your software.
  • Prove to yourself and your customers that your update method is reliable.
  • Do not allow your users, client, or boss to postpone or delay security updates without an incredibly good reason.

You are putting your faith in the maintainer's' ability to provide timely updates that will not break your systems when applied. This is a risk you will have to take if you automate the process, but it can be mitigated through automated or manual testing.

Leave It All To Somebody Else

If all this feels like too much responsibility and hard work then it’s something Ixis have many years of experience in. We have dedicated infrastructure and application support teams to keep your systems secure and updated. Get in touch to see how we can ensure you're secure now and in the future whilst enjoying the use and benefits of open source software.

Jan 28 2015
Jan 28

As the largest bicycling club in the country with more than 16,000 active members and a substantially larger community across the Puget Sound, Cascade Bicycle Club requires serious performance from its website. For most of the year, Cascade.org serves a modest number of web users as it furthers the organization’s mission of “improving lives through bicycling.”

But a few days each year, Cascade opens registration for its major sponsored rides, which results in a series of massive spikes in traffic. Cascade.org has in the past struggled to keep up with demand during these spikes. During the 2014 registration period for example, site traffic peaked at 1,022 concurrent users and >1,000 transactions processed within an hour. The site stayed up, but the single web server seriously struggled to stay on its feet.

In preparation for this year’s event registrations, we implemented horizontal scaling at the web server level as the next logical step forward in keeping pace with Cascade’s members. What is horizontal scaling, you might ask? Let me explain.

[Ed Note: This post gets very technical, very quickly.]

Overview

We had already set up hosting for the site in the Amazon cloud, so our job was to build out the new architecture there, including new Amazon Machine Images (AMIs) along with an Autoscale Group and Scaling Policies.

Here is a diagram of the architecture we ended up with. I’ll touch on most of these pieces below.

Cascade-scaling2

Web Servers as Cattle, Not Pets

I’m not the biggest fan of this metaphor, but it’s catchy: The fundamental mental shift when moving to automatic scaling is to stop thinking of the servers as named and coddled pets, but rather as identical and ephemeral cogs–a herd of cattle, if you will.

In our case, multiple web server instances are running at a given time, and more may be added or taken away automatically at any given time. We don’t know their IP addresses or hostnames without looking them up (which we can do either via the AWS console, or via AWS CLI — a very handy tool for managing AWS services from the command line).

The load balancer is configured to enable connection draining. When the autoscaling group triggers an instance removal, the load balancer will stop sending new traffic, but will finish serving any requests in progress before the instance is destroyed. This, coupled with sticky sessions, helps alleviate concerns about disrupting transactions in progress.

The AMI for the “cattle” web servers (3) is similar to our old single-server configuration, running Nginx and PHP tuned for Drupal. It’s actually a bit smaller of an instance size than the old server, though — since additional servers are automatically thrown into the application as needed based on load on the existing servers — and has some additional configuration that I’ll discuss below.

As you can see in the diagram, we still have many “pets” too. In addition to the surrounding infrastructure like our code repository (8) and continuous integration (7) servers, at AWS we have a “utility” server (9) used for hosting our development environment and some of our supporting scripts, as well as a single RDS instance (4) and a single EC2 instance used as a Memcache and Solr server (6). We also have an S3 instance for managing our static files (5) — more on that later.

Handling Mail

One potential whammy we caught late in the process was handling mail sent from the application. Since the IP of the given web server instance from which mail is sent will not match the SPF record for the domain (IP addresses authorized to send mail), the mail could be flagged as spam or mail from the domain could be blacklisted.

We were already running Mandrill for Drupal’s transactional mail, so to avoid this problem, we configured our web server AMI to have Postfix route all mail through the Mandrill service. Amazon Simple Email Service could also have been used for this purpose.

Static File Management

With our infrastructure in place, the main change at the application level is the way Drupal interacts with the file system. With multiple web servers, we can no longer read and write from the local file system for managing static files like images and other assets uploaded by site editors. A content delivery network or networked file system share lets us offload static files from the local file system to a centralized resource.

In our case, we used Drupal’s S3 File System module to manage our static files in an Amazon S3 bucket. S3FS adds a new “Amazon Simple Storage Service” file system option and stream wrapper. Core and contributed modules, as well as file fields, are configured to use this file system. The AWS CLI provided an easy way to initially transfer static files to the S3 bucket, and iteratively synch new files to the bucket as we tested and proceeded towards launch of the new system.

In addition to static files, special care has to be taken with aggregated CSS and Javascript files. Drupal’s core aggregation can’t be used, as it will write the aggregated files to the local file system. Options (which we’re still investigating) include a combination of contributed modules (Advanced CSS/JS Aggregation + CDN seems like it might do the trick), or Grunt tasks to do the aggregation outside of Drupal during application build (as described in Justin Slattery’s excellent write-up).

In the case of Cascade, we also had to deal with complications from CiviCRM, which stubbornly wants to write to the local file system. Thankfully, these are primarily cache files that Civi doesn’t mind duplicating across webservers.

Drush & Cron

We want a stable, centralized host from which to run cron jobs (which we obviously don’t want to execute on each server) and Drush commands, so one of our “pets” is a small EC2 instance that we maintain for this purpose, along with a few other administrative tasks.

Drush commands can be run against the application from anywhere via Drush aliases, which requires knowing the hostname of one of the running server instances. This can be achieved most easily by using AWS CLI. Something like the bash command below will return the running instances (where ‘webpool’ is an arbitrary tag assigned to our autoscaling group):

[[email protected] ~]$aws ec2 describe-instances --filters "Name=tag-key, Values=webpool" |grep ^INSTANCE |awk '{print $14}'|grep 'compute.amazonaws.com'

We wrote a simple bash script, update-alias.sh, to update the ‘remote-host’ value in our Drush alias file with the hostname of the last running server instance.

Our cron jobs execute update-alias.sh, and then the application (both Drupal and CiviCRM) cron jobs.

Deployment and Scaling Workflows

Our webserver AMI includes a script, bootstraph.sh, that either builds the application from scratch — cloning the code repository, creating placeholder directories, symlinking to environment-specific settings files — or updates the application if it already exists — updating the code repository and doing some cleanup.

A separate script, deploy-to-autoscale.sh, collects all of the running instances similar to update-alias.sh as described above, and executes bootstrap.sh on each instance.

With those two utilities, our continuous integration/deployment process is straightforward. When code changes are pushed to our Git repository, we trigger a job on our Jenkins server that essentially just executes deploy-to-autoscale.sh. We run update-alias.sh to update our Drush alias, clear the application cache via Drush, tag our repository with the Jenkins build ID, and we’re done.

For the autoscaling itself, our current policy is to spin up two new server instances when CPU utilization across the pool of instances reaches 75% for 90 seconds or more. New server instances simply run bootstrap.sh to provision the application before they’re added to the webserver pool.

There’s a 300-second grace time between additional autoscale operations to prevent a stampede of new cattle. Machines are destroyed when CPU usage falls beneath 20% across the pool. They’re removed one at a time for a more gradual decrease in capacity than the swift ramp-up that fits the profile of traffic.

More Butts on Bikes

With this new architecture, we’ve taken a huge step toward one of Cascade’s overarching goals: getting “more butts on bikes”! We’re still tuning and tweaking a bit, but the application has handled this year’s registration period flawlessly so far, and Cascade is confident in its ability to handle the expected — and unexpected — traffic spikes in the future.

Our performant web application for Cascade Bicycle Club means an easier registration process, leaving them to focus on what really matters: improving lives through bicycling.

Previous Post

2015 Digital Trends for Influence

Next Post

Communicating Data for Impact takes Seattle

Oct 20 2014
Oct 20

As I mentioned in my recent post, I got a chance to upgrade the drupal.org ELK stack last week. In doing so, I got to take a look at a Logstash configuration that I created over a year ago, and in the course of doing so, clean up some less-than-optimal configurations based on a year worth of experience and simplify the configuration file a great deal.

The Drupal.org Logging Setup

Drupal.org is served by a large (and growing) number of servers. They all ship their logs to a central logging server for archival, and around a month’s worth are kept in the ELK stack for analysis.

Logs for Varnish, Apache, and syslog are forwarded to a centralized log server for analysis by Logstash. Drupal messages are output to syslog using Drupal core’s syslog module so that logging does not add writes to Drupal.org’s busy database servers. (@TODO: Check if these paths can be published.) Apache logs end up in/var/log/apache_logs/$MACHINE/$VHOST/transfer/$DATE.log, Varnish logs end up in/var/log/varnish_logs/$MACHINE/varnishncsa-$DATE.log and syslog logs end up in /var/log/HOSTS/$MACHINE/$DATE.log. All types of logs get gzipped 1 day after they are closed to save disk space.

Pulling Contextual Smarts From Logs

The Varnish and Apache logs do not contain any content in the logfiles to identify which machine they are from, but the file input sets a path field that can be matched with grok to pull out the machine name from the path and put it into the logsource field, which Grok’s SYSLOGLINE pattern will set when analyzing syslog logs.

Filtering on the logsource field can be quite helpful in the Kibana web UI if a single machine is suspected of behaving weirdly.

Using Grok Overwrite

Consider this snippet from the original version of the Varnish configuration. As I mentioned in my presentation, Varnish logs are nice in that they inclue the HTTP Host header so that you can see exactly which hostname or IP was requested. This makes sense for a daemon like Varnish which does not necessarily have a native concept of virtual hosts (vhosts,) whereas nginx and Apache default to logging by vhost.

Each Logstash configuration snippet shown below assumes that Apache and Varnish logs have already been processed using theCOMBINEDAPACHELOG grok pattern, like so.

filter { if [type] == "varnish" or [type] == "apache" { grok { match => [ "message", "%{COMBINEDAPACHELOG}" ] } } } 1 2 3 4 5 6 7 8 filter{  if[type]=="varnish"or[type]=="apache"{    grok{      match=>["message","%{COMBINEDAPACHELOG}"]    }  }}  

The following snippet was used to normalize Varnish’s request headers to not include https?:// and the Host header so that therequest field in Apache and Varnish logs will be exactly the same and any filtering of web logs can be performed with the vhost andlogsource fields.

filter { if [type] == "varnish" { grok { # Overwrite host for Varnish messages so that it's not always "loghost". match => [ "path", "/var/log/varnish_logs/%{HOST:logsource}" ] } # Grab the vhost and a "request" that matches Apache from the "request" variable for now. mutate { add_field => [ "full_request", "%{request}" ] } mutate { remove_field => "request" } grok { match => [ "full_request", "https?://%{IPORHOST:vhost}%{GREEDYDATA:request}" ] } mutate { remove_field => "full_request" } } } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 filter{  if[type]=="varnish"{    grok{      # Overwrite host for Varnish messages so that it's not always "loghost".      match=>["path","/var/log/varnish_logs/%{HOST:logsource}"]    }    # Grab the vhost and a "request" that matches Apache from the "request" variable for now.    mutate{      add_field=>["full_request","%{request}"]    }    mutate{      remove_field=>"request"    }    grok{      match=>["full_request","https?://%{IPORHOST:vhost}%{GREEDYDATA:request}"]    }    mutate{      remove_field=>"full_request"    }  }}  

As written, this snippet copies the request field into a new field called full_request and then unsets the original request field and then uses a grok filter to parse both the vhost and request fields out of that synthesized full_request field. Finally, it deletesfull_request.

The original approach works, but it takes a number of step and mutations to work. The grok filter has a parameter calledoverwrite that allows this configuration stanza to be considerably simlified. The overwrite paramter accepts an array of values thatgrok should overwrite if it finds matches. By using overwrite, I was able to remove all of the mutate filters from my configuration, and the enture thing now looks like the following.

filter { if [type] == "varnish" { grok { # Overwrite host for Varnish messages so that it's not always "loghost". # Grab the vhost and a "request" that matches Apache from the "request" variable for now. match => { "path" => "/var/log/varnish_logs/%{HOST:logsource}" "request" => "https?://%{IPORHOST:vhost}%{GREEDYDATA:request}" } overwrite => [ "request" ] } } } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 filter{  if[type]=="varnish"{    grok{      # Overwrite host for Varnish messages so that it's not always "loghost".      # Grab the vhost and a "request" that matches Apache from the "request" variable for now.      match=>{        "path"=>"/var/log/varnish_logs/%{HOST:logsource}"        "request"=>"https?://%{IPORHOST:vhost}%{GREEDYDATA:request}"      }      overwrite=>["request"]    }  }}  

Much simpler, isn’t it? 2 grok filters and 3 mutate filters have been combined into a single grok filter with two matching patterns and a single field that it can overwrite. Also note that this version of the configuration passes a hash into the grok filter. Every example I’ve seen just passes an array to grok, but the documentation for the grok filter states that it takes a hash, and this works fine.

Ensuring Field Types

Recent versions of Kibana have also gotten the useful ability to do statistics calculations on the current working dataset. So for example, you can have Kibana display the mean number of bytes sent or the standard deviation of backend response times (if you are capturing them – see my DrupalCon Amsterdam slides for more information on how to do this and how to normalize it between Apache, nginx, and Varnish.) Then, if you filter down to all requests for a single vhost or a set of paths, the statistics will update.

Kibana will only show this option for numerical fields, however, and by default any data that has been parsed with a grok filter will be a string. Converting string fields to other types is a much better use of the mutate filter. Here is an example of converting the bytes and the response code to integers using a mutate filer.

filter { if [type] == "varnish" or [type] == "apache" { mutate { convert => { [ "bytes", "response" ] => "integer", } } } } 1 2 3 4 5 6 7 8 9 10 filter{  if[type]=="varnish"or[type]=="apache"{    mutate{      convert=>{        ["bytes","response"]=>"integer",      }    }  }}  

Lessons Learned

Logstash is a very powerful tool, and small things like the grok overwrite parameter and the mutate convert parameter can help make your log processing configuration simpler and result in more usefulness out of your ELK cluster. Check out Chris Johnson’s post about adding MySQL Slow Query Logs to Logstash!

If you have any other useful Logstash tips and tricks, leave them in the comments!

Apr 07 2014
Apr 07

The first revision control system I ever used was called RCS. It was the pre-cursor to CVS and stored all revision data locally. It was nifty but very limited and not suited for group development. CVS was the first shared revisioning system I used. It was rock solid, IMHO. But it had a few big problems, like the inability to rename or move files. Everything had to be deleted and re-added. 

Since those days, I've used several other revisioning systems: Perforce, Bitkeeper, Clearcase, Subversion and GIT.

I'm tired of learning yet another system. I just want to know which horse is going to win the race for the forseeable future and go all in.

That's where Google Trends comes in very handy. It quickly reveals that I need to bet on GIT. 

I just hope I can make it through the next 5 years or more before having to learn the next greatest solution to our shared problem of tracking code revisions.

Nov 09 2011
Nov 09

At Pantheon, when we look at challenges Drupal projects face, we don't only do what's worked well-enough in existing deployments. We ask ourselves how we can transcend some of the challenges entirely. How Drupal stores files is no exception. So, when we started building the next generation of Pantheon (now launched), we looked at our options -- and then built a solution entirely focused on the needs of Drupal developers.

Existing options

We started with a survey of existing technology, both to see if something off-the-shelf would work for us and to inform the design of any system we might implement.

  • Local filesystem: This is how we delivered the original Pantheon. It's fast, simple, and reliable, but there's no way to have additional application servers share the same set of files or deliver good high-availability; this option was almost immediately off the table.
  • Enterprise SAN: We collected quotes for a few options, and they were almost all unscalable (for what we need to bring an awesome Drupal platform to everyone) or overpriced (almost all the capacity has to be bought up-front).
  • Cloud block storage: Amazon's Elastic Block Store (EBS) has notorious reliability issues in addition to common problems with "successful" snapshots actually working. EBS volumes are limited in scale and only get good performance when striped (which also breaks the ability to use snapshotting at all).
  • Cloud file storage (like S3): There's no reliable way to mount these and provide multi-level directories. They're also very high-latency and prone to brief access problems when used from a different datacenter.
  • GlusterFS: This was the most promising option, but it makes tradeoffs (ones that Drupal projects don't need) for fidelity to traditional filesystem semantics, like random block I/O, locking, and optimization for deep filesystem hierarchies. GlusterFS buildouts require adding a logical volume manager or a special filesystem to get snapshot capability. Client machines in a GlusterFS cluster have to have UIDs and GIDs synchronized, generally meaning use of LDAP (another possible point-of-failure) or other system management tools. Providing access to a Gluster filesystem on a server without it being part of the cluster requires exporting access over something like NFS, which requires making that export service itself highly available (HA), too. Resolving a split-brain across a cluster sometimes requires manual administrator intervention. Geo-distributed replication is only possible with a master/slave configuration; this complicates fail-over and limits the capability of datacenters housing the "slave" instances. There's some effort to simplify setups where one cluster serves multiple, isolated customers, like the GlusterFS derivative HekaFA, but they're quite immature.

What Drupal actually needs

If we set out to build yet another totally generic clustered filesystem, we'd be fools. Fortunately, Drupal's needs for files are pretty specific.

  • Most write operations create entire files. Drupal doesn't usually modify a byte range (in contrast to, say, database servers).
  • Most read operations hit the edge cache, like Varnish or a CDN. Read performance directly off the filesystem isn't critical.
  • Most files, once written, never change. (They might, however, get deleted.)
  • Availability is critical. If the edge cache misses, site users will see broken images, CSS, Javascript, and other problems if the filesystem goes down.
  • Consistency is less important than availability. It's better to allow access to the latest known version of a file (especially given how little a single file changes once created in Drupal) than fail.
  • When a Drupal site has multiple environments (dev, test, live, etc.), the vast majority of files will be identical between them. Changes, however, to dev or test should not affect live.
  • Most files are small (under 5MB), especially because the uploads tend to happen through Drupal.
  • Files are numerous. It's not totally uncommon for a directory in Drupal to have 10,000 files or more -- often images.

How storage works in Valhalla

Valhalla's storage architecture is more similar to systems like Amazon S3 than traditional filesystems backed by block devices.

Volumes

In filesystems, a volume is a unique namespace for managing directories and files. Everything Valhalla stores is broken into a series of volumes. Each environment (dev, test, live) of each site on Pantheon gets its own Valhalla volume. Valhalla volume exist as a wide rows in Cassandra that map individual paths to metadata, including a SHA-512 hash of the file content. We can pack all of this data into a single row because, in Cassandra, a row can scale to over two billion entries if the columns are tiny like our file and directory entries.

File content

Valhalla also manages the content of files by creating a row for each content hash. Each row contains a series of columns (each up to 5MB) named after the offsets into the content of the file. A file under 5MB simply has one column named "0". Because file content is addressed by its hash, multiple references to the same content (whether from the same or different volumes) are able to use the same content. This hash addressing automatically prevents duplicate storage (other than in Cassandra, where we keep three copies of everything already).

Valhalla also uses a copy-on-write strategy for when files change; writing different content to an existing file causes the new content to get its own content row and the entry on the volume to be pointed at the new content. To clean up old content that isn't used in any volumes, Valhalla asynchronously counts references (using a special strategy to avoid race conditions in Cassandra) and deletes content that has achieved a stable state of zero references.

Cloning and snapshotting volumes

Valhalla doesn't just use content hash addressing and copy-on-write to save disk space. It also allows rapid cloning of volumes. When a developer on Pantheon uses the dashboard to "sync" files for an environment, Valhalla simply replaces the target volume (usually dev or test) with a cloned version of the source volume (usually live). Developers with a history of waiting on rsync will be happy to know this takes Valhalla under five seconds for ten thousand files. We can also use this functionality for snapshotting volumes by cloning to a destination that does not already exist.

Providing Drupal's files directory

But all this fancy storage on Pantheon is useless if Drupal can't read and write as it expects to its "files" directory. Instead of using the FUSE driver + re-export model of GlusterFS, the Valhalla server directly provides a WebDAV server written in Twisted Python. The server authenticates access and encrypts data by using Pantheon's platform-wide certificate infrastructure. Application servers running Pantheon sites mount each environment's volume using davfs2, which also caches the file content locally so that Drupal servers don't need to download a fresh copy if the file in Valhalla hasn't changed. A load-balancer fronts the whole Valhalla cluster to provide HA and distribute requests to each Valhalla server.

Creating backups

While Valhalla distributes three copies of every asset (whether volume entry or file content) internally, Pantheon still provides off-site file backups for ultimate assurance. We run backups on a server we call "Ellis Island" (because it handles imports to Pantheon, too) using Jenkins. When Jenkins is performing a backup, it mounts the target volume, creates a compressed tarball, and ships it off for storage in Amazon S3. Pantheon makes the archives available for developer download from the dashboard by using S3's "signed URL" facility.

What's next?

While we're glad we can provide the projects on Pantheon with a reliable, scalable solution to providing Drupal with a "files" directory across multiple application servers, there's more work to be done. Here are some ideas we're looking at. (These aren't formally on our roadmap, especially near-term.)

  • PHP streams support: Right now, Drupal 7 sites on Pantheon access Valhalla using the local filesystem, which transparently back-ends to Valhalla using davfs2. It would be more efficient to provide direct PHP stream access to Valhalla and skip using WebDAV when possible.
  • Edge integration: Currently, we route requests that miss Varnish through a custom node.js proxy called Styx (named because it takes requests to their final destination), then to nginx on an application server. If the request is for a static file, nginx then accesses the filesystem mounted using davfs2. It would be more efficient to have Styx directly access Valhalla.
  • CDN integration: A lot of the difficulty around CDN integration is synchronizing local assets with the CDN. Because Valhalla knows about file changes at a high level, it would provide a great integration point for auto-syncronizing changes on a volume to a CDN.
  • Desktop access: It's already possible to import archives of files and install various file-management modules on Drupal sites running on Pantheon, but we'll be looking into proper desktop access, probably with WebDAV or SFTP/SSHFS.
Oct 27 2011
Oct 27

It's alive! Today at Noon PDT we released a major update to the Pantheon dashboard and infrastructure. If you saw us demoing this at BADCamp last weekend and got an invite, you're welcome to use it now. If you are already on our sign-up list, you won't have to wait too much longer. We're finally going to be able to deliver Pantheon at scale. The new version is accessible at: dashboard.getpantheon.com.

Sites overview

This release represents a complete re-write of the Dashboard in Drupal 7, taking into account the feedback we've gotten from 100s of users this year, as well as extensive paper and in-person testing sessions with the new design. It's not perfect yet, but it's a big step forward, and we are excited to start adding new features.

Sites overview

It's also a whole new infrastructure, a "Drupal Borg" we call it, designed to run literally 1000s of sites on a next-generation grid-style architecture. That means we can handle use-cases from small personal sites up to large use-cases that would traditionally require a multi-server cluster, all without the need to manage complex hardware arrangements or go through painful infrastructure migrations.

Users spinning up new instances on the v2 system are free to start development now, and we should be ready for live launches within a month. For those still waiting to get access, take heart: with this new foundation, the rate of sending invite codes is going to start picking up quickly.

Feb 16 2008
Feb 16

The importance of project management tools is almost never fully appreciated. I am shocked at how common it is for a group of developers to go working without version control, ticket tracking, development documentation and so on. The very first thing I do when working with a new client is to make sure that they get these tools in place if they haven't already.

Those who are used to working without a complete set of project management tools never fail to appreciate the benefits of them once they are introduced. I consider it next to impossible for a team to work together without managing code and tasks in an efficient and highly organized way.[img_assist|nid=155|title=|desc=|link=none|align=right|width=250|height=156]

Hopefully you do not need to be sold on this idea and are using CVS or SVN to manage your project already. You likely have some sort of ticket system. It is a little less likely that you have both of these components integrated with each other.

When it comes to choosing a solution for project management software, a die-hard Drupal user has a dilemna. On one hand, Drupal seems as though it should be the perfect solution. It's fully customizable, has lots of nifty project management related modules and, most importantly, it's Drupal! Why would you not use it? "Eating your own dogfood" is the way to go, right? Meh...

Drupal is generally considered a content management system. Personally, I like to refer to it as a website management system. It is great at managing website related stuff like users, posts, permissions, categorization, and so on. Using contrib modules, you can customize and enhance this core functionality to almost no end. But at the end of the day, Drupal is designed to handle web content and the users that are accessing it. That's what a content management system is (and if content is king, that would make Drupal... well... God).

Managing a project, on the other hand, is a much different business from managing a website. Yes, you have many shared properties such as content and users. But the essence of project management involves things that have nothing to do with website management such as a revision controlled code base edited by multiple users, a need for efficient ticket management, and ideally full integration of everything. Essentials also include stuff like a nice repository browser, user management interface for repository access, fancy reporting for tickets, organization of tasks by milestone, date, person, severity, etc...

It's a very tall order. Yes, you can do all this in Drupal, but not very well. You can piece together something that sorta kinda resembles a project management solution, but in the end, you need to invest a relatively large amount of time to create something that is less than ideal and will require ongoing tweaking and modification. Unless your business is creating an effective project management solution in Drupal (something I dream of!), you should not be using Drupal for project management.

I'm a one man shop, and I do not have time to spare. I cannot justify spending any time at all kludging together a project management solution for a client when there are already far superior solutions available at low cost. I would much rather pay someone a few bucks a month and be done with it. Let them deal with SVN administration and enhancements; let me focus on my primary task which is building cool sites with Drupal.

While there are numerous project management related service providers out there (Fogbugz, Basecamp , Beanstalk to name a few), I want to talk about my personal favorite, Unfuddle. Unfuddle has taken obvious inspiration from the folks over at 37signals, innovators of the simple, clean, effective, it-just-works web application. Unfuddle is an instant project management solution that takes minutes to set up and costs a few dollars a month. The time you'll save in not having to set up SVN and manage SVN users alone makes it worth every penny.

[img_assist|nid=156|title=|desc=|link=none|align=left|width=250|height=221]What you get with a solution such as unfuddle is a ready-to-use repository with integrated documentation, ticketing and reporting. It takes seconds to set up a new user account with permission levels fit for everyone from a developer (gimme root!) or a suit (look but don't touch).

From a single interface, you can browse code, tickets and documentation. Every component integrates with the others. You can even resolve a ticket with an SVN commit message, saving you the trouble of having to go and edit the ticket after your commit! Users can individually subscribe to whatever level of email notificaton they would like to recieve and how often. The developer can shut off all notifications while the manager can get a nice daily summary each morning of milestone completion progress, new tickets, added documentation and so on. The project manager can glance over one of the ticket reports and group tickets into milestones for reasonable short vs long term goals.

SVN comments link back to the tickets they are related to. Tickets contain links to the changesets that resolved them. Viewing these changesets, you can see a beautiful code diff and quickly see what fixed the problem. Senior team members can quickly and easily review code changes submitted by junior staff.

With tools like this available these days, it's just not worth it spending any effort whatever on a lesser solution.

About Drupal Sun

Drupal Sun is an Evolving Web project. It allows you to:

  • Do full-text search on all the articles in Drupal Planet (thanks to Apache Solr)
  • Facet based on tags, author, or feed
  • Flip through articles quickly (with j/k or arrow keys) to find what you're interested in
  • View the entire article text inline, or in the context of the site where it was created

See the blog post at Evolving Web

Evolving Web