Jul 31 2019
Jul 31

Pantheon is an excellent hosting service for both Drupal and WordPress sites. But to make their platform work and scale well they have set a number of limits built into the platform, these include process time limits and memory limits that are large enough for the vast majority of projects, but from time to time run you into trouble on large jobs.

For data loading and updates their official answer is typically to copy the database to another server, run your job there, and copy the database back onto their server. That’s fine if you can afford to freeze updates to your production site, setup a process to mirror changes into your temporary copy, or some other project overhead that can be limiting and challenging. But sometimes that’s not an option, or the data load takes too long for that to be practical on a regular basis.

I recently needed to do a very large import for records into a Drupal database and so started to play around with solutions that would allow me to ignore those time limits. We were looking at needing to do about 50 million data writes and the running time was initially over a week to complete the job.

Since Drupal’s batch system was created to solve this exact problem it seemed like a good place to start. For this solution you need a file you can load and parse in segments, like a CSV file, which you can read one line at a time. It does not have to represent the final state, you can use this to actually load data if the process is quick, or you can serialize each record into a table or a queue job to actually process later.

One quick note about the code samples, I wrote these based on the service-based approach outlined in my post about batch services and the batch service module I discussed there. It could be adapted to a more traditional batch job, but I like the clarity the wrapper provides for breaking this back down for discussion.

The general concept here is that we upload the file and then progressively process it from within a batch job. The code samples below provide two classes to achieve this, first is a form that provides a managed file field which create a file entity that can be reliably passed to the batch processor. From there the batch service takes over and using a bit of basic PHP file handling to load the file into a database table. If you need to do more than load the data into the database directly (say create complex entities or other tasks) you can set up a second phase to run through the values to do that heavier lifting. 

To get us started the form includes this managed file:

   $form['file'] = [
     '#type' => 'managed_file',
     '#name' => 'data_file',
     '#title' => $this->t('Data file'),
     '#description' => $this->t('CSV format for this example.'),
     '#upload_location' => 'private://example_pantheon_loader_data/',
     '#upload_validators' => [
       'file_validate_extensions' => ['csv'],

The managed file form element automagically gives you a file entity, and the value in the form state is the id of that entity. This file will be temporary and have no references once the process is complete and so depending on your site setup the file will eventually be purged. Which all means we can pass all the values straight through to our batch processor:

$batch = $this->dataLoaderBatchService->generateBatchJob($form_state->getValues());

When the data file is small enough, a few thousand rows at most, you can load them all right away without the need of a batch job. But that runs into both time and memory concerns and the whole point of this is to avoid those. With this approach we can ignore those and we’re only limited by Pantheon’s upload file size. If they file size is too large you can upload the file via sftp and read directly from there, so while this is an easy way to load the file you have other options.

As we setup the file for processing in the batch job, we really need the file path not the ID. The main reason to use the managed file is they can reliably get the file path on a Pantheon server without us really needing to know anything about where they have things stashed. Since we’re about to use generic PHP functions for file processing we need to know that path reliably:

$fid = array_pop($data['file']);
$fileEntity = File::load($fid);
$ops = [];

if (empty($fileEntity)) {
  $this->logger->error('Unable to load file data for processing.');
  return [];
$filePath = $this->fileSystem->realpath($fileEntity->getFileUri());
$ops = ['processData' => [$filePath]];

Now we have a file and since it’s a csv we can load a few rows at time, process them, and then start again.

Our batch processing function needs to track two things in addition to the file: the header values and the current file position. So in the first pass we initialize the position to zero and then load the first row as the header. For every pass after that we need to find point we left off. For this we use generic PHP files for loading and seeking the current location:

// Old-school file handling.
$path = array_pop($data);
$file = fopen($path, "r");
fseek($file, $filePos);

// Each pass we process 100 lines, if you have to do something complex
// you might want to reduce the run.
for ($i = 0; $i < 100; $i++) {
  $row = fgetcsv($file);
  if (!empty($row)) {
    $data = array_combine($header, $row);
    $member['timestamp'] = time();
    $rowData = [
             'col_one' => $data['field_name'],
             'data' => serialize($data),
             'timestamp' => time(),
    $row_id = $this->database->insert('example_pantheon_loader_tracker')

    // If you're setting up for a queue you include something like this.
    // $queue = $this->queueFactory->get(‘example_pantheon_loader_remap’);
    // $queue->createItem($row_id);
 else {
$filePos = (float) ftell($file);
$context['finished'] = $filePos / filesize($path);

The example code just dumps this all into a database table. This can be useful as a raw data loader if you need to add a large data set to an existing site that’s used for reference data or something similar.  It can also be used as the base to create more complex objects. The example code includes comments about generating a queue worker that could then run over time on cron or as another batch job; the Queue UI module provides a simple interface to run those on a batch job.

I’ve run this process for several hours at a stretch.  Pantheon does have issues with systems errors if left to run a batch job for extreme runs (I ran into problems on some runs after 6-8 hours of run time), so a prep into the database followed by running on queue or something else easier to restart has been more reliable.

View the code on Gist.

Jun 25 2019
Jun 25

I recently had reason to switch over to using Docksal for a project, and on the whole I really like it as a good easy solution for getting a project specific Drupal dev environment up and running quickly. But like many dev tools the docs I found didn’t quite cover what I wanted because they made a bunch of assumptions.

Most assumed either I was starting a generic project or that I was starting a Pantheon specific project – and that I already had Docksal experience. In my case I was looking for a quick emergency replacement environment for a long-running Pantheon project.

Fairly recently Docksal added support for a project init command that helps setup for Acquia, Pantheon, and Pantheon.sh, but pull init isn’t really well documented and requires a few preconditions.

Since I had to run a dozen Google searches, and ask several friends for help, to make it work I figured I’d write it up.

Install Docksal

First follow the basic Docksal installation instructions for your host operating system. Once that completes, if you are using Linux as the host OS log out and log back in (it just added your user to a group and you need that access to start up docker).

Add Pantheon Machine Token

Next you need to have a Pantheon machine token so that terminus can run within the new container you’re about to create. If you don’t have one already follow Pantheon’s instructions to create one and save if someplace safe (like your password manager).

Once you have a machine token you need to tell Docksal about it.  There are instructions for that (but they aren’t in the instructions for setting up Docksal with pull init) basically you add the key to your docksal.env file:


 Also if you are using Linux you should note that those instructions linked above say the file goes in $HOME/docksal/docksal.env, but you really want $HOME/.docksal/docksal.env (note the dot in front of docksal to hide the directory).

Setup SSH Key

With the machine token in place you are almost ready to run the setup command, just one more precondition.  If you haven’t been using Docker or Docksal they don’t know about your SSH key yet, and pull init assumes it’s around.  So you need to tell Docksal to load it but running:
fin ssh-key add  

If the whole setup is new, you may also need to create your key and add it to Pantheon.  Once you have done that, if you are using a default SSH key name and location it should pick it up automatically (I have not tried this yet on Windows so mileage there may vary – if you know the answer please leave me a comment). It also is a good idea to make sure the key itself is working right but getting the git clone command from your Pantheon dashboard and trying a manual clone on the command line (delete once it’s done, this is just to prove you can get through).

Run Pull Init

Now finally you are ready to run fin pull init: 

fin pull init --hostingplatform=pantheon --hostingsite=[site-machine-name] --hosting-env=[environment-name]

Docksal will now setup the site, maybe ask you a couple questions, and clone the repo. It will leave a couple things out you may need: database setup, and .htaccess.

Add .htaccess as needed

Pantheon uses nginx.  Docksal’s formula uses Apache. If you don’t keep a .htaccess file in your project (and while there is not reason not to, some Pantheon setups don’t keep anything extra stuff around) you need to put it back. If you don’t have a copy handy, copy and paste the content from the Drupal project repo:  https://git.drupalcode.org/project/drupal/blob/8.8.x/.htaccess

Finally, you need to tell Drupal where to find the Docksal copy of the database. For that you need a settings.local.php file. Your project likely has a default version of this, which may contain things you may or may not want so adjust as needed. Docksal creates a default database (named default) and provides a user named…“user”, which has a password of “user”.  The host’s name is ‘db’. So into your settings.local.php file you need to include database settings at the very least:

$databases = array(
  'default' =>
      'default' =>
        'database' => 'default',
        'username' => 'user',
        'password' => 'user',
        'host' => 'db',
        'port' => '',
        'driver' => 'mysql',
        'prefix' => '',

With the database now fully linked up to Drupal, you can now ask Docksal to pull down a copy of the database and a copy of the site files:

fin pull db

fin pull files

In the future you can also pull down code changes:

fin pull code

Bonus points: do this on a server.

On occasion it’s useful to have all this setup on a remote server not just a local machine. There are a few more steps to go to do that safely.

First you may want to enable Basic HTTP Auth just to keep away from the prying eyes of Googlebot and friends.  There are directions for that step (you’ll want the Apache instructions). Next you need to make sure that Docksal is actually listing to the host’s requests and that they are forwarded into the containers.  Lots of blog posts say DOCKSAL_VHOST_PROXY_IP= fin reset proxy. But it turns out that fin reset proxy has been removed, instead you want: 

DOCKSAL_VHOST_PROXY_IP= fin system reset.  

Next you need to add the vhost to the docksal.env file we were working with earlier:


Run fin up to get Docksal to pick up the changes (this section is based on these old instructions).

Now you need to add either a DNS entry someplace, or update your machine’s /etc/hosts file to look in the right place (the public IP address of the host machine).

Anything I missed?

If you think I missed anything feel free to let know. Particularly Windows users feel free to let me know changes related to doing things there. I’ll try to work those in if I don’t get to figuring that out on my own in the near future.

Feb 18 2019
Feb 18

For the SC DUG meeting this month Will Jackson from Kanopi Studios gave a talk about using Docksal for local Drupal development. Will has the joy of working with some of the Docksal developers and has become an advocate for the simplicity and power Docksal provides.

[embedded content]

We frequently use these presentations to practice new presentations, try out heavily revised versions, and test out new ideas with a friendly audience. If you want to see a polished version checkout our group members’ talks at camps and cons. So if some of the content of these videos seems a bit rough please understand we are all learning all the time and we are open to constructive feedback.

If you would like to join us please check out our up coming events on Meetup for meeting times, locations, and connection information.

Jan 28 2019
Jan 28

This fall the South Carolina Drupal User’s Group started using Zoom are part of all our meetings. Sometimes the technology has worked better than others, but when it works in our favor we are recording the presentations and sharing them when we can.

Chris Zietlow presented back in September about using Machine Learning to Improve UX.

[embedded content]

We frequently use these presentations to practice new presentations and test out new ideas. If you want to see a polished version hunt group members out at camps and cons. So if some of the content of these videos seems a bit rough please understand we are all learning all the time and we are open to constructive feedback.

If you would like to join us please check out our up coming events on Meetup for meeting times, locations, and connection information.

Jan 20 2019
Jan 20

For this month’s South Carolina Drupal User Group I gave a talk about creating Batch Services in Drupal 8. As a quick side note we are trying to include video conference access to all our meetings so please feel free to join us even if you cannot come in person.

[embedded content]

Since Drupal 8 was first released I have been frustrated by the fact that Drupal 8 batch jobs were basically untouched from previous versions. There is nothing strictly wrong with that approach, but it has never felt right to me particularly when doing things in a batch job that I might also want to do in another context – that really should be a service and I should write those core jobs first. After several frustrating experiences trying to find a solution I like, I finally created a module that provides an abstract class that can be used to create a service that handles this problem just more elegantly. The project also includes an example module to provide a sample service.

Some of the text in the slides got cut off by the Zoom video window, so I uploaded them to SlideShare as well:

Quick Batch Overview

If you are new to Drupal batches there are lots of articles around that go into details of traditional implementations, so this will be a super quick overview.

To define a batch you generate an array in a particular format – typically as part of a form submit process – and pass that array to batch_set(). The array defines some basic messages, a list of operations, a function to call when the batch is finished, and optionally a few other details. The minimal array would be something like:

  <?php  // Setup final batch array.
    $batch = [
      'title'    => 'Page title',
      'init_message' => 'Openning message',
      'operations'  => [],
      'finished' => '\some\class\namespace\and\name::finishedBatch',

The interesting part should be in that operations array, which is a list of tasks to be run, but getting all your functions setup and the batch array generated can often be its own project.

Each operation is a function that implements callback_batch_operation(), and the data to feed that function. The callbacks are just functions that have a final parameter that is an array reference typically called $context. The function can either perform all the needed work on the provided parameters, or perform part of that work and update the $context['sandbox']['finished'] value to be a number between 0 and 1. Once finished reaches 1 (or isn’t set at the end of the function) batch declares that task complete and moves on to the next one in the queue. Once all tasks are complete it calls the function provided as the finished value of the array that defined the batch.

The finish function implements callback_batch_finish() which means it accepts three parameters: $success, $results, and $operations: $success is true when all tasks completed without error; $results is an array of data you can feed into the $context array during processing; $operations is your operations list again.

Those functions are all expected to be static methods on classes or, more commonly, a function defined in a procedural code block imported from a separate file (which can be provided in the batch array).

My replacement batch service

It’s those blocks of procedural code and classes of nothing but static methods that bug me so much. Admittedly the batch system is convenient and works well enough to handle major tasks for lots of modules. But in Drupal 8 we have a whole suite of services and plugins that are designed to be run in specific contexts that batch does not provide by default. While we can access the Drupal service container and get the objects we need the batch code always feels clunky and out of place within a well structured module or project. What’s more I have often created batches that benefit from having the key tasks be functions of a service not just specific to the batch process.

So after several attempts to force batches and services to play nice together I finally created this module to force a marriage. Even though there are places which required a bit of compromise, but I think I have most of that contained in the abstract class so I don’t have to worry about it on a regular basis. That makes my final code with complex logic and processing far cleaner and easier to maintain.

The Batch Service Interface module provides an interface an an abstract class that implements parts of it: abstract class AbstractBatchService implements BatchServiceInterface. The developer extending that class only needs to define a service that handles generating a list of operations that call local methods of the service and the finish batch function (also as a local method). Nearly everything else is handled by the parent class.

The implementation I provided in the example submodule ends up four simple methods. Even in more complex jobs all the real work could be contained in a method that is isolated from the oddities of batch processing.


namespace Drupal\batch_example;
use Drupal\node\Entity\Node;
use Drupal\batch_service_interface\AbstractBatchService;

 * Class ExampleBatchService logs the name of nodes with id provided on form.
class ExampleBatchService extends AbstractBatchService {

   * Must be set in child classes to be the service name so the service can
   * bootstrap itself.
   * @var string
  protected static $serviceName = 'batch_example.example_batch';

   * Data from the form as needed.
  public function generateBatchJob($data) {
    $ops = [];
    for ($i = 0; $i < $data['message_count']; $i++ ) {
      $ops[] = [
        'logMessage' => ['MessageIndex' => $i + 1],

    return $this->prepBatchArray($this->t('Logging Messages'), $this->t('Starting Batch Processing'), $ops);

  public function logMessage($data, &amp;$context) {


    if (!isset($context['results']['message_count'])) {
      $context['results']['message_count'] = 0;


  public function doFinishBatch($success, $results, $operations) {
    drupal_set_message($this->t('Logged %count quotes', ['%count' => $results['message_count']]));

  public function getRandomMessage() {
    $messages = [
      // list of messages to select from

    return $messages[array_rand($messages)];



There is the oddity that you have to tell the service its own name so it can bootstrap itself. If there is a way around that I’d love to know it. But really one have one line of code that’s a bit strange, everything else is now fairly clear call and response.

One of the nice upsides to this solution is you could write tests for the service that look and feel just like any other services tests. The methods could all be called once, and you are not trying to run tests against a procedural code block or a class that is nothing but static methods.

I would love to hear ideas about ways I could make this solution stronger. So please drop me a comment or send me a patch.

Related core efforts

There is an effort to try to do similar things in core, but they look like they have some distance left to travel. Obviously once that work is complete it is likely to be better than what I have created, but in the meantime my service allows for a new level of abstraction without waiting for core’s updates to be complete.

Nov 30 2018
Nov 30

In software just about all project management methodologies get labeled one of two things: Agile or Waterfall. There are formal definitions of both labels, but in practice few companies stick to those definitions particularly in the world of consulting. For people who really care about such things, there are actually many more methodologies out there but largely for marketing reasons we call any process that’s linear in nature Waterfall, and any that is iterative we call Agile.

Classic cartoon of a tree swing being poorly because every team saw it differently.Failure within project teams leading to disasters is so common and basic that not only is there a cartoon about it but there is a web site dedicated to generating your own versions of that cartoon (http://projectcartoon.com/).

Among consultants I have rarely seen a company that is truly 100% agile or 100% waterfall. In fact I’ve rarely seen a shop that’s close enough to the formal structures of those methodologies to really accurately claim to be one or the other. Nearly all consultancies are some kind of blent of a linear process with stages (sometimes called “a waterfall phase” or “a planning phase”) followed by an iterative process with lots of non-developer input into partially completed features (often called an “agile phase” or “build phase”). Depending on the agency they might cut up the planning into the start of each sprint or they might move it all to the beginning as a separate project phase. Done well it can allow you to merge the highly complex needs of an organization with the predefined structures of an existing platform. Done poorly it can it look like you tried to force a square peg into a round hole. You can see evidence of this around the internet in the articles trying to help you pick a methodology and in the variations on Agile that have been attempted to try to adapt the process to the reality many consultants face.

In 2001 the Agile Manifesto changed how we talk about project management. It challenged standing doctrine about how software development should be done and moved away from trying to mirror manufacturing processes. As the methodology around agile evolved, and proved itself impressively effective for certain projects, it drew adherents and advocates who preach Agile and Scrum structures as rigid rules to be followed. Meanwhile older project methodologies were largely relabeled “Waterfall” and dragged through the mud as out of date and likely to lead to project failure.

But after all this time Agile hasn’t actually won as the only truly useful process because it doesn’t actually work for all projects and all project teams. Particularly among consulting agencies that work on complex platforms like Drupal and Salesforce, you find that regardless of the label the company uses they probably have a mix linear planning with iterative development – or they fail a lot.

Agile works best when you start from scratch and you have a talented team trying to solve a unique problem. Anytime you are building on a mature software platform you are at least a few hundred thousand hours into development before you have your first meeting. These platforms have large feature sets that deliver lots of the functionality needed for most projects just through careful planning and basic configuration – that’s the whole point of using them. So on any enterprise scale data system you have to do a great deal of planning before you start creating the finished product.

If you don’t plan ahead enough to have a generalized, but complete, picture of what you’re building you will discover very large gaps after far too many pieces have been built to elegantly close them, or your solution will have been built far more generically than needed – introducing significant complexity for very little gain. I’ve seen people re-implement features of Drupal within other features of Drupal just to deal with changing requirements or because a major feature was skipped in planning. So those early planning stages are important, but they also need to leave space for new insights into how best to meet the client’s need and discovery of true errors after the planning stage is complete.

Once you have a good plan the team can start to build. But you cannot simply hand a developer the design and say “do this” because your “this” is only as perfect as you are and your plan does not cover all the details. The developer will see things missed during planning, or have questions that everyone else knows but you didn’t think to write down (and if you wrote down every answer to every possible question, you wrote a document no one bothered to actually read). The team needs to implement part of the solution, check with the client to make sure it’s right, adjust to mistakes, and repeat – a very agile-like process that makes waterfall purists uncomfortable because it means the plan they are working from will change.

In all this you also have a client to keep happy and help make successful – that’s why they hired someone in the first place. Giving them a plan that shows you know what they want they are reassured early in the project that you share their vision for a final solution. Being able to see that plan come together while giving chances to refine the details allows you to deliver the best product you are able.

Agile was supposed to fix all our problems, but didn’t. The methodologies used before were supposed to prevent all the problems that agile was trying to fix, but didn’t. But using waterfall-like planning at the start of your project with agile-ish implementation you can combine the best of both approaches giving you the best chances for success.  We all do it, it is about time we all admit it is what we do.

Cartoon of a developer reviewing all the things he's done: check technical specs, unit tests, configuration, permissions, API updates and then says "Just one small detail I need to code it."Cartoon from CommitStrip
Aug 29 2017
Aug 29

Putting this here because I didn’t see it mentioned elsewhere and it might be useful for others. Thinking about the history of the Islandora solution packs for different media types, the Basic Image Solution Pack was probably the first one written. Displaying a JPEG image, after all, is — well — pretty basic. I’m working on an Islandora project where I wanted to add a viewer to Basic Image objects, but I found that the solution pack code didn’t use them. Fortunately, Drupal has some nice ways for me to intercede to add that capability!

Step 1: Alter the /admin/islandora/solution_pack_config/basic_image form

The first step is to alter the solution pack admin form to add the Viewers panel. Drupal gives me a nice way to alter forms with hook_form_FORM_ID_alter().
 * Implements hook_form_FORM_ID_alter().
 * Add a viewers panel to the basic image solution pack admin page
function islandora_ia_viewers_form_islandora_basic_image_admin_alter(&$form, &$form_state, $form_id) {
  module_load_include('inc', 'islandora', 'includes/solution_packs');
  $form += islandora_viewers_form('islandora_image_viewers', 'image/jpeg', 'islandora:sp_basic_image');

/** * Implements hook_form_FORM_ID_alter(). * * Add a viewers panel to the basic image solution pack admin page */ function islandora_ia_viewers_form_islandora_basic_image_admin_alter(&$form, &$form_state, $form_id) { module_load_include('inc', 'islandora', 'includes/solution_packs'); $form += islandora_viewers_form('islandora_image_viewers', 'image/jpeg', 'islandora:sp_basic_image'); }

Step 2: Insert ourselves into the theme preprocess flow

The second step is a little trickier, and I’m not entirely sure it is legal. We’re going to set a basic image preprocess hook and in it override the contents of $variables['islandora_content']. We need to do this because that is where the viewer sets its output.
 * Implements hook_preprocess_HOOK(&$variables)
 * Inject ourselves into the islandora_basic_image theme preprocess flow. 
function islandora_ia_viewers_preprocess_islandora_basic_image(array &$variables) {
  $islandora_object = $variables['islandora_object'];
  module_load_include('inc', 'islandora', 'includes/solution_packs');
  $params = array();
  $viewer = islandora_get_viewer($params, 'islandora_image_viewers', $islandora_object);
  if ($viewer) {
    $variables['islandora_content'] = $viewer;

/** * Implements hook_preprocess_HOOK(&$variables) * * Inject ourselves into the islandora_basic_image theme preprocess flow. */ function islandora_ia_viewers_preprocess_islandora_basic_image(array &$variables) { $islandora_object = $variables['islandora_object']; module_load_include('inc', 'islandora', 'includes/solution_packs'); $params = array(); $viewer = islandora_get_viewer($params, 'islandora_image_viewers', $islandora_object); if ($viewer) { $variables['islandora_content'] = $viewer; } }

I have a sneaking suspicion that the hooks are called in alphabetical order, and since islandora_ia_viewers comes after islandora_basic_image it all works out. (We need our function to be called after the Solution Pack’s preprocess function so our 'islandora_content' value is the one that is ultimately passed to the theming function. Still, it works!

Jul 05 2014
Jul 05

How to Build a Drupal 8 Module

Even though Drupal 7 core fell short of a proper way of handling its brand new entity system (we currently rely on the great Entity module for that), it did give us EntityFieldQuery. For those of you who don’t know, EntityFieldQuery is a very powerful querying class used to search Drupal entities programatically (nodes, users, etc).

It provides a number of methods that make it easy to query entities based on conditions such as field values or class properties. If you don’t know how it works, feel free to check out this documentation page or this great tutorial on the subject.

In this article I am going to talk about what we have in Drupal 8 for querying entities. There is no more EntityFieldQuery, but there’s an entity.query service that will instantiate a query object for a given entity type (and that implements the \Drupal\Core\Entity\Query\QueryInterface). We can access this service statically through the \Drupal namespace or using dependency injection.

First up, we’ll look at querying node entities and then we’ll see how to load them. The same techniques will work with other content entities as well (users, comments etc), but also with configuration entities, and that’s really cool.

The entity query service

As mentioned, there are two ways we can access the entity.query service that we use for querying entities. Statically, we can do this:

$query = \Drupal::entityQuery('node');

Instead of node, we can specify any other entity type machine name and what we get inside the $query variable is the query object for our entity type. The entityQuery() static method on the \Drupal namespace is a shortcut for doing so using the entity.query service.

Alternatively (and the highly recommended approach) is to use dependency injection.

If you have access to the container, you can load the service from there and then get the right query object:

$entity_query_service = $container->get('entity.query');
$query = $entity_query_service->get('node');

As you can see, we use the get() method on the entity_query service to instantiate a query object for the entity type with the machine name passed as a parameter.

Querying entities

Let’s illustrate a couple of examples of querying for node entities using this object.

A very simple query that returns the published nodes:

$query = \Drupal::entityQuery('node')
    ->condition('status', 1);
$nids = $query->execute();

$nids will be an array of entity ids (in our case node ids) keyed by the revision ids (if there is revisioning enabled for the entity type) or the entity ids if not. Let’s see an example in which we add more property conditions as well as field conditions:

$query = \Drupal::entityQuery('node')
    ->condition('status', 1)
    ->condition('changed', REQUEST_TIME, 'condition('title', 'cat', 'CONTAINS')
    ->condition('field_tags.entity.name', 'cats');

$nids = $query->execute();

In this query, we retrieve the node ids of all the published nodes that have been last updated before the current time, that have the word cat inside their title and that have a taxonomy term called cats as a reference in the field_tags.

As you can see, there is no more distinction between propertyCondition and fieldCondition (as there is in D7 with EntityFieldQuery). Additionally, we can include conditions based on referenced entities tacking on the entity.(column) to the entity reference field name.

An important thing to note is that we also have the langcode parameter in the condition() method by which we can specify what translation of the node should be included in the query. For instance, we can retrieve node IDs that contain a specific value inside of a field in one language but another value inside the same field for another language.

For more information on the condition() method you should consult the API documentation.

The next thing we are going to look at is using condition groups (both AND and OR) for more powerful queries:

$query = \Drupal::entityQuery('node')
    ->condition('status', 1)
    ->condition('changed', REQUEST_TIME, 'orConditionGroup()
    ->condition('title', 'cat', 'CONTAINS')
    ->condition('field_tags.entity.name', 'cats');

$nids = $query->condition($group)->execute();

Above, we altered our previous query so as to retrieve nodes that either have the cat string in their title or have a reference to the term called cats in their field_tags field. And we did so by creating an orConditionGroup object that we then pass to the query as a condition. And we can group together multiple conditions within a andConditionGroup as well.

There are many other methods on the QueryInterface that can extend the query (such as for sorting, range, etc). I encourage you to check them out in the documentation and experiment with them. For now, though, let’s take a quick look at what to do with the result set.

Loading entities

As I mentioned above, the execute() method on the query object we’ve been working with returns an array of entity IDs. Supposedly we now have to load those entity objects and work with them. How do we do that?

In Drupal 7 we had the entity_load() function to which we passed an array of IDs and that would return an array of objects. In Drupal 8, this helper function is maintained and you can use it pretty much in the same way, except only for one entity at a time:

$node = entity_load('node', $nids[1]);

And the return value is a node object. To load multiple nodes, you can use the entity_load_multiple() function:

$nodes = entity_load_multiple('node', $nids);

Which then returns an array of entity objects keyed by their ids.

A bonus nugget of information is that both of these functions are wrappers for the storage manager of the entities in question. They basically retrieve the storage manager statically and then call the load() and loadMultiple() methods, respectively, on it:

Statically, you could do similarly:

$node_storage = \Drupal::entityManager()->getStorage('node');

// Load multiple nodes
// Load a single node

But better yet, you could use dependency injection and retrieve the storage class from the container:

$node_storage = $container->get('entity.manager')->getStorage('node');

And then proceed with the loading. Using dependency injection is usually the recommended way to go when it’s possible, i.e. when working within a class. This makes it easier to test your class and better decouples it from the rest of the application.


In this article we’ve seen how to work with querying and loading entities in Drupal 8. There has been an overhaul of the D7 EntityFieldQuery class that turned into a robust API for querying both content and configuration entities. We’ve looked at querying content entities but the system works just the same with config entities. And that is a bit of a win for the new Drupal 8 entity system.

We’ve also seen how to load entities based on the IDs resulted in these queries and what is actually behind the wrapper functions that perform these operations. Next up, we are going to look at defining our own content entity type in Drupal 8. For a refresher on how we do it in Drupal 7, you can check out these Sitepoint articles on the subject.

Jun 23 2014
Jun 23

In this article we will continue exploring the powers of Views and focus on how to use relationships, contextual filters and rewrite field outputs. In a previous tutorial I showed you how to create a new View and perform basic customizations for it. We’ve seen how to select a display format, which fields to show and how to filter and sort the results.

In this article we will go a bit further and see what relationships and contextual filters are – the two most important options found under the Advanced fieldset at the right of the View edit page. Additionally, we’ll rewrite the output of our fields and combine their values into one.

To begin with, I have a simple article View that just shows the titles. Very easy to set up if you want to follow along. And there are three things I want to achieve going forward:

  1. Make it so that the View shows also the username of the article author
  2. Make is so that the View shows only articles authored by the logged in user
  3. Make it so that the author username shows up in parenthesis after the title


First, let’s have the View include the author of the articles. If the View is displaying fields (rather than view modes or anything else), all we have to do is find the field with the author username, right? Wrong. The problem is the following: the node table only contains a reference to the user entity that created the node (in the form of a user ID – uid). So that’s pretty much all we will find if we look for user related fields: Content: Author uid.

What we need to do is use a relationship to the user entity found in the user table. Relationships are basically a fancy way of saying that table A (in our case node) will join with table B (in our case user) in order to retrieve data related to it from there (such as the name of the user and many others). And the join will happen in our case on the uid field which will match in both tables.

So let’s go ahead and add a new relationship of the type Content: Author. Under Identifier, we can put a descriptive name for this relationship like Content Author. The rest we can leave as default.

Now if you go and add a new field, you’ll notice many others that relate to the user who authored the content. Go ahead and add the User: Name field. In its settings, you’ll see a Relationship select list at the top where the relationship identifier we just specified is automatically selected. That means this field is being pulled in using that relationship (or table join). Saving the field will now add the username of the author, already visible in the View preview.


You can also chain relationships. For instance, if the user entity has a reference to another table using a unique identifier, you can add a second relationship. It will use the first one and bring in fields from that table. So the end result will be that the View will show fields that relate to the node through the user who authored the node but not strictly from the user table but somewhere else connected to the author. And on and on you can join tables like this.

Contextual filters

Contextual filters are similar to regular filters in that you can use mainly the same fields to filter the records on. Where contextual filters differ greatly is that you do not set the filtering value when you create the View, but it is taken from context.

There are many different contexts a filter value can come from, but mainly it comes from the URL. However, you can instruct Views to look elsewhere for contexts as well – such as the ID of the logged in user.

What we’ll do now is add a contextual filter so that the View shows only the articles authored by the logged in user. So go ahead and add a new contextual filter of the type Content: Author uid. Next, under the WHEN THE FILTER VALUE IS NOT IN THE URL fieldset, select the Provide default value radio. Our goal here is to have Views look elsewhere if it does not find the user ID in the URL.

contextual filters

You then have some options under the Type select list, where you should choose User ID from logged in user. This will make Views take the ID of the user that is logged in and pass it to the View as a filter. The rest you can leave as is and save the filter. You’ll immediately notice in your preview that only articles authored by you show up. The filtering is taking place dynamically. If you log in with another user account, you should see only the articles authored by that user account.

A great thing about contextual filters is that if you are displaying a View programatically in a custom module, you can pass the filtering value in code, which opens the door to many possibilities.

Rewriting fields

The last thing we will do in this tutorial is look at rewriting fields in order to concatenate their values. We will illustrate this technique by changing the title field to include the author username in parenthesis.

We’ll start by rearranging the order of the fields and move the title to be the last one showing. The reason we want to do this is that when you rewrite fields, you can use tokens that get values only from fields that are added before the one being rewritten. And since we want to rewrite the title field, we want the token for the username value to be present so we need to move it before the title field.

Now that the title field is last, edit the author username field and uncheck the box Create a label and then check the box Exclude from display. You can now save the field. The reason we are excluding this field from being displayed in our View is so that we don’t duplicate it once we concatenate it to the title field.

rewriting fields

Next, edit the title field and under REWRITE RESULTS, check the box Rewrite the output of this field. A new textarea should appear below where we will write the new contents of this field. If you write some gibberish in there and save the field, you’ll notice the title gets replaced by that gibberish.

Below this textarea, you’ll notice also some REPLACEMENT PATTERNS. These represent tokens of all the fields in the View loaded before this one (and including this one as well). So if you followed along, you’ll see there [name] and [title], among others.

What we need to do now is put these tokens in this box, wrapped with the text or markup we want. Having said that we want the username to be in parenthesis after the node title, we can add the following to the text box to achieve this:

[title] ([name])

Save the field and check out the result. Now you should have the author user in parenthesis. However, it’s still not perfect. We left the title field’s Link this field to the original piece of content box checked and this is breaking the output for us a bit due to also the username having a link to the user profile page. What we want is a clean link to the node title and in parenthesis (which themselves do not link to anything), the username linking to the user profile page.

So first up, add a new field called Content: Path (the path to the node). Make sure you exclude it from display, remove its label and move it before the title field. Then, edit the title field, uncheck the Link this field to the original piece of content box and replace the REWRITE RESULTS text with this:

 href="[path]">[title] ([name])

The [path] token is available from the new field we just added. And after you save, you should see already in the preview a much cleaner display of title nodes and usernames in parenthesis.


In this tutorial we’ve looked at three main aspects of building Views in Drupal 7: relationships, contextual filters and rewriting fields. We’ve seen how with the use of relationships we can use information also from related entities, not just those on the base table a View is built on. Contextual filters are great for when the View needs to display content dynamically depending on various contextual conditions (such as a URL or logged-in user). Lastly, we’ve learned how to rewrite fields and build more complex ones with values taken from multiple fields. As you can see, this technique is very powerful for theming Views as it allows us to output complex markup.

Views is pretty much the most popular Drupal module and it is highly complex. Despite its complexity, building views as a site administrator is very easy. All you need to understand is a few basic concepts and you are good to go. Developing for Views to extend its functionality or expose data to it is also an enjoyable experience. If you’d like to know more about that, you can read my tutorial on exposing your own custom module table to Views right here on Sitepoint.com.

Jun 18 2014
Jun 18

How to Build a Drupal 8 Module

In the previous article on Drupal 8 module development, we’ve looked at creating block types and forms. We’ve seen that blocks are now reusable and how everything we need to do for defining block types happens in one single class. Similarly, form generation functions are also grouped under one class with specific methods performing tasks similar to what we are used to in Drupal 7.

In this tutorial, I will continue where we left off. I will illustrate how we can turn our DemoForm into a form used to store a value through the Drupal 8 configuration system. Following that, we will talk a bit about the service container and dependency injection by way of illustration.

Don’t forget that you can check out this repository if you want to get all the code we write in this tutorial series.

When we first defined our DemoForm, we extended the FormBase class which is the simplest implementation of the FormInterface. However, Drupal 8 also comes with a ConfigFormBase that provides some additional functionality which makes it very easy to interact with the configuration system.

What we will do now is transform DemoForm into one which will be used to store the email address the user enters. The first thing we should do is replace the extended class with ConfigFormBase (and of course use it):

use Drupal\Core\Form\ConfigFormBase;

class DemoForm extends ConfigFormBase {

Before we move on to changing other things in the form, let’s understand a bit how simple configuration works in Drupal 8. I say simple because there are also configuration entities that are more complex and that we will not cover today. As it stands now, configuration provided by modules (core or contrib) is stored in YAML files. On enabling a module, this data gets imported into the database (for better performance while working with it). Through the UI we can change this configuration which is then easily exportable to YAML files for deployment across different sites.

A module can provide default configuration in a YAML file located in the config/install folder in the module root directory. The convention for naming this file is to prefix it with the name of the module. So let’s create one called demo.settings.yml. Inside this file, let’s paste the following:

  email_address: [email protected]

This is a nested structure (like an associative array in PHP). Under the key demo, we have another key|value pair. And usually to access these nested values we use a dot(.). In our case demo.email_address.

Once we have this file in place, an important thing you need to remember is that this file gets imported only when the module is installed. So go ahead and reinstall it. And now we can turn back to our form and go through the methods that need adapting one by one.

This is how the buildForm() method should look like now:

public function buildForm(array $form, array &$form_state) {
  $form = parent::buildForm($form, $form_state);
  $config = $this->config('demo.settings');
  $form['email'] = array(
    '#type' => 'email',
    '#title' => $this->t('Your .com email address.'),
    '#default_value' => $config->get('demo.email_address')
  return $form;

First of all, as opposed to FormBase, the ConfigFormBase class implements this method as well in order to add elements to the form array (a submit button). So we can use what the parent did before adding our own elements.

Now for the configuration part. Drupal 8 provides a Config object that we can use to interact with the configuration. Some classes already have it available through dependency injection. ConfigFormBase is one such class.

As you can see, we are using the config() method of the parent class to retrieve a Config object populated with our demo.settings simple configuration. Then, for the #default_value of the email form element, we use the get() method of the Config object to retrieve the value of the email address.

Next, we only need to change the submit handler because the validateForm() method can stay the same for now:

public function submitForm(array &$form, array &$form_state) {
  $config = $this->config('demo.settings');
  $config->set('demo.email_address', $form_state['values']['email']);
  return parent::submitForm($form, $form_state);

In this method we first retrieve the Config object for our configuration (like we did before). Then, we use its set() method to change the value of the email_address to the value the user submitted. Then we use the save() method to save the configuration. Lastly, we extend the parent submit handler because it does contain some functionality (in this case it sets a Drupal message to the screen).

And that’s pretty much it. You can clear the cache and try it out. By submitting a new email address, you are storing it in the configuration. The module demo.settings.yml file won’t change of course, but you can go and export the demo.settings configuration and import it into another site.

The service container and dependency injection

The next thing we are going to look at is the service container. The idea behind services is to split functionality into reusable components. Therefore a service is a PHP class that performs some global operations and that is registered with the service container in order to be accessed.

Dependency injection is the way through which we pass objects to other objects in order to ensure decoupling. Each service needs to deal with one thing and if it needs another service, the latter can be injected into the former. But we’ll see how in a minute.

Going forward, we will create a very simple service and register it with the container. It will only have one real method that returns a simple value. Then, we will inject that service as a dependency to our DemoController and make use of the value provided by the service.

In order to register a service, we need to create a demo.services.yml file located in the root of our module, with the following contents:

        class: Drupal\demo\DemoService

The file naming convention is module_name.services.yml.

The first line creates an array of services. The second line defines the first service (called demo_service, prefixed by the module name). The third line specifies the class that will be instantiated for this service. It follows to create the DemoService.php class file in the src/ folder of our module. This is what my service does (nothing really, it’s just to illustrate how to use it):


 * @file
 * Contains Drupal\demo\DemoService.

namespace Drupal\demo;

class DemoService {
  protected $demo_value;
  public function __construct() {
    $this->demo_value = 'Upchuk';
  public function getDemoValue() {
    return $this->demo_value;

No need to explain anything here as it’s very basic. Next, let’s turn to our DemoController and use this service. There are two ways we can do this: accessing the container globally through the \Drupal class or use dependency injection to pass an object of this class to our controller. Best practice says we should do it the second way, so that’s what we’ll do. But sometimes you will need to access a service globally. For that, you can do something like this:

$service = \Drupal::service('demo.demo_service');

And now $service is an object of the class DemoService we just created. But let’s see how to inject our service in the DemoController class as a dependency. I will explain first what needs to be done, then you’ll see the entire controller with all the changes made to it.

First, we need access to the service container. With controllers, this is really easy. We can extend the ControllerBase class which gives us that in addition to some other helpers. Alternatively, our Controller can implement the ContainerInjectionInterface that also gives us access to the container. But we’ll stick to ControllerBase so we’ll need to use that class.

Next, we need to also use the Symfony 2 ContainerInterface as a requirement of the create() method that instantiates another object of our controller class and passes to it the services we want.

Finally, we’ll need a constructor to get the passed service objects (the ones that create() returns) and assign them to properties for later use. The order in which the objects are returned by the create() method needs to be reflected in the order they are passed to the constructor.

So let’s see our revised DemoController:


 * @file
 * Contains \Drupal\demo\Controller\DemoController.

namespace Drupal\demo\Controller;

use Drupal\Core\Controller\ControllerBase;
use Symfony\Component\DependencyInjection\ContainerInterface;

 * DemoController.
class DemoController extends ControllerBase {
  protected $demoService;
   * Class constructor.
  public function __construct($demoService) {
    $this->demoService = $demoService;
   * {@inheritdoc}
  public static function create(ContainerInterface $container) {
    return new static(
   * Generates an example page.
  public function demo() {
    return array(
      '#markup' => t('Hello @value!', array('@value' => $this->demoService->getDemoValue())),

As you can see, all the steps are there. The create() method creates a new instance of our controller class passing to it our service retrieved from the container. And in the end, an instance of the DemoService class gets stored in the $demoService property, and we can use it to call its getDemoValue() method. And this value is then used in the Hello message. Clear your cache and give it a try. Go to the demo/ path and you should see Hello Upchuk! printed on the page.

I’m sure you can see the power of the service container as we can now write decoupled functionality and pass it where it’s needed. I did not show you how, but you can also declare dependencies when you register services. This means that when Drupal instantiates a service object, it will do so for all its dependencies as well, and pass them to its constructor. You can read more about how to do that on this documentation page.


In this article we’ve looked at a lot of cool stuff. We’ve seen how the configuration system manages simple configuration and what we have available form-wise for this. I do encourage you to explore how the ConfigFormBase is implemented and what you have available if you extend it. Additionally, you should play around in the UI with importing/exporting configuration between sites. This will be a great improvement for the deployment process from now on.

Then, we looked at services, what they are and how they work. A great way of maintaining reusable and decoupled pieces of functionality accessible from anywhere. And I do hope the concept of dependency injection is no longer so scary (if it was for you). It is basically the equivalent of passing parameters to procedural functions, but done using constructor methods (or setters), under the hood, by Symfony and its great service container.

Jun 16 2014
Jun 16

How to Build a Drupal 8 Module

In the first installment of this article series on Drupal 8 module development we started with the basics. We’ve seen what files were needed to let Drupal know about our module, how the routing process works and how to create menu links programatically as configuration.

In this tutorial we are going to go a bit further with our sandbox module found in this repository and look at two new important pieces of functionality: blocks and forms. To this end, we will create a custom block that returns some configurable text. After that, we will create a simple form used to print out user submitted values to the screen.

Drupal 8 blocks

A cool new change to the block API in D8 has been a switch to making blocks more prominent, by making them plugins (a brand new concept). What this means is that they are reusable pieces of functionality (under the hood) as you can now create a block in the UI and reuse it across the site – you are no longer limited to using a block only one time.

Let’s go ahead and create a simple block type that prints to the screen Hello World! by default. All we need to work with is one class file located in the src/Plugin/Block folder of our module’s root directory. Let’s call our new block type DemoBlock, and naturally it needs to reside in a file called DemoBlock.php. Inside this file, we can start with the following:


namespace Drupal\demo\Plugin\Block;

use Drupal\block\BlockBase;
use Drupal\Core\Session\AccountInterface;

 * Provides a 'Demo' block.
 * @Block(
 *   id = "demo_block",
 *   admin_label = @Translation("Demo block"),
 * )

class DemoBlock extends BlockBase {
   * {@inheritdoc}
  public function build() {    
    return array(
      '#markup' => $this->t('Hello World!'),
   * {@inheritdoc}
  public function access(AccountInterface $account) {
    return $account->hasPermission('access content');

Like with all other class files we start by namespacing our class. Then we use the BlockBase class so that we can extend it, as well as the AccountInterface class so that we can get access to the currently logged in user. Then follows something you definitely have not seen in Drupal 7: annotations.

Annotations are a PHP discovery tool located in the comment block of the same file as the class definition. Using these annotations we let Drupal know that we want to register a new block type (@Block) with the id of demo_block and the admin_label of Demo block (passed through the translation system).

Next, we extend the BlockBase class into our own DemoBlock, inside of which we implement two methods (the most common ones you’ll implement). The build() method is the most important as it returns a renderable array the block will print out. The access() method controls access rights for viewing this block. The parameter passed to it is an instance of the AccountInterface class which will be in this case the current user.

Another interesting thing to note is that we are no longer using the t() function globally for translation but we reference the t() method implemented in the class parent.

And that’s it, you can clear the caches and go to the Block layout configuration page. The cool thing is that you have the block types on the right (that you can filter through) and you can place one or more blocks of those types to various regions on the site.

Drupal 8 block configuration

Now that we’ve seen how to create a new block type to use from the UI, let’s tap further into the API and add a configuration form for it. We will make it so that you can edit the block, specify a name in a textfield and then the block will say hello to that name rather than the world.

First, we’ll need to define the form that contains our textfield. So inside our DemoBlock class we can add a new method called blockForm():

 * {@inheritdoc}
public function blockForm($form, &$form_state) {
  $form = parent::blockForm($form, $form_state);
  $config = $this->getConfiguration();

  $form['demo_block_settings'] = array(
    '#type' => 'textfield',
    '#title' => $this->t('Who'),
    '#description' => $this->t('Who do you want to say hello to?'),
    '#default_value' => isset($config['demo_block_settings']) ? $config['demo_block_settings'] : '',
  return $form;

This form API implementation should look very familiar from Drupal 7. There are, however, some new things going on here. First, we retrieve the $form array from the parent class (so we are building on the existing form by adding our own field). Standard OOP stuff. Then, we retrieve and store the configuration for this block. The BlockBase class defines the getConfiguration() method that does this for us. And we place the demo_block_settings value as the #default_value in case it has been set already.

Next, it’s time for the submit handler of this form that will process the value of our field and store it in the block’s configuration:

* {@inheritdoc}
public function blockSubmit($form, &$form_state) {
 $this->setConfigurationValue('demo_block_settings', $form_state['values']['demo_block_settings']);

This method also goes inside the DemoBlock class and all it does is save the value of the demo_block_settings field as a new item in the block’s configuration (keyed by the same name for consistency).

Lastly, we need to adapt our build() method to include the name to say hello to:

 * {@inheritdoc}
public function build() {
  $config = $this->getConfiguration();
  if (isset($config['demo_block_settings']) && !empty($config['demo_block_settings'])) {
    $name = $config['demo_block_settings'];
  else {
    $name = $this->t('to no one');
  return array(
    '#markup' => $this->t('Hello @name!', array('@name' => $name)),

By now, this should look fairly easy. We are retrieving the block’s configuration and if the value of our field is set, we use it for the printed statement. If not, use use a generic one. You can clear the cache and test it out by editing the block you assigned to a region and add a name to say hello to. One thing to keep in mind is that you are still responsible for sanitizing user input upon printing to the screen. I have not included these steps for brevity.

Drupal 8 forms

The last thing we are going to explore in this tutorial is how to create a simple form. Due to space limitations, I will not cover the configuration management aspect of it (storing configuration values submitted through forms). Rather, I will illustrate a simple form definition, the values submitted being simply printed on the screen to show that it works.

In Drupal 8, form definition functions are all grouped together inside a class. So let’s define our simple DemoForm class inside src/Form/DemoForm.php:


 * @file
 * Contains \Drupal\demo\Form\DemoForm.

namespace Drupal\demo\Form;

use Drupal\Core\Form\FormBase;

class DemoForm extends FormBase {
   * {@inheritdoc}.
  public function getFormId() {
    return 'demo_form';
   * {@inheritdoc}.
  public function buildForm(array $form, array &$form_state) {
    $form['email'] = array(
      '#type' => 'email',
      '#title' => $this->t('Your .com email address.')
    $form['show'] = array(
      '#type' => 'submit',
      '#value' => $this->t('Submit'),
    return $form;
   * {@inheritdoc}
  public function validateForm(array &$form, array &$form_state) {
    if (strpos($form_state['values']['email'], '.com') === FALSE ) {
      $this->setFormError('email', $form_state, $this->t('This is not a .com email address.'));
   * {@inheritdoc}
  public function submitForm(array &$form, array &$form_state) {
    drupal_set_message($this->t('Your email address is @email', array('@email' => $form_state['values']['email'])));

Apart from the OOP side of it, everything should look very familiar to Drupal 7. The Form API has remained pretty much unchanged (except for the addition of some new form elements and this class encapsulation). So what happens above?

First, we namespace the class and use the core FormBase class so we can extend it with our own DemoForm class. Then we implement 4 methods, 3 of which should look very familiar. The getFormId() method is new and mandatory, used simply to return the machine name of the form. The buildForm() method is again mandatory and it builds up the form. How? Just like you are used to from Drupal 7. The validateForm() method is optional and its purpose should also be quite clear from D7. And finally, the submitForm() method does the submission handling. Very logical and organised.

So what are we trying to achieve with this form? We have an email field (a new form element in Drupal 8) we want users to fill out. By default, Drupal checks whether the value input is in fact an email address. But in our validation function we make sure it is a .com email address and if not, we set a form error on the field. Lastly, the submit handler just prints a message on the page.

One last thing we need to do in order to use this form is provide a route for it. So edit the demo.routing.yml file and add the following:

  path: '/demo/form'
    _form: '\Drupal\demo\Form\DemoForm'
    _title: 'Demo Form'
    _permission: 'access content'

This should look familiar from the previous article in which we routed a simple page. The only big difference is that instead of _content under defaults, we use _form to specify that the target is a form class. And the value is therefore the class name we just created.

Clear the caches and navigate to demo/form to see the form and test it out.

If you are familiar with drupal_get_form() and are wondering how to load a form like we used to in Drupal 7, the answer is in the global Drupal class. Thus to retrieve a form, you can use its formBuilder() method and do something like this:

$form = \Drupal::formBuilder()->getForm('Drupal\demo\Form\DemoForm');

Then you can return $form which will be the renderable array of the form.


In this article we’ve continued our exploration of Drupal 8 module development with two new topics: blocks and forms. We’ve seen how to create our own block type we can use to create blocks in the UI. We’ve also learned how to add a custom configuration to it and store the values for later use. On the topic of forms, we’ve seen a simple implementation of the FormBase class that we used to print out to the screen the value submitted by the user.

In the next tutorial we will take a quick look at configuration forms. We will save the values submitted by the user using the Drupal 8 configuration system. Additionally, we will look at the service container and dependency injection and how those work in Drupal 8. See you then.

Jun 13 2014
Jun 13

How to Build a Drupal 8 Module

Drupal 8 brings about a lot of changes that seek to enroll it in the same club other modern PHP frameworks belong to. This means the old PHP 4 style procedural programming is heavily replaced with an object oriented architecture. To achieve this, under the initiative of Proudly Found Elsewhere, Drupal 8 includes code not developed specifically for Drupal.

One of the most important additions to Drupal are Symfony components, with 2 major implications for Drupal developers. First, it has the potential to greatly increase the number of devs that will now want to develop for Drupal. And second, it gives quite a scare to some of the current Drupal 7 developers who do not have much experience with modern PHP practices. But that’s ok, we all learn, and lessons taken from frameworks like Symfony (and hopefully Drupal 8), will be easily extensible and applicable to other PHP frameworks out there.

In the meantime, Drupal 8 is in a late stage of its release cycle, the current version at the time of writing being alpha11. We will use this version to show some of the basic changes to module development Drupal 7 devs will first encounter and should get familiar with. I set up a Git repo where you can find the code I write in this series so you can follow along like that if you want.

How do I create a module?

The first thing we are going to look at is defining the necessary files and folder structure to tell Drupal 8 about our new module. In Drupal 7 we had to create at least 2 files (.info and .module), but in Drupal 8, the YAML version of the former is enough. And yes, .info files are now replaced with .info.yml files and contain similar data but structured differently.

Another major change is that custom and contrib module folders now go straight into the root modules/ folder. This is because all of the core code has been moved into a separate core/ folder of its own. Of course, within the modules/ directory, you are encouraged to separate modules between custom and contrib like in Drupal 7.

Let’s go ahead and create a module called demo (very original) and place it in the modules/custom/ directory. And as I mentioned, inside of this newly created demo/ folder, all we need to begin with is a demo.info.yml file with the following required content:

name: Drupal 8 Demo module
description: 'Demo module for Drupal 8 alpha11'
type: module
core: 8.x

Three out of four you should be familiar with (name, description and core). The type is now also a requirement as you can have yml files for themes as well. Another important thing to keep in mind is that white spaces in yml files mean something and proper indentation is used to organize data in array-like structures.

You can check out this documentation page for other key|value pairs that can go into a module .info.yml file and the change notice that announced the switch to this format.

And that’s it, one file. You can now navigate to the Extend page, find the Demo module and enable it.

As I mentioned, we are no longer required to create a .module file before we can enable the module. And architecturally speaking, the .module files will be significantly reduced in size due to most of the business logic moving to service classes, controllers and plugins, but we’ll see some of that later.

What is ‘routing’ and what happened to hook_menu() and its callbacks?

In Drupal 7, hook_menu() was probably the most implemented hook because it was used to define paths to Drupal and connect these paths with callback functions. It was also responsible for creating menu links and a bunch of other stuff.

In Drupal 8 we won’t need hook_menu() anymore as we make heavy use of the Symfony2 components to handle the routing. This involves defining the routes as configuration and handling the callback in a controller (the method of a Controller class). Let’s see how that works by creating a simple page that outputs the classic Hello world!.

First, we need to create a routing file for our module called demo.routing.yml. This file goes in the module root folder (next to demo.info.yml). Inside this file, we can have the following (simple) route definition:

  path: '/demo'
    _content: '\Drupal\demo\Controller\DemoController::demo'
    _title: 'Demo'
    _permission: 'access content'

The first line marks the beginning of a new route called demo for the module demo (the first is the module name and the second the route name). Under path, we specify the path we want this route to register. Under defaults, we have two things: the default page title (_title) and the _content which references a method on the DemoController class. Under requirements, we specify the permission the accessing user needs to have to be able to view the page. You should consult this documentation page for more options you can have for this routing file.

Now, let’s create our first controller called DemoController that will have a method named demo() getting called when a user requests this page.

Inside the module directory, create a folder called src/ and one called Controller/ inside of it. This will be the place to store the controller classes. Go ahead and create the first one: DemoController.php.

The placement of the Controllers and, as we will see, other classes, into the src/ folder is part of the adoption of the PSR-4 standard. Initially, there was a bigger folder structure we had to create (PSR-0 standard) but now there is a transition phase in which both will work. So if you still see code placed in a folder called lib/, that’s PSR-0.

Inside of our DemoController.php file, we can now declare our class:

 * @file
 * Contains \Drupal\demo\Controller\DemoController.

namespace Drupal\demo\Controller;

 * DemoController.
class DemoController {
   * Generates an example page.
  public function demo() {
    return array(
      '#markup' => t('Hello World!'),

This is the simplest and minimum we need to do in order to get something to display on the page. At the top, we specify the class namespace and below we declare the class.

Inside the DemoController class, we only have the demo() method that returns a Drupal 7-like renderable array. Nothing big. All we have to do now is clear the caches and navigate to http://example.com/demo and we should see a Drupal page with Hello World printed on it.

In Drupal 7, when we implement hook_menu(), we can also add the registered paths to menus in order to have menu links showing up on the site. This is again no longer handled with this hook but we use a yml file to declare the menu links as configuration.

Let’s see how we can create a menu link that shows up under the Structure menu of the administration. First, we need to create a file called demo.menu_links.yml in the root of our module. Inside this yml file we will define menu links and their position in existing menus on the site. To achieve what we set out to do, we need the following:

  title: Demo Link
  description: 'This is a demo link'
  parent: system.admin_structure
  route_name: demo.demo

Again we have a yml structure based on indentation in which we first define the machine name of the menu link (demo) for the module demo (like we did with the routing). Next, we have the link title and description followed by the parent of this link (where it should be placed) and what route it should use.

The value of parent is the parent menu link (appended by its module) and to find it you need to do a bit of digging in *.menu_links.yml files. I know that the Structure link is defined in the core System module so by looking into the system.menu_links.yml file I could determine the name of this link.

The route_name is the machine name of the route we want to use for this link. We defined ours earlier. And with this in place, you can clear the cache and navigate to http://example.com/admin/structure where you should now see a brand new menu link with the right title and description and that links to the demo/ path. Not bad.


In this article we began exploring module development in Drupal 8. At this stage (alpha11 release), it is time to start learning how to work with the new APIs and port contrib modules. To this end, I am putting in writing my exploration of this new and exiting framework that will be Drupal 8 so that we can all learn the changes and hit the ground running when release day comes.

For starters, we looked at some basics: how you start a Drupal 8 module (files, folder structure etc), all compared with Drupal 7. We’ve also seen how to define routes and a Controller class with a method to be called by this route. And finally, we’ve seen how to create a menu link that uses the route we defined.

In the next tutorial, we will continue building this module and look at some other cool new things Drupal 8 works with. We will see how we can create blocks and how to work with forms and the configuration system. See you then.

Apr 07 2014
Apr 07

The first revision control system I ever used was called RCS. It was the pre-cursor to CVS and stored all revision data locally. It was nifty but very limited and not suited for group development. CVS was the first shared revisioning system I used. It was rock solid, IMHO. But it had a few big problems, like the inability to rename or move files. Everything had to be deleted and re-added. 

Since those days, I've used several other revisioning systems: Perforce, Bitkeeper, Clearcase, Subversion and GIT.

I'm tired of learning yet another system. I just want to know which horse is going to win the race for the forseeable future and go all in.

That's where Google Trends comes in very handy. It quickly reveals that I need to bet on GIT. 

I just hope I can make it through the next 5 years or more before having to learn the next greatest solution to our shared problem of tracking code revisions.

Apr 25 2013
Apr 25

I recently began working on some PHP code for resolving HTML5 entities into their Unicode codepoints. According to the code, it had been optimized for performance. The code was moderately complex, and the authors appeared to have gone through great pains to build a specialized lookup algorithm. But when I took a closer look, I doubted. I decided to compare the "optimized" version with what I would call a naive version -- the simplest solution to the problem.

Here I show the two solutions, and then benchmark them for both memory and speed.

Up front, I want to state that I am not poking fun at or deriding the original authors. Their solution has merits, and in a compiled language it might actually be a faster implementation. But to optimize for PHP requires both an understanding of the language and an appreciation for opcode-based interpretation.

UPDATE: Jeff Graham made a very astute observation on Twitter:

@technosophos Nice writeup looks like the diff is on the order O(log n) vs. O(1) for ~2000 entries. Knuth put it best c2.com/cgi/wiki?Prema…— Jeff Graham (@jgraham909) April 25, 2013

UPDATE: In the initial version of this article, I claimed that the tree was a b*tree. It's actually not. It's just a standard node tree. As such, there is no way to make it outperform a hash table. Based on this, the conclusion of the article is patently obvious. However, it is good to see how damaging mis-application of an algorithm can be to overall performance.

The Code: HTML5 Character Reference Resolving

HTML5 defines over 2,000 character references (expressed as entities). There is no computable way to match the string-based entity names to their Unicode code points. So to solve the problem, one must build a lookup mechanism that maps the character reference's string name to a Unicode character.

For example, the character reference name of ampersand (&) is amp. Likewise, Á is Aacute. A lookup tool should be able to take that string name (amp, Aacute) and return the right unicode character. PHP has an older function that does this, but it only supports a subset of the characters in HTML5, so full standards support requires work.

The Optimized Code

The original code solved the problem as follows:

  • The string names were broken down into characters and then packed into a datastructure designed to work like a node tree.
  • The node tree was serialized to a nice compact 178k file (see the end of the article for a sample).
  • At runtime, the serialized file is read into memory once.
  • To lookup a character, a function walks the node tree. When it finds the full reference, it returns a codepoint.

In theory, this sounds very good. But does it perform? I took the code, optimized it as much as possible without changing the underlying algorithm, and wrote this simple test harness:

class Data {
  protected static $refs;
  public static function getRefs() {
    // A serialized node tree of entities is stored on disk,
    // It takes 178k of disk space. It is only loaded once.
    if (empty(self::$refs)) {
      self::$refs = unserialize(file_get_contents('./named-character-references.ser'));
    return self::$refs;
function test($lookup)  {
  // Load the entities tree.
  $refs = Data::getRefs();
  // Split the string into an array.
  $stream = str_split($lookup);
  $chars = array_shift($stream);
  // Dive into the tree and look for a match.
  // The tree is structured like a tree:
  // array (a => array(m => array(p => 38)))
  $codepoint = false;
  $char = $chars;
  while ($char !== false && isset($refs[$char])) {
    $refs = $refs[$char];
    if (isset($refs['codepoint'])) {
      $id = $chars;
      $codepoint = $refs['codepoint'];
    $chars .= $char = array_shift($stream);
  // Return the codepoint.
  return chr($codepoint);
$r = test('amp');
printf("Lookup %s >Current mem: %d Peak mem: %d\n", $r, memory_get_usage(), memory_get_peak_usage());

(Note that using static classes seems to improve memory usage in PHP5. It saves around 0.5M of runtime memory compared to a global or local variable.)

Several things stood out to me, though:

  1. While a serialized file might seem logical, it's actually a burden on PHP. Unserializing is not fast.
  2. The node tree isn't really a node tree. It's actually an n-depth hash table. Tree traversal is not going to be very fast.
  3. Not only will tree traversal be slow, but the node tree-like table is going to require a lot of memory.

The Naive Code

Based on my observations, I decided to compare it to a naive implementation: I stored all of the entities in a single hashtable (in PHP parlance, this is an array). Rather than serialize, I just put the entire thing into a PHP file.

So my code looked like this:

class Entities {
public static $byName = array (
  'Aacute' => 'Á',
  'Aacut' => 'Á',
  'aacute' => 'á',
  'aacut' => 'á',
  // Snip a few thousand lines
  'zwj' => '',
  'zwnj' => '',
function test($lookup)  {
  if (!isset(Entities::$byName[$lookup])) return FALSE;
  return Entities::$byName[$lookup];
$r = test('amp');
printf("Lookup %s >Current mem: %d Peak mem: %d\n", $r, memory_get_usage(), memory_get_peak_usage());

Compared to the previous code, this should be self-explanatory. We take a string to look up, and we see look it up. (We could actually remove the isset() check by adding @ on the second call to Entities.) The total file size for this file is 47k, so it's still smaller than the serialized node tree.

My hypothesis coming out of this was that the lookup would be substantially faster, and the memory usage would be a little lower. The real results surprised me, though.


Memory Usage

The above examples are both tooled already to report memory usage. Let's compare:

$ php test-btree.php
Lookup & >Current mem: 5289928 Peak mem: 5575336

To do a single entity lookup, this code took around 5MB of memory. How does that compare to the naive version?

php test-hash.php
Lookup >Current mem: 1263200 Peak mem: 1276768

The naive version used a little over a fifth of the memory that the node tree version used.

Speed Test

So the node tree version takes a little extra memory... that's not a deal breaker, is it? Probably not. In fact, if they perform about the same, then it's probably inconsequential.

To test performance, I did the following:

  • Put the test script on a local webserver
  • Warmed the server by requesting each script ten times
  • Ran ab -n 1000 http://localhost/Perf/test-SCRIPT.php

The output of ab (The Apache Benchmarking Tool) is verbose, but here's the pertinent bit for the node tree version:

Concurrency Level:      1
Time taken for tests:   10.564 seconds
Complete requests:      1000
Failed requests:        0
Write errors:           0
Total transferred:      261000 bytes
HTML transferred:       49000 bytes
Requests per second:    94.66 [#/sec] (mean)
Time per request:       10.564 [ms] (mean)
Time per request:       10.564 [ms] (mean, across all concurrent requests)
Transfer rate:          24.13 [Kbytes/sec] received

The easiest number to zoom in on is Time taken for tests: 10.564 seconds, which conveniently averages to about 10.6 msec per request.

Let's compare with the hashtable version:

Concurrency Level:      1
Time taken for tests:   2.541 seconds
Complete requests:      1000
Failed requests:        0
Write errors:           0
Total transferred:      261000 bytes
HTML transferred:       49000 bytes
Requests per second:    393.61 [#/sec] (mean)
Time per request:       2.541 [ms] (mean)
Time per request:       2.541 [ms] (mean, across all concurrent requests)
Transfer rate:          100.33 [Kbytes/sec] received

Again, the important number: 2.541 seconds, or about 2.5 msec per request.

The naive version took only one quarter of the time that the optimized version took.

Now here's an interesting additional piece of data: This was without opcode caching. Would opcode caching make a difference?

Speed Test with Opcode Caching

To test opcode caching, I installed apc, restarted Apache, and then re-ran the battery of tests.

Here's how the optimized version fared:

Concurrency Level:      1
Time taken for tests:   10.636 seconds
Complete requests:      1000
Failed requests:        0
Write errors:           0
Total transferred:      261000 bytes
HTML transferred:       49000 bytes
Requests per second:    94.02 [#/sec] (mean)
Time per request:       10.636 [ms] (mean)
Time per request:       10.636 [ms] (mean, across all concurrent requests)
Transfer rate:          23.96 [Kbytes/sec] received

For all intents and purposes, there was no real change.

Compare that to the naive version:

Concurrency Level:      1
Time taken for tests:   2.025 seconds
Complete requests:      1000
Failed requests:        0
Write errors:           0
Total transferred:      261000 bytes
HTML transferred:       49000 bytes
Requests per second:    493.81 [#/sec] (mean)
Time per request:       2.025 [ms] (mean)
Time per request:       2.025 [ms] (mean, across all concurrent requests)
Transfer rate:          125.86 [Kbytes/sec] received

The opcode cache has trimmed a little over a half a millisecond off of the request time. That makes the naive version more than 5x faster than the optimized version.

Why did it make a difference for the naive version, but not the node tree-based version? The reason is that the .ser file, introduced ostensibly to speed things up, cannot be cached, as it's not code. So on each request, it must be re-loaded into memory.

Meanwhile, all 2,000 entities in the hashed version are conveniently stored in-memory. Assuming the server has sufficient cache space, that data will not need to be re-read and re-interpreted on each subsequent request.

One Additional Strength of the Optimized Code

While I opted to take the naive version of the code, there is one additional strength of the optimized code: Under certain conditions, this sort of algorithm can become more fault-tolerant. The optimized version can sometimes find a codepoint for a reference that was not well formed, because it traverses until it finds a match, and then it stops. The problem with this algorithm, though, is that given the input string 'foobar' and an entity map that contains 'foo' and 'foobar', the matched candidate will be 'foo'.

The naive version of the code does not correct for encoding errors. If the entity name isn't an exact match, it is not resolved.

Appendix: XHProf Stack

Want to know where all of that time is spent? Here's an xhprof dump of the two call stacks.

Using xhprof_enable(XHPROF_FLAGS_CPU + XHPROF_FLAGS_MEMORY);, I gathered the following stats:

node tree Version

    [Data::getRefs==>file_get_contents] => Array
            [ct] => 1
            [wt] => 219
            [cpu] => 0
            [mu] => 183888
            [pmu] => 192392
    [Data::getRefs==>unserialize] => Array
            [ct] => 1
            [wt] => 9083
            [cpu] => 8001
            [mu] => 4644728
            [pmu] => 4729336
    [test==>Data::getRefs] => Array
            [ct] => 1
            [wt] => 9346
            [cpu] => 8001
            [mu] => 4648312
            [pmu] => 4921728
    [test==>str_split] => Array
            [ct] => 1
            [wt] => 4
            [cpu] => 0
            [mu] => 2072
            [pmu] => 0
    [test==>array_shift] => Array
            [ct] => 4
            [wt] => 10
            [cpu] => 0
            [mu] => 480
            [pmu] => 0
    [test==>chr] => Array
            [ct] => 1
            [wt] => 3
            [cpu] => 0
            [mu] => 1128
            [pmu] => 0
    [main()==>test] => Array
            [ct] => 1
            [wt] => 9435
            [cpu] => 8001
            [mu] => 4654408
            [pmu] => 4921728
    [main()==>xhprof_disable] => Array
            [ct] => 1
            [wt] => 0
            [cpu] => 0
            [mu] => 1080
            [pmu] => 0
    [main()] => Array
            [ct] => 1
            [wt] => 9454
            [cpu] => 8001
            [mu] => 4657560
            [pmu] => 4921728

Hash Version

    [main()==>test] => Array
            [ct] => 1
            [wt] => 139
            [cpu] => 0
            [mu] => 1072
            [pmu] => 0
    [main()==>xhprof_disable] => Array
            [ct] => 1
            [wt] => 1
            [cpu] => 0
            [mu] => 1080
            [pmu] => 0
    [main()] => Array
            [ct] => 1
            [wt] => 178
            [cpu] => 0
            [mu] => 4160
            [pmu] => 0

Appendix 2: Sample SER data

    [A] => Array
            [E] => Array
                    [l] => Array
                            [i] => Array
                                    [g] => Array
                                            [;] => Array
                                                    [codepoint] => 198
                                            [codepoint] => 198
            [M] => Array
                    [P] => Array
                            [;] => Array
                                    [codepoint] => 38
                            [codepoint] => 38
Sep 30 2012
Sep 30

I recently worked on a project which required a updated version of the jQuery library. While there is the jQuery Update module, it only allows you to upgrade Drupal 6 to jQuery 1.3. If you really know what you're doing and want to upgrade beyond that version, you can either hack core or create your own simple module to do it. While hacking core is certainly the easier approach (simply overwriting misc/jquery.js with a newer version), it is very bad practice. You do not want to get yourself in the habit of altering Drupal core unless you want to kill your upgrade path and deal with a new slew of bugs and unpredictable behavior.

Let's start by creating an admin area for configuring the version of jQuery we want to use.

* Module configuration admin form.
function mymodule_admin_form() {
$form['mymodule_custom_jquery'] = array(
'#type' => 'checkbox',
'#title' => t('Override Jquery'),
'#description' => t('Replace the version of jQuery that ships with Drupal
      (jQuery 1.2.6) with the jQuery library specified below. You will need to
      flush your cache when turning this feature on or off.'
'#default_value' => variable_get('mymodule_custom_jquery', 0),
$form['mymodule_custom_jquery_file'] = array(
'#type' => 'textfield',
'#title' => t('Jquery file location'),
'#description' => t('Specify the location of the updated jQuery library
      you would like to use. The location is relative to the docroot and
      so should probably begin with "sites/".'
'#size' => 128,
'#default_value' => variable_get('mymodule_custom_jquery_file', NULL),
$form = system_settings_form($form);



Next, we're going to write a validation hook that simply makes sure the file exists.

* Admin form validation callback.
function mymodule_admin_form_validate($form, &$form_state) {
$file = trim($form_state['values']['mymodule_custom_jquery_file']);
$form_state['values']['mymodule_custom_jquery_file'] = $file; // Only validate this value if js override is turned on
if (variable_get('mymodule_custom_jquery', 0)) {
    if (!
file_exists($file)) {
form_set_error('mymodule_custom_jquery_file', t('The file you specified does not exist: !file.', array('!file' => $file)));

And lastly, we use hook_preprocess_page() to safely replace the jQuery that ships with Drupal core with our own version.

* Implementation of hook_preprocess_page().
function mymodule_preprocess_page(&$variables) {
  if (
variable_get('mymodule_custom_jquery', 0)) {
$file = variable_get('mymodule_custom_jquery_file', 0);
$scripts = drupal_add_js(); // remove core jquery and add our own
$add_scripts['core'][$file] = array(
'cache' => TRUE,
'defer' => FALSE,
'preprocess' => TRUE,
$scripts['core'] = array_merge($add_scripts['core'], $scripts['core']);
$variables['scripts'] = drupal_get_js('header', $scripts);

Make note of the line:

['core'] = array_merge($add_scripts['core'], $scripts['core']);

We are careful to add our new jQuery include at the beginning of the array (where the original jquery.js include was) in order to meet any jquery dependencies in the scripts that follow.

Sep 01 2012
Sep 01

Read on for the onger explanation.

Over the course of QueryPath's life, several people have requested that I change one major thing about QueryPath:

  • In QueryPath 2, find() and related functions return THE SAME object, instead of following jQuery and returning A NEW object.

Based on some recent discussions, I decided to try changing this behavior for QueryPath 3.

  • In the experimental version, each query method (find(), next(), top(), etc.) returns a NEW object.
  • Or, to put it another way, find() now works like branch() used to work. All other methods work like that, too.
  • branch() is still there for backward compatibility
  • findInPlace() has been added. It acts like the old find().

Please try the experimental version and give me some feedback. Is this better or worse than the old QueryPath?

Let me know by September 8.


  • This new version is about the same speed as QueryPath 3.x alpha 2.
  • This new version will take up a little more memory, but not noticeably more in most cases.


You can find this code in the branch named feature/find-v2 in GitHub. It's a branch off of 3.x. If people like this behavior, I will merge it into 3.x and it will be part of QueryPath 3.

Aug 07 2012
Aug 07

The first two articles in a series about the HP Cloud PHP bindings is available on the Cloud Matters, the official HP Cloud blog. Matt Farina is writing this series.

  • More will come this week and next.

The HP Cloud library, which is developed on Github, is a project Matt and I started. It wasn't intended to be part of HP's offerings. Instead, we started the project because we wanted to build our own tools to work with our cloud services. We have a number of specialized internal tools that we use for things like debugging the Identity Service catalog or taking snapshots of DBaaS instances.

We also wanted to be able to store Drupal assets inside of Object Storage, and building a PHP library was the first step. (The second, of course, was to build an HP Cloud Drupal module).

The high point so far has been launching our new blog entirely inside of our own cloud architecture, powered in a large part by this PHP code. We'd been dipping our toes in the water, but this was a head-first plunge.

It's been rewarding to see this library go from "scratching our own itch" to being generally useful for others. The library still has a way to go, and we've just about pushed it to 1.0, but I'm encouraged to see how far it has gone already.

Aug 01 2012
Aug 01

In the last 24 hours, I have had three glimpses into emergency plans. First, my local coffee shop -- the true source of my productivity -- experienced a water main break. Second, a site I manage experienced a server failure. Third, I came across Netflix's recently open sourced Chaos Monkey tool.

The emergencies I am talking about don't involve physical risk and are certainly not life-or-death circumstances. But they deal with a threat to a business: the loss of customers and revenue. Reflecting on these provides some insight into prevention and practice.

The Coffee Shop

When I walked into my corner coffee shop for the morning Joe, the barista informed me that due to a water main break, they were unable to make most of their drinks. I got lucky -- they still had some drip coffee. I ordered and sat down to drink it.

I watched the team deal with customers who came in, deal with the equipment, and engage in problem solving. After an hour, the talk turned from carrying on to closing shop. My curiosity was piqued, so I asked about the protocol for handling situations like this. They explained that they have a procedure to follow for emergencies, and that they had stepped through it to the best of their abilities. But there was a problem: The water main break was sudden, and the protocol didn't adapt well to cases where there was an unexpected and sudden loss of water.

The team did an admirable job, and took measures to encourage customers to come back. But the emergency plan had a flaw. (With a little inventiveness, I think they managed to avoid closing for the day.)

The Website

I admit that in most cases I am not good at planning for emergencies. But in the case of the website that failed, we had a plan. We had thought through enough of the possibilities that even the case that occurred was one we had anticipated. When the site failed, we implemented the plan, and it worked.

We could pat ourselves on the back, but here's the thing: Until our outage, we didn't know whether the emergency plan would work. We got lucky -- and even in our luck, there was a fair amount of flapping as we tried to implement our untested recovery plan.

If only we were more proactive in testing our emergency plan.

Chaos Monkey

Netflix takes a different approach to their emergency plan: they simulate emergencies. In fact, they built a tool to simulate emergencies for them, the aptly named Chaos Monkey.

In a nutshell, Chaos Monkey causes servers to break. Yes, they intentionally break their servers. Then the emergency recovery process kicks in. If some part of the emergency process is broken, they'll know and they will be prepared to react (I presume that it's not 2 AM on a Sunday when they run this thing).

Because of this testing -- somewhat chaotic, but in a controlled environment -- the Netflix engineers can test and improve emergency plans.

An emergency plan is a must-have. But to play its role, it's got to work.

Jul 30 2012
Jul 30

HP Cloud has migrated its blog site into Drupal. This makes the fourth Drupal migration for HP Cloud. But it is the first one to be running entirely inside of our own cloud.

We're using HP Cloud Compute instances, our Relational Database in-cloud MySQL server, Object Storage for all static files, and CDN to seamlessly serve public files out of a content distribution network. What is more, we now have a Stackato-based architecture for rapidly deploying Drupal sites into the cloud.

DISCLAIMER: The opinions expressed here are my own, and not those of my employer or my two rather irritable cats.

Jun 26 2012
Jun 26

tl;dr: FIG is a welcomed force in PHP standardization. But I believe their recent two standards have undercut their credibility. By choosing contentious grounds, issuing an arbitrary standard that competes with existing conventions, and doing this in an area that does not actually improve the interoperability of code, they have weakened their position as a standards body. I suggest that they remedy by downgrading PSR-1/2 from "standard" to "recommendation."

In mid-June, the relatively young Framework Interoperability Group (FIG) proposed a pair of new standards called PSR-1 and PSR-2. FIG's ostensible mission is to provide PHP-related standards to bring some level of interoperability between the plethora of PHP frameworks and applications.

But these two standards, and PSR-2 in particular, have in my mind undercut the larger goal of the standards body. That is, releasing these two documents as formal standards harms the credibility of what had looked like a very promising standards body. Here I explain why I think this, and suggest a remedy.

In Praise of PSR-0

FIG's first standard was called PSR-0, and it was a much-needed standard, a breath of fresh air.

The developers of the PHP language itself have taken what might be considered a laissez-faire attitude toward the way PHP is used. For many common cases (tasks like reading a file off of the file system), there are many ways of doing essentially the same task, and yet PHP itself does not dictate which is preferred, which is best.

In many cases this is a boon. Projects can choose the method best suited for the task or for the community. But sometimes this "openness" substantially harms interoperability. The growth of truly independent libraries in PHP has been hampered in no small part because of the fact that "everyone does it their own way".

Here is a case in point: Loading PHP files. For many years, one needed to manually include all of the files one wanted to use. There was no standard way of mapping, say, a class to a file name. When PHP introduced an autoloader -- a feature allowing developers to map classes to files in user-space code -- the various PHP projects still implemented autoloaders in grossly divergent ways. To use two different libraries, one might have to register (or write) two different autoloaders. This was cumbersome both on the developer and they system.

Along came FIG with a solution. The PSR-0 standard defined a convention for mapping class names to file names, and then provided a reference autoloader.

This changed the PHP culture. Large frameworks jumped on the bandwagon, and projects like Composer provided unification in the dependency management layer. PHP, long in the rears on the basics of library management, suddenly closed the gap with the likes of Java, Python, Perl, and Ruby.

PSR-0 may have saved PHP.

Why is PSR-0 So Successful?

PSR-0's success is easy to explain: It meets an acute technical need. It makes the programmer's life easier, and it is a needed first step toward actual program-level interoperability.

Just a week ago, I used components from three different frameworks to build an application. And it was trivially easy because all of them supported PSR-0. Frankly, I never read the code to these libraries -- I just used the public APIs according to the documentation. Two years ago I would have spent more hours just loading all of the right stuff than I did building the entire application. I would have had to dive into the code of each project just to figure out what I needed to load and when.

FIG was successful with PSR-0 because they introduced a standard that made sense, that helped solve a technical problem, and that made the average programmer's life easier.

Along Came a Spider

Then came PSR-1 and PSR-2. These two standards -- they are called standards, not recommendations or conventions -- dictate coding style.

PSR-1 is reportedly a "basic" set of coding conventions, standardizing practices like capitalization of class names and general layout of a class file. In many ways, it's not dissimilar from what many programming languages have introduced in their basic documentation.

PSR-2 is the detailed coding standard, covering everything from where every parenthesis and curly brace ought to go, to how many spaces and indent should have, and to how long each line of code may go (120 columns!).

First let me state one thing emphatically: I am all in favor of coding conventions. I have seen some code formatting atrocities that would make Jackson Pollock vomit. And on the converse, I have witnessed how convenient it is when all of the code is visually the same. Coding styles are convenient, though clearly not necessary.

But I think FIG made a gross misstep in dictating a standard.

A Good Standard

Each standards group has a purpose. W3 seeks to build standards for the web. Java and Python have standards groups who churn out standard interfaces and processes, and even develop the features of the next generation of the language.

And FIG's purpose is to promote framework interoperability.

PSR-0 is a great example of how to increase interoperability, for it makes disparate standards-compliant libraries work together better.

But do coding conventions increase interoperability? No. In fact, they have almost no impact on interoperability. The language itself enforces the essential syntax constraints. If it runs in a PHP interpreter, it doesn't matter how many spaces one indented nor whether or not there was whitespace between the namespace declaration and the class declaration. And when I use two libraries, the spacing around an = operator makes absolutely no difference in how the libraries function in tandem.

But isn't it nice to have all the code formatted the same way? Yes, it is. It is convenient. But this is the happy tip of a malevolent iceberg.

Rational Disagreement

How many characters should be used to indent in code? 2? 4? 8? And should tabs be allowed? Required? Everyone has their opinions, and for the most part these opinions are rational. There isn't necessarily a right answer.

How about this: should there be a space between if and the opening parenthesis? Again, there is room for rational disagreement.

But in both of these cases, people assiduously draw their lines. They have their preferences, and they stick to them. And there are already several popular coding conventions for PHP.

Now enter the new standards body, FIG.

A Rock and a Hard Place

FIG simply dove into the fray and chose to dictate standards to a wide variety of already developed applications. PSR-1/2 were derived from a survey of 20 or so PHP projects, but the FIG standard doesn't necessarily match any of the existing coding standards (though it comes close to Symfony's). This means the standard begins by demanding that all of these projects change, but with no subsequent technological improvement. Inertia is already against PSR-1/2.

By claiming that these are standards, they have also placed equal importance on PSR-1/2 as they did on PSR-0.

What they have unintentionally done is undermined their position on two fronts:

  • By dictating that coding conventions are on equal level with their other standards, they have matched the crucial (real, problem-solving standards) with the superficial (consistent-looking code).
  • By taking a hard line on an area of broad rational disagreement, they have intentionally positioned themselves in the midst of a controversy, while simultaneously not offering any justification for taking PSR-1/2's standard over any of the myriad competing standards.

Combined, these two have an unfortunate result: Developers will dismiss PSR-1/2 -- especially PSR-2 -- as overzealous and optional, and this will change people's perception of the standards group.

This will most certainly impact FIGs future standards efforts, where developers will take their suggestions less seriously, and immediately view proposals with a jaundiced eye: "is this arbitrary lawmaking or real innovation?"

What FIG Could Have Done (and Still Should Do)

A group dedicated toward real interoperability should have stayed away from dictating superficial trappings. They should avoid, at all cost, diluting the power of well-reasoned problem-solving technical standards with arbitrary conventions -- especially when their doing so is competing with other more established conventions.

They could have made recommendations or conventions for coding style -- in fact, I still think they ought to. PSR-1/2 is still valuable, but as a recommendation of obviously lesser importance than PSR-0. But they should absolutely not call it a standard.

Moreover, FIG's biggest boon is in filling a much-felt gap in PHP: The fact that interoperability is damaged by technically incompatible (or, more properly, compatibility-agnostic) solutions.

I, for one, would love to see Composer-style package declarations become standardized. I'd like to see a reasonable namespace convention, too, since that aids toolchain, programmer, and runtime alike. And I'd definitely welcome some interfaces for oft-implemented patterns. In short, I'd like to see standards that improve the actual interoperability of PHP libraries and frameworks, and that do so in a clearly reasoned manner.

Other standards bodies manage this well. They must. Their survival depends on it. FIG's does, too.

Update: Engineered Web also has a blog post on PSR-2 that takes a novel approach: PHP coders rarely work in a vacuum. We work with other languages like JavaScript. Shouldn't the PHP coding standard be influenced by the standards of those other languages?

**Updated: ** Corrected one instance of SPR to PSR. Thx @DamienMcKenna.

Jun 20 2012
Jun 20

Pronto.js is designed to be a high performance asynchronous application framework that makes it simple to chain together components to build sophisticated application logic. It's the JS equivalent of the PHP Fortissimo framework.

One characteristic that makes both Pronto.js and Fortissimo stand apart is that they provide an alternative to the MVC pattern. They use the Chain-of-Command pattern, which takes a route name and maps it to a series of "commands", each of which is responsible for a different part of the processing. Well-written commands become highly reusable, which makes application development rapid and yet still reliable.

When you build an application the components get chained together into routes with code like this (Pronto.js):

      .does(InitializeSearchService, 'initialization')
      .does(QueryRemoteSearchService, 'do-search')
      .does(SearchTheme, 'format-search-results')

(Fortissimo code looks similar: $register->route('search')->does(/*…*/)) The simple example above registers the route search to a series of commands that each perform part of the overall task of running a search and formatting the response.

Commands (IntiailizeSearchService, QueryRemoteSearchService and so on) are short pieces of object-oriented code (prototypes in JS, classes in PHP) that take predefined input, perform a simple task, and then return data. My typical command is around 20 lines of code.

I know this is just a brief teaser. We're working on several cool applications built on these technologies. With Pronto.js, we've been able to integrate with a wide variety of NPM packages, while Fortissimo and the Symfony (and other) PHP libraries can be easily combined. In the future, I'll blog some more about Fortissimo and Pronto.js.

Jun 18 2012
Jun 18

The PHP CURL library is the most robust library for working with HTTP (and other protocols) from within PHP. As a maintainer of the HPCloud-PHP library, which makes extensive use of REST services, I've been tinkering around with ways of speeding up REST interactions.

What I've found is a way to cut off nearly 70% of the processing time for a typical usage scenario. For example, our unit tests used to take four minutes to run, and we're now down to just over a minute, while our Drupal module's network time has been cut by over 75%.

This article explains how we accomplished this with a surprisingly simple (and counter-intuitive) modification.

The typical way to work with CURL is to use the curl_init() call to create a new CURL handle. After suitable configuration has been done, that CURL handle is typically executed with curl_exec(), and then closed with curl_close(). As might be expected, this builds a new connection for each CURL handle. If you create two handles (calling curl_init() twice, and executing each handle with curl_exec()), the unsurprising result is that two connections to the remote server are created.

But for our library, a pattern emerges quickly: Many requests are sent to the same servers. In some cases, several hundred requests may got to the same server, and even the same URL (though with different headers and bodies). If we use the method above of creating one
CURL handle per request, the network overhead for each request can really slow things down. This is compounded by the fact that almost all request are done over SSL, which means each request has not only network overhead, but also SSL negotiation overhead.

This is hardly a new problem, and HTTP has several methods for dealing with this. Unfortunately, CURL, as I've used it above, cannot make use of any of these. Why not? Because each CURL handle does its own connection management. But there were hints in the PHP manual that there may be ways to share connections. And when looking at CURL's raw output, I could see it leaving connections open for re-use. But how could I make use of those?

Reuse a CURL handle

The first method was to call curl_init() once, and then call curl_exec() multiple times before calling curl_close(). This method is described (sparsely) in a Stack Overflow discussion.

I gave this method a try, but immediately ran up against issues. While I suspect that this method works for simple configurations, our library is not simple. It makes deep use of CURL's configuration API, passing input and output streams around, and conditionally setting many options depending on the type of operation being performed. We use GET, HEAD, POST, PUT, and COPY requests, sometimes in rapid succession. Sometimes we provide only scant data to the server, while other times we are working with large objects. Re-using the same CURL handle did not work well in this situation. While it is easy to set an option, it is not possible to unset or reset an option.

After trying several methods of resetting options, I forwent this approach and began digging again.

CURL Multi is not just for parallel processing

The hint that changed everything came from this entry in the CURL FAQ:

"curl and libcurl have excellent support for persistent connections when transferring several files from the same server. Curl will attempt to reuse connections for all URLs specified on the same command line/config file, and libcurl will reuse connections for all transfers that are made using the same libcurl handle.
When you use the easy interface, the connection cache is kept within the easy handle. If you instead use the multi interface, the connection cache will be kept within the multi handle and will be shared among all the easy handles that are used within the same multi handle. "

It took me a moment to realize that the easy interface was curl_exec, but once I caught on, I knew what I needed to do.

The CURL multi library is typically used for running several requests in parallel. But as you can see from the FAQ entry above, it has another virtue: It caches connections. As long as the CURL multi handle is re-used, CURL connections will automatically be re-used as long as possible.

This method provides the ability to set different options on each CURL handle, but then to run each CURL handle through the CURL multi handler, which provides the connection caching. While this particular chunk of code never executes requests in parallel, CURL multi still provides a huge performance boost.

A quick test of this revealed instant results. Running a series of requests that took 14 seconds on the original configuration took only five seconds with CURL multi. (How does all of this compare to the built-in PHP HTTP Stream? It took 22 seconds to run the same tests, and it takes over seven minutes to run the same batch of tests that takes CURL multi 1.5 minutes.)

An Example in Code

While the HP Cloud library is object oriented, here is a simple procedural example that shows (basically) what my starting code looked like and what the finished code looked like.

Initially, we were using a simple method of executing CURL like this:

   function get($url) {
     // Create a handle.
    $handle = curl_init($url);
     // Set options...
     // Do the request.
     $ret = curl_exec($handle);
     // Do stuff with the results...
     // Destroy the handle.

While our actual code does a lot of options configuring and then does a substantial amount with $handle after the curl_exec() call, this code illustrates the basic idea.

Refactoring to make use of CURL multi, the final code looked more like this:

   function get2($url) {
    // Create a handle.
    $handle = curl_init($url);
     // Set options...
     // Do the request.
     $ret = curlExecWithMulti($handle);
     // Do stuff with the results...
     // Destroy the handle.
   function curlExecWithMulti($handle) {
    // In real life this is a class variable.
      static $multi = NULL;
     // Create a multi if necessary.
    if (empty($multi)) {
     $multi = curl_multi_init();
     // Add the handle to be processed.
     curl_multi_add_handle($multi, $handle);
     // Do all the processing.
      $active = NULL;
    do {
     $ret = curl_multi_exec($multi, $active);
     } while ($ret == CURLM_CALL_MULTI_PERFORM);
     while ($active && $ret == CURLM_OK) {
        if (curl_multi_select($multi) != -1) {
       do {
          $mrc = curl_multi_exec($multi, $active);
        } while ($mrc == CURLM_CALL_MULTI_PERFORM);
     // Remove the handle from the multi processor.
     curl_multi_remove_handle($multi, $handle);
     return TRUE;

Now, instead of using curl_exec(), we supply a method called curlExecWithMulti(). This function keeps a single static $multi instance (again, our actual implementation is more nuanced and less... Singleton-ish). This $multi instance is shared for all requests, and doing this allows us to make use of CURL multi's connection caching.

In each call to curlExecWithMulti(), we add $handle to the $multi request handler, execute it using CURL multi's execution style, and then remove the handle once we are done.

There is nothing particularly fancy about this implementation. It is actually even more complicated than it needs to be (I eventually want to make curlExecWithMulti() be able to take an array of handles for parallel processing). But it certainly does the trick.

Using that pattern for the HPCloud PHP library, I re-ran our unit tests. The unit test run typically takes between four and five minutes to handle several hundred REST requests. But with this pattern, the same tests took under a minute and a half -- and made over 300 requests over the same connection.

We will continue to evolve the HPCloud PHP library to improve performance even more. Parallel and asynchronous processing is one performance item on the roadmap. And we have others as well. If you've got some tricks you'd like to share, feel free to drop them in the issue
queue at GitHub and let us know.

May 09 2012
May 09

I recently found myself faced with an interesting little web dev challenge. Here's the scenario. You've got a site that's powered by a PHP CMS (in this case, Drupal). One of the pages on this site contains a number of HTML text blocks, each of which must be user-editable with a rich-text editor (in this case, TinyMCE). However, some of the HTML within these text blocks (in this case, the unordered lists) needs some fairly advanced styling – the kind that's only possible either with CSS3 (using, for example, nth-child pseudo-selectors), with JS / jQuery manipulation, or with the addition of some extra markup (for example, some first, last, and first-in-row classes on the list item elements).

Naturally, IE7+ compatibility is required – so, CSS3 selectors are out. Injecting element attributes via jQuery is a viable option, but it's an ugly approach, and it may not kick in immediately on page load. Since the users will be editing this content via WYSIWYG, we can't expect them to manually add CSS classes to the markup, or to maintain any markup that the developer provides in such a form. That leaves only one option: injecting extra attributes on the server-side.

When it comes to HTML manipulation, there are two general approaches. The first is Parsing HTML The Cthulhu Way (i.e. using Regular Expressions). However, you already have one problem to solve – do you really want two? The second is to use an HTML parser. Sadly, this problem must be solved in PHP – which, unlike some other languages, lacks an obvious tool of choice in the realm of parsers. I chose to use PHP5's built-in DOMDocument library, which (from what I can tell) is one of the most mature and widely-used PHP HTML parsers available today. Here's my code snippet.

Markup parsing function

 * Parses the specified markup content for unordered lists, and enriches
 * the list markup with unique identifier classes, 'first' and 'last'
 * classes, 'first-in-row' classes, and a prepended inside element for
 * each list item.
 * @param $content
 *   The markup content to enrich.
 * @param $id_prefix
 *   Each list item is given a class with name 'PREFIX-item-XX'.
 *   Optional.
 * @param $items_per_row
 *   For each Nth element, add a 'first-in-row' class. Optional.
 *   If not set, no 'first-in-row' classes are added.
 * @param $prepend_to_li
 *   The name of an HTML element (e.g. 'span') to prepend inside
 *   each liist item. Optional.
 * @return
 *   Enriched markup content.
function enrich_list_markup($content, $id_prefix = NULL,
$items_per_row = NULL, $prepend_to_li = NULL) {
  // Trim leading and trailing whitespace, DOMDocument doesn't like it.
  $content = preg_replace('/^ */', '', $content);
  $content = preg_replace('/ *$/', '', $content);
  $content = preg_replace('/ *\n */', "\n", $content);
  // Remove newlines from the content, DOMDocument doesn't like them.
  $content = preg_replace('/[\r\n]/', '', $content);
  $doc = new DOMDocument();
  foreach ($doc->getElementsByTagName('ul') as $ul_node) {
    $i = 0;
    foreach ($ul_node->childNodes as $li_node) {
      $li_class_list = array();
      if ($id_prefix) {
        $li_class_list[] = $id_prefix . '-item-' . sprintf('%02d', $i+1);
      if (!$i) {
        $li_class_list[] = 'first';
      if ($i == $ul_node->childNodes->length-1) {
        $li_class_list[] = 'last';
      if (!empty($items_per_row) && !($i % $items_per_row)) {
        $li_class_list[] = 'first-in-row';
      $li_node->setAttribute('class', implode(' ', $li_class_list));
      if (!empty($prepend_to_li)) {
        $prepend_el = $doc->createElement($prepend_to_li);
        $li_node->insertBefore($prepend_el, $li_node->firstChild);
  $content = $doc->saveHTML();
  // Manually fix up HTML entity encoding - if there's a better
  // solution for this, let me know.
  $content = str_replace('&acirc;&#128;&#147;', '&ndash;', $content);
  // Manually remove the doctype, html, and body tags that DOMDocument
  // wraps around the text. Apparently, this is the only easy way
  // to fix the problem:
  // http://stackoverflow.com/a/794548
  $content = mb_substr($content, 119, -15);
  return $content;

This is a fairly simple parsing routine, that loops through the li elements of the unordered lists in the text, and that adds some CSS classes, and also prepends a child node. There's some manual cleanup needed after the parsing is done, due to some quirks associated with DOMDocument.

Markup parsing example

For example, say your users have entered the following markup:


And your designer has given you the following rules:

  • List items to be laid out in rows, with three items per row
  • The first and last items to be coloured purple
  • The third and fifth items to be coloured green
  • All other items to be coloured blue
  • Each list item to be given a coloured square 'bullet', which should be the same colour as the list item's background colour, but a darker shade

You can ready the markup for the implementation of these rules, by passing it through the parsing function as follows:

$content = enrich_list_markup($content, 'fruit', 3, 'span');

After parsing, your markup will be:

  <li class="fruit-item-01 first first-in-row"><span></span>Apples</li>
  <li class="fruit-item-02"><span></span>Bananas</li>
  <li class="fruit-item-03"><span></span>Boysenberries</li>
  <li class="fruit-item-04 first-in-row"><span></span>Peaches</li>
  <li class="fruit-item-05"><span></span>Lemons</li>
  <li class="fruit-item-06 last"><span></span>Grapes</li>

You can then whip up some CSS to make your designer happy:

#fruit ul {
  list-style-type: none;

#fruit ul li {
  display: block;
  width: 150px;
  padding: 20px 20px 20px 45px;
  float: left;
  margin: 0 0 20px 20px;
  background-color: #bbddfb;
  position: relative;

#fruit ul li.first-in-row {
  clear: both;
  margin-left: 0;

#fruit ul li span {
  display: block;
  position: absolute;
  left: 20px;
  top: 23px;
  width: 15px;
  height: 15px;
  background-color: #191970;

#fruit ul li.first, #fruit ul li.last {
  background-color: #968adc;

#fruit ul li.fruit-item-03, #fruit ul li.fruit-item-05 {
  background-color: #7bdca6;

#fruit ul li.first span, #fruit ul li.last span {
  background-color: #4b0082;

#fruit ul li.fruit-item-03 span, #fruit ul li.fruit-item-05 span {
  background-color: #00611c;

Your finished product is bound to win you smiles on every front:

  • Apples
  • Bananas
  • Boysenberries
  • Peaches
  • Lemons
  • Grapes

Obviously, this is just one example of how a markup parsing function might look, and of the exact end result that you might want to achieve with such parsing. Take everything presented here, and fiddle liberally to suit your needs.

In the approach I've presented here, I believe I've managed to achieve a reasonable balance between stakeholder needs (i.e. easily editable content, good implementation of visual design), hackery, and technical elegance. Also note that this article is not at all CMS-specific (the code snippets work stand-alone), nor is it particularly parser-specific, or even language-specific (although code snippets are in PHP). Feedback welcome.

May 05 2012
May 05

The Four Kitchens blog is running a story on how they used QueryPath and the Migrate module to migrate over 10,000 pages of content, in many different languages, into Drupal. I love to hear stories about the creative ways developers use QueryPath to accomplish complex tasks. A huge thanks to Mark Theunissen for the detailed write-up.

In related news, the new QueryPath 3 engine is just about done, and will make monster imports like this much faster.

Mar 21 2012
Mar 21

The video and slides for my DrupalCon Denver 2012 session are already available.

The slides and video can be found at the official site. Kudos to the DrupalCon Denver organizers, who are in the midst of running a fantastic conference.

Feb 29 2012
Feb 29

Your task: In PHP code, open a file compressed with BZ2, convert its contents from one character set to another, convert the entire contents to uppercase, run ROT-13 over it, and then write the output to another file. And do it as efficiently as possible.

Oh, and do it without any loops. Just for fun.

Actually, this task is exceptionally easy to do. Just make use of an often overlooked feature of PHP stream API: stream filters. Here's how.

Stream Filters In Theory

PHP uses the concept of "streams" as an abstraction layer for IO. Reading from and writing to files can be done with streams. Sockets, too, can be read from and written to as streams. FTP and HTTP servers have native stream support. That's why you can write this:

$contents = file_get_contents('http://example.com');

The above gets the entire contents of the webpage at the destination URL, and reads it as if it were a file on the local filesystem.

You can also open compressed or archive files like Phar, BZip2, and Gzip files, having the content decompressed on the fly.

Stream filters provide one more layer on top of this: They allow you to open a stream, and then have one or more tasks (filters) run on the stream as the data is read from or written to the stream.

For example, you can open a stream to a remote URL to a gzipped file, and have the file uncompressed as it is read.

Stream Filters In Practice

By way of reminder, here's our task:

We want to read a file compressed with BZip2 and transform it into a file where the data is capitalized, and run through the ROT-13 obfuscator (which "rotates" each character 13 places in the alphabet).

As we go, we will also re-encode the file from ISO-8859-1 to UTF-8.

Broken down into a sequence, we will do the following:

  • Open a stream for reading
  • Open a (plain) stream for writing
  • Uncompress the input stream
  • Transcode from ISO-8859-1 to UTF-8
  • Convert the contents of the stream to uppercase
  • Rotate the characters by ROT-13
  • Write the file out to a plain text file
  • Clean up

With stream filters, this is accomplished by creating a pair of streams, and then assigning filters to each stream. When we copy the data from one stream to another, the filters will be run internally. Other than assigning the filters, we do not have to intervene.

We will begin with the file test.txt.bz2, which is a bzip2-compressed text file whose contents are Hello World.. And we will generate a file called test-uppercase.txt.

Here's how we do it:

 * Example of stream filtering.
// Open two file handles.
$in = fopen('test.txt.bz2', 'rb');
$out = fopen('test-uppercase.txt', 'wb');
// Add a decode filter to the first.
stream_filter_prepend($in, 'bzip2.decompress', STREAM_FILTER_READ);
// Change the charset from ISO-8859-1 to UTF-8
stream_filter_append($out, 'convert.iconv.ISO-8859-1/UTF-8', STREAM_FILTER_WRITE);
// Uppercase the entire string.
stream_filter_append($out, 'string.toupper', STREAM_FILTER_WRITE);
// Run ROT-13 on the output.
stream_filter_append($out, 'string.rot13', STREAM_FILTER_WRITE);
// Now copy. All of the filters are applied here.
stream_copy_to_stream($in, $out);
// Clean up.

Now if we take a look at test-uppercase.txt, we will see that its
contents look like this:


What the code does

The code above basically does the following:

  • Open an input and an output file.
  • On the input file...
    • Assign the bzip.decompress filter in READ mode, which will decompress the input stream as it is read.
  • On the output file...
    • Use iconv to transcode from ISO-8859-1 to UTF-8
    • Use the string.toupper filter to transform the data to uppercase where applicable.
    • Use the string.rot13 filter to obfuscate the contents.
  • Then copy the input stream ($in) to the output stream ($out) in one step.
  • Finally, close the files.

It is important to note that none of the filters are actually applied until the streams are processed. So it is only when stream_copy_to_stream() is executed that all four filters are applied.

This method is far more efficient than performing the same operations in a loop because the copying is done at a lower level, where data does not have to be passed into and out of user space. So in addition to being easier to code, it is also faster and less memory intensive.

Some Important Details

Why doesn't stream filtering get used more often? One reason is that the documentation is sparse. To figure out how to use it, in fact, I had to read part of the C source code for PHP. (The unit tests helped a lot, too).

Here are some useful tips, though:

  • To find out (roughly) what filters are supported, you can use stream_get_filters().
  • The order of filters can be managed using stream_filter_append(), stream_filter_prepend(), and stream_filter_remove().
  • You can even write your own filters, should you so desire.

But one of the most frustrating aspects of the filtering library was figuring out which particular filters are supported. Running stream_get_filters() returns data like this:

    [0] => zlib.<em>
    [1] => bzip2.</em>
    [2] => convert.iconv.<em>
    [3] => string.rot13
    [4] => string.toupper
    [5] => string.tolower
    [6] => string.strip_tags
    [7] => convert.</em>
    [8] => consumed
    [9] => dechunk

But what do we do with zlib.*? Here's what I found:

ZLib Filters

This supports GZip compressing and decompressing.

  • zlib.inflate
  • zlib.deflate

BZip2 Filters

These support reading to and writing from a BZip compressed stream.

  • bzip2.decompress
  • bzip2.compress

Convert Filters

Base-64 and Quoted Printable seem to be the two formats supported by the convert filters:

  • convert.base64-encode
  • convert.base64-decode
  • convert.quoted-printable-encode
  • convert.quoted-printable-decode

Convert.Iconv Filters

The filter format for these is different than the others. It is something like this:


Thus, convert.iconv.ISO-8859-13/ISO-8859-15 would convert from ISO-8859-13 into ISO-8859-15.

Presumably, any charactersets recognized by Iconv are supported by the filter.

String Filters

These perform simple string manipulations:

  • string.toupper
  • string.tolower
  • string.rot13
  • string.strip_tags (removes HTML-like tags)


  • dechunk


I am not sure what this filter is for. The C code looks like it counts the number of bytes consumed during a particular filter run, but I'm not sure what this is used for. Testing it returns nothing.

  • consumed

If you know what this one is for, let me know in the comments.

Feb 01 2012
Feb 01

DoxygenDoxygenIt's been a few years, now, since I gave up using PHPDocumentor to document my PHP projects. I switched to Doxygen, an automated documentation tool that supports a wide variety of languages, including PHP. While PHPDocumentor enjoys broad support in the PHP community, Doxygen, too, is well entrenched. (Drupal uses it.)

I recently began a new project from scratch, and it gave me an opportunity to once again turn a hard gaze upon Doxygen. After some careful reflection on my experiences developing this new medium-sized library and documenting it with Doxygen, here are what I see as Doxygen's strong and weak points when it comes to PHP API documentation.

The Pros


Proofing documentation is enough of a pain as it is. But waiting a minutes for phpdoc to run was just downright aggravating. I often found myself kicking off a process and then going to do something else, and in the process forgetting all about my documentation tasks. It was a colossal waste of time.

Doxygen is blazingly fast. In my current setup, it generates graphics (class diagrams) along with the main documentation, and still does this in just a few seconds. Usually, regenerating documentation is so quick that I can't hit my browser's reload button before the generation has finished.


One of the most time-saving features in Doxygen is its autolinking support. As it analyzes comments, it looks for specific patterns -- camelCasing, parens(), and name::spaces to name a few -- and attempts to determine whether those strings are actually references to existing classes, functions, namespaces, variables, etc. If it determines that they are, it automatically generates a link the the appropriate generated documentation.

Autolinking and Code HighlightingAutolinking and Code Highlighting

This is a fantastic feature, saving developers the time it would otherwise take to make this relation explicit.

While Doxygen sometimes gets false positives (and, occasionally, misses what seems to be an obvious case), this feature saves me copious amounts of time. I'm willing to accept a few misses.

Rich Command (Tag) Library

Doxygen has an astounding number of available commands. You're not limited to things like @param and @author. It has support for sytax-highlighted code blocks (@code and @endcode), subgrouping (@group, @ingroup, and so on), callouts (@todo, @attention, @remark, @bug), and extra stand-alone documentation that is not bound to a piece of source (@mainpage, @page, and so on). If you have a few hours to kill, you can peruse the entire list here: http://www.stack.nl/~dimitri/doxygen/commands.html.

In addition to this command goodness, Doxygen supports a subset of HTML tags, too. And this includes not just bold, italics, and fixed fonts, but also tables and images.

The Source Should Be as Readable as the Generated Docs

I like HTML, PDF, and man page documentation. I use these frequently. But when I'm actively working on a piece of code, it's far more likely that I will read the documentation in the source, rather than in an external web browser. So in my mind it is important that the source code be readable

How many times do you find yourself writing lists -- ordered or unordered -- in your documentation? I do it quite frequently. And I don't want to have to write HTML tags for my lists. I want theses lists to be just as easy to read in the source.

Doxygen makes writing lists as easy as it is in Markdown:

- this
- will
- be
- a 
- bullet
- list
-# this
-# will
-# be
-# a
-# numbered
-# list

Advanced Support for Objects and Classes

So much PHP is now done using its bolted-on object and class structure. Whether or not this feels bolted on in the language itself (and it does), it should feel natural in the documentation.

Doxygen accomplishes this with some great features:

  • When a class inherits from another class or interface, documentation is inherited too. No need to explain the same method twice, even if you did override a superclass method.
  • Classes can be navigated (in generated documentation) by their class hierarchy, by class members, by namespace, and alphabetically by name.
  • Methods are sorted into categories -- constructors/destructors, public, private, protected, static, and so on. Properties and constants are sorted thus, too.
  • AND Doxygen can generate graphical glass diagrams. I know, it's eye candy. But beautiful functional eye candy.

Elegant UI Out-of-the-Box

Okay, I know this is kinda whiny, but I hate ugly documentation. I want API documentation to be attractive and I want findability to be top priority.

Doxygen has a very nice default theme (you can change it, of course). The colors are attractive, the fonts are clear, there aren't numerous cases of markup overflow... it's nice. Customizing colorscheme, logo, and so on is very easy -- it's done via configuration parameters. As I understand it, if you would like to write a full set of HTML templates to replace the defaults, you can do this too.

But findability is clearly the top priority, as it should be.

  • There's a built-in JavaScript-based search engine.
  • Navigation is done with a JavaScript-enabled tree structure
  • There are no less than eight different paths to navigate into the documentation, ranging from by-file to full alphabetical index.

The Cons

The Dastardly Backslash

One of the worst decisions I think the PHP community has ever made is adopting the backslash () character as a namespace separator. Already used for escape sequences and Windows paths, backslash has become a source of frustration in more than one aspect of my PHP coding (for starters, my IDE keeps thinking I'm trying to escape stuff). I abhor using it.

But it gets worse! Doxygen, too, assigns special meaning to the backslash: Doxygen recognizes it as initiating a command. Yes, either @ or \\ can be used to declare a command. @param and \\param are semantically equivalent to Doxygen. And this is the root of much documentation confusion. For consider any case where you want to reference a PHP namespace in your documentation.

You write:

 * @see \Foo\Bar

And doxgen sees:

 * @see @Foo @Bar

This not only causes Doxygen to emit errors, but it also munges up the output. The generated documentation will display something like "See: Foo" or sometimes just "See:".

The simple solution to the backslash problem

Doxygen does have a simple solution. It abstracts the concept of namespacing at a fairly high level, so you can use alternate namespacing separators instead of the backslash. Both :: (double-colon) and # (hash) seem to work in this capacity. Thus I have now developed the habit of documenting namespace references like this:

 * @see Foo::Bar

PHP Ambiguity

Let's be honest, PHP isn't exactly a first-class citizen in the Doxygen world, and the language's ambiguities sometimes keep it from working in exactly the expected way.

The place where this really shows is in the way @param and @return are processed. Sometimes PHP developers include typing informat in these directives, and sometimes they don't.

Here is documentation with type information:

 * @param string $foo
 *  An input string.
 * @return string
 *  An output string.

Here is the same documentation without:

 * @param $foo
 *  An input string.
 * @return
 *  An output string.

From a parsing perspective, the @return tag is problematic, for there are no clear lexical markers indicating whether there is type information. (Some of us conventionally use a newline, but this is not a standard).

To solve problems like this, Doxygen adds extra commands. For example, for typed return information in PHP, you should use @retval instead of @return.

 * @param string $foo
 *  An input string.
 * @retval string
 *  An output string.

Doxygen will correctly parse @retval in such a way that it preserves the type info.

A related issue is that Doxygen limits its types to the PHP primitives, including resource, object, and array. But it doesn't supported setting namespaced class names as the return type.


Doxygen is feature rich. We didn't even cover some of its advanced options, like generating PDF files or man pages. Nor did we look at its ability to provide supplemental documentation. But it should be clear that it is an amazing tool for API documentation generation.

It has its glitches and drawbacks, but in my mind these are outweighed by its benefits.

Jan 10 2012
Jan 10

The following is a guest post by Mitchel Xavier

One of the challenges of developing with Drupal is to understand Drupal’s structure. Until now, when working with the DOM structure, the DOM inspector has been the best tool for viewing the structure. A new tool has been created to make the visualization of the DOM structure much easier to interpret. It is a Firefox add-on and is called Tilt 3D. It creates 3 dimensional interactive representation of the DOM elements as a layered visual image.

A requirement to use Tilt 3D is that your browser supports WebGL. WebGL is a Javascript software library which allows for the creation of 3D graphics very quickly and without the requirement for additional plugins. Currently Firefox is the only browser to support this tool. Firefox has supported WebGL since version 4. The other requirement for Tilt 3D is that it is supported with a capable graphics card.

Tilt 3D is extremely useful for many reasons. When you click on each node, you can see the structure of each element. You can view the html and css of each node. It is great for debugging html structure issues. It also provides the option to refresh in real time during all changes made in Firebug.

Tilt 3D was created by Victor Prof and Rob Campbell, who created Firebug. One of the advantages of Tilt 3D is it’s smooth interface. It is very intuitive to use and creates a completely new way of interacting and obtaining information about your Drupal website. If you are starting work on an existing Drupal project, Tilt 3D would be a great way to understand the structure of that particular project.

[embedded content]

Mitchel Xavier is a Drupal website designer and Drupal 7 specialist.

Jan 10 2012
Jan 10

SyntasticSyntasticVim (VI Improved) is a powerful text editor that comes standard on most versions of Linux, OS X, BSD, and other UNIXes. With thousands of add-ons, console and GUI versions, and a fully scriptable environment, you can transform a humble text editor into a powerful development tool. In fact, there are several Drupal add-ons for vim.

In this article, I explain how to turn on syntax checking for PHP, adding code style validation along with error checking. We do this with three tools: The Syntastic Vim plugin, the PHP CodeSniffer PEAR package, and the Drupal Code Sniffer project from Drupal.org.


The first thing to do is add a Vim module that provides generic syntax checking for programming languages. The syntastic Vim plugin includes support for a few dozen languages out of the box, including PHP, CSS, Ruby, (X)HTML, and several other common web-oriented languages.

The installation instructions for Syntastic suggest using another Vim plugin (pathogen) to install Syntastic. I use the Janus Vim bundle, which comes with Syntastic.

PHP CodeSniffer

Once Syntastic is installed and working, we can extend it a little. Syntastic uses the php command to perform error checking, but it can also perform code style checks as well, giving you an inline warning when your code does not conform to convention.

To gain this extra level of support, we need the PHP CodeSniffer, which provides the commandline client phpcs. This can be installed easily using PHP's pear package manager:

$ pear install PHP_CodeSniffer

(You may need to execute the above with sudo or as the root user.)

Once this is installed, you should have a program called phpcs available on the commandline. You can test it on a PHP file by running phpcs foo.php, where foo.php is some existing PHP file.

By default, PHP CodeSniffer enforces a particular coding standard: The PEAR standard. This is not the same coding standard that Drupal uses. In fact, since PEAR requires a 4-space indent, chances are that it will give you a LOT of warnings.

What we need are Drupal-specific code sniffing rules. And those will come from a different package.

The Drupal Code Sniffer

The Drupal Code Sniffer is a package maintained at Drupal.org. It provides grammars and syntaxes for PHP Code Sniffer.

To install it, you can either download the project snapshot or retrieve it from Drupal's Git repository.

$ git clone drupal://drupalcs

You will then need to follow the instructions on the project page to install DrupalCS. In most cases, this merely requires creating a symbolic link or moving a directory.

Note One installation suggestion is to use a Bash alias for phpcs. This will not work with Syntastic and Vim without some extra legwork on your part. I suggest sticking with the standard installation instructions.

Once you are done, you can test whether phpcs can see your new format:

phpcs -i
The installed coding standards are DrupalCodingStandard, MySource, PEAR, PHPCS, Squiz and Zend

If DrupalCodingStandard is missing from the list above, you have not successfully installed the DrupalCS package.

Editing Your .vimrc (or .vimrc.local)

The last thing to do is tell Vim and Syntastic about your preferred syntax. The easiest way to do this is to add the following line in the file called .vimrc (note the leading dot). This should be in your home directory.

$ echo "let g:syntastic_phpcs_conf='--standard=DrupalCodingStandard'" >> ~/.vimrc

The above will add the following line to the end of your .vimrc:

let g:syntastic_phpcs_conf="--standard=DrupalCodingStandard"

From that point on, anytime you edit a file, VIM will be able to check your code formatting against the Drupal coding standards.

Working with Syntastic

There are many options available with Syntastic. To get to know them,
you can execute :h syntastic inside of Vim. This will show syntastic's
built-in help. Janus, the bundle of Vim plugins that I use,
automatically configured Syntastic to run each time the file is saved.
But by default you must manually execute syntastic. Here are a few
commands that will come in helpful:


This causes Syntastic to check the file and report any errors. As I have
Syntastic configured, it will flag each line with a left-marginal red
box for an error, and a yellow box for a warnings.


To see the verbose error report from Syntastic, you can execute
:Errors. In the screenshot above, you can see the error pane at the
bottom of the window. For each error, it prints out the file (not
shown), the line and character numbers, and the message.

To close the error console, switch to that pane (CTRL-W, CTRL-W) and
close it (:q).

For most practical cases, that's really all there is to it.

Syntax errors are nice, but code formatting errors are SO ANNOYING. How do I turn it off?

As we have it configured, two checks are run on your code:

  • PHP itself is used to check for syntax errors (php -l).
  • PHP CS is used to check your code style.

But the code style "errors" are not compilation errors, but merely
violations of some coding standard's formatting guidelines. As helpful
as this can be at times, it can also be a nusance sometimes.

Want to disable format checking? There's a flag for that:

let g:syntastic_phpcs_disable=1

The flag above disables the format checking for phpcs, but leaves the
PHP syntax checking on. So you will only get real syntax errors. Not
formatting errors.

Customizing PHPCS Output

Want to tune and tweak the output of phpcs? The best way to do this is
to test it by hand, running phpcs and testing out various options and

The command phpcs --help will print out a detailed list of available

Updated: Ryan Johnston's fixes have been incorporated.

Oct 29 2011
Oct 29

The Drupal codebase upon which I work is now over a million lines of code (excluding whitespace and comments). It sounds impressive. But the reality of the matter is that the combination of lots of code and the Drupal way of doing things makes it not impressive, but a maintenance nightmare. Nobody on the current team knows what all of this code does or what it is for. Even limiting things to the custom modules, there still is no longer any member of the team who knows the code well. This, of course, isn't a criticism of the team or even of the platform, but a reflection on what happens when a codebase balloons over the years.

Reading Steve Yegge's post entitled Code's Worst Enemy hit home the concern I have with our code -- and with Drupal in general. (Update 10/29/2011: Steve Yegge's the guy who accidentally posted the Amazon/Services rant on Google+, and who unintentionally "quit his job" in the middle of his presentation at OSCON.)

I suggest reading the entire blog post on its own, but here are several salient details that need explicit mention, and that have a Drupal context:

  • While some languages (Java) may exacerbate the problem, clearly ballooning code can happen in any language. And with a semi-opaque execution sequence (as we have in Drupal), the problem can be compounded by the fact that one cannot determine at a glance what code might be executed on a given execution. To know what code will be executed on a given request, you must know not just core and your own modules, but all of the installed modules.
  • Design Patterns might deserve a measure of skepticism. Steve's point is that relying upon them can introduce needless complexity. He uses Dependency Injection as an example. Too often, design patterns are introduced for their own sake or because they look similar to what we want to accomplish. But then the need to (re-)architect in terms of the pattern sometimes overshadows the original goal of accomplishing a task.
  • Copy-and-Paste (CAP) code is bad. Obviously. But because all of Drupal is a public API, I often see developers choosing to CAP code from function body to function body because they think that is more elegant than providing highly-contextual stand-alone functions that might be mistaken by other developers as "generally useful". (No, prefixing functions with underscores is NOT a good alternative. Lately, I've been encouraging developers to underscore all functions that aren't hooks or constructed callbacks because it's too easy to get hook/namespace collisions otherwise.)
  • Unfortunately, Steve doesn't talk about YAGNI ("You Ain't Gonna Need It") as a good design principle, but the converse of YAGNI -- that tendency to attempt to solve all possible cases before there are any actual cases -- is a dangerous tendency in software developerment that must be countered in the name of simplicity and maintainability.

(This post was written in July, 2011.)

Oct 29 2011
Oct 29

In hindsight, I'm surprised how long it took me to develop a strong appreciation of code formatting standards. It's not that I haven't followed them all along (most IDEs and editors do the lion's share of that for you). What surprises me is that I never really appreciated the value of following them. But managing a codebase of over a million lines makes it readily apparent that coding standards are a big boon -- and that lapses in those standards adversely impact the entire team.

The primary reason for coding standards is this: humans are worse at syntax parsing than machines are. Coding standards exist to make the code easier for humans to work with, and they do this by making the code more amenable to visual scanning.

There are four benefits to be gained from following coding standards: reduction of bugs, preventing new bugs, lowering the learning curve, and easing long term maintenance. I discuss them below.

Clarity is the enemy of bugs. Where the code is murky, bugs flourish.

Here's an example of a code practice that seems to creep into PHP code from time to time, further obfuscated in a very real-world way by Drupal-ish data structures:

function mymodule_get_version($data) {
return $data['field_version_number'][0]['value'] ? $data['field_version_number'][0]['value'] : $data['field_old_version_number'][0]['value'] ? $data['field_old_version_number'][0]['value'] : 0;

The code above is supposed to return the version number if it exists. If no version number exists, it returns the old version number if it exists. And if neither a new nor an old version number exists, it returns 0.

There is a bug in that code. What is it? To find it, let's reformat the code to follow good code formatting practices. Specifically, we will apply the following:

  • Use variables instead of repeated array dereferencing.
  • Remove the nested ternary.
  • Eliminate (at least some of) the ambiguity of the tests being done in conditionals.

And here's what we get:

function mymodule_get_version($data) {
  $version = $data['field_version_number'][0]['value'];
  $old_version = $data['field_old_version_number'][0]['value'];
  $final_version = 0;
  if (!empty($version)) {
    $final_version = $version;
  elseif (!empty($old_version)) {
    $final_version = $old_version;
  return $final_version;

But wait! The bug is gone! Why? Because what we were actually doing in the original code is this:

function mymodule_get_version($data) {
  $version = $data['field_version_number'][0]['value'];
  $old_version = $data['field_old_version_number'][0]['value'];
  $final_Version = 0;
  // THIS is the first check run...
  if (!empty($old_version)) {
    $final_version = $old_version;
  // And the new version is only checked if an old version wasn't found.
  elseif (!empty($version)) {
    $final_version = $version;
  return $final_version;

But that was obscured by the poorly formatted code. (If you don't believe me, test it.)

I see this simple mistake all the time, and I've even accidentally committed it several times:

Original code:

if ($foo)

Now a new feature request comes in... and we need to add another step:

if ($foo)

Oops! Now _do_something() is run regardless of whether $foo is TRUE.

Maybe it's Python syntax haunting me, or maybe I really am that obtuse... but I've been guilty of introducing bugs by simply not noticing that there are no curly braces. And since that code is perfectly valid, I'm not going to get a syntax error or anything.

Of course, the easiest way to prevent this is to follow the standard that any if statements should always use curly braces, even for simple tests:

if ($foo) {

Drupal coding standards require this, and I believe that this is the reason why.

Learning a new codebase can be daunting. One must learn new patterns, overall architecture, classes, methods, and functions, terminology, and so on.

Why make it harder by requiring that the new developer also spend time figuring out what a particular piece of code is supposed to be doing?

Here's an example of code whose formatting significantly decreases understandability:

$i = 0;
$total = array();
$list = array(1, 7, NULL, 4);
if ($list) while ($val = $list[$i++]) if ($val) $total += range(

Yes, this is taken from several real-world examples of code formatting that I have seen. I've seen variations of the above that span hundreds of columns and (I kid you not) 20+ lines.

There are a several things that are confusing about the above:

  • Superficially, it looks like a simple conditional. (Imagine scanning down the left-hand side of a large piece of code. Would you notice that there are three control structures here?)
  • The if/while/if all on one line requires the reader to pay close attention.
  • There are multiple operations all happening on that one line.
  • The range function is split onto four lines (presumably to keep the other line from getting too long), making a standard (short) function call jarringly long.
  • The indentation is actually misleading, as it makes the function arguments appear to be the body of the outer if's block. In fact, the code should be indented 3-4 indentations -- one for each control structure plus one for the function call args.

Because of all of this, a very simple operation is being obfuscated. The above could actually be done far more efficiently in other ways, but in order to even see that, one must cut through the formatting issues. Why make something hard that could be so much easier?

Finally, when you follow coding standards, you're doing yourself (and your team) a favor. You're making it easier on future-you. When you come back to the code six months later, you won't be thrown off by your own past-cleverness. Your life will be easier. And your team (should you have one) won't curse you for making their lives harder.

Few people like maintaining code. So why make that part of your job any harder than it is?

These are arguments against coding standards that I have heard:

"Well formatted" code introduces bloat

Presumably, "bloat" here is supposed to mean non-essential curly braces, whitespace, or line count.

But that's not what we usually mean by bloat. This code isn't adversely impacting either the readability or the execution time of the code. If files are longer because of coding standards, they are longer in a good way. They are longer so that the code is more manageable and less prone to bugs. That is a standard goal in software development, and most certainly is not "needless bloat".

All that extra stuff slows down the parser

Do a few lines of poorly formatted code make the code parse faster?

I have never seen benchmarks that support this theory. When I tested some using compressed and uncompressed versions of QueryPath (and we're talking about removing ALL nonessentials from tens of thousands of lines of PHP code) I found no substantial differences in loading/execution time.

Furthermore, since most sites use opcode caches, even if there was a runtime slowdown the first time a file was parsed, it would vanish after that initial load. Subsequent loads would use the cached opcodes.

Writing complex lines of code shows that I am a better developer than others

I know some developers think this, but several years ago a developer said this to me. That was his justification for writing massive series of nested ternaries.

I came away from the conversation (and the code) thinking precisely the opposite -- that he wasn't a terribly good coder, and that he tried to hide this fact with needless complexity. He wanted others to see the code and say "hmmm... that looks complex. It must have been written by a master." Instead, the code was just frustrating.

Sometimes it feels like a needless pain. Sometimes it feels like wasted effort. Sometimes all those extra spaces and parens feel somehow wasted. But they're not. Code is an intermediate language -- one designed for both humans and machines. Making the code machine-friendly is only half of the equation. Paying attentio to the other half isn't a waste. It's a long-term investment in continued quality.

Oct 27 2011
Oct 27

In years past, I used to do my development on a local machine, and then push my work to a remote server for testing. About two years ago, though, I switched my environment. I began using virtual machines instead of physical servers. Configuring them for Drupal, I could do my Drupal development locally, and then do advanced testing on my virtual machine.

In this article, I give five reasons why I believe Drupal development can be enhanced through using VMs.

1. Replicate the Real Server(s)

What I am advocating is using a VM that acts like a (production) server. If the server has only one site on it, the VM should have only one. If the server hosts six sites and you are developing on all six, the VM should also host all six.

Why all the bother? Why not just create a single VM for all projects and toss them all in?

Because the idea is to make the VM as close to production as possible in order that you catch the same sort of bugs and nuances that are going to crop up on the server. Many configuration nuances can be caught and dealt with in this way. Bugs that show up on, say, a Linux server but not a Mac server can be caught earlier when you run your tests on a Linux server instead of the local workstation OS.

Here are the things I tend to try to replicate:

  • The host OS: I try to get exactly the same version
  • The app stack: I try to install the same versions of Apache, PHP, MySQL, and so on.
  • Custom configurations: To unearth those config file bugs, I try to copy php.ini and the relevant Apache configs, too.
  • Supporting libraries: PEAR and PECL packages play a crucial role in PHP development. I install the same ones the server has.

Things I sometimes try to replicate:

  • RAM or disk space: On apps where memory or storage plays a crucial role, I try to mirror this in the VM. Normally, though, I set these properties based on host OS limitations.
  • Other servers: Sometimes I need to test out database replication, multi-server caching, or proxying. In such cases, I sometimes set up other VMs to mimic these facilities. Sometimes I will configure a VM to do both parts (e.g. running Varnish on my VM, even though in production Varnish has its own server cluster). Typically, though, I do these only as I need to.

2. Catch Stealth Dependencies

In the fabulous book The Productive Programmer, Neal Ford discusses the problem of software dependencies that "creep into" a project through the IDE or other workstation tools. The problem, in a nutshell, is that developers tend to install all sorts of things on their local workstations -- many of which are not directly related to the software being produced. Yet once they are installed, developers will use them.

What happens when a developer has some PEAR package installed, uses the API, and then deploys the resulting code to the production server (without first installing the same PEAR package)?

The developer may not be overly lax in accidentally pushing this code to a server. After all, he or she may not even realize that this dependency has been added. The function was just there! All the developer did was use it!

Drupal sometimes makes this problem worse, as modules do not have a way of uniformly declaring a dependency on a PEAR or PECL package. It may "just work" on a local workstation that already has the library, and perhaps totally fail in production.

Using a VM for development will help catch those stealth dependencies (after all, the code won't work on the VM until you install the necessary libraries, at which time you file a ticket to install those dependencies on production). It can also help identify subtle differences between different versions of core app software (like Apache or PHP).

3. Isolation of resources

Sometimes I crash my servers. I'm glad that that no longer crashes my workstation. Sometimes I configure Varnish to eat up all my memory. I'm glad that only eats up all the VM's memory. Sometimes I get MySQL thrashing.

Since the VM's resources are isolated from the underlying OS, I can still work on my machine even if the server is pulling a massive DB load. And I can kill the server if I have to -- without the inconvenience of restarting the workstation. In fact, it's nice to be able to run a kernel upgrade without having to drop out of IRC. The two are separated.

4. Easy rollbacks

Most VM software provides tools for "snapshot and rollback." Take a picture of the current state of the VM. Then, at a later time, revert to that snapshot.

I've found this tremendously useful for various benchmarking tests, where I can reliably reset test environments.

Of course, it's also a nice feature when an install goes haywire, a database corrupts, or you accidentally run rm -rf /bin. (Oh, yes I did!)

Finally, all of the VM layers I have used recently allow the creation of images from which clone VMs can be built. I've used this to create a base system and share it with other developers or quickly ramp up a new project.

5. Mimic Deployment and Other Maintenance Tasks

The final nicety is that VMs provide a way to test deployment and maintenance tools. Often, deployment scripts or configurations will be written for just one task: Push from staging to production. Few eyes see the code, and since the code is run relatively rarely, bugs often remain unnoticed over several releases (especially on a large site where small problems don't always present themselves in an obvious way).

Use Drush to deploy to production? Then configure your host and VM for Drush deployment and run the same procedure. Run a custom script? Retool the script to work on the VM, too. It's better to find out about flaws during routine development than during the 2AM maintenance window.

Aug 19 2011
Aug 19

What is "good" code? We toss around all kinds of high-level definitions, but these are usually either hopelessly vague or hopelessly complex. This month's Pragmatic Programmer magazine offers a fantastic definition, summarized on this card:
Seven Virtues of CodeSeven Virtues of Code
(Apparently originally from Agile in a Flash)

Tim Ottinger and Jeff Langr's PragProg article is fantastic -- an absolute must read for anyone who regularly works on software. I tell you. I beg you. I entreat you. (I'd bribe you if I could afford it.) Please read it.

Aug 18 2011
Aug 18

There are very good uses for Switch statements. They can be great for nesting logic. But sometimes they are used as a way to (essentially) map a name to a value. In such a scenario, the body of the case is just a simple assignment.

After noticing this pattern a lot recently, I thought I'd benchmark it against the obvious replacement candidate: a hash table lookup. PHP arrays are more like ordered hash tables. For that reason, they provide very fast random access. Are they faster than a switch statement?

I ran two tests. In the first case, I assumed no default. In the second, I assumed a default. The conclusion: Arrays are faster. Read on for the details.

Benchmarking Switch vs. Array index

I took the basic case from what I see as the most typical use of a switch statement: About five different case statements, each of which has a distinct value. In the first test, there is no default case.

No default case:

  $iterations = 10000;
  $options = array('apple', 'banana', 'carrot' ,'date', 'endive');
  $color = NULL;
  // Fill an array with random keys. This ensures
  // that (a) we use the same keys, and (b)
  // slowness in the randomizer doesn't impact the
  // loops (which can happen if entropy collection kicks in)
  $samples = array();
  for ($i = 0; $i < $iterations; ++$i) {
    $samples[] = $options[rand(0, 4)];
  // Test a switch statement.
  $start_switch = microtime(TRUE);
  for ($i = 0; $i < $iterations; ++$i) {
    $option = $samples[$i];
    switch ($option) {
      case 'apple':
        $color = 'red';
      case 'banana':
        $color = 'yellow';
      case 'carrot':
        $color = 'orange';
      case 'date':
        $color = 'brown';
      case 'endive':
        $color = 'green';
  $end_switch = microtime(TRUE);
  $total_switch = $end_switch - $start_switch;
  printf("Switch:\t%0.6f sec to process %d" . PHP_EOL, $total_switch, $iterations);
  // Test an array lookup.
  $start_map = microtime(TRUE);
  $map = array(
    'apple' => 'red', 
    'banana' => 'yellow', 
    'carrot' => 'orange',
    'date' => 'brown', 
    'endive' => 'green'
  for ($i = 0; $i < $iterations; ++$i) {
    $option = $samples[$i];
    $color = $map[$option];
  $end_map = microtime(TRUE);
  $total_map = $end_map - $start_map;
  printf("Map:\t%0.6f sec to process %d" . PHP_EOL, $total_map, $iterations);

I compare two methods.

First, I test the basic switch. Again, the point of comparison is evaluating an average case of using switch to do assignments.

Second, I test doing the same lookups against a PHP array, which is very quick at random accesses of keys it works like a hash table.


I ran numerous iterations of the test, and the average indicates that using a PHP array is about twice as fast as using a switch. Here's the output a representative run of the script above:

  Switch: 0.004895 sec to process 10000
  Map:    0.002009 sec to process 10000

Benchmarking with a default value

But wait! One nicety of a switch statement is the ability to set a default. The first benchmark doesn't measure that. Let's try it and see if the array approach still wins.

Default case:

  $iterations = 10000;
  $options = array('apple', 'banana', 'carrot' ,'date', 'endive');
  $color = NULL;
  // Fill an array with random keys. This ensures
  // that (a) we use the same keys, and (b)
  // slowness in the randomizer doesn't impact the
  // loops (which can happen if entropy collection kicks in)
  $samples = array();
  for ($i = 0; $i < $iterations; ++$i) {
    $samples[] = $options[rand(0, 4)];
  // Test a switch statement.
  $start_switch = microtime(TRUE);
  for ($i = 0; $i < $iterations; ++$i) {
    $option = $samples[$i];
    switch ($option) {
      case 'apple':
        $color = 'red';
      case 'banana':
        $color = 'yellow';
      case 'carrot':
        $color = 'orange';
      case 'date':
        $color = 'brown';
      ##case 'endive':
      ##  $color = 'green';
      ##  break;
        $color = 'green';
  $end_switch = microtime(TRUE);
  $total_switch = $end_switch - $start_switch;
  printf("Switch:\t%0.6f sec to process %d" . PHP_EOL, $total_switch, $iterations);
  // Test an array lookup.
  $start_map = microtime(TRUE);
  $map = array(
    'apple' => 'red', 
    'banana' => 'yellow', 
    'carrot' => 'orange',
    'date' => 'brown', 
    //'endive' => 'green'
  for ($i = 0; $i < $iterations; ++$i) {
    $option = $samples[$i];
    if (isset($map[$option])) {
      $color = $map[$option];
    else {
      $color = 'green';
  $end_map = microtime(TRUE);
  $total_map = $end_map - $start_map;
  printf("Map:\t%0.6f sec to process %d" . PHP_EOL, $total_map, $iterations);


Unsurprisingly, the additional if/else and isset() added a little bit of time to the array-based method. But not enough.

Somewhat more surprising was the fact that changing one value from a case to a default made the switch seem slightly faster. Over the numerous runs I did, the switch was consistently faster with the default than it was without. Usually, it wasn't a lot faster (it tended to hover around 0.0038), but... it was faster. Yet I never produced a case where the switch was faster than the array lookup.

  Switch:   0.003257 sec to process 10000
  Map:    0.002677 sec to process 10000


Matt Farina suggested that the test isn't totally fair. The $map really should be declared outside the timer, as that more accurately reflects the fact that the switch is parsed and built outside of the timer. That might offer a tiny performance boost to the array version.

My aversion to using switch statements to assign values has always been at a higher level. I don't like the fact that they take up both horizontal and vertical space. They look ugly, and they provide more "brain overhead" to read.

Switch statements also seem to be somewhat error prone. Developers sometimes forget break statements, which can sometimes result in hard-to-locate bugs.

For the record, when it comes to switch vs if/elseif/else, the two are about the same: https://gist.github.com/d1fe59a23daa33aaf6fe (Note that I tested against a very small set of options, as that is the common use case.)

Finally, I will state once again that I am not claiming that switch statements are no good, worthless, or should never be used. Rather, I'm pointing out that for a very common set of circumstances switch statements should not be used.

Jul 26 2011
Jul 26

I'm continually surprised by PHP programmers who argue tooth and nail that a PHP "array" is a "real array". These same programmers, who often program in at least one other language (JavaScript), seem confused over what an array is and how it ought to work. (One even referred to collections classes in other languages as "needless bloat," a telling symptom of this misunderstanding.) Conversely, new PHP developers who come from other languages are often confused by the fact that PHP arrays don't work as expected. Strange things happen to array ordering and such. The confusion can be cleared up at the terminological level. The simple fact is that PHP arrays are not arrays in the traditional sense. They are ordered hash tables. I will explain with several examples.

Among PHP's built-in types, there is a widely-used type called array. It is the only native collection-like type. It's constructed like this: array(). It comes with a wide variety of supporting functions, such as array_walk(), array_values(), array_diff() and array_combine().

At face value, the syntax of an array is similar to other languages:

// Create an array and assign the first two values.
$foo = array();
$foo[0] = 'First slot';
$foo[1] = 'Second slot';
// Create an array with five elements, and then iterate over the list.
$bar = array(1, 2, 3, 4, 5);
for ($i = 0; $i < count($bar); ++$i) {
  // Outputs '12345'.
  print $bar[$i];

Yet despite it's trappings, a PHP array is not at all an array. It's an ordered hash table (ordered hash map, order-preserving dictionary). For that reason, you can do things like this:

$foo = array();
$foo['a'] = 'First slot';
$foo['b'] = 'Second slot';
$foo['c'] = 'Third slot';
// This will result in 'abc', because order is preserved.
foreach ($foo as $key => $value) {
  print $key;

I have heard people argue "No, it's both! Here's why: The keys can be ints or any other scalar. When they're ints, the data structure works like an array."

That is not exactly true. It is true that if you supply only values, integer keys will be automatically assigned. And it's true that there are a wide variety of short cuts to make PHP arrays act like arrays. But they are never really arrays at all. (Nor are they linked lists, or array lists). Here's a simple example illustrating why this is the case:

$bar = array();
for ($i = 7; $i >= 0; --$i) {
  $bar[$i] = $i;
// Print the array as a space-separated string of values.
print implode(' ', $bar);

The above loop creates an array with eight elements, assigning them by integer key. However, it assigns values in reverse order. What should the final contents of the array be (and in what order)?

If it were an array, then the final line should print this:

0 1 2 3 4 5 6 7

In fact, it prints this:

7 6 5 4 3 2 1 0

Why? Because this is not an array. It's an ordered hash map. The "real" index of the ordering list is (apparently) inaccessible to us, but it assigns the pair 7 => 7 to the first (0) position, and the pair 0 => 0 to the eighth position (7). Write the same code in JavaScript, Python, Ruby, Perl, Java, or C# using arrays or indexed lists and you will get the first result, a list from 0 to 7.

In case you think I've pulled out some odd edge case, here's another example that creates an array and initializes it with values, missing only the value for the index 4. 4 is later added:

$bar = array(0 => 'a', 1 => 'a', 2 => 'a', 3 => 'a', 5 => 'a', 6 => 'a', 7 => 'a');
$bar[4] = 'b';
print implode(' ', $bar);

What we would expect the above to print is 'a a a a b a a a'. What it actually prints is 'a a a a a a a b'. Why? Because the index position of 4 is occupied by 5, and when we inserted $bar[4], it was actually slotted into the eighth spot. Even though we've ordered the keys numerically, this is still a hash table.

The confusion between array-ness and the PHP array type can cause some quirky bugs. I recently found code that effectively did this:

$foo = array();
$foo[0] = 'First';
$foo[1] = 'Second';
$foo[2] = 'Third';
$foo[3] = 'Last';
// Under some conditions, an item was deleted like this:
// Under other conditions, an item was changed like this.
$foo[1] = 'New second';
print implode(', ', $foo);

It should now be unsurprising that the final order of the array above was '0 2 3 1'. Yet for those used to working with arrays in other languages, that final order is surprising. To get things back in order, you must to a ksort() (quicksort on keys) to re-order the list.

If you're going to implement only one collection type in the language, I think an ordered hash table is a pretty darn good choice... I just don't think it should have been called an array.

Nov 20 2010
Nov 20

Over the past several years that I’ve been working with Drupal as a vendor I’ve become enamored by its power, its flexibility, its support and perhaps most of all with its awesome community.  There is one thing however I’ve personally never quite gotten around.  End Users.

See, the sad thing is most people I’ve met are…well…dummies.  The kind of people who are so tech illiterate they have no idea how illiterate they are.  And, typically I have to support these technological nincompoops.  So, I want to appeal to the Drupal community that is so great and so awesome, and ask how the other site providers among us deal with making Drupal — especially the content creation side — less “Drupal” for end users.

Less Drupal as in, no non-sense UI with few to no options.  Very task oriented and much less “powerful”, but much more self-describing.  Does anyone have any idea on how to do this with existing modules, themes, whathaveyou?  And, I promise when the dust settles to write another post documenting everything on simplifying the Drupal interface.  Deal?  Deal :)

Share this:

Jun 30 2010
Jun 30

One of the most used functions in Drupal's database abstraction layer is db_query, which allows passing an SQL string and corresponding arguments to send a query to the database. I'll give you a quick overview of how db_query works before showing you how to drupalize a query such as:

SELECT field1, field2 FROM table_name WHERE field1 IN (value1,value2,...)

Drupal database abstraction layer and db_query

As many other content management systems and web development frameworks, Drupal implements a database abstraction layer that you, the developer, can use for writing code that works with different database servers. For this to work you need to follow certain rules and this starts with the way you pass queries and arguments to the db_query function.

Arguments in a db_query call need to use sprintf style specifications, placeholders such as %d for integers and %s for strings. This allows Drupal to avoid SQL injection attacks and perform other security checks.

Let's review a simple example:

$sql = "SELECT v.vid FROM {vocabulary} v WHERE v.name = '%s'";
$vid = db_result(db_query($sql, $vocabulary_name));

This code will get a vocabulary's id searching by its name, which is passed as a string. Notice the curly braces around the table's name, they need to be there if you want Drupal to provide table prefixing, this is important so get used to always do it this way.

Many Drupal beginners may opt for what they consider a simpler approach:

$sql = "SELECT v.vid FROM vocabulary v WHERE v.name = '" . $vocabulary_name . "'";
$vid = db_result(db_query($sql));

Even if it may look simpler now you're bypassing Drupal's security checks and your query will start breaking up as a series of concatenated strings which is not good for code readability, and this gets more confusing with more complex queries.

Passing arguments to db_query as an array

Arguments can be passed one by one or contained in an array. Let's slightly modify our example query:

$sql = "SELECT v.vid FROM {vocabulary} v WHERE v.name = '%s' AND v.vid = '%d'";

Now we're being more specific, looking for this vocabulary using v.name and v.vid, pay attention to the placeholders, and we could get our result passing each argument to db_query like this:

$vid = db_result(db_query($sql, $vocabulary_name, $vid));

or we could build an array with both arguments like this:

$args = array($vocabulary_name, $vid);
$vid = db_result(db_query($sql, $args));

I prefer the array approach for queries where I have to pass more than a few arguments and that's often the case when we use the SQL IN operator.

The SQL IN operator and Drupal

Every good Drupal developer has to be highly skilled in writing SQL and that means there will be times when you need the SQL IN operator, which compares a field to a list of values. Let's say you want to get all nodes of types page and blog, you're looking for a query like this:

$sql = "SELECT n.nid, n.title FROM {node} n WHERE n.type IN ('page', 'blog') AND n.status = 1";

If you've tried this before you may have experienced escape quotes hell and opted for the non Drupal way of concatenating arguments in the query, at least I did it until I read about <a href="http://api.drupal.org/api/function/db_placeholders/6db_placeholders.

This is how I build my Drupal queries with the SQL IN operator now:

$types = array('blog', 'embedded_video', 'list', 'node_gallery_gallery');
$args = array();
$args[] = $tid;
$args = array_merge($args, $types);
$args[] = $status;
$args[] = $limit;
$sql = "SELECT n.nid, n.title, n.type, c.comment_count FROM {node} n INNER JOIN {term_node} tn
ON n.nid = tn.nid INNER JOIN {term_data} td ON tn.tid = td.tid LEFT JOIN
{node_comment_statistics} c ON n.nid = c.nid WHERE td.tid = %d AND
n.type IN (" . db_placeholders($types, 'varchar') . ")
AND n.status = %d ORDER BY n.created DESC LIMIT %d";
$result = db_query($sql, $args);

This is a bigger query with more arguments and I'm not only looking for nodes of certain types (listed in the $types array) but also a specific term in the taxonomy ($tid) and a published status ($status). I'm also adding a LIMIT clause at the end ($limit).

db_placeholders takes care of adding the correct number and type of placeholders based on the contents of the array $types. In this case it will add four %s because I passed varchar as the second argument and there are four elements in the array.

The $args array is built based on the order in which the arguments appear in the query, notice how I add $tid first and then use array_merge to add $types, then I add $status and $limit at the end.

So Drupalish, isn't it?

Apr 21 2010
Apr 21

Add another module has been recently updated with another new time-saving feature; you can now display “Add another” tabs on certain node types.  I’m also taking the time to think of what else can be done to make the usual node creation workflow easier for end users.

I’ll be looking for suggestions on how to improve the UI and workflow of node submissions and add another module as focus turns to version 2.x.  2.x will released for Drupal 6.x and 7.x in the next month; and will move the workflow settings to the content type edit pages, rather than a separate settings page.   The 7.x version will also include a “Save and Add another” button, somewhat similar to submit again, and include integration with node clone.

Reblog this post [with Zemanta]

Share this:

Nov 05 2009
Nov 05

This blog post is a by-product of my preparation work for an upcoming talk titled "Why you should be using a distributed version control system (DVCS) for your project" at SAPO Codebits in Lisbon (December 3-5, 2009). Publishing these thoughts prior to the conference serves two purposes: getting some peer review on my findings and acting as a teaser for the actual talk. So please let me know — did I cover the relevant aspects or did I miss anything? What's your take on DVCS vs. the centralized approach? Why do you prefer one over the other? I'm looking forward to your comments!

Even though there are several distributed alternatives available for some years now (with Bazaar, git and Mercurial being the most prominent representatives here), many large and popular Open Source projects still use centralized systems like Subversion or even CVS to maintain their source code. While Subversion has eased some of the pains of CVS (e.g. better remote access, renaming/moving of files and directories, easy branching), the centralized approach by itself poses some disadvantages compared to distributed systems. So what are these? Let me give you a few examples of the limitations that a centralized system like Subversion has and how these affect the possible workflows and development practices.

I highly recommend you to also read Jon Arbash Meinel's Bazaar vs Subversion blog post for a more elaborate description of the limitations.

  • Most operations require interaction with the central repository, which usually is located on a remote server. Browsing the revision history of a file, creating a branch or a tag, comparing differences between two versions — all these activities involve communication via the network. Which means they are not available when you're offline and they could be slow, causing a slight disruption of your workflow. And if the central repository is down because of a network or hardware failure, every developer's work gets interrupted.
  • A developer can only checkpoint his work by committing his changes into the central repository, where it becomes immediately visible for everybody else working on that branch. It's not possible to keep track of your ongoing work by committing it locally first, in small steps, until the task is completed. This also means that any local work that is not supposed to be committed into the central repository can only be maintained as patches outside of version control, which makes it very cumbersome to maintain a larger number of modifications. This also affects external developers who want to join the project and work with the code. While they can easily obtain a checkout of the source tree, they are not able to put their own work into version control until they have been granted write access to the central repository. Until then, they have to maintain their work by submitting patches, which puts an additional burden on the project's maintainers, as they have to apply and merge these patches by hand.
  • Tags and branches of a project are created by copying entire directory structures around inside the repository. There are some recommendations and best practices on how to do that and how these directories should be arranged (e.g. by creating toplevel branches and tags directories), but there are several variants and it's not enforced by the system. This makes it difficult to work with projects that use a non-standard way for maintaining their branches and can be rather confusing (depending on the amount of branches and tags that exist).
  • While creating new branches is quick and atomic in Subversion, it's difficult to resolve conflicts when merging or reconciling changes from other branches. Recent versions of Subversion added support for keeping better track of merges, but this functionality is still not up to par with what the distributed tools provide. Merging between branches used to drop the revision history of the merged code, which made it difficult to keep track of the origins of individual changes. This often meant that developers avoided developing new functionality in separate branches and rather worked on the trunk instead. Working this way makes it much harder to keep the code in trunk a stable state.

Having described some downsides of the centralized approach, I'd now like to mention some of the most notable aspects and highlight a few advantages of using a distributed version control system for maintaining an Open Source project. These are based on my own personal experiences from working with various distributed systems (I've used Bazaar, BitKeeper, Darcs, git, Mercurial and SVK) and from following many other OSS projects that either made the switch from centralized to distributed or have been using a distributed system from the very beginning. For example, MySQL was already using BitKeeper for almost 2 years when I joined the team in 2002. From there, we made the switch to Bazaar in 2008. mylvmbackup, my small MySQL backup project, is also maintained using Bazaar and hosted on Launchpad.

Let me begin with some simple and (by now) well-known technical aspects and benefits of distributed systems before I elaborate on what social and organizational consequences these have.

In contrast to having a central repository on a single server, each working copy of a distributed system is a full-blown backup of the other repository, including the entire revision history. This provides additional security against data loss and it's very easy to promote another repository to become the new master branch. Developers simply point their local repositories to this new location to pull and push all future changes from there, so this usually causes very little disruption.

Disconnected operations allow performing all tasks locally without having to connect to a remote server. Reviewing the history, looking at diffs between arbitrary revisions, applying tags, committing or reverting changes can all be done on the local repository. These operations take place on the same host and don't require establishing a network connection, which also means they are very fast. Changes can later be propagated using push or pull operations - these can be initiated from both sides at any given time. As Ian Clatworthy described it, a distributed VCS decouples the act of snapshotting from the act of publishing.

Because there is no need to configure or set up a dedicated server or separate repository with any of today's popular DVCSes, there is very little overhead and maintenance required to get started. There is no excuse for not putting your work into revision control, even if your projects starts as a one-man show or you never intend to publish your code! Simply run "bzr|git|hg init" in an existing directory structure and you're ready to go!

As there is no technical reason to maintain a central repository, the definition of "the code trunk" changes from being defined by a technical requirements into a social/conventional one. Most projects still maintain one repository that is considered to be the master source tree. However, forking the code and creating branches of a project change from being an exception into being the norm. The challenge of the project team is to remain the canonical/relevant central hub of the development activities. The ease of forking also makes it much simpler to take over an abandoned project, while preserving the original history. As an example, take a look at the zfs-fuse project, which got both a new project lead and moved from Mercurial to git without losing the revision history or requiring any involvement by the original project maintainer.

Both branching and merging are "cheap" and encouraged operations. The role of a project maintainer changes from being a pure developer and committer to becoming the "merge-master". Selecting and merging changes from external branches into the main line of development becomes an important task of the project leads. Good merge-tracking support is a prerequisite for a distributed system and makes this a painless job. Also, the burden of merging can be shared among the maintainers and contributors. It does not matter on which side of a repository a merge is performed. Depending on the repository relationships and how changes are being propagated between them, some DVCSes like Bazaar or git actually provide several merge algorithms that one can choose from.

Having full commit rights into his one's own branch empowers contributors. It encourages experimenting and lowers the barrier for participation. It also creates new ways of collaboration. Small teams of developers can create ad-hoc workgroups to share their modifications by pushing/pulling from a shared private branch or amongst their personal branches. However, it still requires the appropriate privileges to be able to push into the main development branch.

This also helps to improve the stability of the code base. Larger features or other intrusive changes can be developed in parallel to the mainline, kept separate but in sync with the trunk until they have evolved and stabilized sufficiently. With centralized systems, code has to be committed into the trunk first before regression tests can be run. With DVCSes, merging of code can be done in stages, using a "gatekeeper" to review/test all incoming pushes in a staging area before merging it with the mainline code base. This gatekeeper could be a human or an automated build/test system that performs the code propagation into the trunk based on certain criterions, e.g. "it still compiles", "all tests pass", "the new code adheres to the coding standards". While central systems only allow star schemas, a distributed system allows workflows where modifications follow arbitrary directed graphs.

Patches and contributions suffer less from bit rot. A static patch file posted to a mailing list or attached to a bug report may no longer apply cleanly by the time you look into it. The underlying code base has changed and evolved. Instead of posting a patch, a contributor using a DVCS simply provides a pointer to his public branch of the project, which he hopefully keeps in sync with the main line of development. From there, the contribution can be pulled and incorporated at any time. The history of every modification can be tracked in much more detail, as the author's name appears in the revision history (which is not necessarily the case when another developer applies a patch contributed by someone else).

A DVCS allows you to keep track of local changes in the same repository, while still being able to merge bug/security fixes from upstream. Example: your web site might be based on the popular Drupal CMS. While the actual development of Drupal still takes place in (ghasp) CVS, it is possible to follow the development using Bazaar. This allows you to stay in sync with the ongoing development (e.g. receiving and applying security fixes for an otherwise stable branch) and keeping your local modifications under version control as well.

I've probably just scratched the surface on what benefits distributed version control systems provide with this blog post. Many of these aspects and their consequences are not fully analyzed and understood yet. In the meanwhile, more and more projects make the switch, gather experiences and establish best practices. If you're still using a centralized system, I strongly encourage you to start exploring the possibilities of distributed version control. And you don't actually have to "flip the switch" immediately — most of the existing systems happily interact with a central Subversion server as well, allowing you to benefit from some of the advantages without you having to convert your entire infrastructure immediately.

Here are some pointers for further reading on that particular subject:

Oct 28 2009
Oct 28

In part 1 I've reported the results of a micro-benchmark designed to compare the performance of plain php includes vs includes via streams from a sqlite database. In this post I extend the test cases with two more no-includes same work, to serve as a base case and includes from a mysql database, the code is the same as in sqlite case, the connector differs. You can find the code attached at the end previous post.

Overall, having in mind the random factors like netbook load (CPU and IO), the differences between the different test cases are insignificant on the test machine - Acer Aspire One netbook.

Please let me know if you decide to run these benchmarks. I will be especially interested to see what are the differences.

These results are encouraging enough to merit developing a more substantial test. I'm interested in benchmarking drupal. What scenarios would you suggest?

The benchmark

The benchmarking code is this small haskell script, which uses the criterion library to gather, process the statistics and to draw nice plots below.

import Criterion.Main (defaultMain, bench, bgroup)
import System.Cmd (system)

main = defaultMain [
    bgroup "php includes" [
               bench "none" $ system "./clean1.php"
            ,  bench "standard/clean" $ system "./clean.php.txt"
            ,  bench "standard/mixed" $ system "./non-stream1.php"
            ,  bench "streams/sqlite" $ system "./stream1.php.txt"
            ,  bench "streams/mysql" $ system "./stream_mysql.php"

I've run it with 200 samples

./bench -tpng:552x368 -kpng:552x368  -s 200

The benchmark results

Base case

The base case uses no include files, but simulates the work by looping 20 times through a print statement.

benchmarking php includes/none
collecting 200 samples, 2 iterations each, in estimated 21.72279 s
bootstrapping with 100000 resamples
mean: 70.03678 ms, lb 60.15143 ms, ub 94.78743 ms, ci 0.950
std dev: 113.6834 ms, lb 63.42229 ms, ub 189.0330 ms, ci 0.950
found 6 outliers among 200 samples (3.0%)
  4 (2.0%) high severe
variance introduced by outliers: 16.000%
variance is moderately inflated by outliers

The outliers are probably the result of the combination of an underpowered system and a number of other applications running at thetime of the test.

Clean includes

This test case includes 20 php files from the current directory.

benchmarking php includes/standard/clean
collecting 200 samples, 2 iterations each, in estimated 21.10023 s
bootstrapping with 100000 resamples
mean: 60.50757 ms, lb 58.28822 ms, ub 65.57988 ms, ci 0.950
std dev: 22.93855 ms, lb 9.368661 ms, ub 40.26889 ms, ci 0.950
found 28 outliers among 200 samples (14.0%)
  13 (6.5%) high mild
  15 (7.5%) high severe
variance introduced by outliers: 1.000%
variance is unaffected by outliers

The difference with the base case should be insignificant, but is probably affected by random system load. It does hint that the differences displayed previously are not significant.

Mixed includes

This test case does the same work as the previous one, but includes the stub code for the database includes.

benchmarking php includes/standard/mixed
collecting 200 samples, 1 iterations each, in estimated 134.8954 s
bootstrapping with 100000 resamples
mean: 70.91094 ms, lb 61.86253 ms, ub 101.4884 ms, ci 0.950
std dev: 106.7305 ms, lb 31.32543 ms, ub 240.4033 ms, ci 0.950
found 20 outliers among 200 samples (10.0%)
  11 (5.5%) high mild
  9 (4.5%) high severe
variance introduced by outliers: 14.000%
variance is moderately inflated by outliers

Again, since the variance is moderately affected by the outliers we should take this result with a pinch of salt.

Sqlite streams

This test includes 20 scripts from an sqlite3 database.

benchmarking php includes/streams/sqlite
collecting 200 samples, 1 iterations each, in estimated 13.31139 s
bootstrapping with 100000 resamples
mean: 79.69618 ms, lb 73.51617 ms, ub 98.44982 ms, ci 0.950
std dev: 69.63256 ms, lb 16.12973 ms, ub 150.9165 ms, ci 0.950
found 28 outliers among 200 samples (14.0%)
  14 (7.0%) high mild
  14 (7.0%) high severe
variance introduced by outliers: 6.000%
variance is slightly inflated by outliers

MySQL streams

This test includes 20 scripts from a local mysql database

benchmarking php includes/streams/mysql
collecting 200 samples, 1 iterations each, in estimated 13.17959 s
bootstrapping with 100000 resamples
mean: 70.95483 ms, lb 69.74397 ms, ub 72.55464 ms, ci 0.950
std dev: 10.04377 ms, lb 8.195368 ms, ub 12.52959 ms, ci 0.950
found 19 outliers among 200 samples (9.5%)
  11 (5.5%) high mild
  8 (4.0%) high severe
variance introduced by outliers: 0.500%
variance is unaffected by outliers

The difference with sqlite3 for this benchmark is insignificant.


It becomes clearer that the performance difference is relatively insignificant, even in such micro-benchmarks designed to highlight it, by being unfair to the streams version. If you add a significant amount of code and actually do something with it, like a drupal site would, the difference won't be noticeable. Still this hypothesis needs testing.


About Drupal Sun

Drupal Sun is an Evolving Web project. It allows you to:

  • Do full-text search on all the articles in Drupal Planet (thanks to Apache Solr)
  • Facet based on tags, author, or feed
  • Flip through articles quickly (with j/k or arrow keys) to find what you're interested in
  • View the entire article text inline, or in the context of the site where it was created

See the blog post at Evolving Web

Evolving Web