May 12 2014
May 12

Most of high traffic or complex Drupal sites use Apache Solr as the search engine. It is much faster and more scaleable than Drupal's search module.

In this article, we describe one way of many for having a working Apache Solr installation for use with Drupal 7.x, on Ubunutu Server 12.04 LTS. The technique described should work with Ubunut 14.04 LTS as well.

In a later article, now published at: article, we describe how to install other versions of Solr, using the Ubuntu/Debian way.

Objectives

For this article, we focus on having an installation of Apache Solr with the following objectives:

  • Use the latest stable version of Apache Solr
  • Least amount of software dependencies, i.e. no installation of Tomcat server, and no full JDK, and no separate Jetty
  • Least amount of necessary complexity
  • Least amount of software to install and maintain
  • A secure installation

This installation can be done on the same host that runs Drupal, if it has enough memory and CPU, or it can be on the database server. However, it is best if Solr is on a separate server dedicated for search, with enough memory and CPU.

Installing Java

We start by installing the Java Runtime Environment, and choose the headless server variant, i.e. without any GUI components.

sudo aptitude update
sudo aptitude install default-jre-headless

Downloading Apache Solr

Second, we need to download the latest stable version of Apache Solr from a mirror near you. At the time of writing this article, it is 4.7.2. You can find the closest mirror to you at Apache's mirror list.

cd /tmp
wget http://apache.mirror.rafal.ca/lucene/solr/4.7.2/solr-4.7.2.tgz

Extracting Apache Solr

Next we extract the archive, while still in the /tmp directory.

tar -xzf solr-4.7.2.tgz

Moving to the installation directory

We choose to install Solr in /opt, because it is supposed to contain software that is not installed from Ubuntu's repositories, using the apt dependency management system, nor tracked for security updates by Ubuntu.

sudo mv /tmp/solr-4.7.2 /opt/solr

Creating a "core"

Apache Solr can serve multiple sites, eached served by a "core". We will start with one core, called simply "drupal".

cd /opt/solr/example/solr
sudo mv collection1 drupal


Now edit the file ./drupal/core.properties and change the name= to drupal, like so:

name=drupal

Copying the Drupal schema and Solr configuration

We now have to copy the Drupal Solr configuration into Solr. Assuming your site is in installed in /var/www, these commands achieve the tasks:

cd /opt/solr/example/solr/drupal/conf
sudo cp /var/www/sites/all/modules/contrib/apachesolr/solr-conf/solr-4.x/* .

Then edit the file: /opt/solr/example/solr/drupal/conf/solrconfig.xml, and comment our or delete the following section:

<useCompoundFile>false</useCompoundFile>
<ramBufferSizeMB>32</ramBufferSizeMB>
<mergeFactor>10</mergeFactor>

Setting Apache Solr Authentication, using Jetty

By default, a Solr installation listens on the public Ethernet interface of a server, and has no protection whatsoever. Attackers can access Solr, and change its settings remotely. To prevent this, we set password authentication using the embedded Jetty that comes with Solr. This syntax is for Apache Solr 4.x. Earlier versions use a different syntax.

The following settings work well for a single core install, i.e. search for a single Drupal installation. If you want multi-core Solr, i.e. for many sites, then you want to fine tune this to add different roles to different cores.

Then edit the file: /opt/solr/example/etc/jetty.xml, and add this section:

<!-- ======= Securing Solr ===== -->
<Call name="addBean">
  <Arg>
    <New class="org.eclipse.jetty.security.HashLoginService">
      <Set name="name">Solr</Set>
      <Set name="config"><SystemProperty name="jetty.home" default="."/>/etc/realm.properties</Set>
      <Set name="refreshInterval">0</Set>
    </New>
  </Arg>
</Call>

Then edit the file: /opt/solr/example/etc/webdefault.xml, and add this section:

<security-constraint>
  <web-resource-collection>
    <web-resource-name>Solr</web-resource-name>
    <url-pattern>/*</url-pattern>
  </web-resource-collection>
  <auth-constraint>
    <role-name>search-role</role-name>
  </auth-constraint>
</security-constraint>

<login-config>
  <auth-method>BASIC</auth-method>
  <realm-name>Solr</realm-name>
</login-config>

Finally, create a new file named /opt/solr/example/etc/realm.properties, and add the following section to it:

user_name: password, search-role


Note that "search-role" must match what you put in webdefault.xml above.

Instead of "user_name", use the user name that will be used for logging in to Solr. Also, replace "password" with a real strong hard to guess password.

Finally, make sure that the file containing passwords is not readable to anyone but the owner.

chmod 640 /opt/solr/example/etc/realm.properties

Changing File Ownership

We then create a user for solr.

sudo useradd -d /opt/solr -M -s /dev/null -U solr

And finally change ownership of the directory to solr

sudo chown -R solr:solr /opt/solr

Automatically starting Solr

Now you need Solr to start automatically when the server is rebooted. To do this, download the attached file, and copy it to /etc/init.d

sudo cp solr-init.d.sh.txt /etc/init.d/solr
sudo chmod 755 /etc/init.d/solr

And now tell Linux to start it automatically.

sudo update-rc.d solr start 95 2 3 4 5 .

For now, start Solr manually.

sudo /etc/init.d/solr start

Now Solr is up and running.

Verify that it is running by accessing the following URL:

http://x.x.x.x:8983/solr/


Replace x.x.x.x by the IP address of the server that is running Solr.

You can also view the logs at:

tail -f /opt/solr/example/logs/solr.log

Configuring Drupal's Apache Solr module

After you have successfully installed, configured and started Solr, you should configure your Drupal site to interact with the Solr seserver. First, go to this URL: admin/config/search/apachesolr/settings/solr/edit, and enter the information for your Solr server. You should use the URL as follows:

http://user:[email protected]:8983/solr/drupal

Now you can proceed to reindex your site, by sending all the content to Solr.

Removing Solr

If you ever want to cleanly remove Apache Solr that you installed from the server using the above instructions, then use the sequence of the commands below:

sudo /etc/init.d/solr stop

sudo update-rc.d solr disable

sudo update-rc.d solr remove

sudo rm /etc/init.d/solr

sudo userdel solr

sudo rm -rf /opt/solr

sudo aptitude purge default-jre-headless

Additional Resources

Attachment Size solr-init.d.sh_.txt 1.23 KB
May 03 2014
May 03

On Friday May 2nd, 2014, Khalid of 2bits.com, Inc. presented on Drupal Performance.

The presentation covered important topics such as:

  • Drupal misconception: Drupal is slow/resource hog/bloated
  • Drupal misconception: Only Anonymous users benefit from caching in Drupal
  • Drupal misconception: Subsecond response time in Drupal is impossible for logged in users
  • Drupal misconception: Cloud hosting is more cost effective than dedicated servers

The presentation slides are attached for those who may be interested ...

Attachment Size drupalcamp-toronto-2014-drupal-performance-tips-and-tricks.pdf 371.99 KB
Mar 11 2014
Mar 11

We previously wrote in detail about how botnets hammering a web site can cause outages.

Here is another case that emerged in the past month or so.

Again, it is a distributed attempt from many IP addresses all over the world, most probably from PCs infected with malware.

Their main goal seems to be to add content to a Drupal web site, and trying to register a new user when that attempt is denied because of site permissions.

The pattern is like the following excerpt from the web server's access log.

Note the POST, as well as the node/add in the referer. Also note the hard coded 80 port number:

173.0.59.46 - - [10/Mar/2014:00:00:04 -0400] "POST /user/register HTTP/1.1" 200 12759 "http://example.com/user/register" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36"
173.0.59.46 - - [10/Mar/2014:00:00:06 -0400] "POST /user/register HTTP/1.1" 200 12776 "http://example.com/user/register" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36"
107.161.81.55 - - [10/Mar/2014:00:00:10 -0400] "GET /user/register HTTP/1.1" 200 12628 "http://example.com/user/register" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36"
107.161.81.55 - - [10/Mar/2014:00:00:16 -0400] "GET /user/register HTTP/1.1" 200 12642 "http://example.com/user/login?destination=node/add" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36"
202.75.16.18 - - [10/Mar/2014:00:00:17 -0400] "POST /user/register HTTP/1.1" 200 12752 "http://example.com/user/register" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1"
5.255.90.89 - - [10/Mar/2014:00:00:18 -0400] "GET /user/register HTTP/1.1" 200 12627 "http://example.com/user/register" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36"
107.161.81.55 - - [10/Mar/2014:00:00:24 -0400] "GET /user/register HTTP/1.1" 200 12644 "http://example.com/user/login?destination=node/add" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36"
...
128.117.43.92 - - [11/Mar/2014:10:13:30 -0400] "POST /user/register HTTP/1.1" 200 12752 "http://example.com:80/" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110613 Firefox/6.0a2"
128.117.43.92 - - [11/Mar/2014:10:13:30 -0400] "POST /user/register HTTP/1.1" 200 12752 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110613 Firefox/6.0a2"
128.117.43.92 - - [11/Mar/2014:10:13:30 -0400] "POST /user/register HTTP/1.1" 200 12752 "http://example.com:80/" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110613 Firefox/6.0a2"

In the above case, the web site has a CAPTCHA on the login registration page, and that causes a session to be created, and hence full Drupal bootstrap (i.e. no page caching). When this is done by lots of bots simultaneously, it takes its toll on the server's resources.

Botnet Statistics

We gleaned these statistics from analyzing the access log for the web server for a week, prior to putting in the fix below.

Out of 2.3 million requests, 3.9% were to /user/register. 5.6% had http://example.com:80/ in the referer (with the real site instead of example). 2.4% had "destination=node/add" in the referer.

For the same period, but limiting the analysis to accesses to /user/register only, 54.6% have the "/user/login?destination=node/add" in the referer. Over 91% pose as coming from a computer running Mac OS/X Lion 10.7.5 (released October 2012). 45% claim they are on Firefox browser, 33% pretend they are on Chrome, and 19.7% pose as Safari.

Workaround

As usual with botnets, blocking individual IP addresses is futile, since there are so many of them. CloudFlare, which is front ending the site, did not detect nor block these attempts.

In order to solve this problem, we just put in a fix to abort the Drupal bootstrap when this bot is detected. We just add this in settings.php. Don't forget to replace example.com with the domain/subdomain you see in your own access log.

if ($_SERVER['HTTP_REFERER'] == 'http://example.com/user/login?destination=node/add') {
  if ($_SERVER['REQUEST_URI'] == '/user/register') {
    header("HTTP/1.0 418 I'm a teapot");
    exit();
  }
}

// This is for the POST variant, with either port 80 in 
// the referer, or an empty referer
if ($_SERVER['REQUEST_METHOD'] == 'POST') {
  if ($_SERVER['REQUEST_URI'] == '/user/register') {
    switch($_SERVER['HTTP_REFERER']) {
      case 'http://example.com:80/':
      case '':
        header("HTTP/1.0 418 I'm a teapot");
        exit();
    }
  }
}
Jan 08 2014
Jan 08

A client recently asked us for help with a very specific issue. The node edit page was hanging up, but only in Internet Explorer 10, and not in Firefox or Chrome. The client had WYSIWYG editor enabled.

This automatically pointed to a front end issue, not a server issue.

So, we investigated more, and found that the underlying issue is between Internet Explorer and JQuery with a large number of items to be parsed.

Internet Explorer was not able to parse the high number of token items listed (around 220). This causes the browse to hang when rendering the WYSIWYG page, with the following error message:

A script on this page is causing Internet Explorer to run slowly. If it continues to run, you computer might become responsive.

With the option to stop the script.

The real problem is with the critical issue described in #1334456, which is as yet not committed to the repository of the token module.

Fortunately there is an easy workaround, the steps are:

  • Install the Token Tweaks module.
  • Go to /admin/config/system/tokens.
  • Change the Maximum Depth limit from the default of 4 to 1
  • Save the changes.

Now the edit form for the node should work normally, and the browser, whichever it is, will not hang anymore.

Note: Thanks to Dave Reid for this workaround.

Oct 07 2013
Oct 07

Using a Reverse Proxy and/or a Content Delivery Network (CDN) has become common practice for Drupal and other Content Management Systems.

One inconvenient aspect of this is that your web server no longer gets the correct IP address, and neither does your application. The IP address is that of the machine that the reverse proxy is running on.

In Drupal, there is code in core that tries to work around this, by looking up the IP address in the HTTP header HTTP_X_FORWARDED_FOR, or a custom header that you can set.

For example, this would be in the settings.php of a server that runs Varnish on the same box.

$conf['reverse_proxy'] = TRUE;
$conf['reverse_proxy_addresses'] = array('127.0.0.1');

There is also this setting for Drupal 7.x in case your CDN puts the IP address in some other custom header:

// CloudFlare CDN
$conf['reverse_proxy_header'] = 'HTTP_CF_CONNECTING_IP';

Only for the application, what about the web server?

But, even if you solve this at the application level (e.g. Drupal, or WordPress), there is still the issue that your web server is not logging the correct IP address. For example, you can't analyze the logs to know which countries your users are coming from, or identify DDoS attacks.

Apache RPAF module

There is a easy solution to this though: the Reverse Proxy Add Forward (RPAF).

What this Apache module does is extract the correct IP address, and uses that for Apache logs, as well hand over the correct IP address of the client in PHP's variable: $_SERVER['REMOTE_ADDR']

To install RPAF on Ubuntu 12.04 or later, use the command:

aptitude install libapache2-mod-rpaf

If you run the reverse proxy (e.g. Varnish) on same server as your web server and application, and do not use a CDN, then there is no need to do anything more.

However, if you run the reverse proxy on another server, then you need to change the RPAFproxy_ips line to include the IP addresses of these servers. For example, this will be the addresses for your Varnish servers which are front ending Drupal, then they are front ended by the CDN.

You do this by editing the file /etc/apache2/mods-enabled/rpaf.conf.

For example:

RPAFproxy_ips 10.0.0.3 10.0.0.4 10.0.0.5

CDN Client IP Header

If you are using a CDN, then you need to find out what HTTP header the CDN uses to put the client IP address, and modify RPAF's configuration accordingly.

For example, for CloudFlare, the header is CF-Connecting-IP

So, you need to edit the above file, and add the following line:

RPAFheader CF-Connecting-IP

Drupal Reverse Proxy settings no longer needed

And finally, you don't need any of the above Reverse Proxy configuration in settings.php.

// $conf['reverse_proxy'] = TRUE;
// $conf['reverse_proxy_addresses'] = array('127.0.0.1');
// $conf['reverse_proxy_header'] = 'HTTP_CF_CONNECTING_IP';

Now, you have correct client IP addresses in Apache's logs, and inside Drupal as well.

What If RPAF Does Not Work?

If you have RPAF front ended directly by a CDN, without Varnish, then RPAF may not work for a yet unknown reason.

To overcome this, you have several other options.

Apache mod_remoteip

There is a small Apache module called mod_remoteip. This basically does the same thing as RPAF, but with simpler configuration.

Use the download link and save the file to the file named apache-2.2-mod_remoteip.c.

apxs2 -i -a -c apache-2.2-mod_remoteip.c

This should create the module's .so file in Apache's modules directory. It should also add the LoadModule directive in mods-available/remoteip.load, which should look like so:

LoadModule remoteip_module modules/mod_remoteip.so

Now add the RemoteIPHeader directive in a new file called mods-available/remoteip.conf

RemoteIPHeader X-Forwarded-For

If you are using CloudFlare CDN then you use:

RemoteIPHeader CF-Connecting-IP

Now, enable the module:

a2enmod remoteip

Then restart Apache:

service apache2 restart

If this does not work, then you can still do it using the next set of tricks:

Apache Access Log and Drupal Reverse Proxy Settings

We can force Apache to log the correct client IP address to the access log by adding this to the virtual host entry for your site (e.g. /etc/apache2/sites-enabled/example.com):

LogFormat "%{CF-Connecting-IP}i %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\"" proxied

This takes the CF-Connecting-IP header from CloudFlare, and uses that instead of the IP address, which is of the proxy, not the originating client.

Then, under the "VirtualHost" stanza, you add this to use the custom proxied format you created above:

CustomLog ${APACHE_LOG_DIR}/access-example.com.log proxied

Then you need to enable the Drupal reverse proxy setting in settings.php:

$conf['reverse_proxy'] = TRUE;
$conf['reverse_proxy_header'] = 'HTTP_CF_CONNECTING_IP';

You don't need to add the reverse_proxy_addresses variable, because for CloudFlare there are too many of them.

May 06 2013
May 06

We have mentioned before that both Pressflow 6.x and Drupal 7.x (but not core Drupal 6.x), disable page caching when a session is created for an anonymous user.

An extreme case of this happened recently, because of a perfect storm.

Symptoms

The client sends a newsletter to site users, be they who have accounts on the site, or others who just entered their email to get the newsletter.

After a recent code change, when a newsletter was sent, suddenly, we found saw a very high load average, very high CPU usage, and because we plot the number of logged in and anonymous users too, we found around 800 anonymous users on the site, in addition to the peak of 400 logged in users!

Since this is Pressflow, anonymous users are mostly served by Varnish, and they are not supposed to have sessions.

Investigation

So, we started to investigate those anonymous sessions in the database, in the sessions table.

Indeed, there are lots of anonymous sessions.

SELECT COUNT(*) 
FROM sessions 
WHERE uid = 0;
+----------+
| count(*) |
+----------+
|     6664 |
+----------+

And upon closer investigation, most of those sessions had a message in them saying "Comment field is requried".

SELECT COUNT(*) 
FROM sessions 
WHERE uid = 0 
AND session LIKE '%Comment field is required%';
+----------+
| count(*) |
+----------+
|     5176 |
+----------+

And just to compare the day the newsletter was sent to other days, we confirmed that indeed, that day had many multiples of any other day in terms of sessions.

In fact, more than 5X the highest day prior, and up to 55X higher than more typical days.

SELECT DATE(FROM_UNIXTIME(timestamp)) AS date, COUNT(*) 
FROM sessions 
WHERE uid = 0 
GROUP BY date;

+------------+----------+
| date       | count(*) |
+------------+----------+
| .......... |       .. |
| 2013-04-19 |       55 |
| 2013-04-20 |       81 |
| 2013-04-21 |       66 |
| 2013-04-22 |      115 |
| 2013-04-23 |       99 |
| 2013-04-24 |      848 |
| 2013-04-25 |       72 |
| 2013-04-26 |     4524 |
| .......... |       .. |
+------------+----------+

Graphs show the magnitude of the problem

Look at the graphs, to the very right of each one, after Friday noon.

You can see how the load shot up after the newsletter was sent:

The number of anonymous sessions shot up from only a handful to around 800!

The number of logged in users had a spike to 400, up from the 300 or above.

The number of SQL queries also shot up.

And so did the MySQL threads too.

And the CPU usage was very high, with the server trying to serve around 1200 users with no caching for them.

Root Cause Analysis

So, it turns out that the recent code change was done to encourage more people to sign up for an account on the site. This form alters the comment form and adds extra fields to prod the visitor to register for an account, including the email address. Another form above the node also captures the email address.

If people clicked on the button to add their email, Pressflow complained about the missing comment field. And since any message, be it for a logged in users or an anonymous one, is stored in a session, all users who tried to register for an account were treated as logged in users in that they bypass the page cache for Pressflow. This effectively tripled the number of logged in users (from 400 to 1200), who all have to execute PHP and MySQL queries and not being served from Varnish.

Hence the high load and high CPU usage.

Solution

The fix was to revoke the post comment permission for anonymous users, and therefore, remove the comment form from the bottom of every node.

After that, the newsletter was sent without increasing the load the server at all.

Although this problem was on Pressflow 6.x, it should apply to Drupal 7.x as well, since it also disables sessions for anonymous users.

Apr 22 2013
Apr 22

One of the suboptimal techniques that developers often use, is a query that retrieves the entire content of a table, without any conditions or filters.

For example:

SELECT * FROM table_name ORDER BY column_name;

This is acceptable if there are not too many rows in the table, and there is only one call per page view to that function.

However, things start to get out of control when developers do not take into account the frequency of these calls.

Here is an example to illustrate the problem:

A client had high load average (around 5 or 6) on their server which had around 400 logged in users at peak hours. The server was somewhat fragile with any little thing, such as a traffic influx, or a misbehaved crawler, causing the load to go over 12.

This was due to using an older version of the Keyword Links module.

This old version had the following code:

This caused certain keywords to be replaced when a node is being displayed:

function keyword_link_nodeapi(&$node, $op, $teaser, $page) {
  if ($op == 'view' && ...
    $node->content['body']['#value'] = keyword_link_replace(...);
  }
}

And this caused keyword replacement for each comment as well.

function keyword_link_comment(&$a1, $op) {
  if ($op == 'view') {
    $a1->comment = keyword_link_replace($a1->comment);
    $node->content['body']['#value'] = keyword_link_replace(...);
   }
 }

The function that replaced the content with keywords was as follows:

 
function keyword_link_replace($content) {
  $result = db_query("SELECT * FROM {keyword_link} ORDER BY wid ASC");
  while ($keyword = db_fetch_object($result)) {
    ...
    $content = preg_replace($regex, $url, $content, $limit);
  }
  return $content;
}

Which executes the query every time, and iterates through the result set, replacing words.

Now, let us see how many rows are there in the table.

mysql> SELECT COUNT(*) FROM keyword_link;
+----------+
| count(*) |
+----------+
|     2897 |
+----------+
1 row in set (0.00 sec)

Wow! That is a relatively large number.

And Eureka! That is it! The query was re-executed every time the replace function was called!
This means in a list of nodes of 50 nodes, there would be 50 queries!

And even worse, for a node with tens or hundreds of comments, there would be tens or hundreds of queries as well!

Solution

The solution here was to upgrade to the latest release of the module, which has eliminated the replacement of keywords for comments.

But a better solution, preserving the functionality for comments, is a two fold combined solution:

Use memcache as the cache layer

By using memcache, we avoid going to the database for any caching. It is always a good idea in general to have that, except for simple or low traffic sites.

However, on its own, this is not enough.

Static caching for cache_get() result

By statically caching the results of the query, or the cache_get(), those operations are executed once per page view, and not 51 times for a node displaying comments. This is feasible if the size of the dataset is not too huge. For example, for this site, the size was around 1.3 MB for the three fields that are used from that table, and fits in memory without issues for each PHP process.

This is the outline for the code:

function keyword_link_replace($content) {
  static $static_data;

  if (!isset($static_data)) {
    if (($cache = cache_get('keyword_link_data')) &&
      !empty($cache->data)) {
      $static_data = $cache->data;

      foreach($cache->data as $keyword) {
        replace_keyword($keyword, $content);
      }
    }
    else {
      $result = db_query("SELECT * FROM {keyword_link} ORDER BY wid ASC");

      $data = array();

      while ($keyword = db_fetch_object($result)) {
        $data[] = $keyword;

        replace_keyword($keyword, $content);
      }

      $static_data = $data;

      cache_set('keyword_link_data', $data, 'cache');
    }
  }
  else {
    foreach($static_data as $keyword) {
      replace_keyword($keyword, $content);
    }
  }

  return $content;
}

You can download the full Drupal 6.x version with the fixes from here.

What a difference a query makes

The change was done at 22:00 in the daily graphs, and 6 days before the monthly graph was taken. You can see the difference in that the load average is less. Ranging between 1.8 and 2.4 for most of the day, with occasional spikes above 3. This is far better than the 5 or 6 load before the fix. Also the amount of data that is retrieved from MySQL is halved.

As you will notice, no change was seen in the number of SQL queries. This is probably because of the effect of MySQL's query cache. Since all the queries were the same for the same page, it served the result from the query cache, and did not have to re-execute the query for the tens or hundreds of times per page. Even though the query cache saved us from re-executing the query, there still is overhead in getting that data from MySQL's cache to the application, and that consumed CPU cycles.

Faster Node displays

And because we are processing less data, and doing less regular expression replacement, node display for nodes that have lots of nodes has improved. With a node that had hundreds of comments, and 50 comments shown per page, the total page load time was 8,068 milliseconds.

The breakdown was as follows:

keyword_link_replace() 51 calls totalling 2,429 ms
preg_replace() 147,992 calls totalling 1,087 ms
mysql_fetch_object() 150,455 calls totalling 537 ms
db_fetch_object() 150,455 calls totalling 415 ms
mysql_query() 1,479 calls totalling 393 ms
unserialize() 149,656 calls totalling 339 ms

A total of 5,254 milliseconds processing keywords in comments only.

After eliminating the calls to hook_comment() in the module, the total load time for the node was 3,122 milliseconds.

Conclusion

So, always look at the size of your dataset, as well as the frequency of resource intensive or slow operations. They could be bottlenecks for your application.

Apr 10 2013
Apr 10

The bulk of Drupal hosting for clients that we deal with is on virtual servers, whether they are marketed as "cloud" or not. Many eventually have to move to dedicated servers because increased traffic or continually adding features that increase complexity and bloat.

But, there are often common issues that we see repeatedly that have solutions which can prolong the life of your current site's infrastructure.

We assume that your staff, or your hosting provider, have full access to the virtual servers, as well as the physical servers that run on them.

Disks cannot be virtualized

Even for dedicated servers, the server's disk(s) are often the bottleneck for the overall system. They are the slowest part. This is definitely true for mechanical hard disks with rotating platters, and even Solid State Disks (SSDs) are often slower than the CPU or memory.

For the above reasons, disks cannot be fully virtualized. Yes, you do get a storage allocation that is yours to use and no one else can use. But you cannot guarantee a portion of the I/O throughput, which is always a precious resource on servers.

So, other virtual servers that are on the same physical server as you will contend for disk I/O if your site (or theirs) is a busy one or not optimally configured.

In a virtual server environment, you cannot tell how many virtual servers are on the same physical server, nor if they are busy or not. You only deal with the effects (see below).

For a Drupal site, the following are some of the most common causes for high disk I/O activity:

  • MySQL, with either a considerable amount of slow queries that do file sorts and temporary tables; or lots of INSERT/UPDATE/DELETE
  • Lots of logging activity, such as a warning or a notice in a module that keeps reporting exceptions many times per disk access
  • Boost cache expiry, e.g. when a comment is posted

Xen based virtualization vs. Virtuozzo or OpenVZ

The market uses virtualization technologies much like airlines when they overbook flights based on the assumption that some passengers will not show up.

Similarly, not all virtual hosting customers will use all the resources allocated to them, so there is often plenty of unused capacity.

However, not all virtualization technologies are equal when it comes to resource allocation.

Virtuozzo and its free variant, OpenVZ, use the term "burst memory" to allocate unused memory from other instances, or even swap space when applications demand it on one instance. However, this can bring a server to its knees if swap usage causes thrashing.

Moreover, some Virtuozzo/OpenVZ hosts use vzfs, a virtualized file system, which is slow for Drupal when used for certain things, such as having all of web root on it, logs, and database files.

Xen does not suffer from any of the above. It guarantees that memory and CPU allocated to one virtual instance stays dedicated for that instance.

However, since physical disk I/O cannot be virtualized, it remains the only bottleneck with Xen.

Underpowered Instances

One issue that Amazon AWS EC2 users face is that the reasonably priced instances are often underpowered for most Drupal sites. These are the Small and Medium instances.

For sites with low number of nodes/comments per day, and with most traffic being anonymous. These sites lend themselves to working well with proper Varnish caching enabled set to long hours before expiring.

Other sites that rely on a large number of simultaneous logged in users, with lots of enabled modules, and with short cache expiry times do not work well with these underpowered instances. Such sites require the Extra Large instances, and often the High CPU ones too.

Of course, this all adds to the total costs of hosting.

Expensive As You Grow

Needless to say, if your site keeps growing then there will be added hosting costs to cope with this growth.

With the cloud providers, these costs often grow faster than with dedicated servers, as you add more instances, and so on.

Misconfigured Self-Virtualization

Some companies choose to self manage physical servers colocated at a datacenter and virtualized them themselves.

This is often a good option, but can also be a pitfall. Sometimes the servers are badly misconfigured. We saw one case where the physical server was segmented into 12 VMWare virtual servers with no good reason. Moreover, all of them were accessing a single RAID array. On top of that boost was used on a busy popular forum. When a comment was posted, boost was expiring pages, and that was tying up the RAID array from doing anything useful to other visitors of the site.

Variability in Performance

With cloud and virtual servers, you often don't notice issues, but then suddenly variability will creep in.

An analogy ...

This happens because you have bad housemates who flush the toilet when you are in the shower. Except that you do not know who those housemates are, and can't ask them directly. The only symptom is this sudden cold water over your body. Your only recourse is to ask the landlord if someone flushed the toilet!

Here is a case in point: a Drupal site at a VPS with a popular cloud provider. It worked fine for several years. Then the host upgraded to another, newer version, and asked all customers to move their sites.

It was fine most of the time, but then extremely slow at other times. No pattern could be predicted.

For example while getting a page from the cache for anonymous visitors usually takes a few tens of milliseconds at most, on some occasions it takes much more than that, in one case, 13,879 milliseconds, with the total page load time 17,423 milliseconds.

Here is a sample of devel's output:

Executed 55 queries in 12.51 milliseconds. .Page execution time was 118.61 ms.

Executed 55 queries in 7.56 milliseconds. Page execution time was 93.48 ms.

Most of the time is spent is retrieving cached items.

ms where query
0.61 cache_get SELECT data, created, headers, expire FROM cache WHERE cid = 'menu:1:en'
0.42 cache_get SELECT data, created, headers, expire FROM cache WHERE cid = 'bc_87_[redacted]'
0.36 cache_get SELECT data, created, headers, expire FROM cache WHERE cid = 'bc_54_[redacted]'
0.19 cache_get SELECT data, created, headers, expire FROM cache WHERE cid = 'filter:3:0b81537031336685af6f2b0e3a0624b0'
0.18 cache_get SELECT data, created, headers, expire FROM cache WHERE cid = 'bc_88_[redacted]'
0.18 block_list SELECT * FROM blocks WHERE theme = '[redacted]' AND status = 1 ORDER BY region, weight, module

Then suddenly, same site, same server, and you get:

Executed 55 queries in 2237.67 milliseconds. Page execution time was 2323.59 ms.

This was a Virtuozzo host, and it was a sign of disk contention. Since this is a virtual server, we could not tell if this is something inside the virutal host or some other tenant on the same physical server flushing the toilet ...

The solution is in the following point.

Move your VPS to another physical server

When you encounter variable performance or poor performance, before wasting time on troubleshooting that may not lead anywhere, it is worthwhile to contact your host, and ask for your VPS to be moved to a different physical server.

Doing so most likely will solve the issue, since you effectively have a different set of housemates.

Further Reading:

Apr 01 2013
Apr 01

Ubuntu Server 12.04 LTS finally provides a stable long term support server distro that has a recent version of Varnish in its repositories.

Trouble is, the repository provided package of Varnish has some issues. Specifically, the command line tools, such as varnishhist, varnishstat, ...etc. do not report anything. Therefore one cannot know the hit/miss rates, hits per second, or other useful information. Moreover, monitoring Varnish using Munin for such statistics does not work either.

There are two ways you can overcome this, both are described below.

Use the Varnish provided packages from their repository

The Varnish project provides an Ubuntu repository that contains Ubuntu packages. This is the recommended way, because you get updates for security if and when they come out. As well, you will get startup scripts installed and configured automatically, instead of having to create them from scratch, as in the second option.

First, install curl, if it is not already installed on your server:

# aptitude install curl 

Then, add the repository key to your server:

# curl http://repo.varnish-cache.org/debian/GPG-key.txt |
    sudo apt-key add -

And then add the repository itself.

# echo "deb http://repo.varnish-cache.org/ubuntu/ lucid varnish-3.0" | 
    tee /etc/apt/sources.list.d/varnish.list

Now, update the list of packages from the new repository:

# aptitude update

And then install Varnish

# aptitude install varnish 

Configuring Varnish listening port and memory

Then, you need to configure Varnish to front end your web server, so that it runs on port 80 (regular HTTP traffic) and 443 (SSL secure traffic, if your site uses it).

Also, increase, or decrease the amount of memory allocated to Varnish if your server has memory to spare.

We also change the name of the .vcl file, so when upgrading, Ubuntu will not ask about two files changed, and rather one file only (/etc/default/varnish).

So, go ahead and edit the /etc/default/varnish file, and this to the end, if you have a small server (e.g. 1G RAM).

DAEMON_OPTS="-a :80,:443 \
             -T localhost:6082 \
             -f /etc/varnish/main.vcl \
             -S /etc/varnish/secret \
             -s malloc,128m"

If you have a larger server, e.g. 8GB, you would allocate 2G of RAM for varnish.

DAEMON_OPTS="-a :80,:443 \
             -T localhost:6082 \
             -f /etc/varnish/main.vcl \
             -S /etc/varnish/secret \
             -s malloc,2048m"

Then restart Varnish:

service varnish restart

In an upcoming article, we will discuss the details of configuring Varnish's VCL.

Compiling from source

The second option is to download the source code, and compile it yourself.

This means that you are responsible for upgrading if there is a security fix.

You first need to install the C compiler, and some other libraries like curses development, PCRE, as well as the make utility.

You do this via the command:

# aptitude install libncurses5-dev libpcre3-dev pkg-config make 

Which will pull the C compiler if it is not already installed.

Then, download the source:

# wget http://repo.varnish-cache.org/source/varnish-3.0.3.tar.gz

Extract the source from the archive

# tar xzf varnish-3.0.3.tar.gz

Change the directory to what you just extracted

# cd varnish-3.0.3

Run the configure tool. Make sure there are no errors.

# ./configure

Build the binaries

# make

And install the source.

# make install

Varnish will be installed in /usr/local.

You will need to create start scripts for Varnish.
You should use a different name other than what would have been installed by the repository packages, so that it would not clash with the same file names, e.g. /etc/default/varnish-local. This would hold the DAEMON_OPTS mentioned above. You also need to create an /etc/init.d/varnish-local script for startup. I then use the following command to make Varnish run at run level 2.

update-rc.d varnish-local start 80 2 .

Monitoring Varnish with Munin

We assume that you have Munin installed and configured and is already monitoring other things on your server.

We need to install a perl library that will be able to pull the statistics info from Varnish

aptitude install libnet-telnet-perl

Then, we need get the monitoring scripts

cd /usr/share/munin/plugins/

git clone git://github.com/basiszwo/munin-varnish.git

chmod a+x ./munin-varnish/varnish_*

Then create symbolic links for the monitoring scripts.

cd /etc/munin/plugins

ln -s /usr/share/munin/plugins/munin-varnish/varnish_allocated
ln -s /usr/share/munin/plugins/munin-varnish/varnish_cachehitratio
ln -s /usr/share/munin/plugins/munin-varnish/varnish_hitrate
ln -s /usr/share/munin/plugins/munin-varnish/varnish_total_objects

Then edit the file /etc/munin/plugin-conf.d/munin-node, and add the following to the end.

[varnish*]
user root

And restart Munin for the changes to take effect.

service munin-node restart

Configuring Varnish for Drupal

As mentioned above, in an upcoming article, we will discuss the details of configuring varnish for Drupal.

Mar 27 2013
Mar 27

Today, Khalid gave a presentation on Drupal Performance and Scalability for members of the London (Ontario) Drupal Users Group.

The slides from the presentation are attached below.

AttachmentSize 498.3 KB
Mar 19 2013
Mar 19

For sites that have lots of slow queries, disk access is often the bottleneck. For these slow queries, MySQL writes temporary tables to disk, populates them with intermediate results, then query them again for the final result.

We all know that the disk is the slowest part in a computer, because it is limited by being mechanical, rather than electronic. One way of mitigating this is to tell MySQL to use memory rather than disk for temporary tables.

This is often done by creating either a RAM Disk, or the easier to use tmpfs. Both are a portion of the server's RAM memory made to emulate a disk with slightly different details: RAM disk has a file system on it that can be ext3 or anything, while tmpfs is its own file system type.

Since memory access is much faster than a disk, this improves performance, and decreases load on the server by not causing pile up bottlenecks on disks.

We describe here methods to achieve this goal.

Method 1: Using an existing tmpfs directory

Rather than creating a new ram disk or tmpfs mount, we first search for one that is already on your server.

# df -h
Filesystem      Size  Used Avail Use% Mounted on
...
tmpfs           1.6G  260K  1.6G   1% /run
...

This tells us that the the /run filesystem is of type temporary file system, and has 1.6 GB allocated for it.

# mount
...
tmpfs on /run type tmpfs (rw,noexec,nosuid,size=10%,mode=0755)
...

On Ubuntu 12.04 LTS, the directory /run/mysqld already exists and is allocated to a tmpfs with sufficient space for temporary files.

Save yourself some grief and do not try to create your custom directory under /run (e.g. mysqlslow), because it will not survive reboots, and MySQL will not start after a reboot.

So, all we need is telling MySQL to use this directory.

To do this, create a file called /etc/mysql/conf.d/local.cnf. By using this file, and not editing /etc/mysql/my.cnf, we don't have Ubuntu updated overwrite your changes.

Add this to the file:

[mysqld]
tmpdir = /run/mysqld

Then restart MySQL

service mysql restart

Then make sure that the new value is now in effect:

# mysql
mysql> SHOW VARIABLES LIKE 'tmpdir';
+---------------+-------------+
| Variable_name | Value       |
+---------------+-------------+
| tmpdir        | /run/mysqld |
+---------------+-------------+

Method 2: Creating a new tmpfs directory

If you are not running Ubuntu 12.04 LTS, then you may not have a ready made RAM disk that you can use, and you have to create one.

Here are the steps to create a the tmpfs directory:

Create the tmp directory

# mkdir -p /var/mysqltmp

Set permissions

# chown mysql:mysql /var/mysqltmp

Determine mysql user id

# id mysql

Edit /etc/fstab

And add the following line, replacing your specific mysql user id and group id instead of the 105 and 114 below:

tmpfs /var/mysqltmp tmpfs rw,gid=105,uid=114,size=256M,nr_inodes=10k,mode=0700 0 0

Mount the new tmpfs partition

# mount -a

Change your MySQL configuration

# vi /etc/mysql/conf.d/local.cnf 

Change, or add the following line:

tmpdir = /var/mysqltmp

Restart MySQL

/etc/init.d/mysql restart

Or, for Ubuntu 12.04:

service mysql restart

How much of a difference does it make?

How much of a difference can you expect from moving MySQL's temporary files from disk to a RAM? Significant, if your server has lots of slow queries.

Here are the graphs from a site that was suffering considerably because of a large number of logged in users (averaging 250 at peak hours, and exceeding 400 at times), and some other factors.

Using a RAM disk made a noticeable difference.

CPU usage. Note how much iowait (magenta) before and after the change:

And how many slowqueries per second before and after the change:

The server's load is less too:

The Input Outputs per second on sda (where the /tmp is, which was the destination for slow queries before the change.

Mar 12 2013
Mar 12

Over the past few years, we were called in to assist clients with poor performance of their site. Many of these were using Pressflow, because it is "faster" and "more scalable" than Drupal 6.x.

However, some of these clients hurt their site's performance by using Pressflow, rather than plain Drupal, often because they misconfigured or misused it in some way or another.

Setting cache to "external" without having a caching reverse proxy

We saw a couple of cases where clients would set the cache to "External" in admin/settings/performance, but they are not running a reverse proxy cache tier, such as Varnish or Squid.

What happens here is that Pressflow will not cache pages for anonymous users, and just issue the appropriate cache HTTP headers, assuming that a caching reverse proxy, e.g. Varnish, will cache them.

Performance of the site will suffer, since it will be hit by search engine crawlers.

The solution is simple: either configure a reverse proxy, or set caching to "normal".

Setting Page Cache Maximum Age too low

In admin/settings/performance, there is a configuration parameter called "Page cache maximum age" (in Pressflow, which is called "Expiration of cached pages" in Drupal 7.x). This value should not be left set to "none", because that means items will not be left in the cache for sufficient time for them to be served for subsequent users. Setting it too low (e.g. 1 minute) has the same effect too.

Do set this parameter to the highest time possible if you have an external cache like Varnish or Squid.

Enabling modules that create anonymous sessions

Both Pressflow 6.x and Drupal 7.x disable page caching for anonymous users if a session is present.

This means that if you have a module that sets a cookie, caching will be disabled, because a cookie needs a session to store it.

This means that code like this will disable page caching for anonymous users:

  $_SESSION['foo'] = 'bar';

The Pressflow Wiki started an effort to list such modules here: Modules that break Pressflow 6.x caching and how to fix them and here: Code that sets cookie or session, but with so many modules being written, it is virtually impossible to have a complete list.

Also, novice Drupal developers will not know this, and write modules that use cookies, and therefore prevent page caching for anonymous users.

We have seen such cases from such developers where a site that was perfectly working previously is rendered fragile an unstable via one line of code!

Not that this fault applies to Pressflow 6.x, and to Drupal 7.x as well.

If you are using the former, then you can solve the problem temporarily by switching to Drupal core 6.x instead of Pressflow 6.x. Drupal code 6.x does not mind cookies for anonymous users.

Using Varnish with hook_boot() or hook_exit() modules

When using an external cache, like Varnish, all anonymous requests do not hit Drupal at all. They are served from Varnish.

So if you have modules that implement hook_boot() or hook_exit(), then the code that is there will not be triggered at all. If you rely on it for some functionality, then it will be hit only the first time the page is accessed.

For example, the core statistics module hook_exit() increments the view count for the node. If you enable this module with this functionality, then these figures will be far lower than the real numbers, and you are better of disabling this module rather than having inaccurate numbers.

Mar 06 2013
Mar 06

The Boost is often a great help with speeding up web sites of small to medium size and/or hosted on shared hosts.

It works by writing the entire cached page to a disk file, and serving it entirely from the web server, bypassing PHP and MySQL entirely.

This works well in most cases, but we have observed a few cases where boost itself becomes a bottleneck.

One example was when 2bits.com were called to investigate and solve a problem for a Fortune 500 company's Drupal web site.

The site was configured to run on 12 web servers, each being a virtual instance on VMWare, but all of them sharing a single RAID-5 pool for disk storage.

The main problem was when someone posts a comment: the site took up to 20 seconds to respond, and all the web instances were effectively hung.

We investigated and found out that what happens is that boost's expiry logic kicked in, and tries to delete cached HTML intelligently for the node, the front page, ...etc. All this while the site is busy serving pages from the same disk from boost's cache, as well as other static files.

This disk contention from deleting files caused the bottleneck observed.

By disabling boost, and using memcache instead, we were able to bring down the time from 20 seconds to just 8 seconds.

Further improvement could be achieved by using Varnish as the front tier for caching, reducing contention.

Feb 25 2013
Feb 25

In the Drupal community, we always recommend using the Drupal API, and best practices for development, management and deployment. This is for many reasons, including modularity, security and maintainability.

But it is also for performance that you need to stick to these guidelines, refined for many years by so many in the community.

By serving many clients over many years and specifically doing Drupal Performance Assessments, we have seen many cases where these guidelines are not followed, causing site slowdowns and outages.

Here are some examples of how not to do things.

Logic in the theme layer

We often find developers who are proficient in PHP, but new to Drupal misuse its API in many ways.

In extreme cases, they don't know they should write modules to house the application logic and doing data access, and leave only presentation to be done in the theme layer.

We saw a large site where all the application logic was in the theme layer, often in .tpl.php files. The logic even ended with an exit() statement!

This caused Drupal page caching mechanism to be bypassed, resulting in all page accesses from crawlers and anonymous users to be very heavy on the servers, and complicating the infrastructure by over-engineering it to compensate for such a development mistake.

Using PHP in content (nodes, blocks and views)

Another common approach that most developers start using as soon as they discover it, is placing PHP code inside nodes, blocks or views.

Although this is a quick and dirty approach, the initial time savings cause lots of grief down the road through the life cycle of the site. We wrote an article specifically about that, which you will find a link to below.

Heavy queries in the theme layer, when rendering views

In some cases, the logic for rendering individual nodes within a view is complex, and involves code in the view*.tpl.php file that has SQL queries, or calls to heavy functions, such as node_load() and user_load().

We wrote an article on this which you can find the link to below.

Conclusion

Following Drupal's best practices and community guidelines is always beneficial. Performance is just one of the benefits that you gain by following them.

Further reading

Feb 19 2013
Feb 19

When doing performance assessment for large and complex sites to assess why they are not fast or scalable, we often run into cases where modules intentionally disable the Drupal page cache.

Depending on how often it happens and for which pages, disabling the page cache can negatively impact the site's performance, be that in scalability, or speed of serving pages.

How to inspect code for page cache disabling

If you want to inspect a module to see if it disables the page cache, search its code for something like the following:

// Recommended way of disabling the cache in Drupal 7.x
drupal_page_is_cacheable(FALSE);

Or:

$GLOBALS['conf']['cache'] = 0;

Or:

$GLOBALS['conf']['cache'] = CACHE_DISABLED;

Or:

$conf['cache'] = FALSE;

Modules that disable the page cache

We have found the following modules that disable the page cache in some cases:

Bibliography Module

In biblio_init(), the module disables the page cache if someone is visiting a certain URL, such as "biblio/*" or "publications/*", depending on how the module is configured.

if ($user->uid === 0) { 
  // Prevent caching of biblio pages for anonymous users
  // so session variables work and thus filering works
  $base = variable_get('biblio_base', 'biblio');
  if (drupal_match_path($_GET['q'], "$base\n$base/*"))
    $conf['cache'] = FALSE;
}

Flag Module

This code in flag/includes/flag_handler_relationships.inc

if (array_search(DRUPAL_ANONYMOUS_RID, $flag->roles['flag']) !== FALSE) {
  // Disable page caching for anonymous users.
  drupal_page_is_cacheable(FALSE);

Or in Drupal 6.x:

if (array_search(DRUPAL_ANONYMOUS_RID, $flag->roles['flag']) !== FALSE) {
  // Disable page caching for anonymous users.
  $GLOBALS['conf']['cache'] = 0;

Invite Module

case 'user_register':
  // In order to prevent caching of the preset 
  // e-mail address, we have to disable caching 
  // for user/register.
  $GLOBALS['conf']['cache'] = CACHE_DISABLED;

CAPTCHA Module

The CAPTCHA module disables the cache wherever a CAPTCHA form is displayed, be that in a comment or on the login form.

This is done via the hook_element_info() which sets a callback in the function captcha_element_process().

If you find other modules that are commonly used, please post a comment below about it.

Feb 13 2013
Feb 13

While doing a Drupal Performance Assessment for a clients Drupal 6.x site recently, we found an odd problem.

Among other things, page generation times was high, and this was due to the function skinr.module::skinr_preprocess() was being called 190 times for each page load.

Each of these calls was calling theme_get_registry(), which is supposed to statically cache its results. However, these were still consuming too much time, even with static caching. Perhaps this was due to the large amount of memory the data for the theme registry consumes.

Each invocation took up to 11 milliseconds, and those added up significantly.

We measured this by adding the following debugging code in includes/theme.inc::theme_get_registry()

function theme_get_registry($registry = NULL) {
  timer_start(__FUNCTION__);
  static $theme_registry = NULL;
  if (isset($registry)) {
    $theme_registry = $registry;
  }

  $time = timer_stop(__FUNCTION__);
  error_log(__FUNCTION__ . ':' . $time['time']);
  return $theme_registry;
}

We then checked how much time the skinr_preprocess() function takes, by adding a static flag to prevent it from being executed more than once. This of course would impact the site's functionality, but we wanted to measure the impact of all these calls:

function skinr_preprocess(&$vars, $hook) {
  static $already_called = FALSE;
  if ($already_called) {
    return;
  }

  ...
  
  $already_called = TRUE;
}

We found that this made a significant difference in performance, when measured via a code profiler, such as xhprof:

theme_get_registry() 194 calls total of 327,560 usecs 15.8% of total page load time
skinr_preprocess() 190 calls total of 264,246 usecs 12.8% of total page load time

By eliminating these, we take off more than half a second of page execution time!

Instead of a total 2,068 ms before, it is down to 1,220 ms with the change.

Since this does impacts the site's functionality, we modified this to a less performant but still functional variant like so (for Skinr 6.x-1.x):

function skinr_preprocess(&$vars, $hook) {
  static $already_called = FALSE;
  static $skinr_data;
  static $info;
  static $current_theme;
  static $theme_registry;

  if (!$already_called) {
    $already_called = TRUE;

    // Let's make sure this has been run. There have been
    // problems where other modules implement a theme 
    // function in their hook_init(), so we make doubly
    // sure that the includes are included.
    skinr_init();
  
    $skinr_data = skinr_fetch_data();
    $info = skinr_process_info_files();
    $current_theme = skinr_current_theme();
    $theme_registry = theme_get_registry();
  }

  ...
}

Before:

Total page generation time 2,055,029 usecs

theme_get_registry 194 calls totalling 328,355 usecs and 16.0% of total page generation time
skinr_preprocess 190 calls totalling 266,195 usecs and 13.0% of total page generation time

After:

Total page generation time 1,402,446 usecs

Over 650 milliseconds saved!

We also noticed a corresponding decrease in CPU utilization, which means the servers can scale better.

We have created patches that fix the above. You can grab them for Skinr 6.x-1.x, 6.x-2.x and 7.x-2.x from issue #1916534.

Dec 12 2012
Dec 12

We encounter this problem a lot: the extremely popular and oft-used Admin Menu module causes performance problems.

Here is an example from a site we recently did a Drupal performance assessment for.

Executed 3,169 queries in 584.11 milliseconds.
Page execution time was 4,330.86 ms.

As you can see, the number of queries per request is horrendous, and the site is a resource hog if left in that state.

There were several reasons that were causing excessive number of queries leading to excessive page load time, and general slowness and resource usage.

The key here is that the context module was doing a cache_set() and that was triggering Admin Menu to rebuild its menus.

We diagnosed the problem, and were able to get over it by disabling the following modules:

  • ND Context
  • Context Layout
  • Context UI
  • Context
  • Admin Menu

After disabling the above modules, we were able to get much better response and far less queries, after we did that, as follows:

Executed 245 queries in 59.41 milliseconds.
Page execution time was 866.24 ms.

Orders of magnitude better ...

We are told, but have not verified, that the 3.x branch is supposed to fix some of the performance issues of Admin Menu.

Dec 03 2012
Dec 03

We were recently troubleshooting a client site running Drupal 7.x, and main complaint was high memory usage on all the site's pages.

We were able to diagnose and solve two main causes that range from the common to unusual.

This is a Drupal 7 Commerce site with 173 modules, and 3 themes enabled. Apache Solr is used for search, and there is custom code to talk over the network to a non-Drupal backend server.

The site runs on a Debian Squeeze Xen VPS.

For most of the site's pages, the client was seeing high memory usage, as follows:

Problem: high memory usage

When every page load has extremely excessive memory usage, this could be a bottleneck for the site scaling well, since the server has to have enough memory to cope with many pages at the same time using lots of memory.

The first access to a page, where APC has not yet cached anything, would look like this in devel:

Memory used at: devel_boot()=6.92 MB, devel_shutdown()=236.03 MB, PHP peak=243.25 MB.

Subsequent access would show less memory usage, since APC caches the Drupal PHP files, like so:

Memory used at: devel_boot()=6.54 MB, devel_shutdown()=174.37 MB, PHP peak=175.5 MB.

Some page even reported up to 192 MB of peak memory!

That is still excessive. For a site with that many modules, we expected that memory usage would be high, but not to that extent.

Solutions to high memory usage

Increasing APC shared memory size

First, the allocated shared memory for APC was not enough.

The apc.shm_size parameter for APC is set to the default of 32MB.

The code base with that many modules needed at least double that or more.

So, increasing this to 96MB, solved that part.

To do so on Debian or Ubuntu, change the following line in the file /etc/php5/apache2/conf.d/apc.ini

apc.shm_size = 96

Replacing php5-memcached with php-memcache

The other big cause of excessive memory usage was quite unusual. It was the use of php5-memcached (notice the "d") to connect PHP with the memcached daemon, rather than the more commonly used php5-memcache (without a "d").

For some unknown reason, the PHP memcached extension (from the Debian package php5-memcached) uses way more memory than the php5-memcache extension.

In order to remedy this, do the following:

$ sudo aptitude purge php5-memcached
$ sudo aptitude install php5-memcache

What a difference a "d" makes!

Results

The results after doing both of the above things were dramatic. Instead of 175 MB per page, it is now a more typical (for a complex site): 60 MB!

Memory used at: devel_boot()=2.15 MB, devel_shutdown()=58.4 MB, PHP peak=59.5 MB.

Note that these figures are not absolute, and will vary from distro to distro and server to server, depending on what modules you have enabled in PHP and Apache, and many other factors. What matters is the comparative figures, not absolute figures.

For example, the same site on an Ubuntu Server LTS 12.04, which we used for our lab servers:

Memory used at: devel_boot()=2.11 MB, devel_shutdown()=72.27 MB, PHP peak=75.75 MB.

It will be different on CentOS.

Nov 19 2012
Nov 19

The site is ScienceScape.org, a repository of scientific research going back to the 19th century, down to the latest biotechnology and cancer release.

Update: You can watch a video of the presentation on Vimeo.

AttachmentSize 535.43 KB
Oct 29 2012
Oct 29

But this time, it was different. Modules were not to blame.

While inspecting a site that had several performance problems for a client, we noticed is that memory usage was very high. From the "top" command, the RES (resident set) field was 159 MB, far more than what it should be.

We narrowed down the problem to a view that is in a block that is visible on most pages of the site.

But the puzzling part is that the view was configured to only returned 5 rows. It did not make sense for it to use that much memory.

However, when we traced the query, it was like so:

SELECT node.nid, ....
FROM node
INNER JOIN ...
ORDER BY ... DESC

No LIMIT clause was in the query!

When executing the query manually, we found that it returned 35,254 rows, with 5 columns each!

Using the script at the end of this article, we were able to measure memory usage at different steps. We inserted a views embed in the script and measured memory usage:

Before boot               0.63 MB
Boot time               445.1 ms
After Boot               51.74 MB
Boot Peak                51.80 MB
Module count            136
After query              66.37 MB
Query Peak               67.37 MB
After fetch              78.96 MB
Fetch Peak              148.69 MB

So, indeed, that view was the cause of the inflated memory usage! With memory jumping from 67 MB to 148 MB.

In this case, it turns out that the module "views_php" was definitely the culprit. Once it was disabled, the query did not have that huge memory foot print any more.

Here are the results after disabling views_php:

Before boot               0.63MB
Boot time               427.1 ms
After Boot               51.68MB
Boot Peak                51.74MB
Module count            135
After query              66.31MB
Query Peak               67.31MB
After fetch              74.71MB
Fetch Peak               74.89MB

A more reasonable 75MB.

We did not dig further, but it could be that because a field of type "Global: PHP" was used, views wanted to return the entire data set and then apply the PHP to it, rather than add a LIMIT to the query before executing it.

So, watch out for those blocks that are shown on many web pages.

Baseline Memory Usage

As a general comparative reference, here are some baseline figures. These are worse case scenarios, and assume APC is off, or that this measurement is running from the command line, where APC is disabled or non-persistent. The figures would be less from Apache when APC is enabled.

These figures will vary from site to site, and they depend on many factors. For example, what modules are enabled in Apache, what modules are enabled in PHP, ...etc.

Drupal 6 with 73 modules

Before boot:     0.63 MB
After boot:     22.52 MB
Peak memory:    22,52 MB

Drupal 7 site, with 105 modules

Before boot:     0.63 MB
After boot:     57.03 MB
Peak memory:    58.39 MB

Drupal 7 site, with 134 modules

Before boot:     0.63 MB
After boot:     58.79 MB
Peak memory:    60.28 MB

Drupal 6 site, with 381 modules

Before boot:     0.63 MB
After boot:     66.02 MB

Drupal 7 site, pristine default install, 29 modules

Now compare all the above to a pristine Drupal 7 install, which has 29 core modules installed.

Before boot     0.63 MB
Boot time     227.40 ms
After Boot     20.03 MB
Boot Peak      20.07 MB
Module count   29

Effect of APC on boot memory footprint

To see how much APC, and other opcode caches, improves these figures, compare the following:

First access after Apache restarted, for a Drupal 6 site:

Before boot     0.63 MB
Boot time     802.5 ms
After Boot     62.85 MB
Boot Peak      63.11 MB
Module count   210

Subsequent accesses, with APC caching the code:

Before boot     0.61 MB
Boot time     163.80 ms
After Boot     17.24 MB
Boot Peak      18.41 MB
Module count   210

Also, for a default Drupal 7 install, with 29 core modules. Compare to the above figures for the same site without APC.

Before boot     0.61 MB
Boot time      60.4 ms
After Boot      3.36 MB
Boot Peak       3.41 MB
Module count   29

A marked improvement! Not only in bootup time, but also reduced memory foot print.

So always install APC, and configure it correctly on your site.

Memory Measurement Script

Here is a script to measure your memory usage. You can run it from the command line as:

$ cd /your/document_root
$ php mem.php

Add whatever part you think is causing memory usage to sky rocket in place of the commented out section, and you can see how much is being used.

You can also add HTML line breaks to the print statement, and run it from a browser to see the effect of APC code caching as well.

<?php

define('DRUPAL_ROOT', getcwd());

function measure() {
  list($usec, $sec) = explode(" ", microtime());
  $secs = (float)$usec + (float)$sec;
  return $secs * 1000;
}

function fmt_mem($number) {
  return number_format($number/1024/1024, 2) . "MB";
}

$results = array();
$results['Before boot'] =  fmt_mem(memory_get_usage());

$start = measure();
require_once DRUPAL_ROOT . '/includes/bootstrap.inc';
drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);

$results['Boot time'] = number_format((measure() - $start), 1) . " ms";
  
$results['After Boot'] = fmt_mem(memory_get_usage());
$results['Boot Peak '] = fmt_mem(memory_get_peak_usage());
  
$count = count(module_list());
$results['Module count'] = $count;

// Testing a view using the standard method
/*
$view = views_embed_view('your_view', 'block');
*/

// Testing a view in two steps
/*
$view = views_get_view('your_view');

$results['After query'] = fmt_mem(memory_get_usage());
$results['Query Peak '] = fmt_mem(memory_get_peak_usage());

$output = $view->preview('block');
*/

$results['After fetch'] = fmt_mem(memory_get_usage());
$results['Fetch Peak '] = fmt_mem(memory_get_peak_usage());

foreach($results as $k => $v) {
  print $k . "\t\t" . $v . "\n";
}
Oct 22 2012
Oct 22

Do you have a problem with some Drupal admin pages with a large number of input fields not saving? Does the page just returns back with no message confirming that changes have been saved?

Well, this happened recently on a site that we were troubleshooting for a client.

The symptoms were: trying to save has lots of input fields, a) comes back with no message about saving changes, and b) the values changed were not saved.

The site had 210 enabled modules, 14 user roles defined, and 84 content types, with 76 content fields!

For example, take the permissions page at admin/users/permissions. If the site has lots of content types, each with lots of fields, then modules with many permutations of permissions for content types and fields will each have to define reams and reams of permissions each.

For this site, it was the following modules, combined with the number of content types and fields that caused the permissions to grow like that.

  • node
  • actions_permissions
  • publishcontent
  • content_permissions

Let us verify that by saving the permissions page as HTML, and then doing some analysis:

$ grep -c 'input type="checkbox"' permissions.html 
20748

Look at that: 20,748 checkboxes!

Let us see how many permissions we have:

$ grep -c 'class="permission"' permissions.html        
1482

Yup! That is 1,482 permissions!

If you multiply 1482 X 14 roles = 20,748 total checkboxes!

The root cause for this was two fold, one on the PHP side and the other on Apache's side.

Configuring PHP to accept more input variables

The default value for input fields for PHP is 1000. While this is sufficient for normal sites, it is not so for sites that overuse (misuse/abuse?) Drupal features.

You need to increase the number of input variables in PHP:

To verify that this is your problem, look in your web server's error log for something similar to this error message:

mod_fcgid: stderr: PHP Warning: Unknown: Input variables exceeded 1000. To increase the limit change max_input_vars in php.ini. in Unknown on line 0

Just add the following to your php.ini file:

max_input_vars = 1500

If you have the Suhosin enhanced security extension for PHP, then you need to add these as well:

suhosin.post.max_vars = 1500
suhosin.request.max_vars = 1500

Then, restart your web server.

You should be able to save the page now, and get a confirmation message, and see that your changes have "stuck".

But wait a minute: how come you have over 20,000 checkboxes, yet you only made it work with 1500 only?

The answer is that Drupal is using input value arrays for most fields, so it is not a 1 to 1 relationship between number of checkboxes and number of input fields.

Configuring FastCGI for large input

If you are using FastCGI, for example mod_fastcgi or fcgid, then pages would not save even if you implement the above changes. The reason is that with that many input fields, you overflow the default maximum for the size of requests between Apache (or ngnix) and PHP over the FastCGI protocol.

Look in your server's error log for an error message like this one:

mod_fcgid: HTTP request length 131998 (so far) exceeds MaxRequestLen (131072)

Normally, either you will see the errors in the web server's error log, or you will see them right there on the page. But we have had cases where low cost web hosts don't log errors at all anywhere.

The default is 128 Kilobytes (128 X 1024 = 131,072 bytes), and was not enough for this huge number of fields.

To confirm that you are running FastCGI, go to /admin/reports/status/php. If "Server API" is set to "CGI/FastCGI", then continue with the next step.

The fix is easy, and would go under either FastCGI or fcgid, as the case may be with your setup:

For example if you are using fcgid, you would add that under the IfModule mod_fcgid.c section:

  FcgidMaxRequestLen  524288

If you are using the older FastCGI, then you need to add that under the IfModule mod_fastcgi.c section.

Once the above was changed, we got the page to display the reassuring message of "The changes have been saved" appearing, and combined with the max_input_vars change above, the values were saved correctly.

Oct 15 2012
Oct 15

We recently did a Drupal Performance Assessment project for a very large site, with a massive database. The database takes over 174GB, and houses a total of over 707 million rows.

Exporting and importing a database of that size is challenging in many ways.

We had to use different tools and techniques than the standard mysqldump utility, which was causing the import to exceed 14 days for this database. We were able to cut that down to less than 4 days.

While we plan to talk about this site in detail at Drupal Camp Toronto, and publish an article on it, this article is about a very useful byproduct.

Maatkit mk-parallel-dump

We first considered the use of Maatkit parallel dump. However, this is deprecated and is no longer supported in Percona Toolkit, the successor to Maatkit.

Mydumper and Myloader

We then found the right tool: Mydumper.

Mydumper is a very useful tool for faster backups for many sites with smaller database sizes. The tool is written by current and past MySQL/Oracle staff.

The trick for this tool is that it relies on doing dumps and imports in parallel using multiple threads, rather than using a single thread, which is slow and inefficient.

Compiling and Installing Mydumper from Source

Although Mydumper is available as a package in the repository for Ubuntu Server 12.04 LTS Precise Pangolin, it does not work, and casues segmentation fault error.

For this reason, we have to build it from source. However, a patch is also needed in order for it to compile correctly.

This is described in detail in the following steps:

Install the required packages for compiling from source

sudo aptitude install cmake g++ patch libglib2.0-dev \
  libmysqlclient-dev zlib1g-dev libpcre3-dev patch

Download Mydumper

You need to download Mydumper 0.5.1:

wget http://goo.gl/dVhJD

Extract the archive

Then extract the archive in a directory:

mv dVhJD mydumper-0.5.1.tar.gz 
tar xzf mydumper-0.5.1.tar.gz 

Change to the directory where you extracted the source:

cd mydumper-0.5.1/

Get and apply the patch

Then get the patch from Launchpad issue #1002291.

wget http://goo.gl/eUMx6

Now patch the source with what you just downloaded, as follows:

patch -p1 < eUMx6

Compile the source

Then you compile the source. Do not forget the "." after cmake!

cmake .
make

Install

Finally, you need to install the binaries to /usr/local/bin:

sudo make install

Using Mydumper

As we said earlier, Mydumper works by having multiple threads dumping multiple tables, or parts of tables, concurrently.

The program has many options, but the important ones are:

--database

Specifies the name of the database to dump

--threads

The number of threads to use. Generally set this to the number of CPUs in the server, but it can be more in some cases.

--rows

The number of rows per "chunk", before creating anothe chunk for the same table. We used 500,000, and got good results, but it depends on how many rows your tables have.

You can use this script for backup.

#!/bin/sh

DIR_BACKUP=/backup-directory
DB_NAME=database_name
DB_USER=user
DB_PASS=password

DOW=`date "+%a"`
DB_DUMP=$DIR_BACKUP/dbdump.$DB_NAME.$DOW

mydumper \
  --database=$DB_NAME \
  --host=$DB_HOST \
  --user=$DB_USER \
  --password=$DB_PASS \
  --outputdir=$DB_DUMP \
  --rows=500000 \
  --compress \
  --build-empty-files \
  --threads=2 \
  --compress-protocol

Save that file to /usr/local/bin/dbdump.sh

Make it executable using:

chmod +x /usr/local/bin/dbdump.sh

Create a crontab job to run it daily, say at 4:10 am local server time.

10 04 * * * /usr/local/bin/dbdump.sh

Description of Mydumper's output data

Mydumper does not output to files, but rather to files in a directory. The --outputdir option specifies the name of the directory to use.

The output is two parts:

Schema

For each table in the database, a file containing the CREATE TABLE statement will be created. It will be named:

dbname.tablename-schema.sql.gz

You could manipulate this schema and gain some advantages, for example uncompress the file, edit the table creation statement to remove all indexes, other than the primary index. This would make the loading of huge tables much faster than with indexes.

Remember to do a CREATE INDEX for each one after that.

Data

For each table with number of rows above the --rows parameter, you will have a file called:

dbname.tablename.0000n.sql.gz

Where "n" starts with 0 up to the number of.

Import/restore using Myloader

You don't need to know the internals of how Mydumper stores its data in order to do a restore/import. The Myloader tool does that for you.

If you are importing on a machine that has more CPUs than the machine you dumped on, you can use a higher number for the --threads parameter, for example:

myloader \
  --database=$DB_NAME \
  --directory=$DB_DUMP \
  --queries-per-transaction=50000 \
  --threads=6 \
  --compress-protocol \
  --verbose=3

So, what improvement can we see? Here are the details.

Before: Using mysqldump:

For a site that has 5.8 million rows in one table, and 1.39 million in another, there was a noticeable improvement of using Mydumper over mysqldump.

11:00:01 - Start dump of database to .../dbdump.live.Sat.sql.gz
11:08:33 - Successfully dumped database live

So, it takes 8.5 minutes for the dump to complete.

After: Using Mydumper:

After using Mydumper with --rows=500000, the backup was much faster:

11:00:01 - Start dump of database to .../dbdump.live.Mon
11:00:43 - Finished dumping the database live

Only 42 seconds!

Other approaches for faster backups

For sites that do not have large tables that warrant the use of Mydumper, there are other approaches that can help.

For example, you can exclude the data from certain tables with transient content, such as watchdog, sessions, and all the cache tables.

This partial code snippet uses some shell script wizardry to exclude such tables with transient data, and only include their table creation statements, but not their content:

  #!/bin/sh

  # Tables to backup structure, but not data
  DB_NO_DATA="/^(watchdog|sessions|cache(_.+)?)$/"

  ...

  TABLES=`mysql -u... -p... -B -N -e 'SHOW TABLES;' $DB_NAME`
  for TABLE in
  do
    RESULT=`echo $TABLE | awk "$DB_NO_DATA"`
    if [ "$RESULT" != "" ]; then
      # Structure only
      OPTS=" --no-data "
    else
      # Default, structure AND data
      OPTS=""
    fi
    mysqldump $OPTS -u... -p... $DB_NAME $TABLE >> $DB_DUMP
    if [ $? != 0 ]; then
      echo "ERROR: Failed in dumping database $DB_NAME"
    fi
  done

  # Compress the backup
  bzip2 $DB_DUMP
  mv $DB_DUMP.bz2 $DB_DUMP
  echo "Successful dump for database $DB_NAME"

A similar approach for excluding contents of tables with transient content is outlined in this article, if you are using Capistrano.

A different way that relies, like Mydumper, on doing parallel dump and restore, is to use Lullabot's MySQL Parallel, which uses a shell script wrapper around GNU Parallel, and Parallel bzip2. This is of course much better than plain mysqldump, but will not parallelize portions of tables. So a site with a huge table will do other tables quickly, but will be slow for the huge table since all of it will be done in a single thread.

Other use cases for Mydumper/Myloader

More information on how we used MyDumper/MyLoader to cut down the export/import of a massive Drupal database, can be found in our presentation: Huge! Drupal site with 381 modules and 174GB MySQL database and 200 million row tables at Drupal Camp Toronto in November 2012..

Further Information

You can find more information about Mydumper in the following articles:

Update: Added link to presentation that contains MyDumper/MyLoader info.

Jul 23 2012
Jul 23

In the past few days, we have seen another Denial of Service attack on a client's site.

The symptoms were a complete outage of the server, with very high CPU usage, and high load average (over 900 in some cases!).

Upon investigating, we found that this is caused by the following hits:

75.145.153.237 - - [22/Jul/2012:19:55:07 -0400] "POST / HTTP/1.1" 500 539 "-" "-"
75.145.153.237 - - [22/Jul/2012:19:55:07 -0400] "POST / HTTP/1.1" 500 539 "-" "-"
75.145.153.237 - - [22/Jul/2012:19:55:06 -0400] "POST / HTTP/1.1" 500 539 "-" "-"
75.145.153.237 - - [22/Jul/2012:19:55:07 -0400] "POST / HTTP/1.1" 500 539 "-" "-"
75.145.153.237 - - [22/Jul/2012:19:55:06 -0400] "POST / HTTP/1.1" 500 539 "-" "-"
75.145.153.237 - - [22/Jul/2012:19:55:07 -0400] "POST / HTTP/1.1" 500 539 "-" "-"

So, a script/bot was used to post data to the home page of the site, and that caused a service unavailable error to be returned.

All the IP addresses belonged to University of Illinois at Urbana-Champagne (UIUC), or to Comcast customers also in Illinois.

For the first and second incident, we blocked the IP address, or entire subnets. However we soon realized that this is a futile effort, since other IP addresses will be used.

We then devised a plan to prevent the POST request from reaching PHP altogether. This was done by adding the following to Drupal's .htaccess. Basically, it returns an access denied right from Apache if the conditions were met. The conditions are: empty referer, empty user agent, POST request to the home page.

# Modification for dealing with botnet DDoS via high CPU utilization
# by overwhelming PHP with POST data
#
# Referer is empty
RewriteCond %{HTTP_REFERER} ^$
# User agent is empty
RewriteCond %{HTTP_USER_AGENT} ^$
# The request is for the home page
RewriteCond %{REQUEST_URI} ^/$
# It is a POST request
RewriteCond %{REQUEST_METHOD} POST
# Forbid the request
RewriteRule ^(.*)$ - [F,L]

After implementing the above fix, the hits were successfully deflected, with no ill effect on the site.

67.177.109.10 - - [23/Jul/2012:16:31:23 -0400] "POST / HTTP/1.1" 403 202 "-" "-"
67.177.109.10 - - [23/Jul/2012:16:23:58 -0400] "POST / HTTP/1.1" 403 202 "-" "-"
67.177.109.10 - - [23/Jul/2012:16:31:21 -0400] "POST / HTTP/1.1" 403 202 "-" "-"
67.177.109.10 - - [23/Jul/2012:16:23:14 -0400] "POST / HTTP/1.1" 403 202 "-" "-"
67.177.109.10 - - [23/Jul/2012:16:23:29 -0400] "POST / HTTP/1.1" 403 202 "-" "-"
67.177.109.10 - - [23/Jul/2012:16:26:58 -0400] "POST / HTTP/1.1" 403 202 "-" "-"

Obviously, this is not the only protection for this type of attack. Other ways include installing Suhosin, and fiddling with Drupal's .htaccess as well, as described in PSA-2012-001.

Apr 22 2012
Apr 22

We had a site for a client that was stable for close to two years, then suddenly started to experience switches from the master to the geographically separate slave server as frequently as twice a week.

The site is an entertainment news site, and its articles get to Google News on occasions.

The symptoms was increased load on the server, a sudden influx of traffic causing over 800 simultaneous connections all in the ESTABLISHED state.

Normally, a well tuned Drupal site can withstand this influx, with server optimization and proper caching. But for this previously stable site, we found that a combination of factors, some internal to the sites, and the other external, participated to cause the site to switch.

The internal factor was the way the site was setup using purl, and other code around. The links of a URL changed to add a top level section, which redirected to the real URL. This caused around 30% of accesses to the URLs to cause a 302 redirect. Since redirects are not cached, they incurred more overhead than regularly served pages.

Investigating the root cause

We started checking if there is a pattern, and went back to analyse the server logs as far back as a year.

We used the ever helpful Go Access tool to do most of the investigative work.

A week in April 2011, had 28% redirects, but we found an anomaly of the browser share over the months. For that same April week, the browser breakdown are 34% MSIE, 21% Safari and 21% Firefox.

A week in Sep 2011, redirects are 30%, browsers are 26% Safari, 25% MSIE and 20% Firefox. These make sense as Safari gains more market share and Microsoft loses market share.

But when checking a week in Feb 2012, redirects are 32%, but look at the browsers: 46% Firefox, 16% Safari, 14% Others and 12% MSIE

It does not make sense for Firefox to jump by that much and gain market share from thin air.

A partial week in March 2012, shows that redirects are 32%, and again, the browsers are 52% Firefox, 14% Others, 13% Safari and 10% MSIE.

That MSIE dropped is something that one can understand. But the jump in Firefox from Sep to Feb/March is unjustified, and tells us that perhaps there are crawlers, scrappers, leachers or something else masking as Firefox and hitting our content.

Digging deeper, we find that the top 2 Firefox versions are:

27,092 Firefox/10.0.2
180,420 Firefox/3.0.10

The first one is understandable, a current version of Firefox. The second one is a very old version from 2009, and has 6.6X the traffic of the current version!

The signature for the user agent is all like so, with a 2009 build:

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10 (.NET CLR 3.5.30729)

We went back and looked at a week in September (all hours of the day), with that browser signature, and lo and behold:

Unique visitors that suck lots of bandwidth:

  88      10.49%  24/Sep/2011  207.76 MB
  113     13.47%  23/Sep/2011  994.44 MB
  109     12.99%  22/Sep/2011    1.44 GB
  133     15.85%  21/Sep/2011    1.70 GB
  134     15.97%  20/Sep/2011    1.68 GB

There were only 335 different IP addresses!

But look at the same user agent in March for a week:

   94479  38.36%  15/Mar/2012   16.38 GB
  102037  41.43%  14/Mar/2012   17.13 GB
   38795  15.75%  13/Mar/2012   12.48 GB
   11003   4.47%  12/Mar/2012   10.90 GB

See the number of unique visitors compared to September?
And now there are 206,225 different IP addresses!

For a few days in March, Monday to Thursday, here are the figures for this user agent.

Total requests to pages (excluding static file): 1,122,229
Total requests that have an empty referer: 1,120,843
That is, 99.88% are from those botnets!

Verifying the hypothesis

Looking at the web server logs through awstats, we found that a year ago, Feb 2011, the market share for Firefox overall was 24.7% with 16,559,999. And at that time, Firefox 3.0.10 had only 44,436

That is 0.002 % of the total.

In Sep 2011 it had 0.2% with 241,869 hits.

Then in Feb 2011, that old version from 2009 have 2.2% share of hits, with 4,409,396 hits.

So, from 0.002% to 2.2% of total, for an obsolete version of Firefox. This means growth by a factor of 1,100 X in one year.

Does not make sense.

Botnet hammering the site

So, what does this tell us?

Looking at a sample of the IP addresses, we found that they all belong to Cable or DSL companies, mainly in the USA.

This tells us that there is a massive botnets that infect lots of PCs.

They were piloting the botnet in September and went full speed after that, and they are hitting the server hard.

The programs of the botnet seem to have a bug in them that prevent them from coordinating with each other, and they all try to grab new content at the same time. This poor coding causes the sudden influx of traffic that brings the server to its knees, combined with the non-caching of 302 redirects.

Just to make sure, we checked two other sites that we manage quickly for the same symptoms. One entertainment site is showing similar signs, the other, a financial sites is not showing the signs. Both have good caching because of no redirects (97% to 98% return code of 200), and that is why the entertainment site can stand the onslaught.

Solution: block the botnet's user agent

Since the botnet is coming from hundreds of thousands IP addresses, it is not possible to block based on the IP address alone.

Therefore, the solution was to block requests coming with that browser signature from 2009 only, and only when there is no referer.

This solution, that goes into settings.php, prevents Drupal from fully booting when a bad browser signature is encountered and the referer is empty.

We intentionally sent the humorous, but still legitimate, 418 HTTP return code so we can filter by that when analysing logs.

$botnet = 'Gecko/2009042316 Firefox/3.0.10';
if ($_SERVER['HTTP_REFERER'] == '') {
  if (FALSE !== strpos($_SERVER['HTTP_USER_AGENT'], $botnet) {
    header("HTTP/1.0 418 I'm a teapot");
    exit();
  }
}

The above should work in most cases.

However, a better solution is to keep the changes at the Apache level and never bother with executing any PHP code if the conditions are met.

# Fix for botnet crawlers, by 2bits.com, Inc.
#
# Referer is empty
RewriteCond  %{HTTP_REFERER}    ^$
# User agent is bogus old browser
RewriteCond  %{HTTP_USER_AGENT} "Gecko/2009042316 Firefox/3.0.10"
# Forbid the request
RewriteRule  ^(.*)$ - [F,L]

The drawback is that we are using a 403 (access denied) instead of the 418 (I am a teapot), which can skew the statistics a bit in the web server logs.

Further reading

After investigating and solving this problem, I discussed the issue with a friend who manages several high traffic sites that are non-Drupal, and at the time, he did not see the same symptoms. However, a few weeks later he started seeing the same symptoms, and sent me the first two articles. Months later, I saw the third:

Mar 01 2012
Mar 01

A client contacted us to assist them in finding a solution for slow page times for their site.

All the pages of the site were slow, and taking 2.9 to 3.3 seconds.

Upon investigation, we found that one view was responsible for most of that time.

However, the query execution itself was fast, around 11 ms.

But, the views rendering time was obscenely high: 2,603.48 ms!

So, when editing the view, you would see this at the bottom:

Query build time        2.07 ms
Query execute time     11.32 ms
View render time    2,603.48 ms

Since this view was on each page, in a block on the side bar, it was causing all the pages of the site to be slow.

The underlying reason was really bad coding in the views-view--viewname.tpl.php, which is too long to explain. But the gist of it is that the view returned several thousands rows of taxonomy terms, and was was supposed to render them in a tree. However, the actual view template just looped through the dataset and did not do much and displayed static HTML in the end!

The solution for this was quite simple: enable Views caching.

Global Caching

If most of your visitors to the site are anonymous (i.e. not logged in to Drupal), then it is simpler and better to use global caching.

To do this, go to the view's Defaults, then Basic settings, then Caching. Change to Time Based, then select at least 1 hour for each of Query results and Rendered output.

Block Caching

If your site has a significant portion of its visitors logged in, then you can be more granular in caching, per role, per user, ...etc.

To do this, go to the view's Block menu on the left, then Block settings, then Caching.

You can select various options, such as per role, or per user, depending on the specific block.

Remember that you have to do this for each block display, if you have multiple blocks for the same view.

Now, save the view, and you will see a positive impact on performance of your pages.

Feb 28 2012
Feb 28

If your Drupal site suffers occasional slow downs or outages, check if crawlers are hitting your site too hard.

We've seen several clients complain, and upon investigation we found that the culprit is Google's own crawler.

The tell tale sign is that you will see lots of queries executing with the LIMIT clause having high numbers. Depending on your site's specifics, these queries would be slow queries too.

This means that there are crawlers that accessing very old content (hundreds of pages back).

Here is an example from a recent client:

SELECT node.nid AS nid, ...
LIMIT 4213, 26

SELECT node.nid AS nid, ...
LIMIT 7489, 26

SELECT node.nid AS nid, ...
LIMIT 8893, 26

As you can see, Google's crawler is going back 340+ pages for the last query.

Going to your web server's log would show something like this:

1.2.3.4 - - [26/Feb/2012:07:26:59 -0800] "GET /blah-blah?page=621 HTTP/1.1" 200 10017 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Note the page= part, and the Google Bot as the user agent.

The solution is often to go into Google Webmaster and reduce the crawl rate for the site, so they are not hitting too many pages at the same time. Start with 20%. You may need to go up down to 40% in severe cases.

Either way, you need to experiment with a value that would fit your site's specific case.

Nov 27 2011
Nov 27

Recently a client complained that their site has been suffering from slow downs that were not there before.

Upon investigating the history of resource utilization, we found that memory usage has increased, and at the same time, the load average has increased as well.

To cut a long story short, we found out that initially, the site had neither dblog nor syslog modules were enabled, and they were enabled before the issues started. They started when syslog was enabled.

Normally, this would not be an issue on a normal site. But for this site, there were lots of PHP and module errors that were written to the log, that it did tie the disk.

These error are like so:

purl: Infinite redirect prevented.
Warning: Invalid argument supplied for foreach() in some_module_category_context() 
(line 539 of .../modules/custom/some_module/some_module.module).

We pointed out these errors to the client, so that they fix their code.

For the meantime, and to test/prove that this is the root cause, we disabled the syslog module.

Upon doing so, memory utilization went down from 13 GB to 4GB. And the load average dropped from 4 to 5, down to a mere 1 to 2.

The graphs show the decrease in CPU, load average and specially disk I/O and memory usage.

CPU graph. Note the height of the green peaks over the week.

Load average graph.

Disk I/O is much less now.

And memory is also less. The daily cycle is caused by the log rotation of the Drupal syslog. Now the usage is much less, and therefore more free memory for file system cache and buffers.

So, if you have a high load average and the site is slow, check if you are writing a lot of data to the disk either via dblog or syslog. That may be your issue.

Nov 14 2011
Nov 14

Together with Alan Dixon of Black Fly Solutions, Khalid Baheyeldin of 2bits.com, Inc. gave a presentation on Web Site Performance, Optimization and Scalability at Drupal Camp 2011.

The slides from the presentation are attached below.

AttachmentSize 384.56 KB
Nov 04 2011
Nov 04

By design, the Drupal CAPTCHA module disables page caching for pages it is enabled on.

So if you enable CAPTCHA for user login and/or registration forms, those pages will not be cached. This is often acceptable.

However, if you enable CAPTCHA for comments, and have the comment form visible at the bottom of each node, then a big portion of your site's pages will not be cached in the page cache at all.

This means that every node will be treated as if the user is logged in, even though it is anonymous visitors who are hitting. This incurs a lot of PHP execution and SQL queries to the database.

So, even though you are using memcache for your page cache, the site would probably suffer.

Nov 04 2011
Nov 04

We recently conducted a performance assessment for a client, and the main problem was something really simple, but also very detrimental to the site.

The site used the Service Links module, version 6.x-2.x.

The following are the performance figures for the site, in seconds, before any tuning.

 3.67 http://example.com/product/215
 2.64 http://example.com/product/572
68.32 http://example.com/list1
65.11 http://example.com/list2

Have a page load in 3.6 seconds is bad, but having another take 68 seconds is, well, obscene.

In this case, the Service Links module was used to display a block for each node that generates a link for each of Facebook, Twitter, LinkedIn, and Digg.

But the key was this configuration:

Use short links: Always
How generate short links: Use TinyURL.com Service

This means that for each node, your web site has to make four accesses over the network to TinyURL.com to generate short URLs.

For each node, this took over 1 second to generate.

On pages that are lists of nodes of 50 or so, this takes a lot of time, hence the over 60 seconds generation time.

After changing the configuration for Service Links to use a short URL of node/xxx, these were the performance figures:

2.04 http://example.com/product/215
2.04 http://example.com/product/572
4.93 http://example.com/list1
6.51 http://example.com/list2

A 10 X improvement over the original.

Nov 04 2011
Nov 04

Quick Tabs is a widely used Drupal module. Site builders like it because it improves usability in some cases by reducing clutter.

Incidentally, the way this module works has cause us to run across performance issues caused by certain uses. See previous article about Quick Tabs can sure use more caching and a case study involving Quick Tabs.

This is the third case of a site having problems caused bby a combination of quick tabs and confused crawlers, including Google's own crawler.

The Symptoms

The main symptom was a slow down of the server, spikes in slow queries, outages, ...etc.

When investigating the issue, we found that this particular site had close to 66,000 nodes.

But Google Webmaster listed over 12 million URLs having "quicktabs" as a parameter!

Googlebot confused

When investigating the web server log, we found the following:

GET /user/20847?quicktabs_8=0 HTTP/1.1" ... YandexBot/3.0
GET /user/13938?quicktabs_8=2 HTTP/1.1" ... Googlebot/2.1
GET /node/feed?quicktabs_2=0&quicktabs_3=0&quicktabs_7=4 HTTP/1.1" ... Googlebot/2.1
GET /forums/example-url?quicktabs_8=3&quicktabs_2=2 HTTP/1.1" ... YandexBot/3.0

Even some nonsense URLs like:

/frontpage/playlist.xml?page=74&271=&quicktabs_2=3

No wonder Google index orders of magnitude more than what pages the site has.

Additionally, you can also use a tool like the extremely useful Go Access to analyze the logs and see the top accesses by IP address.

Avoiding The Problem

To solve this we did several things, on several fronts.

First, in settings.php, we added the following:

$error_404_header = 'HTTP/1.0 404 Not Found';
$error_404_text  = '<html><head><title>404 Not Found';
$error_404_text .= '</title></head><body><h1>Not Found</h1>';
$error_404_text .= '<p>The requested URL was not found on this server.';
$error_404_text .= '</p></body></html>';

$bot_ua = array(
  'bot',
  'spider',
  'crawler',
  'slurp',
);

if (FALSE !== strpos($_SERVER['QUERY_STRING'], 'quicktabs')) {
  foreach ($bot_ua as $bot) {
    if (FALSE !== stripos($_SERVER['HTTP_USER_AGENT'], $bot)) {
      header($error_404_header);
      print $error_404_text;
      exit();
    }
  }
}

Bing and robots.txt

We later found that Bing is hitting the site too much as well.

To solve this, we added the following to the robots.txt file:

User-agent: *
Crawl-delay: 10
# No access for quicktabs in the URL
Disallow: /*?quicktabs_*
Disallow: /*&quicktabs_*

Google Webmaster

We also went into Google Webmaster, and changed crawling to ignore the quicktabs parameter.

To do so go under "Site Configuration" then "URL Parameters", and edit the "quicktabs" parameter and set it to "No: Does not affect page content".

Refer to this help page for more information.

The Results

Not only were the crawlers hitting the site too quickly. They were also putting these pages in the page cache, overflowing the memory allocated for memcache, and thus hurting the performance for regular users, and overloading the server in two ways, not only one.

Here is a comparison of before and after for a one week period.

Before the change for crawlers, we had 66,280 slow queries per week for 292 unique queries, eating up a total of 350,766 seconds.

After the change for crawlers, we have only 10,660 slow queries per week for 162 unique queries, eating up only 47,446 seconds.

This is a 7X decrease of the CPU time used by slow queries.

Oct 15 2010
Oct 15

As promised, the slides are attached, as a PDF for everyone's reference.

AttachmentSize 621.53 KB
Aug 15 2010
Aug 15

A client site was facing intermittent outages, specially at peak hours.

We investigated the issue over a few days, and narrowed down the cause to certain slow queries, described in more details below.

They had tagadelic on every page, displaying a tag cloud, and from that, crawlers hit every term to paths like taxonomy/term/1581.

Slow queries causing server outage

As more of these queries got executed simultaneously, things get worse and worse, because the temporary tables generated by this query are on disk, and the disk gets busier, and more queries keep coming in the meantime, making the disk even busier, and slower in processing the queries.

This is not a good situation, since not only does it lead to slowdowns, but also to outages as more PHP processes are tied up by slow queries ...

Relevant site statistics

First, the site's relevant statistics are as follows:

40,751 rows in the node table
82,529 rows in the term_node table
79,832 rows in the node_access table

The slow query in question was responsible for combined total of 73.3% of the overall total slow query time for all queries. Out of this, 39.8% were from the query itself, and 33.5% were from the COUNT query associated with it.

EXPLAIN output for the query

The slow query that was bogging down the site looked like this, from the taxonomy module, in the function taxonomy_select_nodes():

EXPLAIN
SELECT DISTINCT(n.nid), n.sticky, n.title, n.created 
FROM node n  
INNER JOIN term_node tn0  ON n.vid  = tn0.vid  
INNER JOIN node_access na ON na.nid = n.nid 
WHERE (
  na.grant_view >= 1 AND 
  (
    (na.gid = 0 AND na.realm = 'all') OR 
    (na.gid = 1 AND na.realm = 'job_view') OR 
    (na.gid = 0 AND na.realm = 'resume_owner') OR 
    (na.gid = 0 AND na.realm = 'og_public')
  )
) 
AND 
( 
  n.status = 1  AND 
  tn0.tid IN (1581) 
)
ORDER BY n.sticky DESC, n.created DESC 
LIMIT 0, 15\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: n
         type: range
possible_keys: PRIMARY,vid,node_status_type
          key: node_status_type
      key_len: 4
          ref: NULL
         rows: 40751
        Extra: Using where; Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: na
         type: ref
possible_keys: PRIMARY,grant_view_index
          key: PRIMARY
      key_len: 4
          ref: live.n.nid
         rows: 1
        Extra: Using where; Distinct
*************************** 3. row ***************************
           id: 1
  select_type: SIMPLE
        table: tn0
         type: ref
possible_keys: vid
          key: vid
      key_len: 4
          ref: live.n.vid
         rows: 1
        Extra: Using where; Distinct

Both the node and term_node tables have many rows, and adding to that is the node_access table. An extra join is done on that table which also contaings many rows.

So, for the sake of diagnosis, we rewrote the query without the node _access table, like so:

EXPLAIN
SELECT DISTINCT(n.nid), n.sticky, n.title, n.created 
FROM node n  
INNER JOIN term_node tn0 ON n.vid = tn0.vid 
WHERE n.status = 1  AND 
tn0.tid IN (1581) 
ORDER BY n.sticky DESC, n.created DESC 
LIMIT 0, 15\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: n
         type: range
possible_keys: vid,node_status_type
          key: node_status_type
      key_len: 4
          ref: NULL
         rows: 40751
        Extra: Using where; Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: tn0
         type: ref
possible_keys: vid
          key: vid
      key_len: 4
          ref: live.n.vid
         rows: 1
        Extra: Using where; Distinct

We eliminate a join, but everything else remains the same, including the temporary table and filesort.

Timing the two scenarios: Using MySQL profiling

We then proceeded to time the two queries in a controlled manner, using two methods, first using MySQL profiling, which is only in 5.0.37 or later, and the Community Edition only.

To do this, you execute the following command on MySQL's command line:

SET PROFILING = 1;

This enables profiling for your session.

You then execute the two above queries, one after the other.

After that, you enter the following command:

SHOW PROFILES;

The output will show the time for each query, and the SQL for it (which we abbreviated for brevity):

+----------+------------+
|        1 | 1.40426500 | SELECT SQL_NO_CACHE DISTINCT(n.nid), ...
|        2 | 0.30603500 | SELECT SQL_NO_CACHE DISTINCT(n.nid), ...
+----------+------------+

Clearly the times are different.

You can also glean more valuable info on resource usage, by querying the information_schema database's profiling table, as follows. The SQL filters all states that take less than 1 milliseconds, again for brevity.

For the original query, with node_access joined, we get:

SELECT state, duration, cpu_user+cpu_system AS cpu, 
block_ops_in+block_ops_out AS blocks_ops 
FROM information_schema.profiling 
WHERE query_id = 1 AND
duration > 0.000999;
+----------------------+----------+------+------------+
| state                | duration | cpu  | blocks_ops |
+----------------------+----------+------+------------+
| Copying to tmp table | 1.403615 | 1.68 |        128 | 
+----------------------+----------+------+------------+

For the second query without the node_access join, we get:

SELECT state, duration, cpu_user+cpu_system AS cpu, 
block_ops_in+block_ops_out AS blocks_ops 
FROM information_schema.profiling 
WHERE query_id = 2 AND
duration > 0.000999;
+----------------------+----------+------+------------+
| state                | duration | cpu  | blocks_ops |
+----------------------+----------+------+------------+
| Copying to tmp table | 0.305544 | 0.35 |          8 | 
+----------------------+----------+------+------------+

As you can see, the CPU usage and the block I/O operations are far less when we eliminate the node_access join.

Timing the two scenarios: custom PHP script

If you don't have a MySQL version that supports profiling, you can time the two queries using a simple custom script.

The server is a medium one with normal specs (8 cores, 8GB of RAM, and a separate disk for MySQL).

The PHP script boots Drupal then executes both variations of the query, measuring the time for each.

Note that we added SQL_NO_CACHE to the query to force it to bypass the query cache of MySQL:

<?php
require_once './includes/bootstrap.inc';
drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);

// We have an array with the query type, and the query SQL
$q = array(
  'before' => "
  SELECT SQL_NO_CACHE DISTINCT(n.nid), n.sticky, n.title, n.created
  FROM node n
  INNER JOIN term_node tn0 ON n.vid = tn0.vid
  INNER JOIN node_access na ON na.nid = n.nid
  WHERE
    (
      na.grant_view >= 1 AND
      (
        (na.gid = 0 AND na.realm = 'all') OR
        (na.gid = 1 AND na.realm = 'job_view') OR
        (na.gid = 0 AND na.realm = 'resume_owner') OR
        (na.gid = 0 AND na.realm = 'og_public')
      )
    )
    AND
    (
      n.status = 1 AND
      tn0.tid IN (1581)
    )
  ORDER BY n.sticky DESC, n.created DESC
  LIMIT 0, 15",

  'after' => "
    SELECT SQL_NO_CACHE DISTINCT(n.nid), n.sticky, n.title, n.created
    FROM node n
    INNER JOIN term_node tn0 ON n.vid = tn0.vid
    WHERE n.status = 1 AND
      tn0.tid IN (1581)
    ORDER BY n.sticky DESC, n.created DESC
    LIMIT 0, 15",
);


$result = array();

// Execute each SQL, measuring it in the process
foreach($q as $type => $query) {
  timer_start($type);
  $result[$type] = db_query($query);
  print "$type:\t" . timer_read($type) . "\n";
}
?>

We save that to a file called query_time.php, and run it like so:

cd /var/www
php /somewhere/query_time.php

You can also run it directly from a browser, if you do not have shell access, or if you don't have PHP CLI installed.

For that, you need to r eplace this line:

  print "$type:\t" . timer_read($type) . "\n";

With

  print "$type: " . timer_read($type) . "
";

For HTML output in the browser, you also need to have the script in the same directory that has Drupal core's index.php.

Analyzing the results

Whatever way you run the above script, the output will look like this:

before: 1597.86
after:   299.9

As you can see, there is a significant difference when having node_access vs. eliminating it from the query. The difference above is more than 5X.

Running it several times, and getting slightly different results each time, we concluded that at least a 4X improvement is possible with this modification.

This means that eliminating the node_access part from the query was going to improve the site by eliminating slowdowns and outages.

Quick Solution: Hack core!

The quick solution was to patch the taxnomony.module in the function taxonomy_select_nodes() to comment out the two db_rewrite_sql() calls.

However, since the site has some private data, we had to protect it in some ways. It turned out that for this site, the private nodes all belonged to one term, 63, and therefore the following patch will bypass the node_access join only when the path is /taxonomy/term/zzz, and return nothing if the visitor is going to term 63 specifically. For other paths, normal access control will apply and those authorized to see the private content will continue to be able to do so:

-    $sql = db_rewrite_sql($sql);
-    $sql_count = db_rewrite_sql($sql_count);
+    if (in_array(63, $tids)) {
+      // Do not display private nodes...
+      return NULL;
+    }
+    // Skip the node_access stuff reducing execution time for queries
+    // $sql = db_rewrite_sql($sql);
+    // $sql_count = db_rewrite_sql($sql_count);

This improved the server performance significantly, as you can see in the graphs below:

MySQL Slow queries reduced significantly:

The load average is back to a normal number, with no spikes:

CPU usage is also normal, with no spikes:

The drawbacks of the above solution is that it uses a hack to core, which has to be maintained in version control for future core upgrades.

Long Term Solution: Separate site

The long term solution though, is to separate the private stuff to its own site with its own subdomain (e.g. private.example.com) and a separate database. The bakery module as a single signon solution. Bakery does not synchronize roles across sites though, and that has to be managed manually, until the feature request in #547524 is implemented.

Further reading

Incidentally, this is not the first time that we had to bypass node access to resolve performance bottlenecks caused by slow queries. Read How Drupal's node_access table can negatively impact site performance for a similar case.

Aug 09 2010
Aug 09

One great feature that Drupal has is the ability to make modules run certain tasks, often heavy ones, in the background at preset intervals. This can be achieved by a module implementing hook_cron.

Core uses this feature to index new content for the search module, ping module to notify remote sites of new content, fetch new release information from drupal.org, poll other sites for RSS feeds, and more.

Various contributed modules use this for various purposes, such as mailing out newsletters, cleaning up logs, synchronizing content with other servers/sites, and much more ...

Core Cron: All or None

This powerful core feature has some limitations though, such as:

  • All hook cron implementations for all modules are run at the same time, in sequence alphabetically or according to module weight.
  • When cron for one module is stuck, all modules following it will not be executed, and cron will not run again until 1 hour has passed.
  • There is no way to know which module is the one that caused the entire cron to get stuck. Moreover, there is no instrumentation information to know which cron hook takes the most time.

2bits.com have proposed core patches to report overcome the lack of instrumentation, by logging the information to the watchdog. The patches are useful only to those who apply them. It is unlikely that they will get in core any time soon.

For a practical example, you can use Job Queue with our own queue mail module to improve end user response time, and avoid timeouts due to sending a lot of emails. This scheme defers sending to when cron is run, and not when a user submits a node or a comment.

This works well, but for core, all cron hooks run at the same time. If you set cron to run every hour, then email sending could be delayed by an hour or even more if job queue cannot send them all in one run. If you make cron run more frequently, e.g. every 15 minutes, then all the heavy hooks such as search indexing and log cleanup will also run every 15 minutes consuming lots of resources.

Enter Elysia Cron ...

With Elysia cron, you can now have the best of both worlds: you can set cron for job_queue to run every minute, and defer other heavy stuff to once a day during off hours, or once an hour. The email is delivered quickly, within minutes, and we don't incur the penalty of long running cron hooks.

Features and Benefits of Elysia Cron

The features that Elysia cron offers are many, the important ones, with a focus on performance, are:

  • You can run different hook_cron implementations for different modules at a different frequency.
  • You are aware what the resource and performance impact of each hook_cron implementation is. This includes the time it took to run it last, the average, and maximum time. This information is very valuable in distributing different hooks across the day, and their frequencies.
  • Set a configurable maximum for cron invocations. Drupal core has a hard coded value of 240 seconds. You can adjust this up or down as per your needs.
  • Handles "stuck" crons better than core. In core, if cron is stuck, it takes one hour for it to automatically recover. In Elysia cron, the other hook invocations continue to run normally.
  • You can set the weight for each module, independent from the weight for the module in the system table. Using this, you can have a different order of execution for modules.
  • You can group modules in "contexts", assigning different run schedules for different contexts, or disable contexts globally.
  • The ability to assign a cron key, or a white list of allowed hosts that can execute cron.
  • Selectively disable cron for one or more modules, but not others, or all cron.
  • Selectively run cron for only one module.
  • Defining a cronapi that developers can use.
  • It requires no patching of core or contributed modules.

Examples of Elysia Cron in action

Here is captcha's cron, which has been configured to run only once a day in the early hours of the morning:

As well, dblog's cron runs once a day too. No need to trigger this every hours or twice an hour.

Search here is shown to be the most heavy of all cron hooks. But still, we run it twice every hour, so that the search is always fresh.

Statistics cleanup is kind of heavy too, so we run it only once a day.

Finally, xmlsitemap is a useful module, yet it is also heavy on a site with lots of nodes. Therefore we run it only once a day.

The above is not cast in stone for these module. They will vary from one site to the other depending on the server configuration, resources available and data set sizes. Moreover, even for the same site, it is recommended to monitor regularly and adjust these on an ongoing basis.

Alternatives to Elysia Cron

Elysia cron is no alone though. There are other modules that have overlapping functions, such as Super Cron, Cron Plus, and even a Cron API module. Super Cron seems promising, but Elysia does everything we need so far, so the motivation to evaluate it low on the list of priorities.

Here is an attempt to compare the various cron modules, but so far it is sparse on information.

A more powerful solution, but also more complex and heavy weight is the use of tools like Hudson Continuous Integration. Since it runs within Java it adds dependencies to the usual LAMP-only server as well as being more demanding on resources. You can read a full article on it here.

Aug 07 2010
Aug 07

With the "Cloud" being in vogue currently, we see a lot of clients asking for cloud solutions, mostly Amazon AWS. Sadly, this is normally done without really doing a proper evaluation into whether the cost is reasonable, or the technology is suitable for their specific needs.

Amazon AWS provides some unique and compelling features. Among those are: instant provisioning of virtual servers, billing for used resources only, ability to provision more instances on demand, a wide variety of instance types, and much more.

We certainly like Amazon AWS for development and testing work, and for specific use cases such as seasonal sites.

For most high traffic sites though, Amazon AWS can be overly expensive, and not performant enough.

Before you decide to use Amazon, spend some time studying the various Amazon instance types, and the pricing that will be incurred. You may be surprised!

Here is a case study of a client that was on Amazon until recently, and we moved them to a more custom setup, with great results.

The client is a specialized news site, and gets linked to often from other high traffic sites such as Yahoo's front page, and the Drudge Report.

The site was originally hosted at another big name hosting company, but unfortunately they went down several times due to data center power issues.

After moving to Amazon AWS, with the setup below, the site was a bit sluggish, and when the traffic spikes described above happened, the setup could not cope with the increased traffic load ...

Amazon AWS Setup

The setup relied on Amazon's Elastic Load Balancer (ELB) front ending the site.

Behind the load balancer, there were a total of 4 instances, varying in type.

First, there were two web servers, each one of them m1.large.

Another m1.small instance acted as the NFS server for both web servers.

Finally another m1.large instance housed the MySQL database.

To summarize:

2 x web servers (each m1.large)
1 x MySQL database server (m1.large)
1 x NFS server (m1.small)

The cost was high compared to the features: EC2 computing cost alone for the instances was around $920 per month.

Additionally, there were 331 million I/O requests costing $36 per month. ELB was an additional $21 per month.

Storage and bandwidth brought the total to $990 per month.

Setup Drawbacks

The drawbacks of such a setup are many:

First, there is complexity: there are many components here, and each required patching with security updates, monitoring of disk space and performance, and other administration tasks.

Second, it was not fault tolerant. The only part that is redundant is the web server, with two of them present. However, if there is a database server or NFS server crash, the entire setup would stop serving pages.

Third, the cost is too high compared to a single well configured server, at almost half the cost.

Fourth, Amazon's ELB Load balancer forces the "www." prefix for the site, which is not a big deal for most sites, but some want to be known without that prefix.

Fifth, the performance was not up to par. The site was sluggish most of the time.

Finally, the setup was not able to handle traffic spikes adequately.

The Solution

After doing a full Drupal site performance assessment, 2bits.com recommended and implemented a new setup consisting of a single medium sized dedicated server for $536 per month.

The server is quad core Xeons E5620 at 2.4GHz, 8GB of RAM, 4 disks (each 2 forming a mirror).

We then did a full server installation, configuration, tuning and optimization tuning the entire LAMP stack from scratch, on our recommended Ubuntu 8.04 LTS Server Edition. Instead of using Varnish, we used only memcached.

The setup is mostly like what we did for another much higher traffic site. You can read the details at: 2.8 million page views per day: 70 million per month: one server!

After consulting with the client, we recommended to go a step further and use the additional budget, and implement a near fault tolerant setup. The second server is in another data center, and the monthly cost is $464.

The Results

The results is that we now have a satisfied client, happy with the new setup that is free of the headaches that they used to face when traffic spikes happen.

Here are the graphs showing a large spike in July 2010.

Two traffic spikes in Google Analytics. The traffic shot up from the normal 81,000 to 83,000 page views per day, to 244,000. The spike on July 12th was 179,000 page views.

Apache accesses per second, by day

Apache volume per second, by day

CPU utilization per day, no noticable spike

Memcache utilization, showing that it took most of the load

Lessons Learned

There are lots of good lessons learned from
A performance assessment is valuable before deciding what hosting to use. This will give a baseline and reveal any bottlenecks that your site may have.

In most cases, we advocate simplicity over complexity. Start simple and then go add complexity when and where needed.

Try to make the most of vertical scaling, before you go horizontal.

Amazon AWS is great for development and specific use cases. It may not be your most cost effective option for high traffic sites though.

Memcache, used properly, will get you far on your journey to scalability and performance.

Further reading

There are lots of links on the web about Amazon AWS and hosting LAMP on it.

Here are a select few recent Drupal specific presentations and podcasts:

Jun 30 2010
Jun 30

One of the most used functions in Drupal's database abstraction layer is db_query, which allows passing an SQL string and corresponding arguments to send a query to the database. I'll give you a quick overview of how db_query works before showing you how to drupalize a query such as:

SELECT field1, field2 FROM table_name WHERE field1 IN (value1,value2,...)

Drupal database abstraction layer and db_query

As many other content management systems and web development frameworks, Drupal implements a database abstraction layer that you, the developer, can use for writing code that works with different database servers. For this to work you need to follow certain rules and this starts with the way you pass queries and arguments to the db_query function.

Arguments in a db_query call need to use sprintf style specifications, placeholders such as %d for integers and %s for strings. This allows Drupal to avoid SQL injection attacks and perform other security checks.

Let's review a simple example:

$sql = "SELECT v.vid FROM {vocabulary} v WHERE v.name = '%s'";
$vid = db_result(db_query($sql, $vocabulary_name));

This code will get a vocabulary's id searching by its name, which is passed as a string. Notice the curly braces around the table's name, they need to be there if you want Drupal to provide table prefixing, this is important so get used to always do it this way.

Many Drupal beginners may opt for what they consider a simpler approach:

$sql = "SELECT v.vid FROM vocabulary v WHERE v.name = '" . $vocabulary_name . "'";
$vid = db_result(db_query($sql));

Even if it may look simpler now you're bypassing Drupal's security checks and your query will start breaking up as a series of concatenated strings which is not good for code readability, and this gets more confusing with more complex queries.

Passing arguments to db_query as an array

Arguments can be passed one by one or contained in an array. Let's slightly modify our example query:

$sql = "SELECT v.vid FROM {vocabulary} v WHERE v.name = '%s' AND v.vid = '%d'";

Now we're being more specific, looking for this vocabulary using v.name and v.vid, pay attention to the placeholders, and we could get our result passing each argument to db_query like this:

$vid = db_result(db_query($sql, $vocabulary_name, $vid));

or we could build an array with both arguments like this:

$args = array($vocabulary_name, $vid);
$vid = db_result(db_query($sql, $args));

I prefer the array approach for queries where I have to pass more than a few arguments and that's often the case when we use the SQL IN operator.

The SQL IN operator and Drupal

Every good Drupal developer has to be highly skilled in writing SQL and that means there will be times when you need the SQL IN operator, which compares a field to a list of values. Let's say you want to get all nodes of types page and blog, you're looking for a query like this:

$sql = "SELECT n.nid, n.title FROM {node} n WHERE n.type IN ('page', 'blog') AND n.status = 1";

If you've tried this before you may have experienced escape quotes hell and opted for the non Drupal way of concatenating arguments in the query, at least I did it until I read about <a href="http://api.drupal.org/api/function/db_placeholders/6db_placeholders.

This is how I build my Drupal queries with the SQL IN operator now:

$types = array('blog', 'embedded_video', 'list', 'node_gallery_gallery');
$args = array();
$args[] = $tid;
$args = array_merge($args, $types);
$args[] = $status;
$args[] = $limit;
$sql = "SELECT n.nid, n.title, n.type, c.comment_count FROM {node} n INNER JOIN {term_node} tn
ON n.nid = tn.nid INNER JOIN {term_data} td ON tn.tid = td.tid LEFT JOIN
{node_comment_statistics} c ON n.nid = c.nid WHERE td.tid = %d AND
n.type IN (" . db_placeholders($types, 'varchar') . ")
AND n.status = %d ORDER BY n.created DESC LIMIT %d";
$result = db_query($sql, $args);

This is a bigger query with more arguments and I'm not only looking for nodes of certain types (listed in the $types array) but also a specific term in the taxonomy ($tid) and a published status ($status). I'm also adding a LIMIT clause at the end ($limit).

db_placeholders takes care of adding the correct number and type of placeholders based on the contents of the array $types. In this case it will add four %s because I passed varchar as the second argument and there are four elements in the array.

The $args array is built based on the order in which the arguments appear in the query, notice how I add $tid first and then use array_merge to add $types, then I add $status and $limit at the end.

So Drupalish, isn't it?

Feb 11 2009
Feb 11

After a few Drupal 6 projects where I had to create themes from scratch, including my recently released Woodpig theme for Ventanazul I've learned a lot and decided to gather some tips I'm sure will help you, my fellow Drupalist, when turning your next design into a functional Drupal managed site. Sounds good? Let's dive into the powerful Drupal 6 theme API.

Flowers And Plants

Get the documentation right

There are a lot of changes in the Drupal 6 theming API and even being quite familiar with Drupal 5 theming I had to invest a considerable amount of time to get all the new ideas; hence, the theme guide for Drupal 6 should be your first stop.

But quickly reading the theme guide for Drupal 6 once may not be enough, topics like preprocess functions and registering hook themes still confused me after my first try at the documentation so I had to re-read many parts. My advice is to read the guide at least twice, the first time just try to get the idea and the second one start playing with a test install of Drupal 6 to see how the new theme API works.

You always need to register theme functions and templates

Drupal 6 calls these theme hooks. A theme hook is something that can be themed, usually output from your modules that may need alternative HTML or CSS. In Drupal 5 you could write a module and add as many theme functions as you wanted, these functions could then be overriden by any theme or converted into a tpl.php file using the _phptemplate_callback function.

Drupal 6 needs to know which theme hooks are available in your theme beforehand; this is to be more efficient and avoid discovering theme hooks on run time. To register theme hooks you should use the new hook_theme. Here you can tell Drupal what theme functions or templates your module will provide and what variables can be used.

Here's some code from the custom module I use in Ventanazul:

function ventanazul_theme() {
return array(
'ventanazul_flickr' => array(
'template' => 'ventanazul-flickr',
'arguments' => array('flickr_user' => NULL, 'attributes' => array()),
),
'ventanazul_google_search' => array(
'arguments' => array('publisher_id' => NULL, 'attributes' => array()),
),
);
}

The first hook, ventanazul_flickr, will use a file called ventanazul-flickr.tpl.php as a template that will have a couple of variables available: $flickr_user and $attributes. This is what I use for showing some pictures from my Flickr account.

The second hook, ventanazul_google_search does not provide a template and that means it will just use a theme function with the following signature: theme_google_search($publisher_id, $attributes). Notice how the variables in this case are passed directly to the theme function.

Don't forget to clear the theme registry

The theme registry is where Drupal 6 stores what it knows about the available theme hooks, every time you modify theme hooks you need to clear it or your changes won't be recognized. Clear theme registry in any of these ways:

  • Visit Administer > Site configuration > Performance and click on Clear cached data. This is my preferred way and I always keep a tab open in Firefox just for this page.
  • Use the devel block from the devel module and click the Empty cache link.
  • Call the function drupal_rebuild_theme_registry.

Define regions for blocks in .info files

For Drupal 5 you added your theme's regions in template.php, now you just need to add a few lines to your theme's .info file. This code is from a theme called Winds of Change that I've just finished coding for a client:

name = Winds of Change
description = Site of Winds of Change Group
version = VERSION
core = 6.x
engine = phptemplate
stylesheets[all][] = style.css
stylesheets[print][] = print.css
regions[left] = left sidebar
regions[right] = right sidebar
regions[content] = content
regions[header] = header
regions[footer] = footer
regions[social] = social

Notice the regions entries by the end, this is plain text but the syntax is similar to PHP arrays. These lines tell your theme to make those regions available and will use the names in brackets as variable names; regions[left] will create a $left variable for your page templates.

Changes in your theme's .info files require clearing the theme registry as explained above. Actually, I'm a little paranoid and everytime some of my theme changes are not reflected on the site I clear the theme registry.

Cache and theming for authenticated and anonymous users

This is a minor but very important detail. Drupal caches it's output for anonymous users and may optimize your CSS and Javascript files. Review your settings in Administer > Site configuration > Performance to disable all caching and optimization while you are developing a new theme and clear the theme registry to see your changes as an anonymous user.

Do your best with the available theme hooks

There are many times when you are tempted to override a certain theme function or template in Drupal 6. This is what I tried to do when theming navigation links for Ventanazul but after thinking a little more I realized I could get the same results with some smart CSS and a few additional variables. For example, for the bottom navigation, I just used:

print theme('links',array_merge($primary_links,$secondary_links), array('id' => 'alternate-menu'))

Woodpig footer

theme_links is provided by Drupal core. Notice how I merge primary and secondary links and apply an id alternate-menu to be styled from CSS inside an #alternate-menu rule.

Before overriding a theme function or template check the source and see if you can pass some variables to affect it's behavior.

Use templates from modules

Drupal 6 recommends all module developers to provide templates, that means you should check in your module's directory to see what .tpl.php files are available and copy the ones you need to modify to your theme's directory. For the Woodpig theme I copied user-profile.tpl.php from modules/user to make a few changes in user account pages. Remember to clear the theme registry after adding a tpl.php file to your theme.

You can create theme hooks for modules you did not write

When I wanted to theme the comments form the first thing I did was looking at comment.module. There I found a call like this:

drupal_get_form('comment_form', $edit, $title)

Ok, I had a comment_form function that I could theme according to Drupal Form API, then I started looking for theme_comment_form but there was none.

How to override a theme hook that does not exist? Just as you do with your own modules, you just need to know the code involved and, this is important, you can include a hook_theme function in your template.php file. This is what I did for Woodpig:

function woodpig_theme() {
return array(
'comment_form' => array(
'arguments' => array('form' => array()),
),
);
}

The function starts with the theme name and the theme hook takes the form argument because it's a that's what drupal_get_form requires. Then I can write my own woodpig_comment_form($form) function.

Behold the power of preprocess functions

Preprocess functions allow you to modify the variables available to your templates. You can also add new variables. This is pretty important and I use it a lot for things like customizing breadcrumbs, passing the theme's path to my templates and separating the comment form from the content of a node. I suggest you play a lot with preprocess functions and understand their power. I'll have a few more articles about them soon.

Far from complete

There are still many lessons to learn from the Drupal 6 theme API and I find more and more in every new project. Even if this wasn't intended to be a complete guide I'm sure these ideas will help you in your next theming adventure.

Let me know what you think in the comments and if you have more questions or suggestions for coding with Drupal pay a visit to the Drupal forums in Ventanazul. Happy coding!

Feb 09 2009
Feb 09

That's a freaking lot of time, really, but as anybody who does web development for a living knows it's sometimes difficult to find the time for your own projects when you are working in clients' gigs. Besides, I wanted to theme my site according to a very specific set of rules that kept changing over time, common problem, I know.

First, I decided to move Ventanazul from a simple blog format to a more niche community site, a site for people working in professional web development, that meant I had to enable account registrations, forget about vBulletin and rethink about the quality of new content and the profile of users to approve (users have to be manually approved and all comments are moderated).

Defining the goals and information architecture of the new Ventanazul took a few months while I gathered ideas from a lot of sources, like sites I enjoyed and projects I was working on. That led me, after many hours with Gimp, to the final mockups of the new design. In the meantime Drupal 6 was out and I had to invest time on learning a few new tricks.

It was the perfect timing as I started working on projects for a couple of clients that required moving to Drupal 6 and using my upgraded theming skills. For Ventanazul's redesign I used many of the new nice features of Drupal, I found the preprocess functions very helpful for separating comments and their form from node content, they usually come as a whole in the $content variable of the page.tpl.php template.

The theme can be considered as a 0.9 version and I know there may be a few small bugs around that I'll fix on the road but I wanted to release and start getting feedback as soon as possible, well, two years is not really soon but you know what I mean. I have a small set of additional features cooking for a future release and may come with some other ideas. As should be the norm in 2009 the markup was built thinking in modern web browsers that respect web standards so I didn't waste time on Internet Explorer 6 bugs or horrible hacks.

Ventanazul - Chancho de madera

And why Woodpig?

The pig dude in the picture was a gift from Yesenia, my wife, during the days I was designing the new theme in Gimp. It's a small pig made of wood with flexible arms and legs that sits over my desktop everyday. I put him in different positions every morning and use it as a model for shooting with the SLR. I had a few shots of him lying around so I used one for the home mockup and after a while realized that the new site would be as cool looking and flexible as my porcine friend and decided the new Ventanazul theme would be called Woodpig.

That's it, enjoy the site, comment and share your web development knowledge and wit with the warm community registering today.

Oh, and it's absolutely forbidden to make jokes about the pig, he's got a very bad temper.

Sep 25 2008
Sep 25

There's a lot to read and absorb on the new Drupal 6 but the good news is that if you've spent a while hacking version 5 you will need just a few days to get up to date.

One of my favorite changes is on theming. Now every module can provide its own .tpl.php files. To override them you just need to copy them to your theme directory and start hacking around.

Important: there's a new theme registry that has to be updated when you add or remove templates, to do so just click Clear cached data on /admin/settings/performance.

And even if using templates is the recommended method to create or modify themes in Drupal 6 you still can use good old theme_ functions.

Details can be found, as usual, on the handbook, theming Drupal 6, and there's an excellent summary presented as a flow chart (PDF, 60 Kb).

Drupal 6 theme flow

Sep 16 2008
Sep 16

Mark Boulton is working on the redesign of Drupal.org, that's great news. I've been following Mark's work for many years and I really enjoy his writing, his series about grids and typography are fantastic pieces.

But what's even better is that Mark and his team have decided to open their design process, a very brave and risky move, and they've called it design by community.

I don't know if this will work or not but in any case I'm sure we will all learn a lot from it.

Pages

About Drupal Sun

Drupal Sun is an Evolving Web project. It allows you to:

  • Do full-text search on all the articles in Drupal Planet (thanks to Apache Solr)
  • Facet based on tags, author, or feed
  • Flip through articles quickly (with j/k or arrow keys) to find what you're interested in
  • View the entire article text inline, or in the context of the site where it was created

See the blog post at Evolving Web

Evolving Web