Feb 26 2018
Feb 26

A recent client performance assessment consulting project showed that on their site, the main page that logged in users would browse is slow. Tuning the server for memory and disk throughput helped somewhat, but did not fully eliminate the issue.

Looking at the page, it was a view, and the total time was around 2.75 seconds.

The main query was not efficient, with lots of left joins, and lots of filtering criteria:

SELECT node.nid AS nid,
... AS ...
... AS ...
'node' AS field_data_field_aaa_node_entity_type,
'node' AS field_data_field_bbb_node_entity_type,
'node' AS field_data_field_ccc_node_entity_type,
... AS ...
FROM node
INNER JOIN ... ON node.uid = ...
LEFT JOIN ... ON ... = ...  AND ... = ...
LEFT JOIN ... ON ... = ... AND (... = '12'
OR ... = '11'
OR ... = '15'
OR ... = '24')
WHERE (( (node.status = '1')
AND (node.type IN ('something'))
AND (... <> '0')
AND ((... <> '1') )
AND ((... = '4'))
AND (... IS NULL ) ))
ORDER  BY  ... DESC
LIMIT  51   OFFSET 0

That caused the first pass to sift through over 24,000 rows, while using both file sort and temporary tables. Both operations are disk intensive.

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ...
   partitions: NULL
         type: range
possible_keys: PRIMARY,...
          key: rid
      key_len: 8
          ref: NULL
         rows: 24039
     filtered: 100.00
        Extra: Using where; Using index; Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: ...
   partitions: NULL
         type: eq_ref
possible_keys: PRIMARY,status
          key: PRIMARY
      key_len: 4
          ref: test43....
         rows: 1
     filtered: 50.00
        Extra: Using where
*************************** 3. row ***************************
           id: 1
  select_type: SIMPLE
        table: node
   partitions: NULL
         type: ref
possible_keys: uid,status,type,node_status_type
          key: uid
      key_len: 4
          ref: test43....
         rows: 5
     filtered: 12.18
        Extra: Using index condition; Using where
*************************** 4. row ***************************
           id: 1
  select_type: SIMPLE
        table: ...
   partitions: NULL
         type: ref
possible_keys: PRIMARY,...
          key: PRIMARY
      key_len: 4
          ref: test43....
         rows: 2
     filtered: 54.50
        Extra: Using where; Not exists; Using index

But here is the puzzle: this query took 250 to 450 milliseconds at most.

Where did the rest of the 2,750 milliseconds go?

To find out, we use xhprof, the profiler for PHP.

In the screenshot below, you can see that the total page processing time (Total Inc. Wall Time, top right) is 2,732 milliseconds.

Out of that, 85% is in database queries (252 total queries, totaling 2,326 milliseconds, Excl.Wall Time).

What are these queries?

They are queries to other tables in the database to retrieve fields for each row.

For example, if you have a product view, with certain criteria, the result still has to get the product name, its price, its image, ...etc.

All these queries add up, specially when you are loading 50 of them. The time needed to retrieve each field, and rendering it for output is multiplied by the number of rows retrieved.

So, how do you mitigate that overhead? There are several ways:

  • Reduce the number of rows returned by the view. For example, instead of 50, make it 25. That would half the number of queries (and processing) needed to produce the page.
  • If the query is the same for all logged in users, then enable views caching (under Advanced when you edit the view), and enable both Query Result and Rendered Output caching. Use time based caching, for as long as practical to your site (e.g. if you add products or change prices only once a day, then you can cache the results for 20 hours or more).
  • Use a fast caching layer, such as the memcache module, instead of the default database caching, which will be slow for a site with many logged in users.
  • Use View Lite Pager to eliminate COUNT queries from being performed.
  • Consider alternate approaches to views, such as Apache Solr Faceted Search, which has much better performance than MySQL based solutions, because they do build an efficient index.

By implementing all the above for the client in question, except the last one, we were able to bring the view page from 2,730 milliseconds, down to 700-800 milliseconds of response time.

Scalability was much better, with the same server could handle more logged in users.

Feb 20 2018
Feb 20

On many occasions, we see web site performance suffereing due to misconfiguration or oversight of system resources. Here is an example where RAM and Disk I/O severely impacted web site performance, and how we fixed them.

A recent project for a client who had bad site performance uncovered issues within the application itself, i.e. how the Drupal site was put together. However, overcoming those issues was not enough to achieve the required scalability with several hundred logged in users on the site at the same time.

First, regarding memory, the site configured too many PHP-FPM processes, and that left no room in memory for the filesystem buffers and cache, which help a lot with disk I/O load.

Here is a partial display from when we were monitoring the server before we fixed it:

As you can see, the buffers + cache + free memory all amount to less than 1 GB of total RAM, while the used RAM is over 7GB.

used buffers cache free 7112M 8892k 746M 119M 7087M 9204k 738M 151M 7081M 9256k 770M 125M 7076M 4436k 768M 136M 7087M 4556k 760M 133M

We did calculations on how much RAM is really needed by watching the main components on the server:

In this case the calculation was:

Memcache + MySQL + (Apache2 X number of instances) + (PHP-FPM X number of instances)

And then adjusting the PHP-FPM number of processes down to a reasonable number, for a total application RAM of no more than 70% of the total.

The result is as follows. As you can see, used memory is now 1.8GB instead of 7GB. Free memory will slowly be used by cache and buffers making I/O operations much faster.

used buffers cache free 1858M 50.9M 1793M 4283M 1880M 51.2M 1795M 4258M 1840M 52.1M 1815M 4278M 1813M 52.4M 1815M 4304M

Another issue with the server, partially caused by by the above lack of cache and buffers, but also by forgotten settings, caused a severe bottleneck in the Disk I/O performance. The disk was so tied up that everything had to wait. I/O Wait was 30%, as seen in top and htop. This is very very high, and should usually be no more than 1 or 2% maximum.

We also observed excessive disk reads and writes, as follows:

disk read disk write i/o read i/o write 5199k 1269k 196 59.9 1731k 1045k 80 50.7 7013k 1106k 286 55.2 23M 1168k 607 58.4 9121k 1369k 358 59.7

Upon investigating, we found that the rules_debug_log setting was on. The site had 130 enabled rules and the syslog module was enabled. We found a file under /var/log/ with over a GB per day and growing. This writing of rules debugging for every page load tied up the disk when a few hundred users were on the site.

After disabling the rules debug log settings, wait for I/O went down to 1.3%! A significant improvement.

Here is the disk I/O figures after the fix.

disk read disk write i/o read i/o write 192k 429k 10.1 27.7 292k 334k 16.0 26.3 2336k 429k 83.6 30.7 85k 742k 4.53 30.8
Now, the site has response times of 1 second or less instead of the 3-4 seconds. 
Feb 12 2018
Feb 12

For all of the sites we consult on, and manage, we use the excellent memcache module, which replaces the core's database caching. Database caching works for low traffic simple sites, but cannot scale for heavy traffic or complex site.

Recently we were asked to consult on the slow performance of a site with an all authenticated audience. The site is indeed complex, with over 235 enabled modules, 130 enabled rules, and 110 views.

The site was moved from a dedicated server to an Amazon AWS cluster, with the site on one EC2 instance, the database on an RDS instance, and memcache on a third instance. This move was in the hope that Amazon's AWS will improve performance.

However, to their dismay, performance after the move went from bad (~ 5 to 7 seconds to load a page) to worse (~ 15 seconds).

We recommended to the client that we perform a Drupal performance assessment server on the site. We got a copy of the site and recreated the site in our lab, and proceeded with the investigation.

After some investigation we found that by having memcached on the same server that runs PHP made a significant performance improvement. This was logical, because with a site with all users logged in, there are plenty of calls to cache_get() and cache_set(). Each of these calls have to do a round trip over the network to the other server and back, even if it returns nothing. The same goes for database queries.

Instead of 29.0, 15.8, 15.9, and 15.5 seconds for different pages on the live site, the page loads in our lab on a single medium server were: 3.6, 5.5, 1.4, 1.5 and 0.6 seconds.

However, this victory was short lived. Once we put load on the web site, other bottlenecks were encountered.

We started with 200 concurrent logged in users on our lab server, and kept investigating and tweaking, running a performance test after each tweak to assess its impact.

The initial figures were: an average response time of 13.93 seconds, and only 6,200 page load attempts for 200 users (with 436 time outs).

So, what did we find? We found that the culprit was memcache! Yes, the very thing that helps site be scaleable was severely impeding scalability!

Why is this so? Because of the way it was configured for locking and stampede protection.

The settings.php for the site had these two lines:

$conf['lock_inc'] = 'sites/all/modules/memcache/memcache-lock.inc';
$conf['memcache_stampede_protection'] = TRUE;

Look at the memcache.inc, lines 180 to 201, in the valid() function:

if (!$cache) {
  if (variable_get('memcache_stampede_protection', FALSE) ... ) {
    static $lock_count = 0;
    $lock_count++;
    if ($lock_count <= variable_get('memcache_stampede_wait_limit', 3)) {
      lock_wait(..., variable_get('memcache_stampede_wait_time', 5));
      $cache = ...;
    }
  }
}

The above is for version 7.x of the module, and the same logic is in the Drupal 8.x branch as well.

If memcache_stampede_protection is set to TRUE, then there will be up to three attempts, with a 5 second delay each. The total then can be as high as 15 seconds when the site is busy, which is exactly what we were seeing. Most of the PHP processes will be waiting for the lock, and no free PHP processes will be available to serve requests from other site visitors.

One possible solution is to lower the number of attempts to 2 (memcache_stampede_wait_limit = 2), and the wait time for each attempt to 1 second (memcache_stampede_wait_time = 1), but that is still 2 seconds of wait!

We did exactly that, and re-ran our tests.

The figures were much better: for 200 concurrent logged in users, the the average response time was 2.89 seconds, and a total of 10,042 page loads, with 100% success (i.e. no time outs).

But a response time of ~ 3 seconds is still slow, and there is still the possibility of a pile up condition when all PHP processes are waiting.

So, we decided that the best course of action is not to use memcache's locking at all, nor its stampede protection, and hence deleted the two lines from settings.php:

//$conf['lock_inc'] = 'sites/all/modules/memcache/memcache-lock.inc';
//$conf['memcache_stampede_protection'] = TRUE;

The results were much better: for 200 concurrent logged in users, the average response time was 1.09 seconds, and a total of 11,196 pages with 100% success rate (no timeouts).

At this point, the server's CPU utilization was 45-55%, meaning that it can handle more users.

But wait! We forgot something: the last test was run with xhprof profiler left enabled by mistake from profiling the web site! That causes lots of CPU time being used up, as well as heavy writes to the disk as well.

So we disabled xhprof and ran another test: and the results were fantastic: for 200 concurrent logged in users, the average response time was just 0.20 seconds, and a total of 11,892 pages with 100% success rate (no timeouts).

Eureka!

Note that for the above tests, we disabled all the rules, disabled a couple of modules that have slow queries, and commented out the history table update query in core's node.module:node_tag_new(). So, these figures are idealized somewhat.

Also, this is a server that is not particularly new (made in 2013), and uses regular spinning disks (not SSDs).

For now, the main bottleneck has been uncovered, and overcome. The site is now only limited by other factors, such as available CPU, speed of its disks, complexity of modules, rules and views ...etc.

So, check your settings.php to see if you have memcache_stampede_protection enabled, and disable it if it is.

May 23 2017
May 23

Acquia has announced the end of life for Mollom, the comment spam filtering service.

Mollom was created by Dries Buytaert and Benjamin Schrauwen, and launched to a few beta testers (including myself) in 2007. Mollom was acquired by Acquia in 2012.

The service worked generally well, with the occasional spam comment getting through. The stated reason for stopping the service is that spammers have gotten more sophisticated, and that perhaps means that Mollom needs to try harder to keep up with the ever changing tactics. Much like computer viruses and malware, spam (email or comments) is an arms race scenario.

The recommended alternative by Acquia is a combination of reCAPTCHA and Honeypot.

But there is a problem with this combinationa: reCAPTCHA, like all modules that depend on the CAPTCHA module, disable the page cache for any form that has CAPTCHA enabled.

This is due to this piece of code in captcha.module:

// Prevent caching of the page with CAPTCHA elements.
// This needs to be done even if the CAPTCHA will be ommitted later:
// other untrusted users should not get a cached page when
// the current untrusted user can skip the current CAPTCHA.
drupal_page_is_cacheable(FALSE);

Another alternative that we have been using that does not disable the page cache is antibot module.

To install the antibot module, you can use your git repository, or the following drush commands:

drush dis mollom
drush dl antibot
drush en antibot

Visit the configuration page for antibot if you want to add more forms that use the module, or disable it from other forms. The default settings work for comments, user registrations, and use logins.

Because of the above mentioned arms race situation, expect spammers to come up with circumvention techniques at some point in the future, and there will be a need to use other measures, be they in antibot, or other alternatives.

Feb 07 2017
Feb 07

Secure Socket Layer (SSL) is the protocol that allows web sites to serve traffic in HTTPS. This provides end to end encryption between the two end points (the browser and the web server). The benefits of using HTTPS is that traffic between the two end points cannot be deciphered by anyone snooping on the connection. This reduces the odds of exposing sensitive information such as passwords, or getting the web site hacked by malicious parties. Google has also indicated that sites serving content exclusively in HTTPS will get a small bump in Page Rank.

Historically, SSL certificate issuers have served a secondary purpose: identity verification. This is when the issuing authority vouches that a host or a domain is indeed owned by the entity that requests the SSL certificate for it. This is traditionally done by submitting paper work including government issued documentation, incorporation certificates, ...etc.

Historically, SSL certificates were costly. However, with the introduction of the Let's Encrypt initiative, functional SSL certificates are now free, and anyone who wants to use them can do so, minus the identity verification part, at least for now.

Implementing HTTPS with Drupal can be straightforward with low traffic web sites. The SSL certificate is installed in the web server, and that is about it. With larger web sites that handle a lot of traffic, a caching layer is almost always present. This caching layer is often Varnish. Varnish does not handle SSL traffic, and just passes all HTTPS traffic straight to Drupal, which means a lot of CPU and I/O load.

This article will explain how to avoid this drawback, and how to have it all: caching in Varnish, plus serving all the site using HTTPS.

The idea is quite simple in principle: terminate SSL before Varnish, which will never know that the content is encrypted upstream. Then pass the traffic from the encryptor/decryptor to Varnish on port 81. From there, Varnish will pass it to Apache on port 8080.

We assume you are deploying all this on Ubuntu 16.04 LTS, which uses Varnish 4.0, although the same can be applied to Ubuntu 14.04 LTS with Varnish 3.0.

Note that we use either one of two possible SSL termination daemons: Pound and Nginx. Each is better in certain cases, but for the large part, they are interchangeable.

One secondary purpose for this article is documenting how to create SSL bundles for intermediate certificate authorities, and to generate a combined certificate / private key. We document this because of the sparse online information on this very topic.

Install Pound

aptitude install pound

Preparing the SSL certificates for Pound

Pound does not allow the private key to be in a separate file or directory from the certificate itself. It has to be included with the main certificate, and with intermediate certificate authorities (if there are any).

We create a directory for the certificates:

mkdir /etc/pound/certs

cd /etc/pound/certs

We then create a bundle for the intermediate certificate authority. For example, if we are using using NameCheap for domain registration, they use COMODO for certificates, and we need to do the following. The order is important.

cat COMODORSADomainValidationSecureServerCA.crt \
  COMODORSAAddTrustCA.crt \
  AddTrustExternalCARoot.crt >> bundle.crt

Then, as we said earlier, we need to create a host certificate that includes the private key.

cat example_com.key example_com.crt > host.pem

And we make sure the host certificate (which contains the private key as well) and the bundle, are readable only to root.

chmod 600 bundle.crt host.pem

Configure Pound

We then edit /etc/pound/pound.cfg

# We have to increase this from the default 128, since it is not enough
# for medium sized sites, where lots of connections are coming in
Threads 3000

# Listener for unencrypted HTTP traffic
ListenHTTP
  Address 0.0.0.0
  Port    80
 
  # If you have other hosts add them here
  Service
    HeadRequire "Host: admin.example.com"
    Backend
      Address 127.0.0.1
      Port 81
    End
  End
 
  # Redirect http to https
  Service
    HeadRequire "Host: example.com"
    Redirect "https://example.com"
  End
 
  # Redirect from www to domain, also https
  Service
    HeadRequire "Host: www.example.com"
    Redirect "https://example.com"
  End
End

# Listener for encrypted HTTP traffic
ListenHTTPS
  Address 0.0.0.0
  Port    443
  # Add headers that Varnish will pass to Drupal, and Drupal will use to switch to HTTPS
  HeadRemove      "X-Forwarded-Proto"
  AddHeader       "X-Forwarded-Proto: https"
 
  # The SSL certificate, and the bundle containing intermediate certificates
  Cert      "/etc/pound/certs/host.pem"
  CAList    "/etc/pound/certs/bundle.crt"
 
  # Send all requests to Varnish
  Service
    HeadRequire "Host: example.com"
    Backend
      Address 127.0.0.1
      Port 81
    End
  End
 
  # Redirect www to the domain
  Service
    HeadRequire "Host: www.example.com.*"
    Redirect "https://example.com"
  End
End

Depending on the amount of concurrent traffic that your site gets, you may need to increase the number of open files for Pound. You also want to increase the backend time out times, and the browser time out time.

To do this, edit the file /etc/default/pound, and add the following lines:

# Timeout value, for browsers
Client  45

# Timeout value, for backend
Timeout 40

# Increase the number of open files, so pound does not log errors like:
# "HTTP Acces: Too many open files"
ulimit -n 20000

You also need to create a run directory for pound and change the owner for it, since the pound package does not do that automatically.

mkdir /var/run/pound
chown www-data.www-data /var/run/pound

Do not forget to change the 'startup' line from 0 to 1, otherwise pound will not start.

Configure SSL Termination for Drupal using Nginx

You may want to use Nginx instead of the simpler Pound in certain cases.

For example, if you want to process your site's traffic using analysis tools, for example Awstats, you need to capture those logs. Although Pound can output logs in Apache combined format, it also outputs errors to the same log, at least on Ubuntu 16.04, and that makes these logs unusable by analysis tools.

Also, Pound has a basic mechanism to handle redirects from the plain HTTP URLs to the corresponding SSL HTTPS URLs. But, you cannot do more complex rewrites, or more configurable and flexible options.

First install Nginx:

aptitude install nginx

Create a new virtual host under /etc/nginx/sites-available/example.com, with this in it:

# Redirect http www to https no-www
server {
  server_name www.example.com;
  access_log off;
  return 301 https://example.com$request_uri;
}

# Redirect http no-www to https no-www
server {
  listen      80 default_server;
  listen [::]:80 default_server;
  server_name example.com;
  access_log off;
  return 301 https://$host$request_uri;
}

# Redirect http www to https no-www
server {
  listen      443 ssl;
  server_name www.example.com;
  access_log off;
  return 301 https://example.com$request_uri;
}

server {
  listen      443 ssl default_server;
  listen [::]:443 ssl default_server ipv6only=on;

  server_name example.com;

  # We capture the log, so we can feed it to analysis tools, e.g. Awstats
  # This will be more comprehensive than what Apache captures, since Varnish
  # will end up removing a lot of the traffic from Apache
  #
  # Replace this line with: 'access_log off' if logging ties up the disk
  access_log /var/log/nginx/access-example.log;

  ssl on;

  # Must contain the a bundle if it is a chained certificate. Order is important.
  # cat example.com.crt bundle.crt > example.com.chained.crt 
  ssl_certificate      /etc/ssl/certs/example.com.chained.crt;
  ssl_certificate_key  /etc/ssl/private/example.com.key;

  # Test certificate
  #ssl_certificate     /etc/ssl/certs/ssl-cert-snakeoil.pem;
  #ssl_certificate_key /etc/ssl/private/ssl-cert-snakeoil.key;

  # Restrict to secure protocols, depending on whether you have visitors
  # from older browsers
  ssl_protocols TLSv1 TLSv1.1 TLSv1.2;

  # Restrict ciphers to known secure ones
  ssl_ciphers ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256;

  ssl_prefer_server_ciphers on;
  ssl_ecdh_curve secp384r1;
  ssl_stapling on;
  ssl_stapling_verify on;

  add_header Strict-Transport-Security "max-age=63072000; includeSubDomains; preload";
  add_header X-Frame-Options SAMEORIGIN;
  add_header X-Content-Type-Options nosniff;

  location / {
    proxy_pass                         http://127.0.0.1:81;
    proxy_read_timeout                 90;
    proxy_connect_timeout              90;
    proxy_redirect                     off;

    proxy_set_header Host              $host;
    proxy_set_header X-Real-IP         $remote_addr;
    proxy_set_header X-Forwarded-For   $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto https;
    proxy_set_header X-Forwarded-Port  443;
   
    proxy_buffers                      8 24k;
    proxy_buffer_size                  2k;
  }
}

Then link this to an entry in the sites-enabled directory

cd /etc/nginx/sites-enabled

ln -s /etc/nginx/sites-available/example.com

Then we add some performance tuning parameters, by creating a new file: /etc/nginx/nginx.conf. These will make sure that we handle higher traffic than the default configuration allows:

At the top of the file, modify these two parameters, or add them if they are not present:

 
worker_processes       auto;
worker_rlimit_nofile   20000;

Then, under the 'events' section, add or modify to look like the following:

events {
  use epoll;
  worker_connections 19000;
  multi_accept       on;
}

And under the 'http' section, make sure the following parameters are added or modified to the following values:

http {
sendfile on;
tcp_nopush on;
tcp_nodelay on;
keepalive_timeout 80;
keepalive_requests 10000;
client_max_body_size 50m;
}

We now have either Pound or Nginx in place, handling port 443 with SSL certifcates, and forwarding the plain text traffic to Varnish.

Change Varnish configuration to use an alternative port

First, we need to make Varnish work on port 81.

On 16.04 LTS, we edit the file: /lib/systemd/system/varnish.service. If you are using Ubuntu 14.04 LTS, then the changes should go into /etc/default/varnish instead.

Change the 'ExecStart' line for the following:

Port that Varnish will listen on (-a :81)
Varnish VCL Configuration file name (/etc/varnish/main.vcl)
Size of the cache (-s malloc,1536m)

You can also change the type of Varnish cache storage, e.g. to be on disk if it is too big to fit in memory (-s file,/var/cache/varnish/varnish_file.bin,200GB,8K). Make sure to create the directory and assign it the correct owner and permissions.

We use a different configuration file name so as to not overwrite the default one, and make updates easier (no questions asks during update to resolve differences).

In order to inform systemd that we changed a daemon startup unit, we need to issue the following command:

systemctl daemon-reload

Add Varnish configuration for SSL

We add the following section to the Varnish VCL configuration file. This will pass a header to Drupal for SSL, so Drupal will enforce HTTPS for that request.

# Routine used to determine the cache key if storing/retrieving a cached page.
sub vcl_hash {

  # This section is for Pound
  hash_data(req.url);

  if (req.http.host) {
    hash_data(req.http.host);
  }
  else {
    hash_data(server.ip);
  }

  # Use special internal SSL hash for https content
  # X-Forwarded-Proto is set to https by Pound
  if (req.http.X-Forwarded-Proto ~ "https") {
    hash_data(req.http.X-Forwarded-Proto);
  }
}

Another change you have to make in Varnish's vcl is this line:

set req.http.Cookie = regsuball(req.http.Cookie, ";(SESS[a-z0-9]+|NO_CACHE)=", "; \1=");

And replace it with this line:

set req.http.Cookie = regsuball(req.http.Cookie, ";(S?SESS[a-z0-9]+|NO_CACHE)=", "; \1=");

This is done to ensure that Varnish will pass through the secure session cookies to the web server.

Change Apache's Configuration

If you had SSL enabled in Apache, you have to disable it so that only Pound (or Nginx) are listening on port 443. If you do not do this, Pound and Nginx will refuse to start with an error: Address already in use.

First disable the Apache SSL module.

a2dismod ssl

We also need to make Apache listen on port 8080, which Varnish will use to forward traffic to.

 
Listen 8080

And finally, your VirtualHost directives should listen on port 8080, as follows. It is also best if you restrict the listening on the localhost interface, so outside connections cannot be made to the plain text virtual hosts.

<VirtualHost 127.0.0.1:8080>
...
</VirtualHost>

The rest of Apache's configuration is detailed in an earlier article on Apache MPM Worker threaded server, with PHP-FPM.

Configure Drupal for Varnish and SSL Termination

We are not done yet. In order for Drupal to know that it should only use SSL for this page request, and not allow connections from plain HTTP, we have to add the following to settings.php:

// Force HTTPS, since we are using SSL exclusively
if (isset($_SERVER['HTTP_X_FORWARDED_PROTO'])) {
  if ($_SERVER['HTTP_X_FORWARDED_PROTO'] == 'https') {
    $_SERVER['HTTPS'] = 'on';
  }
}

If you have not already done so, you also have to enable page cache, and set the external cache age for cached pages. This is just a starting point, assuming Drupal 7.x, and you need to modify these accordingly depending on your specific setup.

// Enable page caching
$conf['cache'] = 1;
// Enable block cache
$conf['block_cache'] = 1;
// Make sure that Memcache does not cache pages
$conf['cache_lifetime'] = 0;
// Enable external page caching via HTTP headers (e.g. in Varnish)
// Adjust the value for the maximum time to allow pages to stay in Varnish
$conf['page_cache_maximum_age'] = 86400;
// Page caching without bootstraping the database, nor invoking hooks
$conf['page_cache_without_database'] = TRUE;
// Nor do we invoke hooks for cached pages
$conf['page_cache_invoke_hooks'] = FALSE;

// Memcache layer
$conf['cache_backends'][]    = './sites/all/modules/contrib/memcache/memcache.inc';
$conf['cache_default_class'] = 'MemCacheDrupal';
$conf['memcache_servers']    = array('127.0.0.1:11211' => 'default');
$conf['memcache_key_prefix'] = 'live';

And that is it for the configuration part.

You now need to clear all caches:

drush cc all

Then restart all the daemons:

service pound restart
service nginx restart # If you use nginx instead of pound
service varnish restart
service apache2 restart

Check that all daemons have indeed restarted, and that there are no errors in the logs. Then test for proper SSL recognition in the browser, and for correct redirects.

For The Extreme Minimalist: Eliminating Various Layers

The above solution stack works trouble free, and has been tested with several sites. However, there is room for eliminating different layers. For example, instead of having Apache as the backend web server, this can be replaced with Nginx itself, listening on both port 443 (SSL), and 8080 (backend), with Varnish in between. In fact, it is possible to even remove Varnish altogether, and use Ngnix FastCGI Cache instead of it. So Nginx listens on port 443, decrypts the connection, and passes the request to its own cache, which decides what is served from cache versus what gets passed through to Nginx itself on port 8080, which hands it over to PHP and Drupal.

Don't let the words 'spaghetti' and 'incest' take over your mind! Eventually, all the oddities will be ironed out, and this will be a viable solution. There are certain things that are much better known in Apache for now in regards to Drupal, like URL rewriting for clean URLs. There are also other things that are handled in .htaccess for Apache that needs to gain wider usage within the community before an Nginx only solution becomes the norm for web server plus cache plus SSL.

Apache MPM Worker Multithreaded with PHP-FPM is a very low overhead, high performance solution, and we will continue to use it until the Nginx only thing matures into a wider used solution, and has wider use and support within the Drupal community to remain viable for the near future.

Nov 21 2016
Nov 21

The other day, we were helping a long time client with setting up a new development server configured with Ubuntu Server LTS 16.04, which comes with PHP 7.x. Benchmarks of PHP 7.x show that it is faster than any PHP 5.x version by a measurable margin, hence the client's attempt to move to the newer version of Ubuntu and PHP.

But when we tried benchmarking the new server against the existing server, which has Ubuntu Server LTS 14.04, showed that the new server is extremely slow compared to the existing one.

We set up investigating where the slow down is coming from, and running a quick version of our Drupal Performance Assessment, which is one of our many Drupal services.

We started by checking out the logs, which on this site is set to syslog, so as not to bog down the database.

We saw the following messages:

Nov 19 16:36:31 localhost drupal: http://dev5.example.com|1478727391|Apache Solr|1.1.1.1|http://dev5.example.com/some-page||0||HTTP Status: 0; Message: Request failed: Connection timed out. TCP Connect Timeout.; Response: ; Request: GET /solr/wso/select?start=0&rows=8&fq=bundle%3A%28blog%20OR%20faq%20OR%20...&wt=json&json.nl=map HTTP/1.0#015#012User-Agent: Drupal (+http://drupal.org/)#015#012Connection: close#015#012Host: solr2.example.com:8983#015#012#015#012; Caller: module_invoke() (line 926 of /.../www/includes/module.inc)

Nov 19 16:36:31 localhost drupal: http://dev5.example.com|1478727391|Apache Solr|1.1.1.1|http://dev5.example.com/some-page||0||HTTP 0; Request failed: Connection timed out. TCP Connect Timeout.

As you can see, the server is timing out trying to connect to the Apache Solr service. The existing server does not have this message.

But is this just an error, or something that impacts performance?

There are a variety of tools that can help profile PHP's page load, and see where time is being spend. The most popular tool is xhprof, but setting it up is time consuming, and we needed simpler tools.

So, we settled on a well tried and tested Linux tool, strace. This tool allows one to see what system calls a process is issuing, and measure the time each takes.

Over the years we have used strace, with tried and tested options to give us what we need quickly, without having to install extensions in PHP, and the like.

The command line we use discovers the process IDs of PHP on its own, and then traces all of them:

Caution: NEVER USE THIS ON A LIVE SERVER! You will slow it down considerably, both the CPU and disk access!

strace -f -tt -s 1024 -o /tmp/trace -p `pidof 'php-fpm: pool www' |
sed -e 's/ /,/g'`

Let us explain the options for a bit:

  • -f means follow children. This means that if a process forks and creates a new process, the new process will be traces as well.
  • -tt outputs a time stamp accurate to microsecond time. This is extremely useful for knowing where time is spent.
  • -s means that for most system calls, such as read, a longer output is displayed.
  • -o gives the file name to output the trace to.
  • -p is a list of process IDs to trace. In this case, we use some shell magic to get the process IDs of the PHP daemon (since we are running PHP-FPM), and then replace the spaces with commas. That way, we don't have to find out the process IDs manually then enter them on the command line. A great time saver!

We run this trace and issue a single request, so the output is not huge. We terminal the strace process by the usual Ctrl-C.

We inspect the output and we find this:

9359 17:02:37.104412 connect(6, {sa_family=AF_INET, sin_port=htons(11211), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)

From the port number, we know this is a connection to the memcached daemon, which is running on the same server. No issues there.

Then we see this:

9359 17:02:37.176523 connect(7, {sa_family=AF_LOCAL, sun_path="/var/run/mysqld/mysqld.sock"}, 29) = 0

That is the connection to MySQL's socket. No issues either.

Now, we see this (IP address obfuscated):

9359 17:02:38.178758 connect(9, {sa_family=AF_INET, sin_port=htons(8983), sin_addr=inet_addr("1.1.1.1")}, 16) = -1 EINPROGRESS (Operation now in progress)

From the port number, we know that this is Apache Solr.

Then we find the following, repeating more than a hundred times:

9359 17:02:38.179042 select(10, [9], [9], [], {1, 25000}) = 0 (Timeout)
9359 17:02:39.205272 nanosleep({0, 5000000}, NULL) = 0
9359 17:02:39.210606 select(10, [9], [9], [], {0, 25000}) = 0 (Timeout)
9359 17:02:39.235936 nanosleep({0, 5000000}, NULL) = 0
9359 17:02:39.241262 select(10, [9], [9], [], {0, 25000}) = 0 (Timeout)
9359 17:02:39.266552 nanosleep({0, 5000000}, NULL) = 0

Until we see this:

9359 17:02:43.134691 select(10, [9], [9], [], {0, 25000}) = 0 (Timeout)
9359 17:02:43.160097 nanosleep({0, 5000000}, NULL) = 0
9359 17:02:43.165415 select(10, [9], [9], [], {0, 25000}) = 0 (Timeout)
9359 17:02:43.190683 nanosleep({0, 5000000}, NULL) = 0

Did you catch that, or did you miss it? Look closely at the time stamps!

The PHP process spend a full 5 seconds trying to contact the Apache Solr server, but timing out!

No wonder page loading is so slow then.

In this case, there was a block of 'related content' populated from an Apache Solr query. This block was not cached, and for every logged in user, the PHP process waits for the Apache Solr server round trip time, plus query execution time. But worse is if the server times out, because of network issues.

Jun 13 2016
Jun 13

Recently, we were reviewing the performance of a large site that has a significant portion of its traffic from logged in users. The site was suffering from a high load average during peak times.

We enabled slow query logging on the site for a entire week, using the following in my.cnf:

log_slow_queries               = 1
slow_query_log                 = 1
slow_query_log_file            = /var/log/mysql/slow-query.log
log-queries-not-using-indexes  = 1
long_query_time                = 0.100

Note that the parameter long_query_time can be a fraction of a second only on more recent versions on MySQL.

You should not set this value too low, otherwise the server's disk could be tied up in logging the queries. Nor should it be too high so as to miss most slow queries.

We then analyzed the logged queries after a week.

We found that the slow queries, on aggregate, examined a total of 150,180 trillion rows, and returned 838,930 million rows.

Out of the total types of queries analyzed, the top two had a disproportionate share of the total.

So these two queries combined were 63.7% of the total slow queries! That is very high, and if we were able to improve these two queries, it would be a huge win for performance and server resources.

Voting API Slow Query

The first query had to do with Voting API and Userpoints.

It was:

SELECT votingapi_vote.*
FROM votingapi_vote
WHERE  value_type = 'points'
AND tag = 'userpoints_karma'
AND uid = '75979'
AND value = '-1'
AND timestamp > '1464077478'

It hogged 45.3% of the total slow queries, and was called 367,531 times per week. It scanned over 213,000 rows every time it ran!

The query took an aggregate time for execution of 90,766, with an average of 247 milliseconds per execution.

The solution was simple: create an index on the uid column:

CREATE INDEX votingapi_vote_uid ON votingapi_vote (uid);

After that was done, the query used the index and scanned only one row, and returned instantly.

Private Messaging Slow Query

The second query had to do with Privatemsg. It is:

SELECT COUNT(pmi.recipient) AS sent_count
FROM pm_message pm
INNER JOIN pm_index pmi ON pm.mid = pmi.mid
WHERE  pm.author = '394106'
AND pm.timestamp > '1463976037'
AND pm.author <> pmi.recipient

This query accounted for 18.4% of the total slow queries, and was called 32,318 times per week. It scanned over 1,350,000 rows on each execution!

The query took an aggregate time for execution of 36,842, with an average of 1.14 seconds (yes, seconds!) per execution.

Again, the solution was simple: create an index on the author column.

CREATE INDEX pm_message_author ON pm_message (author);

Just like the first query, after creating the index, the query used the index and scanned only 10 rows and over a million! It returned instantly.

Results After Tuning

As with any analysis, comparison of the before and after data is crucial.

After letting the tuned top two offending queries run for another week, the results were extremely pleasing:

Before After Total rows examined 150.18 T 34.93 T Total rows returned 838.93 M 500.65 M

A marked improvement!

Conclusion

With performance, the 80/20 rule applies. There are often low hanging fruit that can easily be tuned.

Do not try to tune because of something you read somewhere, that may not apply to your site (including this and other articles on our site!)

Rather, you should do proper analysis, and reach a diagnosis based on facts and measurements, as to the cause(s) of the slowness. After that, tuning them will provide good results.

Mar 22 2016
Mar 22

For years, we have been using and recommending memcached for Drupal sites as its caching layer, and we wrote several articles on it, for example: configuring Drupal with multiple bins in memcached.

Memcached has the advantage of replacing core caching (which uses the database) with memory caching. It still allows modules that have hook_boot() and hook_exit() to work, unlike external cache layers such as Varnish.

However, memcached has its limitations: It is by definition transient, so rebooting wipes out the cache, and the server can suffer if it has high traffic. It is also entirely memory resident, so to cache more items you need more RAM, which is not suitable for small servers.

For Drupal 7, there is a solution that does avoids this first limitation: Redis. It provides persistence, but not the second.

The following is a detailed guide to get Redis installed and configured for your server. It assumes that you are an Ubuntu Server 14.04, or the equivalent Debian release.

Installing Redis

First, download the Drupal redis module, which should go to sites/all/modules/contrib. You can do that in many ways, here is how you would use Drush for that:

drush @live dl redis

You do not need to enable any Redis modules in Drupal.

Then, install the Redis Server itself. On Debian/Ubuntu you can do the following. On CentOS/RedHat, you should use yum.

aptitude install redis-server

Then, install PHP's Redis integration. Once you do that, you do not need to compile from source, or anything like that, as mentioned in Redis README.txt file.

aptitude install php5-redis

Restart PHP, so it loads the Redis integration layer.
This assumes you are using PHP FPM:

service php5-fpm restart

If you are using PHP as an Apache module, then you need to restart it as follows:

service apache2 restart

Configuring Redis

Then in your settings.php file, you should replace the section for memcache which would be as follows:

$conf['cache_backends'][] = './sites/all/modules/contrib/memcache/memcache.inc';
$conf['cache_default_class'] = 'MemCacheDrupal';
$conf['memcache_servers'] = array('127.0.0.1:11211' => 'default');
$conf['memcache_key_prefix'] = 'site1';

And replace it with the following configuration lines:

// Redis settings
$conf['redis_client_interface'] = 'PhpRedis';
$conf['redis_client_host'] = '127.0.0.1';
$conf['lock_inc'] = 'sites/all/modules/contrib/redis/redis.lock.inc';
$conf['path_inc'] = 'sites/all/modules/contrib/redis/redis.path.inc';
$conf['cache_backends'][] = 'sites/all/modules/contrib/redis/redis.autoload.inc';
$conf['cache_default_class'] = 'Redis_Cache';
// For multisite, you must use a unique prefix for each site
$conf['cache_prefix'] = 'site1';

Cleaning Up

Once you do that, caching will start using redis. Memcached is not needed, so you should stop the daemon:

service memcached stop

And you should purge memcached as well:

aptitude purge memcached

And that is all there is to it.

Changing Redis Configuration

You can then review the /etc/redis/redis.conf file to see if you should tweak parameters more, such as changing maxmemory to limit it to a certain amount, as follows:

maxmemory 256mb

More below on this specific value.

Checking That Redis Is Working

To check that Redis is working, you can inspect that keys are being cached. For this, you can use the redis-cli tool. This tool can be used interactively, as in, you get a prompt and type commands in it, and results are returned. Or you can use the specific command as an argument to redis-cli.

For example, this command filters on a specific cache bin, the cache_bootstrap one:

$ redis-cli
127.0.0.1:6379> keys *cache_boot*

Or you can type it as:

$ redis-cli keys "*cache_boot*"

In either case, if Drupal is caching correctly, you should see output like this:

1) "site1:cache_bootstrap:lookup_cache"
2) "site2:cache_bootstrap:system_list"
3) "site3:cache_bootstrap:system_list"
4) "site3:cache_bootstrap:hook_info"
5) "site2:cache_bootstrap:variables"
...

As you can see, the key structure is simple, it is composed of the following components, separated by a colon:

  • Cache Prefix
    This is the site name in a multi site environment.
  • Cache Bin
    This is the cache table name when using the default database caching in Drupal.
  • Cache Key
    This is the unique name for the cached item. For cached pages, the URL is used, with the protocol (http or https) and the host/domain name.

You can also filter by site, using the cache_prefix:

$ redis-cli keys "*site1:cache_page*"

The output will be something like this:

1) "site1:cache_page:http://example.com/node/1"
2) "site1:cache_page:http://example.com/contact_us"
...

You can also check how many items are cached in the database:

$ redis-cli dbsize

The output will be the number of items:

(integer) 20344

Flushing The Cache

If you need to clear the cache, you can do:

$ redis-cli flushall

Checking Time To Live (TTL) For A Key

You can also check how long does a specific item stay in cache, in seconds remaining:

$ redis-cli ttl site1:cache_page:http://example.com/

The output will be the number of seconds:

(integer) 586

Getting Redis Info

You can get a lot of statistics and other information about how Redis is doing, by using the info command:

$ redis-cli info

You can check the full documentation for the info command.

But here is one of the important values to keep an eye on is used_memory_peak_human, which tells you the maximum memory that was used given your site's specifics, such as the number of items cached, the rate of caching, the size of each item, ...etc.

used_memory_peak_human:256.25

You can use that value to tune the maxmemory parameter, as above.

You can decrease the Minimum Cache Lifetime under /admin/config/development/performance to make the available memory fit that number, or the other way around: you can allocate more memory to fit more.

Monitoring Redis Operations In Real Time

And finally, here is a command that would show you all the operations that are being done on Redis in real time. Do not try this on a high traffic site!

$ redis-cli monitor

Performance Results

Redis performance as a page cache for Drupal is quite good, with Time To First Byte (TTFB) is ~ 95 to 105 milliseconds.

Notes on Fault Resilience

One of the big selling points of Redis versus Memcached, is that the former provides cache persistence across reboots.

However, as the documentation states, the default engine for Redis, RDB can lose data on power loss. That may not be a deal breaker on its own for a caching application. However, we found from experience, that loss of some cache records is not the only problem. The real problem was when a disk failure occurred, then repaired by the hosting provider, but the site was still offline, because Redis experienced data corruption, and refused to boot Drupal normally.

The other option is using AOF, which should survive power failures, but it has some disadvantages as well, including more disk space usage, and being slower than RDB.

Alternatives To Redis and Memcached

We did fairly extensive research for Redis and Memcached alternatives with the following criteria:

  • Compatible With Redis or Memcached Protocol
    We wanted to use the same PHP extension and Drupal Redis (or Memcached) modules, and not have to write and test yet another caching module.
  • Non-Memory Resident Storage
    We want to reduce the memory foot print of Redis/Memcached, because they both store the entire key/value combinations in memory. But still wanted to get acceptable performance.

The following products all claim to meet the above criteria, but none of them worked for us. They were tested on Ubuntu LTS 14.04 64-bit:

MongoDB

Using MongoDB article for more details.

MemcacheDB

MemcacheDB is a Memcached compatible server which used the excellent Berkeley DB database for storage.

This MemcacheDB presentation explains what it does in detail.

It has an Ubuntu package right in the repository, so no need to compile from source, or manually configure it. It works flawlessly. The -N option enable the DB_TXN_NOSYNC option, which means writes to the database are asynchronous, providing a huge performance improvement.

Configuration in Drupal's settings.php is very easy: it is exactly like Memcached, with only the port number changing, from 11211 to 21201.

Alas, all is not rosy: it is not really a cache layer, since it does not expire keys/values based on time, like Memcached and Redis does.

Redis NDS

Redis-NDS is a fork of Redis 2.6, patched for NDS (Naive Disk Store).

It does compile and run, but when the line: 'nds yes' is added to the configuration file, it is rejected as an invalid value. Looking briefly in the source, we also tried 'nds_enabled yes', but that was rejected as well. So we could not get it to run in NDS mode.

ARDB

ARDB is another NoSQL database that aims to be Redis protocol compatible.

We compiled this with three different storage engines: The Facebook RocksDB did not compile to begin with. Google's LevelDB compiled cleanly, and so did WiredTiger. But when trying to connect Drupal to it, Drupal hanged and never came back with both engines.

SSDB

SSDB is also another NoSQL database that tries to be Redis protocol compatible.

It compiled cleanly, but had the same symptom as ARDB: Drupal hangs and never receives back a reply from SSDB.

There are a couple of sandbox projects, here and here, that aim for native integration, but no code has been committed so far in two years.

If you were able to get any of the above, or another Redis/Memcached compatible caching engine working, please post a comment below.

Resources

Mar 22 2016
Mar 22

MongoDB is a NoSQL database that has Drupal integration for various scenarios.

One of these scenarios is using MongoDB as the caching layer for Drupal.

This article describes what is needed to get MongoDB working as a caching layer for your Drupal site. We assume that you have an Ubuntu Server LTS 14.04 or similar Debian derived distro.

Download The Drupal Module

First, download the MongoDB Drupal module. You do not need to enable any MongoDB modules.

drush @live dl mongodb

Install MongoDB Server, Tools and PHP Integration

Then install MongoDB, and PHP's MongoDB integration. Note that 'mongodb' is a virtual package that installs the mongodb-server package as well as other client tools and utilities:

aptitude install php5-mongo mongodb

Restart PHP

Restart PHP, so that MongoDB integration takes effect:

service php5-fpm restart

Configure Drupal With MongoDB

Now, edit your settings.php file, to add the following:

$conf['mongodb_connections']['default']['host'] = 'mongodb://127.0.0.1';
$conf['mongodb_connections']['default']['db'] = 'site1';
$conf['cache_backends'][] = 'sites/all/modules/contrib/mongodb/mongodb_cache/mongodb_cache.inc';
$conf['cache_default_class'] = 'DrupalMongoDBCache';

Note, that if you have multisite, then using a different 'db' for different sites will prevent cache collision.

Monitoring MongoDB

You can monitor MongoDB using the following commands.

mongotop -v
mongostat 15

Tuning MongoDB

Turn off MongoDB's journaling, since we are using MongoDB for transient cache data that can be recreated from Drupal.

Edit the file /etc/mongodb.conf and change journal= to false.

Performance Results

Quick testing on a live site showed that MongoDB performance is acceptable, but not spectacular. That is specially true when compared to other memory resident caching, such as Memcached or Redis.

For example, on the same site and server, with Redis, Time To First Byte (TTFB) is ~ 95 to 105 milliseconds. With MongoDB it is ~ 200, but also goes up to ~350 milliseconds.

Still, MongoDB can be a solution in memory constrained environments, such as smallish VPS's.

Aug 25 2015
Aug 25

There are rare occasions when you want to re-index all your site's content in Solr. Such occasions include:

  • Major Drupal version upgrade (e.g. from Drupal 6.x to Drupal 7.x).
  • Changing your Solr schema to include more search criteria.
  • Upgrading your Solr server to a new major version.
  • Moving your Solr server from an old server to a new one.

The usual way of doing this re-indexing is to make cron run more frequently. However, if you do that, there is a risk of cron being blocked because of other long running cron tasks. Moreover, you are usually limited to a few hundred items per cron run, and then you have to wait until the next iteration of cron running.

The indispensable swiss army knife of Drupal, Drush, has hook for Solr. Therefore, for someone like me who does almost everything from the command line, using drush was the natural fit for this task.

To do this, in one terminal, I enter this command:

while true; do drush @live --verbose solr-index; sleep 5; done

This command runs the indexing in a loop, and is not dependent on cron. As soon as the 100 items (or whatever limit you have in your settings) is done, another batch is sent.

In another terminal, you would monitor the progress as follows:

while true; do drush @live solr-get-last-indexed; sleep 30; done

Once you see that the number of items in the second terminal to stop increasing, you check the first terminal for any errors. Usually, it means that indexing is complete. However, if there are errors, they may be due to bad content in nodes, which needs to be fixed (e.g. bad fields) or unpublished as the case may be.

Doing this reindexing on a Drupal 7.x site sending content to a Solr 4.x server, took from 11 pm to 1 pm the next day (14 hours), for 211,900 nodes. There was an overnight network disconnect for the terminals, and it was restarted in the morning, so the actual time is actually less.

In all cases, this is much faster than indexing a few hundred items every 5 minutes via cron. That would have taken several days to complete.

Jul 03 2015
Jul 03

Last week, at the amazing Drupal North regional conference, I gave a talk on Backdrop: an alternative fork of Drupal. The slides from the talk are attached below, in PDF format.

Contents: 

Tags: 

Is your Drupal or WordPress site slow?
Is it suffering from server resources shortages?
Is it experiencing outages?
Contact us for Drupal and WordPress Performance Optimization and Tuning Consulting

Jul 03 2015
Jul 03
Last week, at the amazing Drupal North regional conference, I gave a talk on Backdrop: an alternative fork of Drupal. The slides from the talk are attached below, in PDF format.
Jul 03 2015
Jul 03
Last week, at the amazing Drupal North regional conference, I gave a talk on Backdrop: an alternative fork of Drupal. The slides from the talk are attached below, in PDF format.
May 26 2015
May 26

Last week, Nathan Vexler of the University of Waterloo, and Khalid Baheyeldin of 2bits.com presented at the Waterloo Region Drupal Users Group on Backdrop.

Backdrop is a fork of Drupal, based mostly on Drupal 7.x, and mostly compatible with its API. It also has some features from Drupal 8.x. It aims to provide an alternative that reduces the cost of ownership by minimizing the learning curve for developers.

Note: the slides from this presentation were superseded by my talk at Drupal North 2015.

Contents: 

Tags: 

Is your Drupal or WordPress site slow?
Is it suffering from server resources shortages?
Is it experiencing outages?
Contact us for Drupal and WordPress Performance Optimization and Tuning Consulting

Aug 05 2014
Aug 05

In a previous article from over 5 years ago, we advocated the use of Apache MPM Worker Threaded Server with fcgid over Apache's mod_php.

That was for serveral reasons, including faster handling of static files by Apache threaded server, and lower memory utilization since PHP is not embedded in every Apache process.

However, there were some drawbacks, mainly that APC opcache cache is not shared, and each process has to have its own copy.

But we don't have to settle for the above trade off anymore, because there has been a better alternative for some time now: Apache MPM Worker Threaded server + PHP FPM. This configuration is part of PHP 5.3 and 5.5 available on Ubuntu 12.04 and 14.04.

PHP FPM uses the Fast CGI interface to the web server, and therefore can be used with other web servers, like nginx.

But, we stick to Apache, which has very good performance when it is in MPM Worker, since the threaded server is much more efficient and light weight than the Pre-Fork server.

In the sections below, we describe how to setup and tune Apache2 and PHP FPM optimally for both Ubuntu 12.04 and 14.04, since some sites will stay for some time on the former because of some incompatibilities of some modules with the latter. The former has Apache 2.2, while the latter has Apache 2.4. Because of this and other changes, the locations of the files are different, and some parameters are different too.

We assume a VPS of modest size, 1 or 2 GB of RAM. The values should be adjusted for larger servers of 8GB or more.

Apache MPM Worker Configuration

aptitude install apache2-mpm-worker apache2-threaded-dev apache2-utils libapache2-mod-fastcgi php5-fpm

We then configure Apache MPM Worker. Note that in Ubuntu 12.04, the configuration goes in /etc/apache/conf.d/mpm-worker.conf. As for Ubuntu 14.04, the file should be: /etc/apache2/conf-enabled/mpm-worker.conf.

<IfModule mpm_worker_module>
  ServerLimit           300
  StartServers            3
  MinSpareThreads         3
  MaxSpareThreads        10
  ThreadsPerChild        10
  MaxClients            300
  MaxRequestsPerChild  1000
</IfModule>

Then, we enable the required module for FastCGI to work:

sudo a2enmod fastcgi

Ubuntu 12.04 Configuration

On Ubuntu 12.04, it is better to stay with APC, rather than try with Zend OpCache. You can try running with Zend OpCache, but if you get segfaults, then stick with APC.

First, we tell Apache to send requests for files ending with .php to the PHP FPM server. This goes in the file: /etc/apache2/conf.d/php-fpm.conf, as follows:

<IfModule mod_fastcgi.c> 
  Alias /usr/sbin/php-fpm.fcgi /usr/sbin/php-fpm 
  AddHandler php-fastcgi .php 
  Action php-fastcgi /usr/sbin/php-fpm.fcgi 
  FastCGIExternalServer /usr/sbin/php-fpm -host 127.0.0.1:9000 -pass-header Authorization -idle-timeout 600
  <Directory /usr/sbin> 
    Options ExecCGI FollowSymLinks 
    SetHandler fastcgi-script 
    Order allow,deny 
    Allow from all 
  </Directory> 
</IfModule>

Then, in the file /etc/php5/fpm/pool.d/www.conf, you should have the following values, so that you do not exceed the available memory for the server:

[www]
user = www-data
group = www-data
chdir = /
listen = 127.0.0.1:9000
pm = dynamic
pm.max_children      = 10
pm.start_servers     = 4
pm.min_spare_servers = 2
pm.max_spare_servers = 6
pm.max_requests      = 2000

And finally we increase the memory for PHP and Zend OpCache from the default, in the file: /etc/php5/fpm/conf.d/local.ini.

For fairly simple sites with a small number of modules and themes (e.g. 100 or less), shm_size of 64MB is sufficient. For sites with a large number of modules (~200 or so), you will need to increase that to 128MB or more. Likewise, the value for memory_limit needs to increase for complex sites as well, perhaps up to 256MB in some cases.

To know exactly how much memory APC is using, you need to copy the apc.php file from /usr/share/doc/php temporarily to web root and access it from a browser.

; You may need to increase the values below, depending on your site's complexity
memory_limit = 96M
apc.shm_size = 64M

Ubuntu 14.04 LTS

For Ubuntu 14.04, you should not use APC, but rather the built Zend OpCache.

First, we tell Apache to send requests for files ending with .php to the PHP FPM server. This goes in the file /etc/apache2/conf-enabled/php-fpm.conf

<IfModule mod_fastcgi.c>
  Alias /php-fcgi /usr/lib/cgi-bin/php5
  AddHandler php .php
  Action php /php-fcgi
  FastCgiExternalServer /usr/lib/cgi-bin/php5 -host 127.0.0.1:9000 -pass-header Authorization -idle-timeout 600
  <Directory /usr/lib/cgi-bin>
    AllowOverride All
    Options +ExecCGI +FollowSymLinks
    Require all granted
  </Directory>
</IfModule>

Then we configure the maximum number of PHP processes that would provide good performance, but not exceed the server's memory. This goes in the file /etc/php5/fpm/pool.d/www.conf

[www]
user = www-data
group = www-data
chdir = /
listen = 127.0.0.1:9000
pm = dynamic
pm.max_children      = 10
pm.start_servers     = 4
pm.min_spare_servers = 2
pm.max_spare_servers = 6
pm.max_requests      = 2000

And finally we increase the memory for PHP and Zend OpCache from the default, in the file: /etc/php5/fpm/conf.d/local.ini.

For fairly simple sites with a small number of modules and themes (e.g. 100 or less), memory_consumption of 64MB is sufficient. For sites with a large number of modules (~200 or so), you will need to increase that to 128MB or more. Likewise, the value for memory_limit needs to increase for complex sites as well, perhaps up to 256MB in some cases.

To know exactly how much memory Zend OpCache is using, you need to get the opcache.php script from Rasmus Lerdorf's GitHub, and copy it temporarily to web root and access it from a browser.

; You may need to increase the values below, depending on your site's complexity
memory_limit = 96M
opcache.memory_consumption = 64M

Enabling Required Modules

The final step before restarting the servers is to enable the following Apache modules. This ensures that Apache, PHP-FPM and Drupal work properly together:

sudo a2enmod fastcgi actions rewrite deflate expires

Restarting

Now, restart Apache and PHP, so the above configurations take effect:

sudo service php5-fpm restart
sudo service apache2 restart

Now, you have a very performant yet light weight setup.

Jul 22 2014
Jul 22

We were recently approached by a non-profit site that runs on Drupal.

Major Complaints

Their major complaint was that the "content on the site does not show up". The other main complain is that the site is very slow.

Diagnosis First ...

In order to troubleshoot the disappearing content, we created a copy of the site in our lab, and proceeded to test it, to see if we can replicate the issues.

The site is on the heavy side, with 210 modules and themes enabled. The site uses the Domain Access to server multiple sites from the same Drupal instance, each with its own sub-domain. It also has a full e-commerce store using Ubercart. And of course, other widely used modules are on the site, including Panels, Views and Organic Groups.

We found that indeed, on many pages of the site, the content did not show up. That is, anything in the "Content" region of the theme did not show up. Also, some other blocks were not showing as well.

Investigating further revealed the main issues was a totally inefficient way the site was developed: many pages were created as blocks (instead of nodes), all of them in the "Content" region. Then the Block Visibility setting is set to a different URL for each page.

Because of the way Drupal processes blocks, it has to go through all blocks that are enabled, and create the content for each, then only display the ones that visibility settings say are visible.

This was confirmed because of the high memory usage. Even a setting of 192 MB for memory_limit for PHP did not work. Only having it to 256MB did work. Finally the site was functional!

As for the speed issue, we tuned the entire LAMP stack to have the optimal performance possible with Drupal. This is something we do as a service for clients: Drupal Server Provisioning Service: Installation, Configuration and Tuning.

Old Hosting Setup

The hosting that was used by the site previously was on a shared host using cPanel. cPanel imposes specific versions of the LAMP stack, and therefore is hard to tune well for complex applications like sites created with Drupal.

Moreover, the shared host was charging a substantial amount per month for filtering SPAM emails.

Inexpensive Hosting With Optimal Tuning

Because the site is for a non-profit entity, they did qualify for free Google Apps, with Gmail handling the email for the site, and filtering spam at no cost.

We then recommended a very inexpensive, $20 a month, VPS at Linode, installed the LAMP stack from scratch (Linux, Apache, PHP, MySQL), configured it all for Drupal, and tuned it optimally.

We also included some performance and resource monitoring tools (including Munin), web statstics (Awstats), and the latest Drush version.

Other Configuration

Due to the memory constraints (2GB), we decided to configure the web site without Varnish, as we do occasionally. The page cache being disabled, as the client needs, contributed to that decision. We did configure memcached though, so as to speed up the other non-optional cached parts in Drupal.

Simple and Fast

We also did not use any complicated or expensive components or services, such as a Content Delivery Network (CDN), since we already achieved very good performance with minimal complexity and cost.

Performance Results

Here are the results of the old server, and the new server after we configured and tuned it optimally for Drupal. The times are in milliseconds.

            Old     New
/           2,706   572
/about      2,063   430
[redacted]  2,235   455
[redacted]  2,437   457

As you can see, this is significantly faster than before, and the client is very happy with it.

If you, or any of your clients need to have an inexpensive and fast hosting setup, as per our Drupal Server Provisioning Service: Installation, Configuration and Tuning, please contact us today.

May 27 2014
May 27

Everyone needs to have a backup plan for their live site. Not only can your server's disk get corrupted, but you can also erroneously overwrite your site with bad code or bad data, or your site can get hacked. Detecting the latter situations takes some time. Hours or days.

For this reason, you should have multiple backup copies at multiple time points.

The most convenient scheme is to have a 7 day sliding backup: that is, you have one backup snapshot for each day of the week, with today's backup overwriting the backup from 8 days ago.

The scheme we describe in this post uses the extremely flexible, popular and useful Drush the command line swiss army knife for Drupal.

It has the following advantage:

  • No need to know what user, password and host MySQL is running on.
  • Single self contained all-in-one backup file for database, Drupal and static files.
  • 7 Day sliding backups.

Note: The use of Drush for backup, or any other scheme that uses mysqldump, are suitable for sites that do not have a large database. If your site's database is large, then you may want to explore other backup schemes, such as using MyDumper for Fast Parallel MySQL backups.

Install Drush

You can install Drush using Composer, as discribed on Drush's github. Or you can install it via PHP PEAR.

Installing from PEAR is as follows:
First, install the PHP PEAR package from Ubuntu's repositories.

aptitude install php-pear

Then install dependencies:

pear upgrade --force pear
pear install Console_GetoptPlus
pear install Console_Table

Finally install Drush itself.

pear channel-discover pear.drush.org
pear install drush/drush

Now that you have Drush installed, we proceed for using it for backup.

Create A Backup Directory

We now create a directory to hold the backups, and change its permissions so no one could read it but someone with root access.

sudo mkdir /home/backup
sudo chmod 700 /home/backup

Create a Drush alias file

Then create an aliases file for Drush to know where your site is installed. Change example.com below to your real directory where the site is installed and its real URL.

<?php
# This file should be in ~/.drush/aliases.drushrc.php
$aliases['live'] = array(
  'uri'  => 'http://example.com',
  'root' => '/home/example.com/www',
);

Creating the drush backup script

Now create a file in /usr/local/bin/daily-backup.sh

#!/bin/sh

# Daily backup of the database, using Drush
#

# Backup directory
DIR_BACKUP=/home/backup

log_msg() {
  # Log to syslog
  logger -t `basename $0` "$*"

  # Echo to stdout
  LOG_TS=`date +'%H:%M:%S'`
  echo "$LOG_TS - $*"
}

DAY_OF_WEEK=`date '+%a'`
BACKUP_FILE=$DIR_BACKUP/backup.$DAY_OF_WEEK.tgz

log_msg "Backing up files and database to $BACKUP_FILE ..."

drush @live archive-dump \
  --destination=$BACKUP_FILE \
  --preserve-symlinks \
  --overwrite

RC=$?

if [ "$RC" = 0 ]; then
  log_msg "Backup completed successfully ..."
else
  log_msg "Backup exited with return code: $RC"
fi

Make the script executable:

chmod 755 /usr/local/bin/daily-backup.sh

Scheduling the daily backup

Now add the script to cron so as to run daily, for example at 2:30 am

30 2 * * * /usr/local/bin/daily-backup.sh

After a few days, you should see backups in the directory, kept for 7 days:

ls /home/backup/

backup.Fri.tgz
backup.Thu.tgz

It is also a good idea to copy the backup files to another server, ideally not in the same datacenter as the live site. At least do this weekly using an addition to the script using scp or rsync.

May 19 2014
May 19

Most of high traffic or complex Drupal sites use Apache Solr as the search engine. It is much faster and more scaleable than Drupal's search module.

In a previous article on Drupal with Apache Solr 4.x, we described one way to install the latest stable Apache Solr 4.x. That article detailed a lot of manual steps involving downloading, extracting, setting permissions, creating a startup script, ...etc.

In this article, we describe another way for having a working Apache Solr installation for use with Drupal 7.x, on Ubunutu Server 12.04 LTS, as well as Ubuntu Server 14.04 LTS. This article uses an older version of Solr, 1.4 on the former, and 3.6 on the latter. But the methodology described uses the tried an tested apt dependency management system for Ubuntu, and other Debian based systems.

Objectives

For this article, we focus on having an installation of Apache Solr with the following objectives:

  • Use the version of Apache Solr in Ubuntu's repositories, therefore no searching for software, or dependencies, start/stop scripts, ...etc.
  • Least amount of software dependencies, e.g. no need for 'heavy' components, such as a full Tomcat server
  • Least amount of necessary complexity
  • Least amount of software to install and maintain
  • A secure installation

This installation can be done on the same host that runs Drupal, if it has enough memory and CPU, or it can be on a separate server dedicated for search, which is the preferred approach.

Installing Java

We start by installing the Java Development Kit, whatever the default is for our version of the distro. This is guaranteed to work with Solr, using the Debian dependency magic.

sudo aptitude update
sudo aptitude install solr-jetty default-jdk

Configuring Jetty

Now, edit the file /etc/default/jetty, and change the following lines as follows:

NO_START=0
JETTY_HOST=0.0.0.0
JETTY_PORT=8983

For Ubuntu 12.04, also add the following line:

JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64

For Ubuntu 14.04, also add the following line:

JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

Note that the JAVA_HOME line is dependent on your CPU architecture, and the JDK version that gets installed automatically by default, from the repositories. So if the above does not work, check what directories are installed under the directory /usr/lib/jvm, and choose a suitable version.

You can find out which JVM version and architecture is installed on your server by using the following command:

ls -l /usr/lib/jvm

Copying the Drupal schema and Solr configuration

We now have to copy the Drupal Solr configuration into Solr. Assuming your site is installed in /var/www, these commands achieve the tasks:
For Ubunutu 12.04, which installs Solr 1.4, use the following:

sudo cp /var/www/sites/all/modules/contrib/apachesolr/solr-conf/solr-1.4/* /etc/solr/conf

For Ubunutu 14.04, which installs Solr 3.6, use the following:

sudo cp /var/www/sites/all/modules/contrib/apachesolr/solr-conf/solr-3.x/* /etc/solr/conf

Then edit the file /etc/solr/conf/solrconfig.xml, and uncomment the following line:

<dataDir>${solr.data.dir:./solr/data}</dataDir>

Setting Apache Solr Authentication, using Jetty

By default, a Solr installation using Jetty, does not start at all, unless you configure /etc/default/jetty as above. The values above tells jetty to listen on the public Ethernet interface of a server, but has no protection whatsoever. Attackers can access Solr, and change its settings remotely. To prevent this, we set password authentication for Jetty, which in turn protest Solr.

Note: The following syntax is for Apache Solr 1.4 and 3.6 which depend on Jetty 6. Solr 4.x depends on Jetty 8, and use a different syntax described in our article linked above.

The following settings work well for a single core install, i.e. search for a single Drupal installation. If you want multi-core Solr, i.e. for many sites, then you want to fine tune this to add different roles to different cores.

First, edit the file: /etc/jetty/jetty.xml, and add this section before the last line:

<!-- ======= Securing Solr ===== -->
<Set name="UserRealms">
  <Array type="org.mortbay.jetty.security.UserRealm">
    <Item>
      <New class="org.mortbay.jetty.security.HashUserRealm">
        <Set name="name">Solr</Set>
        <Set name="config"><SystemProperty name="jetty.home" default="."/>/etc/realm.properties</Set>
      </New>
    </Item>
  </Array>
</Set>

Then edit the file: /etc/jetty/webdefault.xml, and add this section, also before the last line:

<security-constraint>
  <web-resource-collection>
    <web-resource-name>Solr</web-resource-name>
    <url-pattern>/*</url-pattern>
  </web-resource-collection>
  <auth-constraint>
    <role-name>search-role</role-name>
  </auth-constraint>
</security-constraint>

<login-config>
  <auth-method>BASIC</auth-method>
  <realm-name>Solr</realm-name>
</login-config>

Finally, edit the file: /etc/jetty/realm.properties, and comment out all the existing lines, then add this line:

user: password, search-role

Note that "search-role" must match what you put in webdefault.xml above.

You can change "user" and "password" to suitable values.

Finally, make sure that the file containing passwords is not readable to anyone but the owner.


chmod 640 /opt/solr/example/etc/realm.properties

Starting Solr

To start Solr, you need to start Jetty, which runs Solr for you automatically.

service jetty start

Now Solr is up and running.

Verify that it is running by accessing the following URL, where you can check the progress of indexing once you start it (see below):

http://x.x.x.x:8983/solr/admin/stats.jsp

Replace x.x.x.x by the IP address of the server that is running Solr, or its fully qualified domain name.

Viewing the logs

You can view the logs at:

tail -f /var/log/jetty/*.log

Configuring Drupal's Apache Solr module

After you have successfully installed, configured and started Solr, you should configure your Drupal site to interact with the Solr seserver. First, go to this URL: admin/config/search/apachesolr/settings/solr/edit, and enter the information for your Solr server. You should use the URL as follows:

http://user:[email protected]:8983/solr/

Now you can proceed to reindex your site, by sending all the content to Solr.

Removing Solr

If you ever want to cleanly remove Apache Solr that you installed from the server using the above instructions, then use the sequence of the commands below:

sudo aptitude purge default-jdk solr-jetty

sudo rm -rf /var/lib/solr/ /etc/jetty /etc/solr

May 12 2014
May 12

Most of high traffic or complex Drupal sites use Apache Solr as the search engine. It is much faster and more scaleable than Drupal's search module.

In this article, we describe one way of many for having a working Apache Solr installation for use with Drupal 7.x, on Ubunutu Server 12.04 LTS. The technique described should work with Ubunut 14.04 LTS as well.

In a later article, now published at: article, we describe how to install other versions of Solr, using the Ubuntu/Debian way.

Objectives

For this article, we focus on having an installation of Apache Solr with the following objectives:

  • Use the latest stable version of Apache Solr
  • Least amount of software dependencies, i.e. no installation of Tomcat server, and no full JDK, and no separate Jetty
  • Least amount of necessary complexity
  • Least amount of software to install and maintain
  • A secure installation

This installation can be done on the same host that runs Drupal, if it has enough memory and CPU, or it can be on the database server. However, it is best if Solr is on a separate server dedicated for search, with enough memory and CPU.

Installing Java

We start by installing the Java Runtime Environment, and choose the headless server variant, i.e. without any GUI components.

sudo aptitude update
sudo aptitude install default-jre-headless

Downloading Apache Solr

Second, we need to download the latest stable version of Apache Solr from a mirror near you. At the time of writing this article, it is 4.7.2. You can find the closest mirror to you at Apache's mirror list.

cd /tmp
wget http://apache.mirror.rafal.ca/lucene/solr/4.7.2/solr-4.7.2.tgz

Extracting Apache Solr

Next we extract the archive, while still in the /tmp directory.

tar -xzf solr-4.7.2.tgz

Moving to the installation directory

We choose to install Solr in /opt, because it is supposed to contain software that is not installed from Ubuntu's repositories, using the apt dependency management system, nor tracked for security updates by Ubuntu.

sudo mv /tmp/solr-4.7.2 /opt/solr

Creating a "core"

Apache Solr can serve multiple sites, eached served by a "core". We will start with one core, called simply "drupal".

cd /opt/solr/example/solr
sudo mv collection1 drupal


Now edit the file ./drupal/core.properties and change the name= to drupal, like so:

name=drupal

Copying the Drupal schema and Solr configuration

We now have to copy the Drupal Solr configuration into Solr. Assuming your site is in installed in /var/www, these commands achieve the tasks:

cd /opt/solr/example/solr/drupal/conf
sudo cp /var/www/sites/all/modules/contrib/apachesolr/solr-conf/solr-4.x/* .

Then edit the file: /opt/solr/example/solr/drupal/conf/solrconfig.xml, and comment our or delete the following section:

<useCompoundFile>false</useCompoundFile>
<ramBufferSizeMB>32</ramBufferSizeMB>
<mergeFactor>10</mergeFactor>

Setting Apache Solr Authentication, using Jetty

By default, a Solr installation listens on the public Ethernet interface of a server, and has no protection whatsoever. Attackers can access Solr, and change its settings remotely. To prevent this, we set password authentication using the embedded Jetty that comes with Solr. This syntax is for Apache Solr 4.x. Earlier versions use a different syntax.

The following settings work well for a single core install, i.e. search for a single Drupal installation. If you want multi-core Solr, i.e. for many sites, then you want to fine tune this to add different roles to different cores.

Then edit the file: /opt/solr/example/etc/jetty.xml, and add this section:

<!-- ======= Securing Solr ===== -->
<Call name="addBean">
  <Arg>
    <New class="org.eclipse.jetty.security.HashLoginService">
      <Set name="name">Solr</Set>
      <Set name="config"><SystemProperty name="jetty.home" default="."/>/etc/realm.properties</Set>
      <Set name="refreshInterval">0</Set>
    </New>
  </Arg>
</Call>

Then edit the file: /opt/solr/example/etc/webdefault.xml, and add this section:

<security-constraint>
  <web-resource-collection>
    <web-resource-name>Solr</web-resource-name>
    <url-pattern>/*</url-pattern>
  </web-resource-collection>
  <auth-constraint>
    <role-name>search-role</role-name>
  </auth-constraint>
</security-constraint>

<login-config>
  <auth-method>BASIC</auth-method>
  <realm-name>Solr</realm-name>
</login-config>

Finally, create a new file named /opt/solr/example/etc/realm.properties, and add the following section to it:

user_name: password, search-role


Note that "search-role" must match what you put in webdefault.xml above.

Instead of "user_name", use the user name that will be used for logging in to Solr. Also, replace "password" with a real strong hard to guess password.

Finally, make sure that the file containing passwords is not readable to anyone but the owner.

chmod 640 /opt/solr/example/etc/realm.properties

Changing File Ownership

We then create a user for solr.

sudo useradd -d /opt/solr -M -s /dev/null -U solr

And finally change ownership of the directory to solr

sudo chown -R solr:solr /opt/solr

Automatically starting Solr

Now you need Solr to start automatically when the server is rebooted. To do this, download the attached file, and copy it to /etc/init.d

sudo cp solr-init.d.sh.txt /etc/init.d/solr
sudo chmod 755 /etc/init.d/solr

And now tell Linux to start it automatically.

sudo update-rc.d solr start 95 2 3 4 5 .

For now, start Solr manually.

sudo /etc/init.d/solr start

Now Solr is up and running.

Verify that it is running by accessing the following URL:

http://x.x.x.x:8983/solr/


Replace x.x.x.x by the IP address of the server that is running Solr.

You can also view the logs at:

tail -f /opt/solr/example/logs/solr.log

Configuring Drupal's Apache Solr module

After you have successfully installed, configured and started Solr, you should configure your Drupal site to interact with the Solr seserver. First, go to this URL: admin/config/search/apachesolr/settings/solr/edit, and enter the information for your Solr server. You should use the URL as follows:

http://user:[email protected]:8983/solr/drupal

Now you can proceed to reindex your site, by sending all the content to Solr.

Removing Solr

If you ever want to cleanly remove Apache Solr that you installed from the server using the above instructions, then use the sequence of the commands below:

sudo /etc/init.d/solr stop

sudo update-rc.d solr disable

sudo update-rc.d solr remove

sudo rm /etc/init.d/solr

sudo userdel solr

sudo rm -rf /opt/solr

sudo aptitude purge default-jre-headless

Additional Resources

Attachment Size solr-init.d.sh_.txt 1.23 KB
May 03 2014
May 03

On Friday May 2nd, 2014, Khalid of 2bits.com, Inc. presented on Drupal Performance.

The presentation covered important topics such as:

  • Drupal misconception: Drupal is slow/resource hog/bloated
  • Drupal misconception: Only Anonymous users benefit from caching in Drupal
  • Drupal misconception: Subsecond response time in Drupal is impossible for logged in users
  • Drupal misconception: Cloud hosting is more cost effective than dedicated servers

The presentation slides are attached for those who may be interested ...

Attachment Size drupalcamp-toronto-2014-drupal-performance-tips-and-tricks.pdf 371.99 KB
Apr 29 2014
Apr 29

Navigation

Published Tue, 2014/04/29 - 08:46

Khalid of 2bits.com Inc. will be presenting Drupal Performance Tips and Tricks at DrupalCamp Toronto 2014 this coming Friday May 2nd at Humber College, Lakeshore Campus.

See you all there ...

»

Is your Drupal or WordPress site slow?
Is it suffering from server resources shortages?
Is it experiencing outages?
Contact us for Drupal and WordPress Performance Optimization and Tuning Consulting

Drupal Association Organization MemberDrupal Association Organization Member

Do you use any of our Drupal modules?

Did you find our Drupal, WordPress, and LAMP performance articles informative?

Follow us on Twitter @2bits for tips and tricks on Drupal and WordPress Performance.

Contact us for Drupal and WordPress Performance Optimization and Tuning Consulting

In Depth Articles

Search

Google

Custom Search

Mar 11 2014
Mar 11

We previously wrote in detail about how botnets hammering a web site can cause outages.

Here is another case that emerged in the past month or so.

Again, it is a distributed attempt from many IP addresses all over the world, most probably from PCs infected with malware.

Their main goal seems to be to add content to a Drupal web site, and trying to register a new user when that attempt is denied because of site permissions.

The pattern is like the following excerpt from the web server's access log.

Note the POST, as well as the node/add in the referer. Also note the hard coded 80 port number:

173.0.59.46 - - [10/Mar/2014:00:00:04 -0400] "POST /user/register HTTP/1.1" 200 12759 "http://example.com/user/register" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36"
173.0.59.46 - - [10/Mar/2014:00:00:06 -0400] "POST /user/register HTTP/1.1" 200 12776 "http://example.com/user/register" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36"
107.161.81.55 - - [10/Mar/2014:00:00:10 -0400] "GET /user/register HTTP/1.1" 200 12628 "http://example.com/user/register" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36"
107.161.81.55 - - [10/Mar/2014:00:00:16 -0400] "GET /user/register HTTP/1.1" 200 12642 "http://example.com/user/login?destination=node/add" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36"
202.75.16.18 - - [10/Mar/2014:00:00:17 -0400] "POST /user/register HTTP/1.1" 200 12752 "http://example.com/user/register" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1"
5.255.90.89 - - [10/Mar/2014:00:00:18 -0400] "GET /user/register HTTP/1.1" 200 12627 "http://example.com/user/register" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36"
107.161.81.55 - - [10/Mar/2014:00:00:24 -0400] "GET /user/register HTTP/1.1" 200 12644 "http://example.com/user/login?destination=node/add" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36"
...
128.117.43.92 - - [11/Mar/2014:10:13:30 -0400] "POST /user/register HTTP/1.1" 200 12752 "http://example.com:80/" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110613 Firefox/6.0a2"
128.117.43.92 - - [11/Mar/2014:10:13:30 -0400] "POST /user/register HTTP/1.1" 200 12752 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110613 Firefox/6.0a2"
128.117.43.92 - - [11/Mar/2014:10:13:30 -0400] "POST /user/register HTTP/1.1" 200 12752 "http://example.com:80/" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110613 Firefox/6.0a2"

In the above case, the web site has a CAPTCHA on the login registration page, and that causes a session to be created, and hence full Drupal bootstrap (i.e. no page caching). When this is done by lots of bots simultaneously, it takes its toll on the server's resources.

Botnet Statistics

We gleaned these statistics from analyzing the access log for the web server for a week, prior to putting in the fix below.

Out of 2.3 million requests, 3.9% were to /user/register. 5.6% had http://example.com:80/ in the referer (with the real site instead of example). 2.4% had "destination=node/add" in the referer.

For the same period, but limiting the analysis to accesses to /user/register only, 54.6% have the "/user/login?destination=node/add" in the referer. Over 91% pose as coming from a computer running Mac OS/X Lion 10.7.5 (released October 2012). 45% claim they are on Firefox browser, 33% pretend they are on Chrome, and 19.7% pose as Safari.

Workaround

As usual with botnets, blocking individual IP addresses is futile, since there are so many of them. CloudFlare, which is front ending the site, did not detect nor block these attempts.

In order to solve this problem, we just put in a fix to abort the Drupal bootstrap when this bot is detected. We just add this in settings.php. Don't forget to replace example.com with the domain/subdomain you see in your own access log.

if ($_SERVER['HTTP_REFERER'] == 'http://example.com/user/login?destination=node/add') {
  if ($_SERVER['REQUEST_URI'] == '/user/register') {
    header("HTTP/1.0 418 I'm a teapot");
    exit();
  }
}

// This is for the POST variant, with either port 80 in 
// the referer, or an empty referer
if ($_SERVER['REQUEST_METHOD'] == 'POST') {
  if ($_SERVER['REQUEST_URI'] == '/user/register') {
    switch($_SERVER['HTTP_REFERER']) {
      case 'http://example.com:80/':
      case '':
        header("HTTP/1.0 418 I'm a teapot");
        exit();
    }
  }
}
Jan 08 2014
Jan 08

A client recently asked us for help with a very specific issue. The node edit page was hanging up, but only in Internet Explorer 10, and not in Firefox or Chrome. The client had WYSIWYG editor enabled.

This automatically pointed to a front end issue, not a server issue.

So, we investigated more, and found that the underlying issue is between Internet Explorer and JQuery with a large number of items to be parsed.

Internet Explorer was not able to parse the high number of token items listed (around 220). This causes the browse to hang when rendering the WYSIWYG page, with the following error message:

A script on this page is causing Internet Explorer to run slowly. If it continues to run, you computer might become responsive.

With the option to stop the script.

The real problem is with the critical issue described in #1334456, which is as yet not committed to the repository of the token module.

Fortunately there is an easy workaround, the steps are:

  • Install the Token Tweaks module.
  • Go to /admin/config/system/tokens.
  • Change the Maximum Depth limit from the default of 4 to 1
  • Save the changes.

Now the edit form for the node should work normally, and the browser, whichever it is, will not hang anymore.

Note: Thanks to Dave Reid for this workaround.

Oct 07 2013
Oct 07

Using a Reverse Proxy and/or a Content Delivery Network (CDN) has become common practice for Drupal and other Content Management Systems.

One inconvenient aspect of this is that your web server no longer gets the correct IP address, and neither does your application. The IP address is that of the machine that the reverse proxy is running on.

In Drupal, there is code in core that tries to work around this, by looking up the IP address in the HTTP header HTTP_X_FORWARDED_FOR, or a custom header that you can set.

For example, this would be in the settings.php of a server that runs Varnish on the same box.

$conf['reverse_proxy'] = TRUE;
$conf['reverse_proxy_addresses'] = array('127.0.0.1');

There is also this setting for Drupal 7.x in case your CDN puts the IP address in some other custom header:

// CloudFlare CDN
$conf['reverse_proxy_header'] = 'HTTP_CF_CONNECTING_IP';

Only for the application, what about the web server?

But, even if you solve this at the application level (e.g. Drupal, or WordPress), there is still the issue that your web server is not logging the correct IP address. For example, you can't analyze the logs to know which countries your users are coming from, or identify DDoS attacks.

Apache RPAF module

There is a easy solution to this though: the Reverse Proxy Add Forward (RPAF).

What this Apache module does is extract the correct IP address, and uses that for Apache logs, as well hand over the correct IP address of the client in PHP's variable: $_SERVER['REMOTE_ADDR']

To install RPAF on Ubuntu 12.04 or later, use the command:

aptitude install libapache2-mod-rpaf

If you run the reverse proxy (e.g. Varnish) on same server as your web server and application, and do not use a CDN, then there is no need to do anything more.

However, if you run the reverse proxy on another server, then you need to change the RPAFproxy_ips line to include the IP addresses of these servers. For example, this will be the addresses for your Varnish servers which are front ending Drupal, then they are front ended by the CDN.

You do this by editing the file /etc/apache2/mods-enabled/rpaf.conf.

For example:

RPAFproxy_ips 10.0.0.3 10.0.0.4 10.0.0.5

CDN Client IP Header

If you are using a CDN, then you need to find out what HTTP header the CDN uses to put the client IP address, and modify RPAF's configuration accordingly.

For example, for CloudFlare, the header is CF-Connecting-IP

So, you need to edit the above file, and add the following line:

RPAFheader CF-Connecting-IP

Drupal Reverse Proxy settings no longer needed

And finally, you don't need any of the above Reverse Proxy configuration in settings.php.

// $conf['reverse_proxy'] = TRUE;
// $conf['reverse_proxy_addresses'] = array('127.0.0.1');
// $conf['reverse_proxy_header'] = 'HTTP_CF_CONNECTING_IP';

Now, you have correct client IP addresses in Apache's logs, and inside Drupal as well.

What If RPAF Does Not Work?

If you have RPAF front ended directly by a CDN, without Varnish, then RPAF may not work for a yet unknown reason.

To overcome this, you have several other options.

Apache mod_remoteip

There is a small Apache module called mod_remoteip. This basically does the same thing as RPAF, but with simpler configuration.

Use the download link and save the file to the file named apache-2.2-mod_remoteip.c.

apxs2 -i -a -c apache-2.2-mod_remoteip.c

This should create the module's .so file in Apache's modules directory. It should also add the LoadModule directive in mods-available/remoteip.load, which should look like so:

LoadModule remoteip_module modules/mod_remoteip.so

Now add the RemoteIPHeader directive in a new file called mods-available/remoteip.conf

RemoteIPHeader X-Forwarded-For

If you are using CloudFlare CDN then you use:

RemoteIPHeader CF-Connecting-IP

Now, enable the module:

a2enmod remoteip

Then restart Apache:

service apache2 restart

If this does not work, then you can still do it using the next set of tricks:

Apache Access Log and Drupal Reverse Proxy Settings

We can force Apache to log the correct client IP address to the access log by adding this to the virtual host entry for your site (e.g. /etc/apache2/sites-enabled/example.com):

LogFormat "%{CF-Connecting-IP}i %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\"" proxied

This takes the CF-Connecting-IP header from CloudFlare, and uses that instead of the IP address, which is of the proxy, not the originating client.

Then, under the "VirtualHost" stanza, you add this to use the custom proxied format you created above:

CustomLog ${APACHE_LOG_DIR}/access-example.com.log proxied

Then you need to enable the Drupal reverse proxy setting in settings.php:

$conf['reverse_proxy'] = TRUE;
$conf['reverse_proxy_header'] = 'HTTP_CF_CONNECTING_IP';

You don't need to add the reverse_proxy_addresses variable, because for CloudFlare there are too many of them.

May 06 2013
May 06

We have mentioned before that both Pressflow 6.x and Drupal 7.x (but not core Drupal 6.x), disable page caching when a session is created for an anonymous user.

An extreme case of this happened recently, because of a perfect storm.

Symptoms

The client sends a newsletter to site users, be they who have accounts on the site, or others who just entered their email to get the newsletter.

After a recent code change, when a newsletter was sent, suddenly, we found saw a very high load average, very high CPU usage, and because we plot the number of logged in and anonymous users too, we found around 800 anonymous users on the site, in addition to the peak of 400 logged in users!

Since this is Pressflow, anonymous users are mostly served by Varnish, and they are not supposed to have sessions.

Investigation

So, we started to investigate those anonymous sessions in the database, in the sessions table.

Indeed, there are lots of anonymous sessions.

SELECT COUNT(*) 
FROM sessions 
WHERE uid = 0;
+----------+
| count(*) |
+----------+
|     6664 |
+----------+

And upon closer investigation, most of those sessions had a message in them saying "Comment field is requried".

SELECT COUNT(*) 
FROM sessions 
WHERE uid = 0 
AND session LIKE '%Comment field is required%';
+----------+
| count(*) |
+----------+
|     5176 |
+----------+

And just to compare the day the newsletter was sent to other days, we confirmed that indeed, that day had many multiples of any other day in terms of sessions.

In fact, more than 5X the highest day prior, and up to 55X higher than more typical days.

SELECT DATE(FROM_UNIXTIME(timestamp)) AS date, COUNT(*) 
FROM sessions 
WHERE uid = 0 
GROUP BY date;

+------------+----------+
| date       | count(*) |
+------------+----------+
| .......... |       .. |
| 2013-04-19 |       55 |
| 2013-04-20 |       81 |
| 2013-04-21 |       66 |
| 2013-04-22 |      115 |
| 2013-04-23 |       99 |
| 2013-04-24 |      848 |
| 2013-04-25 |       72 |
| 2013-04-26 |     4524 |
| .......... |       .. |
+------------+----------+

Graphs show the magnitude of the problem

Look at the graphs, to the very right of each one, after Friday noon.

You can see how the load shot up after the newsletter was sent:

The number of anonymous sessions shot up from only a handful to around 800!

The number of logged in users had a spike to 400, up from the 300 or above.

The number of SQL queries also shot up.

And so did the MySQL threads too.

And the CPU usage was very high, with the server trying to serve around 1200 users with no caching for them.

Root Cause Analysis

So, it turns out that the recent code change was done to encourage more people to sign up for an account on the site. This form alters the comment form and adds extra fields to prod the visitor to register for an account, including the email address. Another form above the node also captures the email address.

If people clicked on the button to add their email, Pressflow complained about the missing comment field. And since any message, be it for a logged in users or an anonymous one, is stored in a session, all users who tried to register for an account were treated as logged in users in that they bypass the page cache for Pressflow. This effectively tripled the number of logged in users (from 400 to 1200), who all have to execute PHP and MySQL queries and not being served from Varnish.

Hence the high load and high CPU usage.

Solution

The fix was to revoke the post comment permission for anonymous users, and therefore, remove the comment form from the bottom of every node.

After that, the newsletter was sent without increasing the load the server at all.

Although this problem was on Pressflow 6.x, it should apply to Drupal 7.x as well, since it also disables sessions for anonymous users.

Apr 22 2013
Apr 22

One of the suboptimal techniques that developers often use, is a query that retrieves the entire content of a table, without any conditions or filters.

For example:

SELECT * FROM table_name ORDER BY column_name;

This is acceptable if there are not too many rows in the table, and there is only one call per page view to that function.

However, things start to get out of control when developers do not take into account the frequency of these calls.

Here is an example to illustrate the problem:

A client had high load average (around 5 or 6) on their server which had around 400 logged in users at peak hours. The server was somewhat fragile with any little thing, such as a traffic influx, or a misbehaved crawler, causing the load to go over 12.

This was due to using an older version of the Keyword Links module.

This old version had the following code:

This caused certain keywords to be replaced when a node is being displayed:

function keyword_link_nodeapi(&$node, $op, $teaser, $page) {
  if ($op == 'view' && ...
    $node->content['body']['#value'] = keyword_link_replace(...);
  }
}

And this caused keyword replacement for each comment as well.

function keyword_link_comment(&$a1, $op) {
  if ($op == 'view') {
    $a1->comment = keyword_link_replace($a1->comment);
    $node->content['body']['#value'] = keyword_link_replace(...);
   }
 }

The function that replaced the content with keywords was as follows:

 
function keyword_link_replace($content) {
  $result = db_query("SELECT * FROM {keyword_link} ORDER BY wid ASC");
  while ($keyword = db_fetch_object($result)) {
    ...
    $content = preg_replace($regex, $url, $content, $limit);
  }
  return $content;
}

Which executes the query every time, and iterates through the result set, replacing words.

Now, let us see how many rows are there in the table.

mysql> SELECT COUNT(*) FROM keyword_link;
+----------+
| count(*) |
+----------+
|     2897 |
+----------+
1 row in set (0.00 sec)

Wow! That is a relatively large number.

And Eureka! That is it! The query was re-executed every time the replace function was called!
This means in a list of nodes of 50 nodes, there would be 50 queries!

And even worse, for a node with tens or hundreds of comments, there would be tens or hundreds of queries as well!

Solution

The solution here was to upgrade to the latest release of the module, which has eliminated the replacement of keywords for comments.

But a better solution, preserving the functionality for comments, is a two fold combined solution:

Use memcache as the cache layer

By using memcache, we avoid going to the database for any caching. It is always a good idea in general to have that, except for simple or low traffic sites.

However, on its own, this is not enough.

Static caching for cache_get() result

By statically caching the results of the query, or the cache_get(), those operations are executed once per page view, and not 51 times for a node displaying comments. This is feasible if the size of the dataset is not too huge. For example, for this site, the size was around 1.3 MB for the three fields that are used from that table, and fits in memory without issues for each PHP process.

This is the outline for the code:

function keyword_link_replace($content) {
  static $static_data;

  if (!isset($static_data)) {
    if (($cache = cache_get('keyword_link_data')) &&
      !empty($cache->data)) {
      $static_data = $cache->data;

      foreach($cache->data as $keyword) {
        replace_keyword($keyword, $content);
      }
    }
    else {
      $result = db_query("SELECT * FROM {keyword_link} ORDER BY wid ASC");

      $data = array();

      while ($keyword = db_fetch_object($result)) {
        $data[] = $keyword;

        replace_keyword($keyword, $content);
      }

      $static_data = $data;

      cache_set('keyword_link_data', $data, 'cache');
    }
  }
  else {
    foreach($static_data as $keyword) {
      replace_keyword($keyword, $content);
    }
  }

  return $content;
}

You can download the full Drupal 6.x version with the fixes from here.

What a difference a query makes

The change was done at 22:00 in the daily graphs, and 6 days before the monthly graph was taken. You can see the difference in that the load average is less. Ranging between 1.8 and 2.4 for most of the day, with occasional spikes above 3. This is far better than the 5 or 6 load before the fix. Also the amount of data that is retrieved from MySQL is halved.

As you will notice, no change was seen in the number of SQL queries. This is probably because of the effect of MySQL's query cache. Since all the queries were the same for the same page, it served the result from the query cache, and did not have to re-execute the query for the tens or hundreds of times per page. Even though the query cache saved us from re-executing the query, there still is overhead in getting that data from MySQL's cache to the application, and that consumed CPU cycles.

Faster Node displays

And because we are processing less data, and doing less regular expression replacement, node display for nodes that have lots of nodes has improved. With a node that had hundreds of comments, and 50 comments shown per page, the total page load time was 8,068 milliseconds.

The breakdown was as follows:

keyword_link_replace() 51 calls totalling 2,429 ms
preg_replace() 147,992 calls totalling 1,087 ms
mysql_fetch_object() 150,455 calls totalling 537 ms
db_fetch_object() 150,455 calls totalling 415 ms
mysql_query() 1,479 calls totalling 393 ms
unserialize() 149,656 calls totalling 339 ms

A total of 5,254 milliseconds processing keywords in comments only.

After eliminating the calls to hook_comment() in the module, the total load time for the node was 3,122 milliseconds.

Conclusion

So, always look at the size of your dataset, as well as the frequency of resource intensive or slow operations. They could be bottlenecks for your application.

Apr 10 2013
Apr 10

The bulk of Drupal hosting for clients that we deal with is on virtual servers, whether they are marketed as "cloud" or not. Many eventually have to move to dedicated servers because increased traffic or continually adding features that increase complexity and bloat.

But, there are often common issues that we see repeatedly that have solutions which can prolong the life of your current site's infrastructure.

We assume that your staff, or your hosting provider, have full access to the virtual servers, as well as the physical servers that run on them.

Disks cannot be virtualized

Even for dedicated servers, the server's disk(s) are often the bottleneck for the overall system. They are the slowest part. This is definitely true for mechanical hard disks with rotating platters, and even Solid State Disks (SSDs) are often slower than the CPU or memory.

For the above reasons, disks cannot be fully virtualized. Yes, you do get a storage allocation that is yours to use and no one else can use. But you cannot guarantee a portion of the I/O throughput, which is always a precious resource on servers.

So, other virtual servers that are on the same physical server as you will contend for disk I/O if your site (or theirs) is a busy one or not optimally configured.

In a virtual server environment, you cannot tell how many virtual servers are on the same physical server, nor if they are busy or not. You only deal with the effects (see below).

For a Drupal site, the following are some of the most common causes for high disk I/O activity:

  • MySQL, with either a considerable amount of slow queries that do file sorts and temporary tables; or lots of INSERT/UPDATE/DELETE
  • Lots of logging activity, such as a warning or a notice in a module that keeps reporting exceptions many times per disk access
  • Boost cache expiry, e.g. when a comment is posted

Xen based virtualization vs. Virtuozzo or OpenVZ

The market uses virtualization technologies much like airlines when they overbook flights based on the assumption that some passengers will not show up.

Similarly, not all virtual hosting customers will use all the resources allocated to them, so there is often plenty of unused capacity.

However, not all virtualization technologies are equal when it comes to resource allocation.

Virtuozzo and its free variant, OpenVZ, use the term "burst memory" to allocate unused memory from other instances, or even swap space when applications demand it on one instance. However, this can bring a server to its knees if swap usage causes thrashing.

Moreover, some Virtuozzo/OpenVZ hosts use vzfs, a virtualized file system, which is slow for Drupal when used for certain things, such as having all of web root on it, logs, and database files.

Xen does not suffer from any of the above. It guarantees that memory and CPU allocated to one virtual instance stays dedicated for that instance.

However, since physical disk I/O cannot be virtualized, it remains the only bottleneck with Xen.

Underpowered Instances

One issue that Amazon AWS EC2 users face is that the reasonably priced instances are often underpowered for most Drupal sites. These are the Small and Medium instances.

For sites with low number of nodes/comments per day, and with most traffic being anonymous. These sites lend themselves to working well with proper Varnish caching enabled set to long hours before expiring.

Other sites that rely on a large number of simultaneous logged in users, with lots of enabled modules, and with short cache expiry times do not work well with these underpowered instances. Such sites require the Extra Large instances, and often the High CPU ones too.

Of course, this all adds to the total costs of hosting.

Expensive As You Grow

Needless to say, if your site keeps growing then there will be added hosting costs to cope with this growth.

With the cloud providers, these costs often grow faster than with dedicated servers, as you add more instances, and so on.

Misconfigured Self-Virtualization

Some companies choose to self manage physical servers colocated at a datacenter and virtualized them themselves.

This is often a good option, but can also be a pitfall. Sometimes the servers are badly misconfigured. We saw one case where the physical server was segmented into 12 VMWare virtual servers with no good reason. Moreover, all of them were accessing a single RAID array. On top of that boost was used on a busy popular forum. When a comment was posted, boost was expiring pages, and that was tying up the RAID array from doing anything useful to other visitors of the site.

Variability in Performance

With cloud and virtual servers, you often don't notice issues, but then suddenly variability will creep in.

An analogy ...

This happens because you have bad housemates who flush the toilet when you are in the shower. Except that you do not know who those housemates are, and can't ask them directly. The only symptom is this sudden cold water over your body. Your only recourse is to ask the landlord if someone flushed the toilet!

Here is a case in point: a Drupal site at a VPS with a popular cloud provider. It worked fine for several years. Then the host upgraded to another, newer version, and asked all customers to move their sites.

It was fine most of the time, but then extremely slow at other times. No pattern could be predicted.

For example while getting a page from the cache for anonymous visitors usually takes a few tens of milliseconds at most, on some occasions it takes much more than that, in one case, 13,879 milliseconds, with the total page load time 17,423 milliseconds.

Here is a sample of devel's output:

Executed 55 queries in 12.51 milliseconds. .Page execution time was 118.61 ms.

Executed 55 queries in 7.56 milliseconds. Page execution time was 93.48 ms.

Most of the time is spent is retrieving cached items.

ms where query
0.61 cache_get SELECT data, created, headers, expire FROM cache WHERE cid = 'menu:1:en'
0.42 cache_get SELECT data, created, headers, expire FROM cache WHERE cid = 'bc_87_[redacted]'
0.36 cache_get SELECT data, created, headers, expire FROM cache WHERE cid = 'bc_54_[redacted]'
0.19 cache_get SELECT data, created, headers, expire FROM cache WHERE cid = 'filter:3:0b81537031336685af6f2b0e3a0624b0'
0.18 cache_get SELECT data, created, headers, expire FROM cache WHERE cid = 'bc_88_[redacted]'
0.18 block_list SELECT * FROM blocks WHERE theme = '[redacted]' AND status = 1 ORDER BY region, weight, module

Then suddenly, same site, same server, and you get:

Executed 55 queries in 2237.67 milliseconds. Page execution time was 2323.59 ms.

This was a Virtuozzo host, and it was a sign of disk contention. Since this is a virtual server, we could not tell if this is something inside the virutal host or some other tenant on the same physical server flushing the toilet ...

The solution is in the following point.

Move your VPS to another physical server

When you encounter variable performance or poor performance, before wasting time on troubleshooting that may not lead anywhere, it is worthwhile to contact your host, and ask for your VPS to be moved to a different physical server.

Doing so most likely will solve the issue, since you effectively have a different set of housemates.

Further Reading:

Apr 01 2013
Apr 01

Ubuntu Server 12.04 LTS finally provides a stable long term support server distro that has a recent version of Varnish in its repositories.

Trouble is, the repository provided package of Varnish has some issues. Specifically, the command line tools, such as varnishhist, varnishstat, ...etc. do not report anything. Therefore one cannot know the hit/miss rates, hits per second, or other useful information. Moreover, monitoring Varnish using Munin for such statistics does not work either.

There are two ways you can overcome this, both are described below.

Use the Varnish provided packages from their repository

The Varnish project provides an Ubuntu repository that contains Ubuntu packages. This is the recommended way, because you get updates for security if and when they come out. As well, you will get startup scripts installed and configured automatically, instead of having to create them from scratch, as in the second option.

First, install curl, if it is not already installed on your server:

# aptitude install curl 

Then, add the repository key to your server:

# curl http://repo.varnish-cache.org/debian/GPG-key.txt |
    sudo apt-key add -

And then add the repository itself.

# echo "deb http://repo.varnish-cache.org/ubuntu/ lucid varnish-3.0" | 
    tee /etc/apt/sources.list.d/varnish.list

Now, update the list of packages from the new repository:

# aptitude update

And then install Varnish

# aptitude install varnish 

Configuring Varnish listening port and memory

Then, you need to configure Varnish to front end your web server, so that it runs on port 80 (regular HTTP traffic) and 443 (SSL secure traffic, if your site uses it).

Also, increase, or decrease the amount of memory allocated to Varnish if your server has memory to spare.

We also change the name of the .vcl file, so when upgrading, Ubuntu will not ask about two files changed, and rather one file only (/etc/default/varnish).

So, go ahead and edit the /etc/default/varnish file, and this to the end, if you have a small server (e.g. 1G RAM).

DAEMON_OPTS="-a :80,:443 \
             -T localhost:6082 \
             -f /etc/varnish/main.vcl \
             -S /etc/varnish/secret \
             -s malloc,128m"

If you have a larger server, e.g. 8GB, you would allocate 2G of RAM for varnish.

DAEMON_OPTS="-a :80,:443 \
             -T localhost:6082 \
             -f /etc/varnish/main.vcl \
             -S /etc/varnish/secret \
             -s malloc,2048m"

Then restart Varnish:

service varnish restart

In an upcoming article, we will discuss the details of configuring Varnish's VCL.

Compiling from source

The second option is to download the source code, and compile it yourself.

This means that you are responsible for upgrading if there is a security fix.

You first need to install the C compiler, and some other libraries like curses development, PCRE, as well as the make utility.

You do this via the command:

# aptitude install libncurses5-dev libpcre3-dev pkg-config make 

Which will pull the C compiler if it is not already installed.

Then, download the source:

# wget http://repo.varnish-cache.org/source/varnish-3.0.3.tar.gz

Extract the source from the archive

# tar xzf varnish-3.0.3.tar.gz

Change the directory to what you just extracted

# cd varnish-3.0.3

Run the configure tool. Make sure there are no errors.

# ./configure

Build the binaries

# make

And install the source.

# make install

Varnish will be installed in /usr/local.

You will need to create start scripts for Varnish.
You should use a different name other than what would have been installed by the repository packages, so that it would not clash with the same file names, e.g. /etc/default/varnish-local. This would hold the DAEMON_OPTS mentioned above. You also need to create an /etc/init.d/varnish-local script for startup. I then use the following command to make Varnish run at run level 2.

update-rc.d varnish-local start 80 2 .

Monitoring Varnish with Munin

We assume that you have Munin installed and configured and is already monitoring other things on your server.

We need to install a perl library that will be able to pull the statistics info from Varnish

aptitude install libnet-telnet-perl

Then, we need get the monitoring scripts

cd /usr/share/munin/plugins/

git clone git://github.com/basiszwo/munin-varnish.git

chmod a+x ./munin-varnish/varnish_*

Then create symbolic links for the monitoring scripts.

cd /etc/munin/plugins

ln -s /usr/share/munin/plugins/munin-varnish/varnish_allocated
ln -s /usr/share/munin/plugins/munin-varnish/varnish_cachehitratio
ln -s /usr/share/munin/plugins/munin-varnish/varnish_hitrate
ln -s /usr/share/munin/plugins/munin-varnish/varnish_total_objects

Then edit the file /etc/munin/plugin-conf.d/munin-node, and add the following to the end.

[varnish*]
user root

And restart Munin for the changes to take effect.

service munin-node restart

Configuring Varnish for Drupal

As mentioned above, in an upcoming article, we will discuss the details of configuring varnish for Drupal.

Mar 27 2013
Mar 27

Today, Khalid gave a presentation on Drupal Performance and Scalability for members of the London (Ontario) Drupal Users Group.

The slides from the presentation are attached below.

AttachmentSize 498.3 KB
Mar 19 2013
Mar 19

For sites that have lots of slow queries, disk access is often the bottleneck. For these slow queries, MySQL writes temporary tables to disk, populates them with intermediate results, then query them again for the final result.

We all know that the disk is the slowest part in a computer, because it is limited by being mechanical, rather than electronic. One way of mitigating this is to tell MySQL to use memory rather than disk for temporary tables.

This is often done by creating either a RAM Disk, or the easier to use tmpfs. Both are a portion of the server's RAM memory made to emulate a disk with slightly different details: RAM disk has a file system on it that can be ext3 or anything, while tmpfs is its own file system type.

Since memory access is much faster than a disk, this improves performance, and decreases load on the server by not causing pile up bottlenecks on disks.

We describe here methods to achieve this goal.

Method 1: Using an existing tmpfs directory

Rather than creating a new ram disk or tmpfs mount, we first search for one that is already on your server.

# df -h
Filesystem      Size  Used Avail Use% Mounted on
...
tmpfs           1.6G  260K  1.6G   1% /run
...

This tells us that the the /run filesystem is of type temporary file system, and has 1.6 GB allocated for it.

# mount
...
tmpfs on /run type tmpfs (rw,noexec,nosuid,size=10%,mode=0755)
...

On Ubuntu 12.04 LTS, the directory /run/mysqld already exists and is allocated to a tmpfs with sufficient space for temporary files.

Save yourself some grief and do not try to create your custom directory under /run (e.g. mysqlslow), because it will not survive reboots, and MySQL will not start after a reboot.

So, all we need is telling MySQL to use this directory.

To do this, create a file called /etc/mysql/conf.d/local.cnf. By using this file, and not editing /etc/mysql/my.cnf, we don't have Ubuntu updated overwrite your changes.

Add this to the file:

[mysqld]
tmpdir = /run/mysqld

Then restart MySQL

service mysql restart

Then make sure that the new value is now in effect:

# mysql
mysql> SHOW VARIABLES LIKE 'tmpdir';
+---------------+-------------+
| Variable_name | Value       |
+---------------+-------------+
| tmpdir        | /run/mysqld |
+---------------+-------------+

Method 2: Creating a new tmpfs directory

If you are not running Ubuntu 12.04 LTS, then you may not have a ready made RAM disk that you can use, and you have to create one.

Here are the steps to create a the tmpfs directory:

Create the tmp directory

# mkdir -p /var/mysqltmp

Set permissions

# chown mysql:mysql /var/mysqltmp

Determine mysql user id

# id mysql

Edit /etc/fstab

And add the following line, replacing your specific mysql user id and group id instead of the 105 and 114 below:

tmpfs /var/mysqltmp tmpfs rw,gid=105,uid=114,size=256M,nr_inodes=10k,mode=0700 0 0

Mount the new tmpfs partition

# mount -a

Change your MySQL configuration

# vi /etc/mysql/conf.d/local.cnf 

Change, or add the following line:

tmpdir = /var/mysqltmp

Restart MySQL

/etc/init.d/mysql restart

Or, for Ubuntu 12.04:

service mysql restart

How much of a difference does it make?

How much of a difference can you expect from moving MySQL's temporary files from disk to a RAM? Significant, if your server has lots of slow queries.

Here are the graphs from a site that was suffering considerably because of a large number of logged in users (averaging 250 at peak hours, and exceeding 400 at times), and some other factors.

Using a RAM disk made a noticeable difference.

CPU usage. Note how much iowait (magenta) before and after the change:

And how many slowqueries per second before and after the change:

The server's load is less too:

The Input Outputs per second on sda (where the /tmp is, which was the destination for slow queries before the change.

Mar 12 2013
Mar 12

Over the past few years, we were called in to assist clients with poor performance of their site. Many of these were using Pressflow, because it is "faster" and "more scalable" than Drupal 6.x.

However, some of these clients hurt their site's performance by using Pressflow, rather than plain Drupal, often because they misconfigured or misused it in some way or another.

Setting cache to "external" without having a caching reverse proxy

We saw a couple of cases where clients would set the cache to "External" in admin/settings/performance, but they are not running a reverse proxy cache tier, such as Varnish or Squid.

What happens here is that Pressflow will not cache pages for anonymous users, and just issue the appropriate cache HTTP headers, assuming that a caching reverse proxy, e.g. Varnish, will cache them.

Performance of the site will suffer, since it will be hit by search engine crawlers.

The solution is simple: either configure a reverse proxy, or set caching to "normal".

Setting Page Cache Maximum Age too low

In admin/settings/performance, there is a configuration parameter called "Page cache maximum age" (in Pressflow, which is called "Expiration of cached pages" in Drupal 7.x). This value should not be left set to "none", because that means items will not be left in the cache for sufficient time for them to be served for subsequent users. Setting it too low (e.g. 1 minute) has the same effect too.

Do set this parameter to the highest time possible if you have an external cache like Varnish or Squid.

Enabling modules that create anonymous sessions

Both Pressflow 6.x and Drupal 7.x disable page caching for anonymous users if a session is present.

This means that if you have a module that sets a cookie, caching will be disabled, because a cookie needs a session to store it.

This means that code like this will disable page caching for anonymous users:

  $_SESSION['foo'] = 'bar';

The Pressflow Wiki started an effort to list such modules here: Modules that break Pressflow 6.x caching and how to fix them and here: Code that sets cookie or session, but with so many modules being written, it is virtually impossible to have a complete list.

Also, novice Drupal developers will not know this, and write modules that use cookies, and therefore prevent page caching for anonymous users.

We have seen such cases from such developers where a site that was perfectly working previously is rendered fragile an unstable via one line of code!

Not that this fault applies to Pressflow 6.x, and to Drupal 7.x as well.

If you are using the former, then you can solve the problem temporarily by switching to Drupal core 6.x instead of Pressflow 6.x. Drupal code 6.x does not mind cookies for anonymous users.

Using Varnish with hook_boot() or hook_exit() modules

When using an external cache, like Varnish, all anonymous requests do not hit Drupal at all. They are served from Varnish.

So if you have modules that implement hook_boot() or hook_exit(), then the code that is there will not be triggered at all. If you rely on it for some functionality, then it will be hit only the first time the page is accessed.

For example, the core statistics module hook_exit() increments the view count for the node. If you enable this module with this functionality, then these figures will be far lower than the real numbers, and you are better of disabling this module rather than having inaccurate numbers.

Mar 06 2013
Mar 06

The Boost is often a great help with speeding up web sites of small to medium size and/or hosted on shared hosts.

It works by writing the entire cached page to a disk file, and serving it entirely from the web server, bypassing PHP and MySQL entirely.

This works well in most cases, but we have observed a few cases where boost itself becomes a bottleneck.

One example was when 2bits.com were called to investigate and solve a problem for a Fortune 500 company's Drupal web site.

The site was configured to run on 12 web servers, each being a virtual instance on VMWare, but all of them sharing a single RAID-5 pool for disk storage.

The main problem was when someone posts a comment: the site took up to 20 seconds to respond, and all the web instances were effectively hung.

We investigated and found out that what happens is that boost's expiry logic kicked in, and tries to delete cached HTML intelligently for the node, the front page, ...etc. All this while the site is busy serving pages from the same disk from boost's cache, as well as other static files.

This disk contention from deleting files caused the bottleneck observed.

By disabling boost, and using memcache instead, we were able to bring down the time from 20 seconds to just 8 seconds.

Further improvement could be achieved by using Varnish as the front tier for caching, reducing contention.

Feb 25 2013
Feb 25

In the Drupal community, we always recommend using the Drupal API, and best practices for development, management and deployment. This is for many reasons, including modularity, security and maintainability.

But it is also for performance that you need to stick to these guidelines, refined for many years by so many in the community.

By serving many clients over many years and specifically doing Drupal Performance Assessments, we have seen many cases where these guidelines are not followed, causing site slowdowns and outages.

Here are some examples of how not to do things.

Logic in the theme layer

We often find developers who are proficient in PHP, but new to Drupal misuse its API in many ways.

In extreme cases, they don't know they should write modules to house the application logic and doing data access, and leave only presentation to be done in the theme layer.

We saw a large site where all the application logic was in the theme layer, often in .tpl.php files. The logic even ended with an exit() statement!

This caused Drupal page caching mechanism to be bypassed, resulting in all page accesses from crawlers and anonymous users to be very heavy on the servers, and complicating the infrastructure by over-engineering it to compensate for such a development mistake.

Using PHP in content (nodes, blocks and views)

Another common approach that most developers start using as soon as they discover it, is placing PHP code inside nodes, blocks or views.

Although this is a quick and dirty approach, the initial time savings cause lots of grief down the road through the life cycle of the site. We wrote an article specifically about that, which you will find a link to below.

Heavy queries in the theme layer, when rendering views

In some cases, the logic for rendering individual nodes within a view is complex, and involves code in the view*.tpl.php file that has SQL queries, or calls to heavy functions, such as node_load() and user_load().

We wrote an article on this which you can find the link to below.

Conclusion

Following Drupal's best practices and community guidelines is always beneficial. Performance is just one of the benefits that you gain by following them.

Further reading

Feb 19 2013
Feb 19

When doing performance assessment for large and complex sites to assess why they are not fast or scalable, we often run into cases where modules intentionally disable the Drupal page cache.

Depending on how often it happens and for which pages, disabling the page cache can negatively impact the site's performance, be that in scalability, or speed of serving pages.

How to inspect code for page cache disabling

If you want to inspect a module to see if it disables the page cache, search its code for something like the following:

// Recommended way of disabling the cache in Drupal 7.x
drupal_page_is_cacheable(FALSE);

Or:

$GLOBALS['conf']['cache'] = 0;

Or:

$GLOBALS['conf']['cache'] = CACHE_DISABLED;

Or:

$conf['cache'] = FALSE;

Modules that disable the page cache

We have found the following modules that disable the page cache in some cases:

Bibliography Module

In biblio_init(), the module disables the page cache if someone is visiting a certain URL, such as "biblio/*" or "publications/*", depending on how the module is configured.

if ($user->uid === 0) { 
  // Prevent caching of biblio pages for anonymous users
  // so session variables work and thus filering works
  $base = variable_get('biblio_base', 'biblio');
  if (drupal_match_path($_GET['q'], "$base\n$base/*"))
    $conf['cache'] = FALSE;
}

Flag Module

This code in flag/includes/flag_handler_relationships.inc

if (array_search(DRUPAL_ANONYMOUS_RID, $flag->roles['flag']) !== FALSE) {
  // Disable page caching for anonymous users.
  drupal_page_is_cacheable(FALSE);

Or in Drupal 6.x:

if (array_search(DRUPAL_ANONYMOUS_RID, $flag->roles['flag']) !== FALSE) {
  // Disable page caching for anonymous users.
  $GLOBALS['conf']['cache'] = 0;

Invite Module

case 'user_register':
  // In order to prevent caching of the preset 
  // e-mail address, we have to disable caching 
  // for user/register.
  $GLOBALS['conf']['cache'] = CACHE_DISABLED;

CAPTCHA Module

The CAPTCHA module disables the cache wherever a CAPTCHA form is displayed, be that in a comment or on the login form.

This is done via the hook_element_info() which sets a callback in the function captcha_element_process().

If you find other modules that are commonly used, please post a comment below about it.

Feb 13 2013
Feb 13

While doing a Drupal Performance Assessment for a clients Drupal 6.x site recently, we found an odd problem.

Among other things, page generation times was high, and this was due to the function skinr.module::skinr_preprocess() was being called 190 times for each page load.

Each of these calls was calling theme_get_registry(), which is supposed to statically cache its results. However, these were still consuming too much time, even with static caching. Perhaps this was due to the large amount of memory the data for the theme registry consumes.

Each invocation took up to 11 milliseconds, and those added up significantly.

We measured this by adding the following debugging code in includes/theme.inc::theme_get_registry()

function theme_get_registry($registry = NULL) {
  timer_start(__FUNCTION__);
  static $theme_registry = NULL;
  if (isset($registry)) {
    $theme_registry = $registry;
  }

  $time = timer_stop(__FUNCTION__);
  error_log(__FUNCTION__ . ':' . $time['time']);
  return $theme_registry;
}

We then checked how much time the skinr_preprocess() function takes, by adding a static flag to prevent it from being executed more than once. This of course would impact the site's functionality, but we wanted to measure the impact of all these calls:

function skinr_preprocess(&$vars, $hook) {
  static $already_called = FALSE;
  if ($already_called) {
    return;
  }

  ...
  
  $already_called = TRUE;
}

We found that this made a significant difference in performance, when measured via a code profiler, such as xhprof:

theme_get_registry() 194 calls total of 327,560 usecs 15.8% of total page load time
skinr_preprocess() 190 calls total of 264,246 usecs 12.8% of total page load time

By eliminating these, we take off more than half a second of page execution time!

Instead of a total 2,068 ms before, it is down to 1,220 ms with the change.

Since this does impacts the site's functionality, we modified this to a less performant but still functional variant like so (for Skinr 6.x-1.x):

function skinr_preprocess(&$vars, $hook) {
  static $already_called = FALSE;
  static $skinr_data;
  static $info;
  static $current_theme;
  static $theme_registry;

  if (!$already_called) {
    $already_called = TRUE;

    // Let's make sure this has been run. There have been
    // problems where other modules implement a theme 
    // function in their hook_init(), so we make doubly
    // sure that the includes are included.
    skinr_init();
  
    $skinr_data = skinr_fetch_data();
    $info = skinr_process_info_files();
    $current_theme = skinr_current_theme();
    $theme_registry = theme_get_registry();
  }

  ...
}

Before:

Total page generation time 2,055,029 usecs

theme_get_registry 194 calls totalling 328,355 usecs and 16.0% of total page generation time
skinr_preprocess 190 calls totalling 266,195 usecs and 13.0% of total page generation time

After:

Total page generation time 1,402,446 usecs

Over 650 milliseconds saved!

We also noticed a corresponding decrease in CPU utilization, which means the servers can scale better.

We have created patches that fix the above. You can grab them for Skinr 6.x-1.x, 6.x-2.x and 7.x-2.x from issue #1916534.

Dec 12 2012
Dec 12

We encounter this problem a lot: the extremely popular and oft-used Admin Menu module causes performance problems.

Here is an example from a site we recently did a Drupal performance assessment for.

Executed 3,169 queries in 584.11 milliseconds.
Page execution time was 4,330.86 ms.

As you can see, the number of queries per request is horrendous, and the site is a resource hog if left in that state.

There were several reasons that were causing excessive number of queries leading to excessive page load time, and general slowness and resource usage.

The key here is that the context module was doing a cache_set() and that was triggering Admin Menu to rebuild its menus.

We diagnosed the problem, and were able to get over it by disabling the following modules:

  • ND Context
  • Context Layout
  • Context UI
  • Context
  • Admin Menu

After disabling the above modules, we were able to get much better response and far less queries, after we did that, as follows:

Executed 245 queries in 59.41 milliseconds.
Page execution time was 866.24 ms.

Orders of magnitude better ...

We are told, but have not verified, that the 3.x branch is supposed to fix some of the performance issues of Admin Menu.

Dec 03 2012
Dec 03

We were recently troubleshooting a client site running Drupal 7.x, and main complaint was high memory usage on all the site's pages.

We were able to diagnose and solve two main causes that range from the common to unusual.

This is a Drupal 7 Commerce site with 173 modules, and 3 themes enabled. Apache Solr is used for search, and there is custom code to talk over the network to a non-Drupal backend server.

The site runs on a Debian Squeeze Xen VPS.

For most of the site's pages, the client was seeing high memory usage, as follows:

Problem: high memory usage

When every page load has extremely excessive memory usage, this could be a bottleneck for the site scaling well, since the server has to have enough memory to cope with many pages at the same time using lots of memory.

The first access to a page, where APC has not yet cached anything, would look like this in devel:

Memory used at: devel_boot()=6.92 MB, devel_shutdown()=236.03 MB, PHP peak=243.25 MB.

Subsequent access would show less memory usage, since APC caches the Drupal PHP files, like so:

Memory used at: devel_boot()=6.54 MB, devel_shutdown()=174.37 MB, PHP peak=175.5 MB.

Some page even reported up to 192 MB of peak memory!

That is still excessive. For a site with that many modules, we expected that memory usage would be high, but not to that extent.

Solutions to high memory usage

Increasing APC shared memory size

First, the allocated shared memory for APC was not enough.

The apc.shm_size parameter for APC is set to the default of 32MB.

The code base with that many modules needed at least double that or more.

So, increasing this to 96MB, solved that part.

To do so on Debian or Ubuntu, change the following line in the file /etc/php5/apache2/conf.d/apc.ini

apc.shm_size = 96

Replacing php5-memcached with php-memcache

The other big cause of excessive memory usage was quite unusual. It was the use of php5-memcached (notice the "d") to connect PHP with the memcached daemon, rather than the more commonly used php5-memcache (without a "d").

For some unknown reason, the PHP memcached extension (from the Debian package php5-memcached) uses way more memory than the php5-memcache extension.

In order to remedy this, do the following:

$ sudo aptitude purge php5-memcached
$ sudo aptitude install php5-memcache

What a difference a "d" makes!

Results

The results after doing both of the above things were dramatic. Instead of 175 MB per page, it is now a more typical (for a complex site): 60 MB!

Memory used at: devel_boot()=2.15 MB, devel_shutdown()=58.4 MB, PHP peak=59.5 MB.

Note that these figures are not absolute, and will vary from distro to distro and server to server, depending on what modules you have enabled in PHP and Apache, and many other factors. What matters is the comparative figures, not absolute figures.

For example, the same site on an Ubuntu Server LTS 12.04, which we used for our lab servers:

Memory used at: devel_boot()=2.11 MB, devel_shutdown()=72.27 MB, PHP peak=75.75 MB.

It will be different on CentOS.

Nov 19 2012
Nov 19

The site is ScienceScape.org, a repository of scientific research going back to the 19th century, down to the latest biotechnology and cancer release.

Update: You can watch a video of the presentation on Vimeo.

AttachmentSize 535.43 KB
Oct 29 2012
Oct 29

But this time, it was different. Modules were not to blame.

While inspecting a site that had several performance problems for a client, we noticed is that memory usage was very high. From the "top" command, the RES (resident set) field was 159 MB, far more than what it should be.

We narrowed down the problem to a view that is in a block that is visible on most pages of the site.

But the puzzling part is that the view was configured to only returned 5 rows. It did not make sense for it to use that much memory.

However, when we traced the query, it was like so:

SELECT node.nid, ....
FROM node
INNER JOIN ...
ORDER BY ... DESC

No LIMIT clause was in the query!

When executing the query manually, we found that it returned 35,254 rows, with 5 columns each!

Using the script at the end of this article, we were able to measure memory usage at different steps. We inserted a views embed in the script and measured memory usage:

Before boot               0.63 MB
Boot time               445.1 ms
After Boot               51.74 MB
Boot Peak                51.80 MB
Module count            136
After query              66.37 MB
Query Peak               67.37 MB
After fetch              78.96 MB
Fetch Peak              148.69 MB

So, indeed, that view was the cause of the inflated memory usage! With memory jumping from 67 MB to 148 MB.

In this case, it turns out that the module "views_php" was definitely the culprit. Once it was disabled, the query did not have that huge memory foot print any more.

Here are the results after disabling views_php:

Before boot               0.63MB
Boot time               427.1 ms
After Boot               51.68MB
Boot Peak                51.74MB
Module count            135
After query              66.31MB
Query Peak               67.31MB
After fetch              74.71MB
Fetch Peak               74.89MB

A more reasonable 75MB.

We did not dig further, but it could be that because a field of type "Global: PHP" was used, views wanted to return the entire data set and then apply the PHP to it, rather than add a LIMIT to the query before executing it.

So, watch out for those blocks that are shown on many web pages.

Baseline Memory Usage

As a general comparative reference, here are some baseline figures. These are worse case scenarios, and assume APC is off, or that this measurement is running from the command line, where APC is disabled or non-persistent. The figures would be less from Apache when APC is enabled.

These figures will vary from site to site, and they depend on many factors. For example, what modules are enabled in Apache, what modules are enabled in PHP, ...etc.

Drupal 6 with 73 modules

Before boot:     0.63 MB
After boot:     22.52 MB
Peak memory:    22,52 MB

Drupal 7 site, with 105 modules

Before boot:     0.63 MB
After boot:     57.03 MB
Peak memory:    58.39 MB

Drupal 7 site, with 134 modules

Before boot:     0.63 MB
After boot:     58.79 MB
Peak memory:    60.28 MB

Drupal 6 site, with 381 modules

Before boot:     0.63 MB
After boot:     66.02 MB

Drupal 7 site, pristine default install, 29 modules

Now compare all the above to a pristine Drupal 7 install, which has 29 core modules installed.

Before boot     0.63 MB
Boot time     227.40 ms
After Boot     20.03 MB
Boot Peak      20.07 MB
Module count   29

Effect of APC on boot memory footprint

To see how much APC, and other opcode caches, improves these figures, compare the following:

First access after Apache restarted, for a Drupal 6 site:

Before boot     0.63 MB
Boot time     802.5 ms
After Boot     62.85 MB
Boot Peak      63.11 MB
Module count   210

Subsequent accesses, with APC caching the code:

Before boot     0.61 MB
Boot time     163.80 ms
After Boot     17.24 MB
Boot Peak      18.41 MB
Module count   210

Also, for a default Drupal 7 install, with 29 core modules. Compare to the above figures for the same site without APC.

Before boot     0.61 MB
Boot time      60.4 ms
After Boot      3.36 MB
Boot Peak       3.41 MB
Module count   29

A marked improvement! Not only in bootup time, but also reduced memory foot print.

So always install APC, and configure it correctly on your site.

Memory Measurement Script

Here is a script to measure your memory usage. You can run it from the command line as:

$ cd /your/document_root
$ php mem.php

Add whatever part you think is causing memory usage to sky rocket in place of the commented out section, and you can see how much is being used.

You can also add HTML line breaks to the print statement, and run it from a browser to see the effect of APC code caching as well.

<?php

define('DRUPAL_ROOT', getcwd());

function measure() {
  list($usec, $sec) = explode(" ", microtime());
  $secs = (float)$usec + (float)$sec;
  return $secs * 1000;
}

function fmt_mem($number) {
  return number_format($number/1024/1024, 2) . "MB";
}

$results = array();
$results['Before boot'] =  fmt_mem(memory_get_usage());

$start = measure();
require_once DRUPAL_ROOT . '/includes/bootstrap.inc';
drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);

$results['Boot time'] = number_format((measure() - $start), 1) . " ms";
  
$results['After Boot'] = fmt_mem(memory_get_usage());
$results['Boot Peak '] = fmt_mem(memory_get_peak_usage());
  
$count = count(module_list());
$results['Module count'] = $count;

// Testing a view using the standard method
/*
$view = views_embed_view('your_view', 'block');
*/

// Testing a view in two steps
/*
$view = views_get_view('your_view');

$results['After query'] = fmt_mem(memory_get_usage());
$results['Query Peak '] = fmt_mem(memory_get_peak_usage());

$output = $view->preview('block');
*/

$results['After fetch'] = fmt_mem(memory_get_usage());
$results['Fetch Peak '] = fmt_mem(memory_get_peak_usage());

foreach($results as $k => $v) {
  print $k . "\t\t" . $v . "\n";
}
Oct 22 2012
Oct 22

Do you have a problem with some Drupal admin pages with a large number of input fields not saving? Does the page just returns back with no message confirming that changes have been saved?

Well, this happened recently on a site that we were troubleshooting for a client.

The symptoms were: trying to save has lots of input fields, a) comes back with no message about saving changes, and b) the values changed were not saved.

The site had 210 enabled modules, 14 user roles defined, and 84 content types, with 76 content fields!

For example, take the permissions page at admin/users/permissions. If the site has lots of content types, each with lots of fields, then modules with many permutations of permissions for content types and fields will each have to define reams and reams of permissions each.

For this site, it was the following modules, combined with the number of content types and fields that caused the permissions to grow like that.

  • node
  • actions_permissions
  • publishcontent
  • content_permissions

Let us verify that by saving the permissions page as HTML, and then doing some analysis:

$ grep -c 'input type="checkbox"' permissions.html 
20748

Look at that: 20,748 checkboxes!

Let us see how many permissions we have:

$ grep -c 'class="permission"' permissions.html        
1482

Yup! That is 1,482 permissions!

If you multiply 1482 X 14 roles = 20,748 total checkboxes!

The root cause for this was two fold, one on the PHP side and the other on Apache's side.

Configuring PHP to accept more input variables

The default value for input fields for PHP is 1000. While this is sufficient for normal sites, it is not so for sites that overuse (misuse/abuse?) Drupal features.

You need to increase the number of input variables in PHP:

To verify that this is your problem, look in your web server's error log for something similar to this error message:

mod_fcgid: stderr: PHP Warning: Unknown: Input variables exceeded 1000. To increase the limit change max_input_vars in php.ini. in Unknown on line 0

Just add the following to your php.ini file:

max_input_vars = 1500

If you have the Suhosin enhanced security extension for PHP, then you need to add these as well:

suhosin.post.max_vars = 1500
suhosin.request.max_vars = 1500

Then, restart your web server.

You should be able to save the page now, and get a confirmation message, and see that your changes have "stuck".

But wait a minute: how come you have over 20,000 checkboxes, yet you only made it work with 1500 only?

The answer is that Drupal is using input value arrays for most fields, so it is not a 1 to 1 relationship between number of checkboxes and number of input fields.

Configuring FastCGI for large input

If you are using FastCGI, for example mod_fastcgi or fcgid, then pages would not save even if you implement the above changes. The reason is that with that many input fields, you overflow the default maximum for the size of requests between Apache (or ngnix) and PHP over the FastCGI protocol.

Look in your server's error log for an error message like this one:

mod_fcgid: HTTP request length 131998 (so far) exceeds MaxRequestLen (131072)

Normally, either you will see the errors in the web server's error log, or you will see them right there on the page. But we have had cases where low cost web hosts don't log errors at all anywhere.

The default is 128 Kilobytes (128 X 1024 = 131,072 bytes), and was not enough for this huge number of fields.

To confirm that you are running FastCGI, go to /admin/reports/status/php. If "Server API" is set to "CGI/FastCGI", then continue with the next step.

The fix is easy, and would go under either FastCGI or fcgid, as the case may be with your setup:

For example if you are using fcgid, you would add that under the IfModule mod_fcgid.c section:

  FcgidMaxRequestLen  524288

If you are using the older FastCGI, then you need to add that under the IfModule mod_fastcgi.c section.

Once the above was changed, we got the page to display the reassuring message of "The changes have been saved" appearing, and combined with the max_input_vars change above, the values were saved correctly.

Pages

About Drupal Sun

Drupal Sun is an Evolving Web project. It allows you to:

  • Do full-text search on all the articles in Drupal Planet (thanks to Apache Solr)
  • Facet based on tags, author, or feed
  • Flip through articles quickly (with j/k or arrow keys) to find what you're interested in
  • View the entire article text inline, or in the context of the site where it was created

See the blog post at Evolving Web

Evolving Web