Feeds

Upgrade Your Drupal Skills

We trained 1,000+ Drupal Developers over the last decade.

See Advanced Courses NAH, I know Enough
Feb 26 2018
Feb 26

A recent client performance assessment consulting project showed that on their site, the main page that logged in users would browse is slow. Tuning the server for memory and disk throughput helped somewhat, but did not fully eliminate the issue.

Looking at the page, it was a view, and the total time was around 2.75 seconds.

The main query was not efficient, with lots of left joins, and lots of filtering criteria:

SELECT node.nid AS nid,
... AS ...
... AS ...
'node' AS field_data_field_aaa_node_entity_type,
'node' AS field_data_field_bbb_node_entity_type,
'node' AS field_data_field_ccc_node_entity_type,
... AS ...
FROM node
INNER JOIN ... ON node.uid = ...
LEFT JOIN ... ON ... = ...  AND ... = ...
LEFT JOIN ... ON ... = ... AND (... = '12'
OR ... = '11'
OR ... = '15'
OR ... = '24')
WHERE (( (node.status = '1')
AND (node.type IN ('something'))
AND (... <> '0')
AND ((... <> '1') )
AND ((... = '4'))
AND (... IS NULL ) ))
ORDER  BY  ... DESC
LIMIT  51   OFFSET 0

That caused the first pass to sift through over 24,000 rows, while using both file sort and temporary tables. Both operations are disk intensive.

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ...
   partitions: NULL
         type: range
possible_keys: PRIMARY,...
          key: rid
      key_len: 8
          ref: NULL
         rows: 24039
     filtered: 100.00
        Extra: Using where; Using index; Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: ...
   partitions: NULL
         type: eq_ref
possible_keys: PRIMARY,status
          key: PRIMARY
      key_len: 4
          ref: test43....
         rows: 1
     filtered: 50.00
        Extra: Using where
*************************** 3. row ***************************
           id: 1
  select_type: SIMPLE
        table: node
   partitions: NULL
         type: ref
possible_keys: uid,status,type,node_status_type
          key: uid
      key_len: 4
          ref: test43....
         rows: 5
     filtered: 12.18
        Extra: Using index condition; Using where
*************************** 4. row ***************************
           id: 1
  select_type: SIMPLE
        table: ...
   partitions: NULL
         type: ref
possible_keys: PRIMARY,...
          key: PRIMARY
      key_len: 4
          ref: test43....
         rows: 2
     filtered: 54.50
        Extra: Using where; Not exists; Using index

But here is the puzzle: this query took 250 to 450 milliseconds at most.

Where did the rest of the 2,750 milliseconds go?

To find out, we use xhprof, the profiler for PHP.

In the screenshot below, you can see that the total page processing time (Total Inc. Wall Time, top right) is 2,732 milliseconds.

Out of that, 85% is in database queries (252 total queries, totaling 2,326 milliseconds, Excl.Wall Time).

What are these queries?

They are queries to other tables in the database to retrieve fields for each row.

For example, if you have a product view, with certain criteria, the result still has to get the product name, its price, its image, ...etc.

All these queries add up, specially when you are loading 50 of them. The time needed to retrieve each field, and rendering it for output is multiplied by the number of rows retrieved.

So, how do you mitigate that overhead? There are several ways:

  • Reduce the number of rows returned by the view. For example, instead of 50, make it 25. That would half the number of queries (and processing) needed to produce the page.
  • If the query is the same for all logged in users, then enable views caching (under Advanced when you edit the view), and enable both Query Result and Rendered Output caching. Use time based caching, for as long as practical to your site (e.g. if you add products or change prices only once a day, then you can cache the results for 20 hours or more).
  • Use a fast caching layer, such as the memcache module, instead of the default database caching, which will be slow for a site with many logged in users.
  • Use View Lite Pager to eliminate COUNT queries from being performed.
  • Consider alternate approaches to views, such as Apache Solr Faceted Search, which has much better performance than MySQL based solutions, because they do build an efficient index.

By implementing all the above for the client in question, except the last one, we were able to bring the view page from 2,730 milliseconds, down to 700-800 milliseconds of response time.

Scalability was much better, with the same server could handle more logged in users.

Feb 20 2018
Feb 20

On many occasions, we see web site performance suffereing due to misconfiguration or oversight of system resources. Here is an example where RAM and Disk I/O severely impacted web site performance, and how we fixed them.

A recent project for a client who had bad site performance uncovered issues within the application itself, i.e. how the Drupal site was put together. However, overcoming those issues was not enough to achieve the required scalability with several hundred logged in users on the site at the same time.

First, regarding memory, the site configured too many PHP-FPM processes, and that left no room in memory for the filesystem buffers and cache, which help a lot with disk I/O load.

Here is a partial display from when we were monitoring the server before we fixed it:

As you can see, the buffers + cache + free memory all amount to less than 1 GB of total RAM, while the used RAM is over 7GB.

used buffers cache free 7112M 8892k 746M 119M 7087M 9204k 738M 151M 7081M 9256k 770M 125M 7076M 4436k 768M 136M 7087M 4556k 760M 133M

We did calculations on how much RAM is really needed by watching the main components on the server:

In this case the calculation was:

Memcache + MySQL + (Apache2 X number of instances) + (PHP-FPM X number of instances)

And then adjusting the PHP-FPM number of processes down to a reasonable number, for a total application RAM of no more than 70% of the total.

The result is as follows. As you can see, used memory is now 1.8GB instead of 7GB. Free memory will slowly be used by cache and buffers making I/O operations much faster.

used buffers cache free 1858M 50.9M 1793M 4283M 1880M 51.2M 1795M 4258M 1840M 52.1M 1815M 4278M 1813M 52.4M 1815M 4304M

Another issue with the server, partially caused by by the above lack of cache and buffers, but also by forgotten settings, caused a severe bottleneck in the Disk I/O performance. The disk was so tied up that everything had to wait. I/O Wait was 30%, as seen in top and htop. This is very very high, and should usually be no more than 1 or 2% maximum.

We also observed excessive disk reads and writes, as follows:

disk read disk write i/o read i/o write 5199k 1269k 196 59.9 1731k 1045k 80 50.7 7013k 1106k 286 55.2 23M 1168k 607 58.4 9121k 1369k 358 59.7

Upon investigating, we found that the rules_debug_log setting was on. The site had 130 enabled rules and the syslog module was enabled. We found a file under /var/log/ with over a GB per day and growing. This writing of rules debugging for every page load tied up the disk when a few hundred users were on the site.

After disabling the rules debug log settings, wait for I/O went down to 1.3%! A significant improvement.

Here is the disk I/O figures after the fix.

disk read disk write i/o read i/o write 192k 429k 10.1 27.7 292k 334k 16.0 26.3 2336k 429k 83.6 30.7 85k 742k 4.53 30.8
Now, the site has response times of 1 second or less instead of the 3-4 seconds. 
Feb 12 2018
Feb 12

For all of the sites we consult on, and manage, we use the excellent memcache module, which replaces the core's database caching. Database caching works for low traffic simple sites, but cannot scale for heavy traffic or complex site.

Recently we were asked to consult on the slow performance of a site with an all authenticated audience. The site is indeed complex, with over 235 enabled modules, 130 enabled rules, and 110 views.

The site was moved from a dedicated server to an Amazon AWS cluster, with the site on one EC2 instance, the database on an RDS instance, and memcache on a third instance. This move was in the hope that Amazon's AWS will improve performance.

However, to their dismay, performance after the move went from bad (~ 5 to 7 seconds to load a page) to worse (~ 15 seconds).

We recommended to the client that we perform a Drupal performance assessment server on the site. We got a copy of the site and recreated the site in our lab, and proceeded with the investigation.

After some investigation we found that by having memcached on the same server that runs PHP made a significant performance improvement. This was logical, because with a site with all users logged in, there are plenty of calls to cache_get() and cache_set(). Each of these calls have to do a round trip over the network to the other server and back, even if it returns nothing. The same goes for database queries.

Instead of 29.0, 15.8, 15.9, and 15.5 seconds for different pages on the live site, the page loads in our lab on a single medium server were: 3.6, 5.5, 1.4, 1.5 and 0.6 seconds.

However, this victory was short lived. Once we put load on the web site, other bottlenecks were encountered.

We started with 200 concurrent logged in users on our lab server, and kept investigating and tweaking, running a performance test after each tweak to assess its impact.

The initial figures were: an average response time of 13.93 seconds, and only 6,200 page load attempts for 200 users (with 436 time outs).

So, what did we find? We found that the culprit was memcache! Yes, the very thing that helps site be scaleable was severely impeding scalability!

Why is this so? Because of the way it was configured for locking and stampede protection.

The settings.php for the site had these two lines:

$conf['lock_inc'] = 'sites/all/modules/memcache/memcache-lock.inc';
$conf['memcache_stampede_protection'] = TRUE;

Look at the memcache.inc, lines 180 to 201, in the valid() function:

if (!$cache) {
  if (variable_get('memcache_stampede_protection', FALSE) ... ) {
    static $lock_count = 0;
    $lock_count++;
    if ($lock_count <= variable_get('memcache_stampede_wait_limit', 3)) {
      lock_wait(..., variable_get('memcache_stampede_wait_time', 5));
      $cache = ...;
    }
  }
}

The above is for version 7.x of the module, and the same logic is in the Drupal 8.x branch as well.

If memcache_stampede_protection is set to TRUE, then there will be up to three attempts, with a 5 second delay each. The total then can be as high as 15 seconds when the site is busy, which is exactly what we were seeing. Most of the PHP processes will be waiting for the lock, and no free PHP processes will be available to serve requests from other site visitors.

One possible solution is to lower the number of attempts to 2 (memcache_stampede_wait_limit = 2), and the wait time for each attempt to 1 second (memcache_stampede_wait_time = 1), but that is still 2 seconds of wait!

We did exactly that, and re-ran our tests.

The figures were much better: for 200 concurrent logged in users, the the average response time was 2.89 seconds, and a total of 10,042 page loads, with 100% success (i.e. no time outs).

But a response time of ~ 3 seconds is still slow, and there is still the possibility of a pile up condition when all PHP processes are waiting.

So, we decided that the best course of action is not to use memcache's locking at all, nor its stampede protection, and hence deleted the two lines from settings.php:

//$conf['lock_inc'] = 'sites/all/modules/memcache/memcache-lock.inc';
//$conf['memcache_stampede_protection'] = TRUE;

The results were much better: for 200 concurrent logged in users, the average response time was 1.09 seconds, and a total of 11,196 pages with 100% success rate (no timeouts).

At this point, the server's CPU utilization was 45-55%, meaning that it can handle more users.

But wait! We forgot something: the last test was run with xhprof profiler left enabled by mistake from profiling the web site! That causes lots of CPU time being used up, as well as heavy writes to the disk as well.

So we disabled xhprof and ran another test: and the results were fantastic: for 200 concurrent logged in users, the average response time was just 0.20 seconds, and a total of 11,892 pages with 100% success rate (no timeouts).

Eureka!

Note that for the above tests, we disabled all the rules, disabled a couple of modules that have slow queries, and commented out the history table update query in core's node.module:node_tag_new(). So, these figures are idealized somewhat.

Also, this is a server that is not particularly new (made in 2013), and uses regular spinning disks (not SSDs).

For now, the main bottleneck has been uncovered, and overcome. The site is now only limited by other factors, such as available CPU, speed of its disks, complexity of modules, rules and views ...etc.

So, check your settings.php to see if you have memcache_stampede_protection enabled, and disable it if it is.

May 23 2017
May 23

Acquia has announced the end of life for Mollom, the comment spam filtering service.

Mollom was created by Dries Buytaert and Benjamin Schrauwen, and launched to a few beta testers (including myself) in 2007. Mollom was acquired by Acquia in 2012.

The service worked generally well, with the occasional spam comment getting through. The stated reason for stopping the service is that spammers have gotten more sophisticated, and that perhaps means that Mollom needs to try harder to keep up with the ever changing tactics. Much like computer viruses and malware, spam (email or comments) is an arms race scenario.

The recommended alternative by Acquia is a combination of reCAPTCHA and Honeypot.

But there is a problem with this combinationa: reCAPTCHA, like all modules that depend on the CAPTCHA module, disable the page cache for any form that has CAPTCHA enabled.

This is due to this piece of code in captcha.module:

// Prevent caching of the page with CAPTCHA elements.
// This needs to be done even if the CAPTCHA will be ommitted later:
// other untrusted users should not get a cached page when
// the current untrusted user can skip the current CAPTCHA.
drupal_page_is_cacheable(FALSE);

Another alternative that we have been using that does not disable the page cache is antibot module.

To install the antibot module, you can use your git repository, or the following drush commands:

drush dis mollom
drush dl antibot
drush en antibot

Visit the configuration page for antibot if you want to add more forms that use the module, or disable it from other forms. The default settings work for comments, user registrations, and use logins.

Because of the above mentioned arms race situation, expect spammers to come up with circumvention techniques at some point in the future, and there will be a need to use other measures, be they in antibot, or other alternatives.

Feb 07 2017
Feb 07

Secure Socket Layer (SSL) is the protocol that allows web sites to serve traffic in HTTPS. This provides end to end encryption between the two end points (the browser and the web server). The benefits of using HTTPS is that traffic between the two end points cannot be deciphered by anyone snooping on the connection. This reduces the odds of exposing sensitive information such as passwords, or getting the web site hacked by malicious parties. Google has also indicated that sites serving content exclusively in HTTPS will get a small bump in Page Rank.

Historically, SSL certificate issuers have served a secondary purpose: identity verification. This is when the issuing authority vouches that a host or a domain is indeed owned by the entity that requests the SSL certificate for it. This is traditionally done by submitting paper work including government issued documentation, incorporation certificates, ...etc.

Historically, SSL certificates were costly. However, with the introduction of the Let's Encrypt initiative, functional SSL certificates are now free, and anyone who wants to use them can do so, minus the identity verification part, at least for now.

Implementing HTTPS with Drupal can be straightforward with low traffic web sites. The SSL certificate is installed in the web server, and that is about it. With larger web sites that handle a lot of traffic, a caching layer is almost always present. This caching layer is often Varnish. Varnish does not handle SSL traffic, and just passes all HTTPS traffic straight to Drupal, which means a lot of CPU and I/O load.

This article will explain how to avoid this drawback, and how to have it all: caching in Varnish, plus serving all the site using HTTPS.

The idea is quite simple in principle: terminate SSL before Varnish, which will never know that the content is encrypted upstream. Then pass the traffic from the encryptor/decryptor to Varnish on port 81. From there, Varnish will pass it to Apache on port 8080.

We assume you are deploying all this on Ubuntu 16.04 LTS, which uses Varnish 4.0, although the same can be applied to Ubuntu 14.04 LTS with Varnish 3.0.

Note that we use either one of two possible SSL termination daemons: Pound and Nginx. Each is better in certain cases, but for the large part, they are interchangeable.

One secondary purpose for this article is documenting how to create SSL bundles for intermediate certificate authorities, and to generate a combined certificate / private key. We document this because of the sparse online information on this very topic.

Install Pound

aptitude install pound

Preparing the SSL certificates for Pound

Pound does not allow the private key to be in a separate file or directory from the certificate itself. It has to be included with the main certificate, and with intermediate certificate authorities (if there are any).

We create a directory for the certificates:

mkdir /etc/pound/certs

cd /etc/pound/certs

We then create a bundle for the intermediate certificate authority. For example, if we are using using NameCheap for domain registration, they use COMODO for certificates, and we need to do the following. The order is important.

cat COMODORSADomainValidationSecureServerCA.crt \
  COMODORSAAddTrustCA.crt \
  AddTrustExternalCARoot.crt >> bundle.crt

Then, as we said earlier, we need to create a host certificate that includes the private key.

cat example_com.key example_com.crt > host.pem

And we make sure the host certificate (which contains the private key as well) and the bundle, are readable only to root.

chmod 600 bundle.crt host.pem

Configure Pound

We then edit /etc/pound/pound.cfg

# We have to increase this from the default 128, since it is not enough
# for medium sized sites, where lots of connections are coming in
Threads 3000

# Listener for unencrypted HTTP traffic
ListenHTTP
  Address 0.0.0.0
  Port    80
 
  # If you have other hosts add them here
  Service
    HeadRequire "Host: admin.example.com"
    Backend
      Address 127.0.0.1
      Port 81
    End
  End
 
  # Redirect http to https
  Service
    HeadRequire "Host: example.com"
    Redirect "https://example.com"
  End
 
  # Redirect from www to domain, also https
  Service
    HeadRequire "Host: www.example.com"
    Redirect "https://example.com"
  End
End

# Listener for encrypted HTTP traffic
ListenHTTPS
  Address 0.0.0.0
  Port    443
  # Add headers that Varnish will pass to Drupal, and Drupal will use to switch to HTTPS
  HeadRemove      "X-Forwarded-Proto"
  AddHeader       "X-Forwarded-Proto: https"
 
  # The SSL certificate, and the bundle containing intermediate certificates
  Cert      "/etc/pound/certs/host.pem"
  CAList    "/etc/pound/certs/bundle.crt"
 
  # Send all requests to Varnish
  Service
    HeadRequire "Host: example.com"
    Backend
      Address 127.0.0.1
      Port 81
    End
  End
 
  # Redirect www to the domain
  Service
    HeadRequire "Host: www.example.com.*"
    Redirect "https://example.com"
  End
End

Depending on the amount of concurrent traffic that your site gets, you may need to increase the number of open files for Pound. You also want to increase the backend time out times, and the browser time out time.

To do this, edit the file /etc/default/pound, and add the following lines:

# Timeout value, for browsers
Client  45

# Timeout value, for backend
Timeout 40

# Increase the number of open files, so pound does not log errors like:
# "HTTP Acces: Too many open files"
ulimit -n 20000

You also need to create a run directory for pound and change the owner for it, since the pound package does not do that automatically.

mkdir /var/run/pound
chown www-data.www-data /var/run/pound

Do not forget to change the 'startup' line from 0 to 1, otherwise pound will not start.

Configure SSL Termination for Drupal using Nginx

You may want to use Nginx instead of the simpler Pound in certain cases.

For example, if you want to process your site's traffic using analysis tools, for example Awstats, you need to capture those logs. Although Pound can output logs in Apache combined format, it also outputs errors to the same log, at least on Ubuntu 16.04, and that makes these logs unusable by analysis tools.

Also, Pound has a basic mechanism to handle redirects from the plain HTTP URLs to the corresponding SSL HTTPS URLs. But, you cannot do more complex rewrites, or more configurable and flexible options.

First install Nginx:

aptitude install nginx

Create a new virtual host under /etc/nginx/sites-available/example.com, with this in it:

# Redirect http www to https no-www
server {
  server_name www.example.com;
  access_log off;
  return 301 https://example.com$request_uri;
}

# Redirect http no-www to https no-www
server {
  listen      80 default_server;
  listen [::]:80 default_server;
  server_name example.com;
  access_log off;
  return 301 https://$host$request_uri;
}

# Redirect http www to https no-www
server {
  listen      443 ssl;
  server_name www.example.com;
  access_log off;
  return 301 https://example.com$request_uri;
}

server {
  listen      443 ssl default_server;
  listen [::]:443 ssl default_server ipv6only=on;

  server_name example.com;

  # We capture the log, so we can feed it to analysis tools, e.g. Awstats
  # This will be more comprehensive than what Apache captures, since Varnish
  # will end up removing a lot of the traffic from Apache
  #
  # Replace this line with: 'access_log off' if logging ties up the disk
  access_log /var/log/nginx/access-example.log;

  ssl on;

  # Must contain the a bundle if it is a chained certificate. Order is important.
  # cat example.com.crt bundle.crt > example.com.chained.crt 
  ssl_certificate      /etc/ssl/certs/example.com.chained.crt;
  ssl_certificate_key  /etc/ssl/private/example.com.key;

  # Test certificate
  #ssl_certificate     /etc/ssl/certs/ssl-cert-snakeoil.pem;
  #ssl_certificate_key /etc/ssl/private/ssl-cert-snakeoil.key;

  # Restrict to secure protocols, depending on whether you have visitors
  # from older browsers
  ssl_protocols TLSv1 TLSv1.1 TLSv1.2;

  # Restrict ciphers to known secure ones
  ssl_ciphers ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256;

  ssl_prefer_server_ciphers on;
  ssl_ecdh_curve secp384r1;
  ssl_stapling on;
  ssl_stapling_verify on;

  add_header Strict-Transport-Security "max-age=63072000; includeSubDomains; preload";
  add_header X-Frame-Options SAMEORIGIN;
  add_header X-Content-Type-Options nosniff;

  location / {
    proxy_pass                         http://127.0.0.1:81;
    proxy_read_timeout                 90;
    proxy_connect_timeout              90;
    proxy_redirect                     off;

    proxy_set_header Host              $host;
    proxy_set_header X-Real-IP         $remote_addr;
    proxy_set_header X-Forwarded-For   $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto https;
    proxy_set_header X-Forwarded-Port  443;
   
    proxy_buffers                      8 24k;
    proxy_buffer_size                  2k;
  }
}

Then link this to an entry in the sites-enabled directory

cd /etc/nginx/sites-enabled

ln -s /etc/nginx/sites-available/example.com

Then we add some performance tuning parameters, by creating a new file: /etc/nginx/nginx.conf. These will make sure that we handle higher traffic than the default configuration allows:

At the top of the file, modify these two parameters, or add them if they are not present:

 
worker_processes       auto;
worker_rlimit_nofile   20000;

Then, under the 'events' section, add or modify to look like the following:

events {
  use epoll;
  worker_connections 19000;
  multi_accept       on;
}

And under the 'http' section, make sure the following parameters are added or modified to the following values:

http {
sendfile on;
tcp_nopush on;
tcp_nodelay on;
keepalive_timeout 80;
keepalive_requests 10000;
client_max_body_size 50m;
}

We now have either Pound or Nginx in place, handling port 443 with SSL certifcates, and forwarding the plain text traffic to Varnish.

Change Varnish configuration to use an alternative port

First, we need to make Varnish work on port 81.

On 16.04 LTS, we edit the file: /lib/systemd/system/varnish.service. If you are using Ubuntu 14.04 LTS, then the changes should go into /etc/default/varnish instead.

Change the 'ExecStart' line for the following:

Port that Varnish will listen on (-a :81)
Varnish VCL Configuration file name (/etc/varnish/main.vcl)
Size of the cache (-s malloc,1536m)

You can also change the type of Varnish cache storage, e.g. to be on disk if it is too big to fit in memory (-s file,/var/cache/varnish/varnish_file.bin,200GB,8K). Make sure to create the directory and assign it the correct owner and permissions.

We use a different configuration file name so as to not overwrite the default one, and make updates easier (no questions asks during update to resolve differences).

In order to inform systemd that we changed a daemon startup unit, we need to issue the following command:

systemctl daemon-reload

Add Varnish configuration for SSL

We add the following section to the Varnish VCL configuration file. This will pass a header to Drupal for SSL, so Drupal will enforce HTTPS for that request.

# Routine used to determine the cache key if storing/retrieving a cached page.
sub vcl_hash {

  # This section is for Pound
  hash_data(req.url);

  if (req.http.host) {
    hash_data(req.http.host);
  }
  else {
    hash_data(server.ip);
  }

  # Use special internal SSL hash for https content
  # X-Forwarded-Proto is set to https by Pound
  if (req.http.X-Forwarded-Proto ~ "https") {
    hash_data(req.http.X-Forwarded-Proto);
  }
}

Another change you have to make in Varnish's vcl is this line:

set req.http.Cookie = regsuball(req.http.Cookie, ";(SESS[a-z0-9]+|NO_CACHE)=", "; \1=");

And replace it with this line:

set req.http.Cookie = regsuball(req.http.Cookie, ";(S?SESS[a-z0-9]+|NO_CACHE)=", "; \1=");

This is done to ensure that Varnish will pass through the secure session cookies to the web server.

Change Apache's Configuration

If you had SSL enabled in Apache, you have to disable it so that only Pound (or Nginx) are listening on port 443. If you do not do this, Pound and Nginx will refuse to start with an error: Address already in use.

First disable the Apache SSL module.

a2dismod ssl

We also need to make Apache listen on port 8080, which Varnish will use to forward traffic to.

 
Listen 8080

And finally, your VirtualHost directives should listen on port 8080, as follows. It is also best if you restrict the listening on the localhost interface, so outside connections cannot be made to the plain text virtual hosts.


...

The rest of Apache's configuration is detailed in an earlier article on Apache MPM Worker threaded server, with PHP-FPM.

Configure Drupal for Varnish and SSL Termination

We are not done yet. In order for Drupal to know that it should only use SSL for this page request, and not allow connections from plain HTTP, we have to add the following to settings.php:

// Force HTTPS, since we are using SSL exclusively
if (isset($_SERVER['HTTP_X_FORWARDED_PROTO'])) {
  if ($_SERVER['HTTP_X_FORWARDED_PROTO'] == 'https') {
    $_SERVER['HTTPS'] = 'on';
  }
}

If you have not already done so, you also have to enable page cache, and set the external cache age for cached pages. This is just a starting point, assuming Drupal 7.x, and you need to modify these accordingly depending on your specific setup.

// Enable page caching
$conf['cache'] = 1;
// Enable block cache
$conf['block_cache'] = 1;
// Make sure that Memcache does not cache pages
$conf['cache_lifetime'] = 0;
// Enable external page caching via HTTP headers (e.g. in Varnish)
// Adjust the value for the maximum time to allow pages to stay in Varnish
$conf['page_cache_maximum_age'] = 86400;
// Page caching without bootstraping the database, nor invoking hooks
$conf['page_cache_without_database'] = TRUE;
// Nor do we invoke hooks for cached pages
$conf['page_cache_invoke_hooks'] = FALSE;

// Memcache layer
$conf['cache_backends'][]    = './sites/all/modules/contrib/memcache/memcache.inc';
$conf['cache_default_class'] = 'MemCacheDrupal';
$conf['memcache_servers']    = array('127.0.0.1:11211' => 'default');
$conf['memcache_key_prefix'] = 'live';

And that is it for the configuration part.

You now need to clear all caches:

drush cc all

Then restart all the daemons:

service pound restart
service nginx restart # If you use nginx instead of pound
service varnish restart
service apache2 restart

Check that all daemons have indeed restarted, and that there are no errors in the logs. Then test for proper SSL recognition in the browser, and for correct redirects.

For The Extreme Minimalist: Eliminating Various Layers

The above solution stack works trouble free, and has been tested with several sites. However, there is room for eliminating different layers. For example, instead of having Apache as the backend web server, this can be replaced with Nginx itself, listening on both port 443 (SSL), and 8080 (backend), with Varnish in between. In fact, it is possible to even remove Varnish altogether, and use Ngnix FastCGI Cache instead of it. So Nginx listens on port 443, decrypts the connection, and passes the request to its own cache, which decides what is served from cache versus what gets passed through to Nginx itself on port 8080, which hands it over to PHP and Drupal.

Don't let the words 'spaghetti' and 'incest' take over your mind! Eventually, all the oddities will be ironed out, and this will be a viable solution. There are certain things that are much better known in Apache for now in regards to Drupal, like URL rewriting for clean URLs. There are also other things that are handled in .htaccess for Apache that needs to gain wider usage within the community before an Nginx only solution becomes the norm for web server plus cache plus SSL.

Apache MPM Worker Multithreaded with PHP-FPM is a very low overhead, high performance solution, and we will continue to use it until the Nginx only thing matures into a wider used solution, and has wider use and support within the Drupal community to remain viable for the near future.

Nov 21 2016
Nov 21

The other day, we were helping a long time client with setting up a new development server configured with Ubuntu Server LTS 16.04, which comes with PHP 7.x. Benchmarks of PHP 7.x show that it is faster than any PHP 5.x version by a measurable margin, hence the client's attempt to move to the newer version of Ubuntu and PHP.

But when we tried benchmarking the new server against the existing server, which has Ubuntu Server LTS 14.04, showed that the new server is extremely slow compared to the existing one.

We set up investigating where the slow down is coming from, and running a quick version of our Drupal Performance Assessment, which is one of our many Drupal services.

We started by checking out the logs, which on this site is set to syslog, so as not to bog down the database.

We saw the following messages:

Nov 19 16:36:31 localhost drupal: http://dev5.example.com|1478727391|Apache Solr|1.1.1.1|http://dev5.example.com/some-page||0||HTTP Status: 0; Message: Request failed: Connection timed out. TCP Connect Timeout.; Response: ; Request: GET /solr/wso/select?start=0&rows=8&fq=bundle%3A%28blog%20OR%20faq%20OR%20...&wt=json&json.nl=map HTTP/1.0#015#012User-Agent: Drupal (+http://drupal.org/)#015#012Connection: close#015#012Host: solr2.example.com:8983#015#012#015#012; Caller: module_invoke() (line 926 of /.../www/includes/module.inc)

Nov 19 16:36:31 localhost drupal: http://dev5.example.com|1478727391|Apache Solr|1.1.1.1|http://dev5.example.com/some-page||0||HTTP 0; Request failed: Connection timed out. TCP Connect Timeout.

As you can see, the server is timing out trying to connect to the Apache Solr service. The existing server does not have this message.

But is this just an error, or something that impacts performance?

There are a variety of tools that can help profile PHP's page load, and see where time is being spend. The most popular tool is xhprof, but setting it up is time consuming, and we needed simpler tools.

So, we settled on a well tried and tested Linux tool, strace. This tool allows one to see what system calls a process is issuing, and measure the time each takes.

Over the years we have used strace, with tried and tested options to give us what we need quickly, without having to install extensions in PHP, and the like.

The command line we use discovers the process IDs of PHP on its own, and then traces all of them:

Caution: NEVER USE THIS ON A LIVE SERVER! You will slow it down considerably, both the CPU and disk access!

strace -f -tt -s 1024 -o /tmp/trace -p `pidof 'php-fpm: pool www' |
sed -e 's/ /,/g'`

Let us explain the options for a bit:

  • -f means follow children. This means that if a process forks and creates a new process, the new process will be traces as well.
  • -tt outputs a time stamp accurate to microsecond time. This is extremely useful for knowing where time is spent.
  • -s means that for most system calls, such as read, a longer output is displayed.
  • -o gives the file name to output the trace to.
  • -p is a list of process IDs to trace. In this case, we use some shell magic to get the process IDs of the PHP daemon (since we are running PHP-FPM), and then replace the spaces with commas. That way, we don't have to find out the process IDs manually then enter them on the command line. A great time saver!

We run this trace and issue a single request, so the output is not huge. We terminal the strace process by the usual Ctrl-C.

We inspect the output and we find this:

9359 17:02:37.104412 connect(6, {sa_family=AF_INET, sin_port=htons(11211), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)

From the port number, we know this is a connection to the memcached daemon, which is running on the same server. No issues there.

Then we see this:

9359 17:02:37.176523 connect(7, {sa_family=AF_LOCAL, sun_path="/var/run/mysqld/mysqld.sock"}, 29) = 0

That is the connection to MySQL's socket. No issues either.

Now, we see this (IP address obfuscated):

9359 17:02:38.178758 connect(9, {sa_family=AF_INET, sin_port=htons(8983), sin_addr=inet_addr("1.1.1.1")}, 16) = -1 EINPROGRESS (Operation now in progress)

From the port number, we know that this is Apache Solr.

Then we find the following, repeating more than a hundred times:

9359 17:02:38.179042 select(10, [9], [9], [], {1, 25000}) = 0 (Timeout)
9359 17:02:39.205272 nanosleep({0, 5000000}, NULL) = 0
9359 17:02:39.210606 select(10, [9], [9], [], {0, 25000}) = 0 (Timeout)
9359 17:02:39.235936 nanosleep({0, 5000000}, NULL) = 0
9359 17:02:39.241262 select(10, [9], [9], [], {0, 25000}) = 0 (Timeout)
9359 17:02:39.266552 nanosleep({0, 5000000}, NULL) = 0

Until we see this:

9359 17:02:43.134691 select(10, [9], [9], [], {0, 25000}) = 0 (Timeout)
9359 17:02:43.160097 nanosleep({0, 5000000}, NULL) = 0
9359 17:02:43.165415 select(10, [9], [9], [], {0, 25000}) = 0 (Timeout)
9359 17:02:43.190683 nanosleep({0, 5000000}, NULL) = 0

Did you catch that, or did you miss it? Look closely at the time stamps!

The PHP process spend a full 5 seconds trying to contact the Apache Solr server, but timing out!

No wonder page loading is so slow then.

In this case, there was a block of 'related content' populated from an Apache Solr query. This block was not cached, and for every logged in user, the PHP process waits for the Apache Solr server round trip time, plus query execution time. But worse is if the server times out, because of network issues.

Jun 13 2016
Jun 13

Recently, we were reviewing the performance of a large site that has a significant portion of its traffic from logged in users. The site was suffering from a high load average during peak times.

We enabled slow query logging on the site for a entire week, using the following in my.cnf:

log_slow_queries               = 1
slow_query_log                 = 1
slow_query_log_file            = /var/log/mysql/slow-query.log
log-queries-not-using-indexes  = 1
long_query_time                = 0.100

Note that the parameter long_query_time can be a fraction of a second only on more recent versions on MySQL.

You should not set this value too low, otherwise the server's disk could be tied up in logging the queries. Nor should it be too high so as to miss most slow queries.

We then analyzed the logged queries after a week.

We found that the slow queries, on aggregate, examined a total of 150,180 trillion rows, and returned 838,930 million rows.

Out of the total types of queries analyzed, the top two had a disproportionate share of the total.

So these two queries combined were 63.7% of the total slow queries! That is very high, and if we were able to improve these two queries, it would be a huge win for performance and server resources.

Voting API Slow Query

The first query had to do with Voting API and Userpoints.

It was:

SELECT votingapi_vote.*
FROM votingapi_vote
WHERE  value_type = 'points'
AND tag = 'userpoints_karma'
AND uid = '75979'
AND value = '-1'
AND timestamp > '1464077478'

It hogged 45.3% of the total slow queries, and was called 367,531 times per week. It scanned over 213,000 rows every time it ran!

The query took an aggregate time for execution of 90,766, with an average of 247 milliseconds per execution.

The solution was simple: create an index on the uid column:

CREATE INDEX votingapi_vote_uid ON votingapi_vote (uid);

After that was done, the query used the index and scanned only one row, and returned instantly.

Private Messaging Slow Query

The second query had to do with Privatemsg. It is:

SELECT COUNT(pmi.recipient) AS sent_count
FROM pm_message pm
INNER JOIN pm_index pmi ON pm.mid = pmi.mid
WHERE  pm.author = '394106'
AND pm.timestamp > '1463976037'
AND pm.author <> pmi.recipient

This query accounted for 18.4% of the total slow queries, and was called 32,318 times per week. It scanned over 1,350,000 rows on each execution!

The query took an aggregate time for execution of 36,842, with an average of 1.14 seconds (yes, seconds!) per execution.

Again, the solution was simple: create an index on the author column.

CREATE INDEX pm_message_author ON pm_message (author);

Just like the first query, after creating the index, the query used the index and scanned only 10 rows and over a million! It returned instantly.

Results After Tuning

As with any analysis, comparison of the before and after data is crucial.

After letting the tuned top two offending queries run for another week, the results were extremely pleasing:

Before After Total rows examined 150.18 T 34.93 T Total rows returned 838.93 M 500.65 M

A marked improvement!

Conclusion

With performance, the 80/20 rule applies. There are often low hanging fruit that can easily be tuned.

Do not try to tune because of something you read somewhere, that may not apply to your site (including this and other articles on our site!)

Rather, you should do proper analysis, and reach a diagnosis based on facts and measurements, as to the cause(s) of the slowness. After that, tuning them will provide good results.

Mar 22 2016
Mar 22

For years, we have been using and recommending memcached for Drupal sites as its caching layer, and we wrote several articles on it, for example: configuring Drupal with multiple bins in memcached.

Memcached has the advantage of replacing core caching (which uses the database) with memory caching. It still allows modules that have hook_boot() and hook_exit() to work, unlike external cache layers such as Varnish.

However, memcached has its limitations: It is by definition transient, so rebooting wipes out the cache, and the server can suffer if it has high traffic. It is also entirely memory resident, so to cache more items you need more RAM, which is not suitable for small servers.

For Drupal 7, there is a solution that does avoids this first limitation: Redis. It provides persistence, but not the second.

The following is a detailed guide to get Redis installed and configured for your server. It assumes that you are an Ubuntu Server 14.04, or the equivalent Debian release.

Installing Redis

First, download the Drupal redis module, which should go to sites/all/modules/contrib. You can do that in many ways, here is how you would use Drush for that:

drush @live dl redis

You do not need to enable any Redis modules in Drupal.

Then, install the Redis Server itself. On Debian/Ubuntu you can do the following. On CentOS/RedHat, you should use yum.

aptitude install redis-server

Then, install PHP's Redis integration. Once you do that, you do not need to compile from source, or anything like that, as mentioned in Redis README.txt file.

aptitude install php5-redis

Restart PHP, so it loads the Redis integration layer.
This assumes you are using PHP FPM:

service php5-fpm restart

If you are using PHP as an Apache module, then you need to restart it as follows:

service apache2 restart

Configuring Redis

Then in your settings.php file, you should replace the section for memcache which would be as follows:

$conf['cache_backends'][] = './sites/all/modules/contrib/memcache/memcache.inc';
$conf['cache_default_class'] = 'MemCacheDrupal';
$conf['memcache_servers'] = array('127.0.0.1:11211' => 'default');
$conf['memcache_key_prefix'] = 'site1';

And replace it with the following configuration lines:

// Redis settings
$conf['redis_client_interface'] = 'PhpRedis';
$conf['redis_client_host'] = '127.0.0.1';
$conf['lock_inc'] = 'sites/all/modules/contrib/redis/redis.lock.inc';
$conf['path_inc'] = 'sites/all/modules/contrib/redis/redis.path.inc';
$conf['cache_backends'][] = 'sites/all/modules/contrib/redis/redis.autoload.inc';
$conf['cache_default_class'] = 'Redis_Cache';
// For multisite, you must use a unique prefix for each site
$conf['cache_prefix'] = 'site1';

Cleaning Up

Once you do that, caching will start using redis. Memcached is not needed, so you should stop the daemon:

service memcached stop

And you should purge memcached as well:

aptitude purge memcached

And that is all there is to it.

Changing Redis Configuration

You can then review the /etc/redis/redis.conf file to see if you should tweak parameters more, such as changing maxmemory to limit it to a certain amount, as follows:

maxmemory 256mb

More below on this specific value.

Checking That Redis Is Working

To check that Redis is working, you can inspect that keys are being cached. For this, you can use the redis-cli tool. This tool can be used interactively, as in, you get a prompt and type commands in it, and results are returned. Or you can use the specific command as an argument to redis-cli.

For example, this command filters on a specific cache bin, the cache_bootstrap one:

$ redis-cli
127.0.0.1:6379> keys *cache_boot*

Or you can type it as:

$ redis-cli keys "*cache_boot*"

In either case, if Drupal is caching correctly, you should see output like this:

1) "site1:cache_bootstrap:lookup_cache"
2) "site2:cache_bootstrap:system_list"
3) "site3:cache_bootstrap:system_list"
4) "site3:cache_bootstrap:hook_info"
5) "site2:cache_bootstrap:variables"
...

As you can see, the key structure is simple, it is composed of the following components, separated by a colon:

  • Cache Prefix
    This is the site name in a multi site environment.
  • Cache Bin
    This is the cache table name when using the default database caching in Drupal.
  • Cache Key
    This is the unique name for the cached item. For cached pages, the URL is used, with the protocol (http or https) and the host/domain name.

You can also filter by site, using the cache_prefix:

$ redis-cli keys "*site1:cache_page*"

The output will be something like this:

1) "site1:cache_page:http://example.com/node/1"
2) "site1:cache_page:http://example.com/contact_us"
...

You can also check how many items are cached in the database:

$ redis-cli dbsize

The output will be the number of items:

(integer) 20344

Flushing The Cache

If you need to clear the cache, you can do:

$ redis-cli flushall

Checking Time To Live (TTL) For A Key

You can also check how long does a specific item stay in cache, in seconds remaining:

$ redis-cli ttl site1:cache_page:http://example.com/

The output will be the number of seconds:

(integer) 586

Getting Redis Info

You can get a lot of statistics and other information about how Redis is doing, by using the info command:

$ redis-cli info

You can check the full documentation for the info command.

But here is one of the important values to keep an eye on is used_memory_peak_human, which tells you the maximum memory that was used given your site's specifics, such as the number of items cached, the rate of caching, the size of each item, ...etc.

used_memory_peak_human:256.25

You can use that value to tune the maxmemory parameter, as above.

You can decrease the Minimum Cache Lifetime under /admin/config/development/performance to make the available memory fit that number, or the other way around: you can allocate more memory to fit more.

Monitoring Redis Operations In Real Time

And finally, here is a command that would show you all the operations that are being done on Redis in real time. Do not try this on a high traffic site!

$ redis-cli monitor

Performance Results

Redis performance as a page cache for Drupal is quite good, with Time To First Byte (TTFB) is ~ 95 to 105 milliseconds.

Notes on Fault Resilience

One of the big selling points of Redis versus Memcached, is that the former provides cache persistence across reboots.

However, as the documentation states, the default engine for Redis, RDB can lose data on power loss. That may not be a deal breaker on its own for a caching application. However, we found from experience, that loss of some cache records is not the only problem. The real problem was when a disk failure occurred, then repaired by the hosting provider, but the site was still offline, because Redis experienced data corruption, and refused to boot Drupal normally.

The other option is using AOF, which should survive power failures, but it has some disadvantages as well, including more disk space usage, and being slower than RDB.

Alternatives To Redis and Memcached

We did fairly extensive research for Redis and Memcached alternatives with the following criteria:

  • Compatible With Redis or Memcached Protocol
    We wanted to use the same PHP extension and Drupal Redis (or Memcached) modules, and not have to write and test yet another caching module.
  • Non-Memory Resident Storage
    We want to reduce the memory foot print of Redis/Memcached, because they both store the entire key/value combinations in memory. But still wanted to get acceptable performance.

The following products all claim to meet the above criteria, but none of them worked for us. They were tested on Ubuntu LTS 14.04 64-bit:

MongoDB

Using MongoDB article for more details.

MemcacheDB

MemcacheDB is a Memcached compatible server which used the excellent Berkeley DB database for storage.

This MemcacheDB presentation explains what it does in detail.

It has an Ubuntu package right in the repository, so no need to compile from source, or manually configure it. It works flawlessly. The -N option enable the DB_TXN_NOSYNC option, which means writes to the database are asynchronous, providing a huge performance improvement.

Configuration in Drupal's settings.php is very easy: it is exactly like Memcached, with only the port number changing, from 11211 to 21201.

Alas, all is not rosy: it is not really a cache layer, since it does not expire keys/values based on time, like Memcached and Redis does.

Redis NDS

Redis-NDS is a fork of Redis 2.6, patched for NDS (Naive Disk Store).

It does compile and run, but when the line: 'nds yes' is added to the configuration file, it is rejected as an invalid value. Looking briefly in the source, we also tried 'nds_enabled yes', but that was rejected as well. So we could not get it to run in NDS mode.

ARDB

ARDB is another NoSQL database that aims to be Redis protocol compatible.

We compiled this with three different storage engines: The Facebook RocksDB did not compile to begin with. Google's LevelDB compiled cleanly, and so did WiredTiger. But when trying to connect Drupal to it, Drupal hanged and never came back with both engines.

SSDB

SSDB is also another NoSQL database that tries to be Redis protocol compatible.

It compiled cleanly, but had the same symptom as ARDB: Drupal hangs and never receives back a reply from SSDB.

There are a couple of sandbox projects, here and here, that aim for native integration, but no code has been committed so far in two years.

If you were able to get any of the above, or another Redis/Memcached compatible caching engine working, please post a comment below.

Resources

Mar 22 2016
Mar 22

MongoDB is a NoSQL database that has Drupal integration for various scenarios.

One of these scenarios is using MongoDB as the caching layer for Drupal.

This article describes what is needed to get MongoDB working as a caching layer for your Drupal site. We assume that you have an Ubuntu Server LTS 14.04 or similar Debian derived distro.

Download The Drupal Module

First, download the MongoDB Drupal module. You do not need to enable any MongoDB modules.

drush @live dl mongodb

Install MongoDB Server, Tools and PHP Integration

Then install MongoDB, and PHP's MongoDB integration. Note that 'mongodb' is a virtual package that installs the mongodb-server package as well as other client tools and utilities:

aptitude install php5-mongo mongodb

Restart PHP

Restart PHP, so that MongoDB integration takes effect:

service php5-fpm restart

Configure Drupal With MongoDB

Now, edit your settings.php file, to add the following:

$conf['mongodb_connections']['default']['host'] = 'mongodb://127.0.0.1';
$conf['mongodb_connections']['default']['db'] = 'site1';
$conf['cache_backends'][] = 'sites/all/modules/contrib/mongodb/mongodb_cache/mongodb_cache.inc';
$conf['cache_default_class'] = 'DrupalMongoDBCache';

Note, that if you have multisite, then using a different 'db' for different sites will prevent cache collision.

Monitoring MongoDB

You can monitor MongoDB using the following commands.

mongotop -v
mongostat 15

Tuning MongoDB

Turn off MongoDB's journaling, since we are using MongoDB for transient cache data that can be recreated from Drupal.

Edit the file /etc/mongodb.conf and change journal= to false.

Performance Results

Quick testing on a live site showed that MongoDB performance is acceptable, but not spectacular. That is specially true when compared to other memory resident caching, such as Memcached or Redis.

For example, on the same site and server, with Redis, Time To First Byte (TTFB) is ~ 95 to 105 milliseconds. With MongoDB it is ~ 200, but also goes up to ~350 milliseconds.

Still, MongoDB can be a solution in memory constrained environments, such as smallish VPS's.

Aug 25 2015
Aug 25

There are rare occasions when you want to re-index all your site's content in Solr. Such occasions include:

  • Major Drupal version upgrade (e.g. from Drupal 6.x to Drupal 7.x).
  • Changing your Solr schema to include more search criteria.
  • Upgrading your Solr server to a new major version.
  • Moving your Solr server from an old server to a new one.

The usual way of doing this re-indexing is to make cron run more frequently. However, if you do that, there is a risk of cron being blocked because of other long running cron tasks. Moreover, you are usually limited to a few hundred items per cron run, and then you have to wait until the next iteration of cron running.

The indispensable swiss army knife of Drupal, Drush, has hook for Solr. Therefore, for someone like me who does almost everything from the command line, using drush was the natural fit for this task.

To do this, in one terminal, I enter this command:

while true
do
  drush @live --verbose solr-index
  sleep 5
done

This command runs the indexing in a loop, and is not dependent on cron. As soon as the 100 items (or whatever limit you have in your settings) is done, another batch is sent.

In another terminal, you would monitor the progress as follows:

while true
do
  drush @live solr-get-last-indexed
  sleep 30
done

Once you see that the number of items in the second terminal to stop increasing, you check the first terminal for any errors. Usually, it means that indexing is complete. However, if there are errors, they may be due to bad content in nodes, which needs to be fixed (e.g. bad fields) or unpublished as the case may be.

Doing this reindexing on a Drupal 7.x site sending content to a Solr 4.x server, took from 11 pm to 1 pm the next day (14 hours), for 211,900 nodes. There was an overnight network disconnect for the terminals, and it was restarted in the morning, so the actual time is actually less.

In all cases, this is much faster than indexing a few hundred items every 5 minutes via cron. That would have taken several days to complete.

About Drupal Sun

Drupal Sun is an Evolving Web project. It allows you to:

  • Do full-text search on all the articles in Drupal Planet (thanks to Apache Solr)
  • Facet based on tags, author, or feed
  • Flip through articles quickly (with j/k or arrow keys) to find what you're interested in
  • View the entire article text inline, or in the context of the site where it was created

See the blog post at Evolving Web

Evolving Web