May 03 2014
May 03

On Friday May 2nd, 2014, Khalid of 2bits.com, Inc. presented on Drupal Performance.

The presentation covered important topics such as:

  • Drupal misconception: Drupal is slow/resource hog/bloated
  • Drupal misconception: Only Anonymous users benefit from caching in Drupal
  • Drupal misconception: Subsecond response time in Drupal is impossible for logged in users
  • Drupal misconception: Cloud hosting is more cost effective than dedicated servers

The presentation slides are attached for those who may be interested ...

Attachment Size drupalcamp-toronto-2014-drupal-performance-tips-and-tricks.pdf 371.99 KB
Apr 29 2014
Apr 29

Navigation

Published Tue, 2014/04/29 - 08:46

Khalid of 2bits.com Inc. will be presenting Drupal Performance Tips and Tricks at DrupalCamp Toronto 2014 this coming Friday May 2nd at Humber College, Lakeshore Campus.

See you all there ...

ยป

Is your Drupal or WordPress site slow?
Is it suffering from server resources shortages?
Is it experiencing outages?
Contact us for Drupal and WordPress Performance Optimization and Tuning Consulting

Drupal Association Organization MemberDrupal Association Organization Member

Do you use any of our Drupal modules?

Did you find our Drupal, WordPress, and LAMP performance articles informative?

Follow us on Twitter @2bits for tips and tricks on Drupal and WordPress Performance.

Contact us for Drupal and WordPress Performance Optimization and Tuning Consulting

In Depth Articles

Search

Google

Custom Search

Mar 11 2014
Mar 11

We previously wrote in detail about how botnets hammering a web site can cause outages.

Here is another case that emerged in the past month or so.

Again, it is a distributed attempt from many IP addresses all over the world, most probably from PCs infected with malware.

Their main goal seems to be to add content to a Drupal web site, and trying to register a new user when that attempt is denied because of site permissions.

The pattern is like the following excerpt from the web server's access log.

Note the POST, as well as the node/add in the referer. Also note the hard coded 80 port number:

173.0.59.46 - - [10/Mar/2014:00:00:04 -0400] "POST /user/register HTTP/1.1" 200 12759 "http://example.com/user/register" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36"
173.0.59.46 - - [10/Mar/2014:00:00:06 -0400] "POST /user/register HTTP/1.1" 200 12776 "http://example.com/user/register" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36"
107.161.81.55 - - [10/Mar/2014:00:00:10 -0400] "GET /user/register HTTP/1.1" 200 12628 "http://example.com/user/register" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36"
107.161.81.55 - - [10/Mar/2014:00:00:16 -0400] "GET /user/register HTTP/1.1" 200 12642 "http://example.com/user/login?destination=node/add" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36"
202.75.16.18 - - [10/Mar/2014:00:00:17 -0400] "POST /user/register HTTP/1.1" 200 12752 "http://example.com/user/register" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1"
5.255.90.89 - - [10/Mar/2014:00:00:18 -0400] "GET /user/register HTTP/1.1" 200 12627 "http://example.com/user/register" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36"
107.161.81.55 - - [10/Mar/2014:00:00:24 -0400] "GET /user/register HTTP/1.1" 200 12644 "http://example.com/user/login?destination=node/add" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36"
...
128.117.43.92 - - [11/Mar/2014:10:13:30 -0400] "POST /user/register HTTP/1.1" 200 12752 "http://example.com:80/" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110613 Firefox/6.0a2"
128.117.43.92 - - [11/Mar/2014:10:13:30 -0400] "POST /user/register HTTP/1.1" 200 12752 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110613 Firefox/6.0a2"
128.117.43.92 - - [11/Mar/2014:10:13:30 -0400] "POST /user/register HTTP/1.1" 200 12752 "http://example.com:80/" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110613 Firefox/6.0a2"

In the above case, the web site has a CAPTCHA on the login registration page, and that causes a session to be created, and hence full Drupal bootstrap (i.e. no page caching). When this is done by lots of bots simultaneously, it takes its toll on the server's resources.

Botnet Statistics

We gleaned these statistics from analyzing the access log for the web server for a week, prior to putting in the fix below.

Out of 2.3 million requests, 3.9% were to /user/register. 5.6% had http://example.com:80/ in the referer (with the real site instead of example). 2.4% had "destination=node/add" in the referer.

For the same period, but limiting the analysis to accesses to /user/register only, 54.6% have the "/user/login?destination=node/add" in the referer. Over 91% pose as coming from a computer running Mac OS/X Lion 10.7.5 (released October 2012). 45% claim they are on Firefox browser, 33% pretend they are on Chrome, and 19.7% pose as Safari.

Workaround

As usual with botnets, blocking individual IP addresses is futile, since there are so many of them. CloudFlare, which is front ending the site, did not detect nor block these attempts.

In order to solve this problem, we just put in a fix to abort the Drupal bootstrap when this bot is detected. We just add this in settings.php. Don't forget to replace example.com with the domain/subdomain you see in your own access log.

if ($_SERVER['HTTP_REFERER'] == 'http://example.com/user/login?destination=node/add') {
  if ($_SERVER['REQUEST_URI'] == '/user/register') {
    header("HTTP/1.0 418 I'm a teapot");
    exit();
  }
}

// This is for the POST variant, with either port 80 in 
// the referer, or an empty referer
if ($_SERVER['REQUEST_METHOD'] == 'POST') {
  if ($_SERVER['REQUEST_URI'] == '/user/register') {
    switch($_SERVER['HTTP_REFERER']) {
      case 'http://example.com:80/':
      case '':
        header("HTTP/1.0 418 I'm a teapot");
        exit();
    }
  }
}
Jan 08 2014
Jan 08

A client recently asked us for help with a very specific issue. The node edit page was hanging up, but only in Internet Explorer 10, and not in Firefox or Chrome. The client had WYSIWYG editor enabled.

This automatically pointed to a front end issue, not a server issue.

So, we investigated more, and found that the underlying issue is between Internet Explorer and JQuery with a large number of items to be parsed.

Internet Explorer was not able to parse the high number of token items listed (around 220). This causes the browse to hang when rendering the WYSIWYG page, with the following error message:

A script on this page is causing Internet Explorer to run slowly. If it continues to run, you computer might become responsive.

With the option to stop the script.

The real problem is with the critical issue described in #1334456, which is as yet not committed to the repository of the token module.

Fortunately there is an easy workaround, the steps are:

  • Install the Token Tweaks module.
  • Go to /admin/config/system/tokens.
  • Change the Maximum Depth limit from the default of 4 to 1
  • Save the changes.

Now the edit form for the node should work normally, and the browser, whichever it is, will not hang anymore.

Note: Thanks to Dave Reid for this workaround.

Apr 22 2013
Apr 22

One of the suboptimal techniques that developers often use, is a query that retrieves the entire content of a table, without any conditions or filters.

For example:

SELECT * FROM table_name ORDER BY column_name;

This is acceptable if there are not too many rows in the table, and there is only one call per page view to that function.

However, things start to get out of control when developers do not take into account the frequency of these calls.

Here is an example to illustrate the problem:

A client had high load average (around 5 or 6) on their server which had around 400 logged in users at peak hours. The server was somewhat fragile with any little thing, such as a traffic influx, or a misbehaved crawler, causing the load to go over 12.

This was due to using an older version of the Keyword Links module.

This old version had the following code:

This caused certain keywords to be replaced when a node is being displayed:

function keyword_link_nodeapi(&$node, $op, $teaser, $page) {
  if ($op == 'view' && ...
    $node->content['body']['#value'] = keyword_link_replace(...);
  }
}

And this caused keyword replacement for each comment as well.

function keyword_link_comment(&$a1, $op) {
  if ($op == 'view') {
    $a1->comment = keyword_link_replace($a1->comment);
    $node->content['body']['#value'] = keyword_link_replace(...);
   }
 }

The function that replaced the content with keywords was as follows:

 
function keyword_link_replace($content) {
  $result = db_query("SELECT * FROM {keyword_link} ORDER BY wid ASC");
  while ($keyword = db_fetch_object($result)) {
    ...
    $content = preg_replace($regex, $url, $content, $limit);
  }
  return $content;
}

Which executes the query every time, and iterates through the result set, replacing words.

Now, let us see how many rows are there in the table.

mysql> SELECT COUNT(*) FROM keyword_link;
+----------+
| count(*) |
+----------+
|     2897 |
+----------+
1 row in set (0.00 sec)

Wow! That is a relatively large number.

And Eureka! That is it! The query was re-executed every time the replace function was called!
This means in a list of nodes of 50 nodes, there would be 50 queries!

And even worse, for a node with tens or hundreds of comments, there would be tens or hundreds of queries as well!

Solution

The solution here was to upgrade to the latest release of the module, which has eliminated the replacement of keywords for comments.

But a better solution, preserving the functionality for comments, is a two fold combined solution:

Use memcache as the cache layer

By using memcache, we avoid going to the database for any caching. It is always a good idea in general to have that, except for simple or low traffic sites.

However, on its own, this is not enough.

Static caching for cache_get() result

By statically caching the results of the query, or the cache_get(), those operations are executed once per page view, and not 51 times for a node displaying comments. This is feasible if the size of the dataset is not too huge. For example, for this site, the size was around 1.3 MB for the three fields that are used from that table, and fits in memory without issues for each PHP process.

This is the outline for the code:

function keyword_link_replace($content) {
  static $static_data;

  if (!isset($static_data)) {
    if (($cache = cache_get('keyword_link_data')) &&
      !empty($cache->data)) {
      $static_data = $cache->data;

      foreach($cache->data as $keyword) {
        replace_keyword($keyword, $content);
      }
    }
    else {
      $result = db_query("SELECT * FROM {keyword_link} ORDER BY wid ASC");

      $data = array();

      while ($keyword = db_fetch_object($result)) {
        $data[] = $keyword;

        replace_keyword($keyword, $content);
      }

      $static_data = $data;

      cache_set('keyword_link_data', $data, 'cache');
    }
  }
  else {
    foreach($static_data as $keyword) {
      replace_keyword($keyword, $content);
    }
  }

  return $content;
}

You can download the full Drupal 6.x version with the fixes from here.

What a difference a query makes

The change was done at 22:00 in the daily graphs, and 6 days before the monthly graph was taken. You can see the difference in that the load average is less. Ranging between 1.8 and 2.4 for most of the day, with occasional spikes above 3. This is far better than the 5 or 6 load before the fix. Also the amount of data that is retrieved from MySQL is halved.

As you will notice, no change was seen in the number of SQL queries. This is probably because of the effect of MySQL's query cache. Since all the queries were the same for the same page, it served the result from the query cache, and did not have to re-execute the query for the tens or hundreds of times per page. Even though the query cache saved us from re-executing the query, there still is overhead in getting that data from MySQL's cache to the application, and that consumed CPU cycles.

Faster Node displays

And because we are processing less data, and doing less regular expression replacement, node display for nodes that have lots of nodes has improved. With a node that had hundreds of comments, and 50 comments shown per page, the total page load time was 8,068 milliseconds.

The breakdown was as follows:

keyword_link_replace() 51 calls totalling 2,429 ms
preg_replace() 147,992 calls totalling 1,087 ms
mysql_fetch_object() 150,455 calls totalling 537 ms
db_fetch_object() 150,455 calls totalling 415 ms
mysql_query() 1,479 calls totalling 393 ms
unserialize() 149,656 calls totalling 339 ms

A total of 5,254 milliseconds processing keywords in comments only.

After eliminating the calls to hook_comment() in the module, the total load time for the node was 3,122 milliseconds.

Conclusion

So, always look at the size of your dataset, as well as the frequency of resource intensive or slow operations. They could be bottlenecks for your application.

Apr 10 2013
Apr 10

The bulk of Drupal hosting for clients that we deal with is on virtual servers, whether they are marketed as "cloud" or not. Many eventually have to move to dedicated servers because increased traffic or continually adding features that increase complexity and bloat.

But, there are often common issues that we see repeatedly that have solutions which can prolong the life of your current site's infrastructure.

We assume that your staff, or your hosting provider, have full access to the virtual servers, as well as the physical servers that run on them.

Disks cannot be virtualized

Even for dedicated servers, the server's disk(s) are often the bottleneck for the overall system. They are the slowest part. This is definitely true for mechanical hard disks with rotating platters, and even Solid State Disks (SSDs) are often slower than the CPU or memory.

For the above reasons, disks cannot be fully virtualized. Yes, you do get a storage allocation that is yours to use and no one else can use. But you cannot guarantee a portion of the I/O throughput, which is always a precious resource on servers.

So, other virtual servers that are on the same physical server as you will contend for disk I/O if your site (or theirs) is a busy one or not optimally configured.

In a virtual server environment, you cannot tell how many virtual servers are on the same physical server, nor if they are busy or not. You only deal with the effects (see below).

For a Drupal site, the following are some of the most common causes for high disk I/O activity:

  • MySQL, with either a considerable amount of slow queries that do file sorts and temporary tables; or lots of INSERT/UPDATE/DELETE
  • Lots of logging activity, such as a warning or a notice in a module that keeps reporting exceptions many times per disk access
  • Boost cache expiry, e.g. when a comment is posted

Xen based virtualization vs. Virtuozzo or OpenVZ

The market uses virtualization technologies much like airlines when they overbook flights based on the assumption that some passengers will not show up.

Similarly, not all virtual hosting customers will use all the resources allocated to them, so there is often plenty of unused capacity.

However, not all virtualization technologies are equal when it comes to resource allocation.

Virtuozzo and its free variant, OpenVZ, use the term "burst memory" to allocate unused memory from other instances, or even swap space when applications demand it on one instance. However, this can bring a server to its knees if swap usage causes thrashing.

Moreover, some Virtuozzo/OpenVZ hosts use vzfs, a virtualized file system, which is slow for Drupal when used for certain things, such as having all of web root on it, logs, and database files.

Xen does not suffer from any of the above. It guarantees that memory and CPU allocated to one virtual instance stays dedicated for that instance.

However, since physical disk I/O cannot be virtualized, it remains the only bottleneck with Xen.

Underpowered Instances

One issue that Amazon AWS EC2 users face is that the reasonably priced instances are often underpowered for most Drupal sites. These are the Small and Medium instances.

For sites with low number of nodes/comments per day, and with most traffic being anonymous. These sites lend themselves to working well with proper Varnish caching enabled set to long hours before expiring.

Other sites that rely on a large number of simultaneous logged in users, with lots of enabled modules, and with short cache expiry times do not work well with these underpowered instances. Such sites require the Extra Large instances, and often the High CPU ones too.

Of course, this all adds to the total costs of hosting.

Expensive As You Grow

Needless to say, if your site keeps growing then there will be added hosting costs to cope with this growth.

With the cloud providers, these costs often grow faster than with dedicated servers, as you add more instances, and so on.

Misconfigured Self-Virtualization

Some companies choose to self manage physical servers colocated at a datacenter and virtualized them themselves.

This is often a good option, but can also be a pitfall. Sometimes the servers are badly misconfigured. We saw one case where the physical server was segmented into 12 VMWare virtual servers with no good reason. Moreover, all of them were accessing a single RAID array. On top of that boost was used on a busy popular forum. When a comment was posted, boost was expiring pages, and that was tying up the RAID array from doing anything useful to other visitors of the site.

Variability in Performance

With cloud and virtual servers, you often don't notice issues, but then suddenly variability will creep in.

An analogy ...

This happens because you have bad housemates who flush the toilet when you are in the shower. Except that you do not know who those housemates are, and can't ask them directly. The only symptom is this sudden cold water over your body. Your only recourse is to ask the landlord if someone flushed the toilet!

Here is a case in point: a Drupal site at a VPS with a popular cloud provider. It worked fine for several years. Then the host upgraded to another, newer version, and asked all customers to move their sites.

It was fine most of the time, but then extremely slow at other times. No pattern could be predicted.

For example while getting a page from the cache for anonymous visitors usually takes a few tens of milliseconds at most, on some occasions it takes much more than that, in one case, 13,879 milliseconds, with the total page load time 17,423 milliseconds.

Here is a sample of devel's output:

Executed 55 queries in 12.51 milliseconds. .Page execution time was 118.61 ms.

Executed 55 queries in 7.56 milliseconds. Page execution time was 93.48 ms.

Most of the time is spent is retrieving cached items.

ms where query
0.61 cache_get SELECT data, created, headers, expire FROM cache WHERE cid = 'menu:1:en'
0.42 cache_get SELECT data, created, headers, expire FROM cache WHERE cid = 'bc_87_[redacted]'
0.36 cache_get SELECT data, created, headers, expire FROM cache WHERE cid = 'bc_54_[redacted]'
0.19 cache_get SELECT data, created, headers, expire FROM cache WHERE cid = 'filter:3:0b81537031336685af6f2b0e3a0624b0'
0.18 cache_get SELECT data, created, headers, expire FROM cache WHERE cid = 'bc_88_[redacted]'
0.18 block_list SELECT * FROM blocks WHERE theme = '[redacted]' AND status = 1 ORDER BY region, weight, module

Then suddenly, same site, same server, and you get:

Executed 55 queries in 2237.67 milliseconds. Page execution time was 2323.59 ms.

This was a Virtuozzo host, and it was a sign of disk contention. Since this is a virtual server, we could not tell if this is something inside the virutal host or some other tenant on the same physical server flushing the toilet ...

The solution is in the following point.

Move your VPS to another physical server

When you encounter variable performance or poor performance, before wasting time on troubleshooting that may not lead anywhere, it is worthwhile to contact your host, and ask for your VPS to be moved to a different physical server.

Doing so most likely will solve the issue, since you effectively have a different set of housemates.

Further Reading:

Mar 27 2013
Mar 27

Today, Khalid gave a presentation on Drupal Performance and Scalability for members of the London (Ontario) Drupal Users Group.

The slides from the presentation are attached below.

AttachmentSize 498.3 KB
Mar 12 2013
Mar 12

Over the past few years, we were called in to assist clients with poor performance of their site. Many of these were using Pressflow, because it is "faster" and "more scalable" than Drupal 6.x.

However, some of these clients hurt their site's performance by using Pressflow, rather than plain Drupal, often because they misconfigured or misused it in some way or another.

Setting cache to "external" without having a caching reverse proxy

We saw a couple of cases where clients would set the cache to "External" in admin/settings/performance, but they are not running a reverse proxy cache tier, such as Varnish or Squid.

What happens here is that Pressflow will not cache pages for anonymous users, and just issue the appropriate cache HTTP headers, assuming that a caching reverse proxy, e.g. Varnish, will cache them.

Performance of the site will suffer, since it will be hit by search engine crawlers.

The solution is simple: either configure a reverse proxy, or set caching to "normal".

Setting Page Cache Maximum Age too low

In admin/settings/performance, there is a configuration parameter called "Page cache maximum age" (in Pressflow, which is called "Expiration of cached pages" in Drupal 7.x). This value should not be left set to "none", because that means items will not be left in the cache for sufficient time for them to be served for subsequent users. Setting it too low (e.g. 1 minute) has the same effect too.

Do set this parameter to the highest time possible if you have an external cache like Varnish or Squid.

Enabling modules that create anonymous sessions

Both Pressflow 6.x and Drupal 7.x disable page caching for anonymous users if a session is present.

This means that if you have a module that sets a cookie, caching will be disabled, because a cookie needs a session to store it.

This means that code like this will disable page caching for anonymous users:

  $_SESSION['foo'] = 'bar';

The Pressflow Wiki started an effort to list such modules here: Modules that break Pressflow 6.x caching and how to fix them and here: Code that sets cookie or session, but with so many modules being written, it is virtually impossible to have a complete list.

Also, novice Drupal developers will not know this, and write modules that use cookies, and therefore prevent page caching for anonymous users.

We have seen such cases from such developers where a site that was perfectly working previously is rendered fragile an unstable via one line of code!

Not that this fault applies to Pressflow 6.x, and to Drupal 7.x as well.

If you are using the former, then you can solve the problem temporarily by switching to Drupal core 6.x instead of Pressflow 6.x. Drupal code 6.x does not mind cookies for anonymous users.

Using Varnish with hook_boot() or hook_exit() modules

When using an external cache, like Varnish, all anonymous requests do not hit Drupal at all. They are served from Varnish.

So if you have modules that implement hook_boot() or hook_exit(), then the code that is there will not be triggered at all. If you rely on it for some functionality, then it will be hit only the first time the page is accessed.

For example, the core statistics module hook_exit() increments the view count for the node. If you enable this module with this functionality, then these figures will be far lower than the real numbers, and you are better of disabling this module rather than having inaccurate numbers.

Mar 06 2013
Mar 06

The Boost is often a great help with speeding up web sites of small to medium size and/or hosted on shared hosts.

It works by writing the entire cached page to a disk file, and serving it entirely from the web server, bypassing PHP and MySQL entirely.

This works well in most cases, but we have observed a few cases where boost itself becomes a bottleneck.

One example was when 2bits.com were called to investigate and solve a problem for a Fortune 500 company's Drupal web site.

The site was configured to run on 12 web servers, each being a virtual instance on VMWare, but all of them sharing a single RAID-5 pool for disk storage.

The main problem was when someone posts a comment: the site took up to 20 seconds to respond, and all the web instances were effectively hung.

We investigated and found out that what happens is that boost's expiry logic kicked in, and tries to delete cached HTML intelligently for the node, the front page, ...etc. All this while the site is busy serving pages from the same disk from boost's cache, as well as other static files.

This disk contention from deleting files caused the bottleneck observed.

By disabling boost, and using memcache instead, we were able to bring down the time from 20 seconds to just 8 seconds.

Further improvement could be achieved by using Varnish as the front tier for caching, reducing contention.

Feb 25 2013
Feb 25

In the Drupal community, we always recommend using the Drupal API, and best practices for development, management and deployment. This is for many reasons, including modularity, security and maintainability.

But it is also for performance that you need to stick to these guidelines, refined for many years by so many in the community.

By serving many clients over many years and specifically doing Drupal Performance Assessments, we have seen many cases where these guidelines are not followed, causing site slowdowns and outages.

Here are some examples of how not to do things.

Logic in the theme layer

We often find developers who are proficient in PHP, but new to Drupal misuse its API in many ways.

In extreme cases, they don't know they should write modules to house the application logic and doing data access, and leave only presentation to be done in the theme layer.

We saw a large site where all the application logic was in the theme layer, often in .tpl.php files. The logic even ended with an exit() statement!

This caused Drupal page caching mechanism to be bypassed, resulting in all page accesses from crawlers and anonymous users to be very heavy on the servers, and complicating the infrastructure by over-engineering it to compensate for such a development mistake.

Using PHP in content (nodes, blocks and views)

Another common approach that most developers start using as soon as they discover it, is placing PHP code inside nodes, blocks or views.

Although this is a quick and dirty approach, the initial time savings cause lots of grief down the road through the life cycle of the site. We wrote an article specifically about that, which you will find a link to below.

Heavy queries in the theme layer, when rendering views

In some cases, the logic for rendering individual nodes within a view is complex, and involves code in the view*.tpl.php file that has SQL queries, or calls to heavy functions, such as node_load() and user_load().

We wrote an article on this which you can find the link to below.

Conclusion

Following Drupal's best practices and community guidelines is always beneficial. Performance is just one of the benefits that you gain by following them.

Further reading

Jul 23 2012
Jul 23

In the past few days, we have seen another Denial of Service attack on a client's site.

The symptoms were a complete outage of the server, with very high CPU usage, and high load average (over 900 in some cases!).

Upon investigating, we found that this is caused by the following hits:

75.145.153.237 - - [22/Jul/2012:19:55:07 -0400] "POST / HTTP/1.1" 500 539 "-" "-"
75.145.153.237 - - [22/Jul/2012:19:55:07 -0400] "POST / HTTP/1.1" 500 539 "-" "-"
75.145.153.237 - - [22/Jul/2012:19:55:06 -0400] "POST / HTTP/1.1" 500 539 "-" "-"
75.145.153.237 - - [22/Jul/2012:19:55:07 -0400] "POST / HTTP/1.1" 500 539 "-" "-"
75.145.153.237 - - [22/Jul/2012:19:55:06 -0400] "POST / HTTP/1.1" 500 539 "-" "-"
75.145.153.237 - - [22/Jul/2012:19:55:07 -0400] "POST / HTTP/1.1" 500 539 "-" "-"

So, a script/bot was used to post data to the home page of the site, and that caused a service unavailable error to be returned.

All the IP addresses belonged to University of Illinois at Urbana-Champagne (UIUC), or to Comcast customers also in Illinois.

For the first and second incident, we blocked the IP address, or entire subnets. However we soon realized that this is a futile effort, since other IP addresses will be used.

We then devised a plan to prevent the POST request from reaching PHP altogether. This was done by adding the following to Drupal's .htaccess. Basically, it returns an access denied right from Apache if the conditions were met. The conditions are: empty referer, empty user agent, POST request to the home page.

# Modification for dealing with botnet DDoS via high CPU utilization
# by overwhelming PHP with POST data
#
# Referer is empty
RewriteCond %{HTTP_REFERER} ^$
# User agent is empty
RewriteCond %{HTTP_USER_AGENT} ^$
# The request is for the home page
RewriteCond %{REQUEST_URI} ^/$
# It is a POST request
RewriteCond %{REQUEST_METHOD} POST
# Forbid the request
RewriteRule ^(.*)$ - [F,L]

After implementing the above fix, the hits were successfully deflected, with no ill effect on the site.

67.177.109.10 - - [23/Jul/2012:16:31:23 -0400] "POST / HTTP/1.1" 403 202 "-" "-"
67.177.109.10 - - [23/Jul/2012:16:23:58 -0400] "POST / HTTP/1.1" 403 202 "-" "-"
67.177.109.10 - - [23/Jul/2012:16:31:21 -0400] "POST / HTTP/1.1" 403 202 "-" "-"
67.177.109.10 - - [23/Jul/2012:16:23:14 -0400] "POST / HTTP/1.1" 403 202 "-" "-"
67.177.109.10 - - [23/Jul/2012:16:23:29 -0400] "POST / HTTP/1.1" 403 202 "-" "-"
67.177.109.10 - - [23/Jul/2012:16:26:58 -0400] "POST / HTTP/1.1" 403 202 "-" "-"

Obviously, this is not the only protection for this type of attack. Other ways include installing Suhosin, and fiddling with Drupal's .htaccess as well, as described in PSA-2012-001.

Apr 22 2012
Apr 22

We had a site for a client that was stable for close to two years, then suddenly started to experience switches from the master to the geographically separate slave server as frequently as twice a week.

The site is an entertainment news site, and its articles get to Google News on occasions.

The symptoms was increased load on the server, a sudden influx of traffic causing over 800 simultaneous connections all in the ESTABLISHED state.

Normally, a well tuned Drupal site can withstand this influx, with server optimization and proper caching. But for this previously stable site, we found that a combination of factors, some internal to the sites, and the other external, participated to cause the site to switch.

The internal factor was the way the site was setup using purl, and other code around. The links of a URL changed to add a top level section, which redirected to the real URL. This caused around 30% of accesses to the URLs to cause a 302 redirect. Since redirects are not cached, they incurred more overhead than regularly served pages.

Investigating the root cause

We started checking if there is a pattern, and went back to analyse the server logs as far back as a year.

We used the ever helpful Go Access tool to do most of the investigative work.

A week in April 2011, had 28% redirects, but we found an anomaly of the browser share over the months. For that same April week, the browser breakdown are 34% MSIE, 21% Safari and 21% Firefox.

A week in Sep 2011, redirects are 30%, browsers are 26% Safari, 25% MSIE and 20% Firefox. These make sense as Safari gains more market share and Microsoft loses market share.

But when checking a week in Feb 2012, redirects are 32%, but look at the browsers: 46% Firefox, 16% Safari, 14% Others and 12% MSIE

It does not make sense for Firefox to jump by that much and gain market share from thin air.

A partial week in March 2012, shows that redirects are 32%, and again, the browsers are 52% Firefox, 14% Others, 13% Safari and 10% MSIE.

That MSIE dropped is something that one can understand. But the jump in Firefox from Sep to Feb/March is unjustified, and tells us that perhaps there are crawlers, scrappers, leachers or something else masking as Firefox and hitting our content.

Digging deeper, we find that the top 2 Firefox versions are:

27,092 Firefox/10.0.2
180,420 Firefox/3.0.10

The first one is understandable, a current version of Firefox. The second one is a very old version from 2009, and has 6.6X the traffic of the current version!

The signature for the user agent is all like so, with a 2009 build:

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10 (.NET CLR 3.5.30729)

We went back and looked at a week in September (all hours of the day), with that browser signature, and lo and behold:

Unique visitors that suck lots of bandwidth:

  88      10.49%  24/Sep/2011  207.76 MB
  113     13.47%  23/Sep/2011  994.44 MB
  109     12.99%  22/Sep/2011    1.44 GB
  133     15.85%  21/Sep/2011    1.70 GB
  134     15.97%  20/Sep/2011    1.68 GB

There were only 335 different IP addresses!

But look at the same user agent in March for a week:

   94479  38.36%  15/Mar/2012   16.38 GB
  102037  41.43%  14/Mar/2012   17.13 GB
   38795  15.75%  13/Mar/2012   12.48 GB
   11003   4.47%  12/Mar/2012   10.90 GB

See the number of unique visitors compared to September?
And now there are 206,225 different IP addresses!

For a few days in March, Monday to Thursday, here are the figures for this user agent.

Total requests to pages (excluding static file): 1,122,229
Total requests that have an empty referer: 1,120,843
That is, 99.88% are from those botnets!

Verifying the hypothesis

Looking at the web server logs through awstats, we found that a year ago, Feb 2011, the market share for Firefox overall was 24.7% with 16,559,999. And at that time, Firefox 3.0.10 had only 44,436

That is 0.002 % of the total.

In Sep 2011 it had 0.2% with 241,869 hits.

Then in Feb 2011, that old version from 2009 have 2.2% share of hits, with 4,409,396 hits.

So, from 0.002% to 2.2% of total, for an obsolete version of Firefox. This means growth by a factor of 1,100 X in one year.

Does not make sense.

Botnet hammering the site

So, what does this tell us?

Looking at a sample of the IP addresses, we found that they all belong to Cable or DSL companies, mainly in the USA.

This tells us that there is a massive botnets that infect lots of PCs.

They were piloting the botnet in September and went full speed after that, and they are hitting the server hard.

The programs of the botnet seem to have a bug in them that prevent them from coordinating with each other, and they all try to grab new content at the same time. This poor coding causes the sudden influx of traffic that brings the server to its knees, combined with the non-caching of 302 redirects.

Just to make sure, we checked two other sites that we manage quickly for the same symptoms. One entertainment site is showing similar signs, the other, a financial sites is not showing the signs. Both have good caching because of no redirects (97% to 98% return code of 200), and that is why the entertainment site can stand the onslaught.

Solution: block the botnet's user agent

Since the botnet is coming from hundreds of thousands IP addresses, it is not possible to block based on the IP address alone.

Therefore, the solution was to block requests coming with that browser signature from 2009 only, and only when there is no referer.

This solution, that goes into settings.php, prevents Drupal from fully booting when a bad browser signature is encountered and the referer is empty.

We intentionally sent the humorous, but still legitimate, 418 HTTP return code so we can filter by that when analysing logs.

$botnet = 'Gecko/2009042316 Firefox/3.0.10';
if ($_SERVER['HTTP_REFERER'] == '') {
  if (FALSE !== strpos($_SERVER['HTTP_USER_AGENT'], $botnet) {
    header("HTTP/1.0 418 I'm a teapot");
    exit();
  }
}

The above should work in most cases.

However, a better solution is to keep the changes at the Apache level and never bother with executing any PHP code if the conditions are met.

# Fix for botnet crawlers, by 2bits.com, Inc.
#
# Referer is empty
RewriteCond  %{HTTP_REFERER}    ^$
# User agent is bogus old browser
RewriteCond  %{HTTP_USER_AGENT} "Gecko/2009042316 Firefox/3.0.10"
# Forbid the request
RewriteRule  ^(.*)$ - [F,L]

The drawback is that we are using a 403 (access denied) instead of the 418 (I am a teapot), which can skew the statistics a bit in the web server logs.

Further reading

After investigating and solving this problem, I discussed the issue with a friend who manages several high traffic sites that are non-Drupal, and at the time, he did not see the same symptoms. However, a few weeks later he started seeing the same symptoms, and sent me the first two articles. Months later, I saw the third:

Dec 29 2011
Dec 29

First of all, check the requirements for Barracuda on the project page. You need a freshly installed server that matches one of the distributions listed (all of them are Debian or Ubuntu).

Open the terminal of the server (ssh to it if it's remote), (you might need to switch to root) and grab the script (open the Readme to make sure the link is correct):

wget -q -U iCab http://files.aegir.cc/versions/BARRACUDA.sh.txt

Then edit the IP of the server and hostname in the script (using vi or nano). You can use 127.0.0.1 or 127.0.11 for localhost. Check /etc/hosts for the hostname and corresponding IP. If you don't have public DNS setup, add the aegir frontend as a host.:

_MY_OWNIP="127.0.1.1"
_MY_HOSTN="yourhostname"
_MY_FRONT="aegir.yourhostname"

You can also change

_MY_EMAIL="youremailhere"

Try running the script now.

bash BARRACUDA.sh.txt

If you run into the "invalid DNS setup" error, you can try disabling the DNS check in the script. Note that after you run Barracuda installer for the first time, it creates a configuration file and will use the values from this files for the installation. So if you need to modify them, you can either delete this file and modify Barracuda script or edit the config file directly:

nano /root/.barracuda.cnf

Search (Ctrl+W in nano) for _DNS_SETUP_TEST and set it to NO:

_DNS_SETUP_TEST=NO

You might also need to disable _SMT_RELAY_TEST in the same manner if you're installing Barracuda on the local server.

Run the script again and wait for all the components to be installed. You'll get prompted about the MySQL passwords and optional components. Once you're done, you'll get the link to your Aegir frontend. Check the previous screencasts on how to use Aegir.

Optionally, you can try installing Octopus to create instances of Aegir on the same server with prebuilt platforms based on popular Drupal distributions.

Nov 14 2011
Nov 14

Together with Alan Dixon of Black Fly Solutions, Khalid Baheyeldin of 2bits.com, Inc. gave a presentation on Web Site Performance, Optimization and Scalability at Drupal Camp 2011.

The slides from the presentation are attached below.

AttachmentSize 384.56 KB
Mar 15 2011
Mar 15

I have my websites hosted on Dreamhost VPS's and the other day decided to consolidate some of my sites onto one server. Using the Munin monitoring I've installed (see Use Munin to monitor a Dreamhost MySQL VPS) I could see the CPU utilization on the server was very low - below 15% - indicating that the server had CPU capacity. But a different performance issue showed up after consolidating the sites on one server. Namely the memory allocation had to be increased to 3GB when that server previously ran well at 1.2GB of memory.

I have put together several websites, and this consolidation involved moving 3 sites onto a server that already had 4 sites. All 7 are Drupal 6.

Last week I'd installed Block Cache Alter on another site (not one of those 7 and not run on the server in question), and seen performance improvement. The observation on that other site was to slightly improve CPU utilization, and somewhat decrease the memory requirement.

Last night I decided to install Block Cache Alter into all the sites now consolidated on the server. I finished that just before midnight last night, and just now looked at the Munin graphs and saw this dramatic result on MySQL. Additionally the memory allocation could be reduced back down to 1.5GB and CPU utilization is approximately the same as it was before.

localhost.localdomain-mysql_queries-day.png

What we see is a sharp decline in "update" and "insert" operations occurring just as I finished installing Block Cache Alter last night. Hence I feel safe in concluding this result came from installing Block Cache Alter.

Installing Block Cache Alter is straight-forward and doesn't involve anything strange like adding entries to settings.php (unlike Cache Router). However to have it give any affect you must edit all of your enabled blocks, and change the cache settings. It appears you don't have to alter cache settings on the non-enabled blocks.

There are a range of choices .. no caching, cache globally, cache per user, cache per role. It helps to understand the content of each block to know which setting to choose. Many blocks are static non-changing content and they would be fine to cache globally. Other blocks change their content depending on the logged-in user, or on the role of the user, or they change on every page view. Obviously if a specific block needs to change on every page view, it won't make sense to cache globally because it would only show the one cached value rather than the constantly changing value.

Here's how I configured the blocks:

  • On one site I had several text-only blocks (hence, not changing) and decided to consolidate them all into one block, and set global caching on that consolidated block
  • I found that most blocks could be set for global caching
  • The "Navigation" menu is one that needs per-user caching, because its content changes based on the user ID.
  • Some blocks have static text which contains javascript, and the javascript dynamically runs on page view in an AJAX fashion to display some data. Because the text Drupal stores doesn't change, it's okay to use global caching on those blocks.
  • One of my blocks is a view that randomly selects on every page view 10 taxonomy terms from a large taxonomy. Obviously this one cannot be cached.
  • The "Who's Online" block is another example of a block which cannot be cached.
Oct 15 2010
Oct 15

As promised, the slides are attached, as a PDF for everyone's reference.

AttachmentSize 621.53 KB
Aug 15 2010
Aug 15

A client site was facing intermittent outages, specially at peak hours.

We investigated the issue over a few days, and narrowed down the cause to certain slow queries, described in more details below.

They had tagadelic on every page, displaying a tag cloud, and from that, crawlers hit every term to paths like taxonomy/term/1581.

Slow queries causing server outage

As more of these queries got executed simultaneously, things get worse and worse, because the temporary tables generated by this query are on disk, and the disk gets busier, and more queries keep coming in the meantime, making the disk even busier, and slower in processing the queries.

This is not a good situation, since not only does it lead to slowdowns, but also to outages as more PHP processes are tied up by slow queries ...

Relevant site statistics

First, the site's relevant statistics are as follows:

40,751 rows in the node table
82,529 rows in the term_node table
79,832 rows in the node_access table

The slow query in question was responsible for combined total of 73.3% of the overall total slow query time for all queries. Out of this, 39.8% were from the query itself, and 33.5% were from the COUNT query associated with it.

EXPLAIN output for the query

The slow query that was bogging down the site looked like this, from the taxonomy module, in the function taxonomy_select_nodes():

EXPLAIN
SELECT DISTINCT(n.nid), n.sticky, n.title, n.created 
FROM node n  
INNER JOIN term_node tn0  ON n.vid  = tn0.vid  
INNER JOIN node_access na ON na.nid = n.nid 
WHERE (
  na.grant_view >= 1 AND 
  (
    (na.gid = 0 AND na.realm = 'all') OR 
    (na.gid = 1 AND na.realm = 'job_view') OR 
    (na.gid = 0 AND na.realm = 'resume_owner') OR 
    (na.gid = 0 AND na.realm = 'og_public')
  )
) 
AND 
( 
  n.status = 1  AND 
  tn0.tid IN (1581) 
)
ORDER BY n.sticky DESC, n.created DESC 
LIMIT 0, 15\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: n
         type: range
possible_keys: PRIMARY,vid,node_status_type
          key: node_status_type
      key_len: 4
          ref: NULL
         rows: 40751
        Extra: Using where; Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: na
         type: ref
possible_keys: PRIMARY,grant_view_index
          key: PRIMARY
      key_len: 4
          ref: live.n.nid
         rows: 1
        Extra: Using where; Distinct
*************************** 3. row ***************************
           id: 1
  select_type: SIMPLE
        table: tn0
         type: ref
possible_keys: vid
          key: vid
      key_len: 4
          ref: live.n.vid
         rows: 1
        Extra: Using where; Distinct

Both the node and term_node tables have many rows, and adding to that is the node_access table. An extra join is done on that table which also contaings many rows.

So, for the sake of diagnosis, we rewrote the query without the node _access table, like so:

EXPLAIN
SELECT DISTINCT(n.nid), n.sticky, n.title, n.created 
FROM node n  
INNER JOIN term_node tn0 ON n.vid = tn0.vid 
WHERE n.status = 1  AND 
tn0.tid IN (1581) 
ORDER BY n.sticky DESC, n.created DESC 
LIMIT 0, 15\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: n
         type: range
possible_keys: vid,node_status_type
          key: node_status_type
      key_len: 4
          ref: NULL
         rows: 40751
        Extra: Using where; Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: tn0
         type: ref
possible_keys: vid
          key: vid
      key_len: 4
          ref: live.n.vid
         rows: 1
        Extra: Using where; Distinct

We eliminate a join, but everything else remains the same, including the temporary table and filesort.

Timing the two scenarios: Using MySQL profiling

We then proceeded to time the two queries in a controlled manner, using two methods, first using MySQL profiling, which is only in 5.0.37 or later, and the Community Edition only.

To do this, you execute the following command on MySQL's command line:

SET PROFILING = 1;

This enables profiling for your session.

You then execute the two above queries, one after the other.

After that, you enter the following command:

SHOW PROFILES;

The output will show the time for each query, and the SQL for it (which we abbreviated for brevity):

+----------+------------+
|        1 | 1.40426500 | SELECT SQL_NO_CACHE DISTINCT(n.nid), ...
|        2 | 0.30603500 | SELECT SQL_NO_CACHE DISTINCT(n.nid), ...
+----------+------------+

Clearly the times are different.

You can also glean more valuable info on resource usage, by querying the information_schema database's profiling table, as follows. The SQL filters all states that take less than 1 milliseconds, again for brevity.

For the original query, with node_access joined, we get:

SELECT state, duration, cpu_user+cpu_system AS cpu, 
block_ops_in+block_ops_out AS blocks_ops 
FROM information_schema.profiling 
WHERE query_id = 1 AND
duration > 0.000999;
+----------------------+----------+------+------------+
| state                | duration | cpu  | blocks_ops |
+----------------------+----------+------+------------+
| Copying to tmp table | 1.403615 | 1.68 |        128 | 
+----------------------+----------+------+------------+

For the second query without the node_access join, we get:

SELECT state, duration, cpu_user+cpu_system AS cpu, 
block_ops_in+block_ops_out AS blocks_ops 
FROM information_schema.profiling 
WHERE query_id = 2 AND
duration > 0.000999;
+----------------------+----------+------+------------+
| state                | duration | cpu  | blocks_ops |
+----------------------+----------+------+------------+
| Copying to tmp table | 0.305544 | 0.35 |          8 | 
+----------------------+----------+------+------------+

As you can see, the CPU usage and the block I/O operations are far less when we eliminate the node_access join.

Timing the two scenarios: custom PHP script

If you don't have a MySQL version that supports profiling, you can time the two queries using a simple custom script.

The server is a medium one with normal specs (8 cores, 8GB of RAM, and a separate disk for MySQL).

The PHP script boots Drupal then executes both variations of the query, measuring the time for each.

Note that we added SQL_NO_CACHE to the query to force it to bypass the query cache of MySQL:

<?php
require_once './includes/bootstrap.inc';
drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);

// We have an array with the query type, and the query SQL
$q = array(
  'before' => "
  SELECT SQL_NO_CACHE DISTINCT(n.nid), n.sticky, n.title, n.created
  FROM node n
  INNER JOIN term_node tn0 ON n.vid = tn0.vid
  INNER JOIN node_access na ON na.nid = n.nid
  WHERE
    (
      na.grant_view >= 1 AND
      (
        (na.gid = 0 AND na.realm = 'all') OR
        (na.gid = 1 AND na.realm = 'job_view') OR
        (na.gid = 0 AND na.realm = 'resume_owner') OR
        (na.gid = 0 AND na.realm = 'og_public')
      )
    )
    AND
    (
      n.status = 1 AND
      tn0.tid IN (1581)
    )
  ORDER BY n.sticky DESC, n.created DESC
  LIMIT 0, 15",

  'after' => "
    SELECT SQL_NO_CACHE DISTINCT(n.nid), n.sticky, n.title, n.created
    FROM node n
    INNER JOIN term_node tn0 ON n.vid = tn0.vid
    WHERE n.status = 1 AND
      tn0.tid IN (1581)
    ORDER BY n.sticky DESC, n.created DESC
    LIMIT 0, 15",
);


$result = array();

// Execute each SQL, measuring it in the process
foreach($q as $type => $query) {
  timer_start($type);
  $result[$type] = db_query($query);
  print "$type:\t" . timer_read($type) . "\n";
}
?>

We save that to a file called query_time.php, and run it like so:

cd /var/www
php /somewhere/query_time.php

You can also run it directly from a browser, if you do not have shell access, or if you don't have PHP CLI installed.

For that, you need to r eplace this line:

  print "$type:\t" . timer_read($type) . "\n";

With

  print "$type: " . timer_read($type) . "
";

For HTML output in the browser, you also need to have the script in the same directory that has Drupal core's index.php.

Analyzing the results

Whatever way you run the above script, the output will look like this:

before: 1597.86
after:   299.9

As you can see, there is a significant difference when having node_access vs. eliminating it from the query. The difference above is more than 5X.

Running it several times, and getting slightly different results each time, we concluded that at least a 4X improvement is possible with this modification.

This means that eliminating the node_access part from the query was going to improve the site by eliminating slowdowns and outages.

Quick Solution: Hack core!

The quick solution was to patch the taxnomony.module in the function taxonomy_select_nodes() to comment out the two db_rewrite_sql() calls.

However, since the site has some private data, we had to protect it in some ways. It turned out that for this site, the private nodes all belonged to one term, 63, and therefore the following patch will bypass the node_access join only when the path is /taxonomy/term/zzz, and return nothing if the visitor is going to term 63 specifically. For other paths, normal access control will apply and those authorized to see the private content will continue to be able to do so:

-    $sql = db_rewrite_sql($sql);
-    $sql_count = db_rewrite_sql($sql_count);
+    if (in_array(63, $tids)) {
+      // Do not display private nodes...
+      return NULL;
+    }
+    // Skip the node_access stuff reducing execution time for queries
+    // $sql = db_rewrite_sql($sql);
+    // $sql_count = db_rewrite_sql($sql_count);

This improved the server performance significantly, as you can see in the graphs below:

MySQL Slow queries reduced significantly:

The load average is back to a normal number, with no spikes:

CPU usage is also normal, with no spikes:

The drawbacks of the above solution is that it uses a hack to core, which has to be maintained in version control for future core upgrades.

Long Term Solution: Separate site

The long term solution though, is to separate the private stuff to its own site with its own subdomain (e.g. private.example.com) and a separate database. The bakery module as a single signon solution. Bakery does not synchronize roles across sites though, and that has to be managed manually, until the feature request in #547524 is implemented.

Further reading

Incidentally, this is not the first time that we had to bypass node access to resolve performance bottlenecks caused by slow queries. Read How Drupal's node_access table can negatively impact site performance for a similar case.

Aug 09 2010
Aug 09

One great feature that Drupal has is the ability to make modules run certain tasks, often heavy ones, in the background at preset intervals. This can be achieved by a module implementing hook_cron.

Core uses this feature to index new content for the search module, ping module to notify remote sites of new content, fetch new release information from drupal.org, poll other sites for RSS feeds, and more.

Various contributed modules use this for various purposes, such as mailing out newsletters, cleaning up logs, synchronizing content with other servers/sites, and much more ...

Core Cron: All or None

This powerful core feature has some limitations though, such as:

  • All hook cron implementations for all modules are run at the same time, in sequence alphabetically or according to module weight.
  • When cron for one module is stuck, all modules following it will not be executed, and cron will not run again until 1 hour has passed.
  • There is no way to know which module is the one that caused the entire cron to get stuck. Moreover, there is no instrumentation information to know which cron hook takes the most time.

2bits.com have proposed core patches to report overcome the lack of instrumentation, by logging the information to the watchdog. The patches are useful only to those who apply them. It is unlikely that they will get in core any time soon.

For a practical example, you can use Job Queue with our own queue mail module to improve end user response time, and avoid timeouts due to sending a lot of emails. This scheme defers sending to when cron is run, and not when a user submits a node or a comment.

This works well, but for core, all cron hooks run at the same time. If you set cron to run every hour, then email sending could be delayed by an hour or even more if job queue cannot send them all in one run. If you make cron run more frequently, e.g. every 15 minutes, then all the heavy hooks such as search indexing and log cleanup will also run every 15 minutes consuming lots of resources.

Enter Elysia Cron ...

With Elysia cron, you can now have the best of both worlds: you can set cron for job_queue to run every minute, and defer other heavy stuff to once a day during off hours, or once an hour. The email is delivered quickly, within minutes, and we don't incur the penalty of long running cron hooks.

Features and Benefits of Elysia Cron

The features that Elysia cron offers are many, the important ones, with a focus on performance, are:

  • You can run different hook_cron implementations for different modules at a different frequency.
  • You are aware what the resource and performance impact of each hook_cron implementation is. This includes the time it took to run it last, the average, and maximum time. This information is very valuable in distributing different hooks across the day, and their frequencies.
  • Set a configurable maximum for cron invocations. Drupal core has a hard coded value of 240 seconds. You can adjust this up or down as per your needs.
  • Handles "stuck" crons better than core. In core, if cron is stuck, it takes one hour for it to automatically recover. In Elysia cron, the other hook invocations continue to run normally.
  • You can set the weight for each module, independent from the weight for the module in the system table. Using this, you can have a different order of execution for modules.
  • You can group modules in "contexts", assigning different run schedules for different contexts, or disable contexts globally.
  • The ability to assign a cron key, or a white list of allowed hosts that can execute cron.
  • Selectively disable cron for one or more modules, but not others, or all cron.
  • Selectively run cron for only one module.
  • Defining a cronapi that developers can use.
  • It requires no patching of core or contributed modules.

Examples of Elysia Cron in action

Here is captcha's cron, which has been configured to run only once a day in the early hours of the morning:

As well, dblog's cron runs once a day too. No need to trigger this every hours or twice an hour.

Search here is shown to be the most heavy of all cron hooks. But still, we run it twice every hour, so that the search is always fresh.

Statistics cleanup is kind of heavy too, so we run it only once a day.

Finally, xmlsitemap is a useful module, yet it is also heavy on a site with lots of nodes. Therefore we run it only once a day.

The above is not cast in stone for these module. They will vary from one site to the other depending on the server configuration, resources available and data set sizes. Moreover, even for the same site, it is recommended to monitor regularly and adjust these on an ongoing basis.

Alternatives to Elysia Cron

Elysia cron is no alone though. There are other modules that have overlapping functions, such as Super Cron, Cron Plus, and even a Cron API module. Super Cron seems promising, but Elysia does everything we need so far, so the motivation to evaluate it low on the list of priorities.

Here is an attempt to compare the various cron modules, but so far it is sparse on information.

A more powerful solution, but also more complex and heavy weight is the use of tools like Hudson Continuous Integration. Since it runs within Java it adds dependencies to the usual LAMP-only server as well as being more demanding on resources. You can read a full article on it here.

Aug 07 2010
Aug 07

With the "Cloud" being in vogue currently, we see a lot of clients asking for cloud solutions, mostly Amazon AWS. Sadly, this is normally done without really doing a proper evaluation into whether the cost is reasonable, or the technology is suitable for their specific needs.

Amazon AWS provides some unique and compelling features. Among those are: instant provisioning of virtual servers, billing for used resources only, ability to provision more instances on demand, a wide variety of instance types, and much more.

We certainly like Amazon AWS for development and testing work, and for specific use cases such as seasonal sites.

For most high traffic sites though, Amazon AWS can be overly expensive, and not performant enough.

Before you decide to use Amazon, spend some time studying the various Amazon instance types, and the pricing that will be incurred. You may be surprised!

Here is a case study of a client that was on Amazon until recently, and we moved them to a more custom setup, with great results.

The client is a specialized news site, and gets linked to often from other high traffic sites such as Yahoo's front page, and the Drudge Report.

The site was originally hosted at another big name hosting company, but unfortunately they went down several times due to data center power issues.

After moving to Amazon AWS, with the setup below, the site was a bit sluggish, and when the traffic spikes described above happened, the setup could not cope with the increased traffic load ...

Amazon AWS Setup

The setup relied on Amazon's Elastic Load Balancer (ELB) front ending the site.

Behind the load balancer, there were a total of 4 instances, varying in type.

First, there were two web servers, each one of them m1.large.

Another m1.small instance acted as the NFS server for both web servers.

Finally another m1.large instance housed the MySQL database.

To summarize:

2 x web servers (each m1.large)
1 x MySQL database server (m1.large)
1 x NFS server (m1.small)

The cost was high compared to the features: EC2 computing cost alone for the instances was around $920 per month.

Additionally, there were 331 million I/O requests costing $36 per month. ELB was an additional $21 per month.

Storage and bandwidth brought the total to $990 per month.

Setup Drawbacks

The drawbacks of such a setup are many:

First, there is complexity: there are many components here, and each required patching with security updates, monitoring of disk space and performance, and other administration tasks.

Second, it was not fault tolerant. The only part that is redundant is the web server, with two of them present. However, if there is a database server or NFS server crash, the entire setup would stop serving pages.

Third, the cost is too high compared to a single well configured server, at almost half the cost.

Fourth, Amazon's ELB Load balancer forces the "www." prefix for the site, which is not a big deal for most sites, but some want to be known without that prefix.

Fifth, the performance was not up to par. The site was sluggish most of the time.

Finally, the setup was not able to handle traffic spikes adequately.

The Solution

After doing a full Drupal site performance assessment, 2bits.com recommended and implemented a new setup consisting of a single medium sized dedicated server for $536 per month.

The server is quad core Xeons E5620 at 2.4GHz, 8GB of RAM, 4 disks (each 2 forming a mirror).

We then did a full server installation, configuration, tuning and optimization tuning the entire LAMP stack from scratch, on our recommended Ubuntu 8.04 LTS Server Edition. Instead of using Varnish, we used only memcached.

The setup is mostly like what we did for another much higher traffic site. You can read the details at: 2.8 million page views per day: 70 million per month: one server!

After consulting with the client, we recommended to go a step further and use the additional budget, and implement a near fault tolerant setup. The second server is in another data center, and the monthly cost is $464.

The Results

The results is that we now have a satisfied client, happy with the new setup that is free of the headaches that they used to face when traffic spikes happen.

Here are the graphs showing a large spike in July 2010.

Two traffic spikes in Google Analytics. The traffic shot up from the normal 81,000 to 83,000 page views per day, to 244,000. The spike on July 12th was 179,000 page views.

Apache accesses per second, by day

Apache volume per second, by day

CPU utilization per day, no noticable spike

Memcache utilization, showing that it took most of the load

Lessons Learned

There are lots of good lessons learned from
A performance assessment is valuable before deciding what hosting to use. This will give a baseline and reveal any bottlenecks that your site may have.

In most cases, we advocate simplicity over complexity. Start simple and then go add complexity when and where needed.

Try to make the most of vertical scaling, before you go horizontal.

Amazon AWS is great for development and specific use cases. It may not be your most cost effective option for high traffic sites though.

Memcache, used properly, will get you far on your journey to scalability and performance.

Further reading

There are lots of links on the web about Amazon AWS and hosting LAMP on it.

Here are a select few recent Drupal specific presentations and podcasts:

About Drupal Sun

Drupal Sun is an Evolving Web project. It allows you to:

  • Do full-text search on all the articles in Drupal Planet (thanks to Apache Solr)
  • Facet based on tags, author, or feed
  • Flip through articles quickly (with j/k or arrow keys) to find what you're interested in
  • View the entire article text inline, or in the context of the site where it was created

See the blog post at Evolving Web

Evolving Web