Jul 15 2015
Jul 15

Regardless of industry, staff size, and budget, many of today’s organizations have one thing in common: they’re demanding the best content management systems (CMS) to build their websites on. With requirement lists that can range from 10 to 100 features, an already short list of “best CMS options” shrinks even further once “user-friendly”, “rapidly-deployable”, and “cost-effective” are added to the list.

There is one CMS, though, that not only meets the core criteria of ease-of-use, reasonable pricing, and flexibility, but a long list of other valuable features, too: Drupal.

With Drupal, both developers and non-developer admins can deploy a long list of robust functionalities right out-of-the-box. This powerful, open source CMS allows for easy content creation and editing, as well as seamless integration with numerous 3rd party platforms (including social media and e-commerce). Drupal is highly scalable, cloud-friendly, and highly intuitive. Did we mention it’s effectively-priced, too?

In our “Why Drupal?” 3-part series, we’ll highlight some features (many which you know you need, and others which you may not have even considered) that make Drupal a clear front-runner in the CMS market.

For a personalized synopsis of how your organization’s site can be built on or migrated to Drupal with amazing results, grab a free ticket to Drupal GovCon 2015 where you can speak with one of our site migration experts for free, or contact us through our website.

_______________________________

SEO + Social Networking:

Unlike other content software, Drupal does not get in the way of SEO or social networking. By using a properly built theme–as well as add-on modules–a highly optimized site can be created. There are even modules that will provide an SEO checklist and monitor the site’s SEO performance. The Metatags module ensures continued support for the latest metatags used by various social networking sites when content is shared from Drupal.

SEO Search Engine Optimization, Ranking algorithm

E-Commerce:

Drupal Commerce is an excellent e-commerce platform that uses Drupal’s native information architecture features. One can easily add desired fields to products and orders without having to write any code. There are numerous add-on modules for reports, order workflows, shipping calculators, payment processors, and other commerce-based tools.

E-Commerce-SEO-–-How-to-Do-It-Right

Search:

Drupal’s native search functionality is strong. There is also a Search API module that allows site managers to build custom search widgets with layered search capabilities. Additionally, there are modules that enable integration of third-party search engines, such as Google Search Appliance and Apache Solr.

Third-Party Integration:

Drupal not only allows for the integration of search engines, but a long list of other tools, too. The Feeds module allows Drupal to consume structured data (for example, .xml and .json) from various sources. The consumed content can be manipulated and presented just like content that is created natively in Drupal. Content can also be exposed through a RESTful API using the Services module. The format and structure of the exposed content is also highly configurable, and requires no programming.

Taxonomy + Tagging:

Taxonomy and tagging are core Drupal features. The ability to create categories (dubbed “vocabularies” by Drupal) and then create unlimited terms within that vocabulary is connected to the platform’s robust information architecture. To make taxonomy even easier, Drupal even provides a drag-n-drop interface to organize the terms into a hierarchy, if needed. Content managers are able to use vocabularies for various functions, eliminating the need to replicate efforts. For example, a vocabulary could be used for both content tagging and making complex drop-down lists and user groups, or even building a menu structure.

YS43P

Workflows:

There are a few contributor modules that provide workflow functionality in Drupal. They all provide common functionality along with unique features for various use cases. The most popular options are Maestro and Workbench.

Security:

Drupal has a dedicated security team that is very quick to react to vulnerabilities that are found in Drupal core as well as contributed modules. If a security issue is found within a contrib module, the security team will notify the module maintainer and give them a deadline to fix it. If the module does not get fixed by the deadline, the security team will issue an advisory recommending that the module be disabled, and will also classify the module as unsupported.

Cloud, Scalability, and Performance:

Drupal’s architecture makes it incredibly “cloud friendly”. It is easy to create a Drupal site that can be setup to auto-scale (i.e., add more servers during peak traffic times and shut them down when not needed). Some modules integrate with cloud storage such as S3. Further, Drupal is built for caching. By default, Drupal caches content in the database for quick delivery; support for other caching mechanisms (such as Memcache) can be added to make the caching lightning fast.

cloud-computing

Multi-Site Deployments:

Drupal is architected to allow for multiple sites to share a single codebase. This feature is built-in and, unlike WordPress, it does not require any cumbersome add-ons. This can be a tremendous benefit for customers who want to have multiple sites that share similar functionality. There are few–if any–limitations to a multi-site configuration. Each site can have its own modules and themes that are completely separate from the customer’s other sites.

Want to know other amazing functionalities that Drupal has to offer? Stay tuned for the final installment of our 3-part “Why Drupal?” series!

Mar 27 2013
Mar 27

Today, Khalid gave a presentation on Drupal Performance and Scalability for members of the London (Ontario) Drupal Users Group.

The slides from the presentation are attached below.

AttachmentSize 498.3 KB
Apr 17 2012
Apr 17

A lot of very interesting things are happening to make Drupal's caching system a bit smarter. One of my favorite recent (albeit smaller) developments is a patch (http://drupal.org/node/1471200) for the Views module that allows for cached views to have no expiration date. This means that the view will remain in the cache until it is explicitly removed.

Before this patch landed, developers were forced to set an arbitrary time limit for how long Views would store the cached content. So even if your view's content only changed every six months, you had to choose a time limit from a list of those predefined by Views, the maximum of which was 6 days. Every six days, the view content would be flushed and regenerated, regardless of whether its contents had actually changed or not.

The functionality provided by this patch opens the door for some really powerful behavior. Say, for instance, that I have a fairly standard blog view. Since I publish blog posts somewhat infrequently, I would only like to clear this view's cache when a new blog post is created, updated, or deleted.

To set up the view to cache indefinitely, click on the "Caching" settings in your view and select "Time-based" from the pop-up.

Then, in the Caching settings form that follows, set the length of time to "Custom" and enter "0" in the "Seconds" field. You can do the same for the "Rendered output" settings if you'd like to also cache the rendered output of the view.

Once you save your view, you should be all set.

Next, we need to manually invalidate the cached view whenever its content changes. There are a couple different ways to do this depending on what sort of content is included in the view (including both of the modules linked to above). In this case, I'll keep it lightweight and act on hooks in a custom module:

/**
* Implements hook_node_insert().
*/
function MY_MODULE_node_insert($node) {
  if ($node->type == 'blog') {
    cache_clear_all('blog:', 'cache_views', TRUE); 
  }
}...Same for hook_node_update() and hook_node_delete()...

And just like that, my view is only regenerated when it needs to be, and should be blazing fast in between.

The patch was committed to the 7.x-3.x branch of Views on March 31, 2012, so for now you will have to manually apply the patch until it is released in the next point release.

Happy caching!

Dec 23 2011
Dec 23

About thegateway.org:

The Gateway has been serving teachers continuously since 1996 which makes it one of the oldest publically accessible U.S. repositories of education resources on the Web. The Gateway contains a variety of educational resource types from activities and lesson plans to online projects to assessment items.

The older version of the website was on plone. The team hired us to migrate it to Drupal. It was an absolutely right choice to make. Given that, with Drupal comes a lot more benefits.

We redesigned the existing website giving it a new look and on Drupal. Then we hosted it on Acquia managed could to boost its performance and scalability. The new look is more compact, organized and easier to use.

It was a very interesting project for us and our team is proud to be a part of such a great educational organization serving the nation.

Looking forward to a grand success of the new launch!

thegateway.org BEFORE:

 

thegateway.org NOW:

Dec 09 2011
Dec 09

Drupal can power any site from the lowliest blog to the highest-traffic corporate dot-com. Come learn about the high-end of the spectrum with this comparison of techniques for scaling your site to hundreds of thousands or millions of page views an hour. This Do it with Drupal session with Nate Haug will cover software that you need to make Drupal run at its best, as well as software that acts as a front-end cache (a.k.a Reverse-Proxy Cache) that you can put in-front of your site to offload the majority of the processing work. This talk will cover the following software and architectural concepts:

  • Configuring Apache and PHP
  • MySQL Configuration (with Master/Slave setups)
  • Using Memcache to reduce database load and speed up the site
  • Using Varnish to serve up anonymous content lightning fast
  • Hardware overview for high-availability setups
  • Considering nginx (instead of Apache) for high amounts of authenticated traffic
Sep 22 2011
Sep 22

Tomorrow is the last day of Summer but the Drupal training scene is as hot as ever. We’ve scheduled a number of trainings in Los Angeles this Fall that we’re excited to tell you about, and we’re happy to publicly announce our training assistance program.

First, though, we’re sending out discount codes on Twitter and Facebook. Follow @LarksLA on Twitter, like Exaltation of Larks on Facebook or sign up to our training newsletter at http://www.larks.la/training to get a 15% early bird discount* toward all our trainings!

Los Angeles Drupal trainings in October and November, 2011

Here are the trainings we’ve lined up. If you have any questions, visit us at http://www.larks.la/training or contact us at trainings [at] larks [dot] la and we’ll be happy to talk with you. You can also call us at 888-LARKS-LA (888-527-5752) with any questions.

Beginner trainings:

Intermediate training:

Advanced trainings:

All our trainings are $400 a day (1-day trainings are $400, 2-day trainings are $800, etc.). We’re excited about these trainings and hope you are, too. Here are some more details and descriptions.

Training details and descriptions

   Drupal Fundamentals
   October 31, 2011
   http://ex.tl/df7

Drupal Fundamentals is our introductory training that touches on nearly every aspect of the core Drupal framework and covers many must-have modules. By the end of the day, you’ll have created a Drupal site that looks and functions much like any you’ll see on the web today.

This training is for Drupal 7. For more information, visit http://ex.tl/sbd7

   Drupal Scalability and Performance
   October 31, 2011
   http://ex.tl/dsp1

In this advanced Drupal Scalability and Performance training, we’ll show you the best practices for running fast sites for a large volume of users. Starting with a blank Linux virtual server, we’ll work together through the setup, configuration and tuning of Drupal using Varnish, Pressflow, Apache, MySQL, Memcache and Apache Solr.

This training is for both Drupal 6 and Drupal 7. For more information, visit http://ex.tl/dsp1

   Drupal Architecture (Custom Content, Fields and Lists)
   November 1 & 2, 2011
   http://ex.tl/ccfl1

Drupal Architecture (Custom Content, Fields and Lists) is our intermediate training where we explore modules and configurations you can combine to build more customized systems using Drupal. You’ll create many examples of more advanced configurations and content displays using the popular Content Construction Kit (CCK) and Views modules.

This training is for Drupal 6. For more information, visit http://ex.tl/ccfl1

   Developing RESTful Web Services and APIs
   November 3, 4 & 5, 2011
   http://ex.tl/dwsa1

Offered for the first time in Southern California, Developing RESTful Web Services and APIs is an advanced 2-day training (with an optional third day of additional hands-on support) for those developers seeking accelerated understanding of exploiting Services 3.0 to its fullest. This is THE training you need if you’re using Drupal to create a backend for iPad, iPhone or Android applications.

This training covers both Drupal 6 and Drupal 7. For more information, visit
http://ex.tl/dwsa1

Training assistance program

In closing, we’d like to tell you about our training assistance program. For each class, we’re setting aside a limited number of seats for students, unemployed job seekers and people in need.

For more details about the program, contact us at trainings [at] larks [dot] la and we’ll be happy to talk with you. You can also call us at 888-LARKS-LA (888-527-5752) with any questions.

* Our early bird discount is not valid toward the Red Cross First Aid, CPR & AED training and 2-year certification that we’re organizing. It’s already being offered at nearly 33% off, so sign up today. You won’t regret it and you might even save someone’s life. ^

May 20 2011
May 20
http://www.flickr.com/photos/essjay/224318029/

Last night I made a presentation on the “Business of Drupal” to the Sydney Drupal users meetup. The talk covered the subject areas of scalable jobs and wild randomness, basic business models in the software industry, the GPL, eight business models for Drupal in increasing order of scalability, ways developers can deepen their skills and a round up of how various organisations in the Drupal community are structuring the way they do businesss. I have just uploaded the slides to the talk.

For those wanting a little bit more detail without going to the slides, I’ll reproduce some of the content here.

Drupal business models in increasing order of scalability

1. Employment

  • Employment at Drupal shop or company
  • Income limited by salary (skill, experience)
  • Non scalable
  • Very regular

2. Pure services

  • Contractors, Drupal shops, F2F training
  • eg. Cross Functional, Previous Next
  • Income limited by incoming jobs (supply) and staff
  • Non scalable due to staffing requirements
  • Variable regularity, no subscriptions

3. GPL products with services

  • Distribution owners, module authors
  • eg. Phase2, Ubercart
  • Income limited by product popularity and staff
  • Non scalable due to staffing requirements
  • Variable regularity

4. Drupal hosting platform

  • Drupal hosting
  • eg. Acquia Dev Cloud, Managed Cloud, Chapter Three Pantheon, Omega8cc Aegir
  • Overhead of maintaining platform – Aegir
  • Scalable
  • Regular

5. Drupal as a service (DaaS)

  • Drupal running as a SaaS
  • eg. Drupal Gardens, Buzzr, wordpress.com
  • Overhead of maintaining platform
  • Scalable
  • Regular

6. Software as a service (Saas)

  • Service accessed via bridge module.
  • eg. Mollom, Acquia Solr
  • Overhead of maintaining platform
  • Scalable
  • Regular

7. Products with some non GPL code

  • Themes
  • eg. Top Notch Themes
  • Overhead of deloping product
  • Scalable
  • Irregular
  • Problem: Is the main IP in the code or the images?

8. Products with all non GPL code

  • Online training, documentation, books
  • eg. Lullabot drupalize.me
  • Overhead of deloping product
  • Scalable (online training)
  • (Ir)regular

Possible areas of specialisation for service providers

  • Data migration: Data is like wine, code like fish
  • Theming: Where are the themers?
  • Custom module development
  • Project scoping
  • Verticals: distros
  • Server admin, deployment (?)
  • Performance (?)

The main takeaway idea from the talk was that working in non-scalable areas such as full time employment is a safe option which will yield good results so long as you have skill and apply yourself. However, exposing yourself a little to some “wild randomness” in the form of scalable ventures (startups, SaaS, distros) could be a worthwhile pursuit if you are successful.

Be Sociable, Share!
Jan 21 2011
mg
Jan 21

Drupal is widely recognized as a great content management system, but we strongly believe that Drupal offers a lot more than that – a framework, a platform, and a set of technology – to build and run enterprise applications, specifically on the cloud. This post is an attempt to explore the benefits and potential of Drupal on the cloud.

Elasticity

One of the last things the customers should worry about their websites is the performance degradation due to sudden spike in the traffic. For years, the customers had to size their servers to meet the peak demand. They overpaid, and still failed to deliver on promise, at peak load. Cloud solves this elasticity problem really well, and if you are using Drupal, you automatically get the elasticity benefits, since Drupal’s modularized architecture - user management, web services, caching etc. - is designed for scale-up and scale-down on the cloud for elastic load.

PaaS

If Heroku’s $212 million acquisition by Salesforce.com is any indication, the future of PaaS is bright. Drupal, at its core, is a platform. The companies such as Acquia through Drupal Gardens are doing a great job delivering the power of Drupal by making it incredible easy for the people to create, run, and maintain their websites. This is not a full-blown PaaS, but I don’t see why they cannot make it one. We also expect to see a lot more players jumping into this category. The PaaS players such as phpfog and djangy have started gaining popularity amongst web developers.

Time-to-market and time-to-value:

Drupal has helped customers move from concept to design to a fully functional content-rich interactive website in relatively short period of time using built-in features and thousands of modules. Cloud further accelerates this process. Amazon and Rackspace have pre-defined high-performance Drupal images that the customers can use to get started. Another option is to leverage PaaS as we described above. The cloud not only accelerates time-to-market and time-to-value but it also provides economic benefits during scale-up and scale-down situations.

Management

The cloud management tools experienced significant growth in the last two years and this category is expected to grown even more as the customers opt for simplifying and unifying their hybrid landscapes. With Drupal, the customers not only could leverage the cloud management tools but also augment their application-specific management capabilities with Drupal’s modules such as Quant for tracking usage, Admin for managing administrative tasks, and Google Analytics for integration with Google Analytics. There is still a disconnect between the cloud native management tools and Drupal-specific management tools, but we expect them to converge and provide a unified set of tools to manage the entire Drupal landscape on the cloud.

Open source all the way

Not only Drupal is completely open source but it also has direct integration with major open source components such as memcached, Apache SOLR, and native support for jQuery. This not only provides additional scale and performance benefits to Drupal on the cloud but the entire stack on the cloud is backed by vibrant open source communities.

Security

It took a couple of years for the customers to overcome the initial adoption concerns around the cloud security. They are at least asking the right questions. Anything that runs on the cloud is expected to be scrutinized for its security as well. We believe that the developers should not explicitly code for security. Their applications should be secured by the framework that they use. Drupal not only leverages the underlying cloud security but it also offers additional security features to prevent the security attacks such as cross-site scripting, session hijacking, SQL injection etc. Here is the complete list by OWASP on top 10 security risks.

Search and Semantic Web

One of the core functionally that any content website needs is search. Developers shouldn’t have to reinvent the wheel. Integration with SOLR is a great way to implement search functionality without putting in monumental efforts. Drupal also has built-in support for RDF and SPARQL for the developers that are interested in Semantic Web.

NoSQL

The cloud is a natural platform for NoSQL and there has been immense ongoing innovation in the NoSQL category. For the modern applications and websites, using NoSQL on the cloud is a must-have requirement in many cases. Cloud makes it a great platform for NoSQL is so is Drupal. Drupal has modules for MongoDB and Cassandra and the modules for other NoSQL stores are currently being developed.

Drupal started out as an inexpensive content management system, but it has crossed the chasm. Not only the developers are trying to extend Drupal by adding more modules and designing different distributions, but importantly enterprise ISVs have also actively started exploring Drupal to make their offerings more attractive by creating extensions and leveraging the multi-site feature to set up multi-tenant infrastructure for their SaaS solutions. We expect that, the cloud as a runtime platform, will help Drupal, ISVs, and the customers to deliver compelling content management systems and applications on the cloud.

Aug 26 2008
Aug 26

a while ago i posted some performance benchmarks for drupal running on a variety of servers in amazon's elastic compute cloud.

amazon have just released ebs, the final piece of technology that makes their ec2 platform really viable for running lamp stacks stuck as drupal.

ebs, the "elastic block store", provides sophisticated storage for your database instance, with features including:

  • high io throughput
  • data replication
  • large storage capacity
  • hot backups using snapshots
  • instance type portability e.g. quickly swapping your database hardware for a bigger machine.

amazon also have a great mysql on ebs tutorial on their developer connection.

let me know if you've given this a go. it looks like a great platform.

Apr 15 2008
Apr 15
recently i posted some encouraging performance benchmarks for drupal running on a variety of servers in amazon's elastic compute cloud. while the performance was encouraging, the suitability of this environment for running lamp stacks was not. ec2 had some fundamental issues including a lack of static ip addresses and no viable persistent storage mechanism.

amazon are quickly rectifying these problems, and recently announced elasic ip addresses; a "static" ip address that you own and can dynamically point at any of your instances.

today amazon indicated that persistent storage will soon be available. they claim that this storage will:

  • behave like raw, unformatted hard drives or block devices
  • be significantly more durable than the local disks within an amazon ec2 instance
  • support snapshots backed up to S3
  • support volumes ranging in size from 1GB to 1TB
  • allow the attachment of multiple volumes to a single instance
  • allow high throughput, low latency access from amazon ec2
  • support applications including relational databases, distributed file systems and hadoop processing clusters using amazon ec2

if this works as advertised, it will make ec2 a wonderful platform for your lamp application. amazon promise public availability of this service later this year.

Jan 28 2008
Jan 28
amazon's elastic compute cloud, "ec2", provides a flexible and scalable hosting option for applications. while ec2 is not inherently suited for running application stacks with relational databases such as lamp, it does provide many advantages over traditional hosting solutions.

in this article we get a sense of lamp performance on ec2 by running a series of benchmarks on the drupal cms system. these benchmarks establish read throughput numbers for logged-in and logged-out users, for each of amazon's hardware classes.

we also look at op-code caching, and gauge it's performance benefit in cpu-bound lamp deployments.

the elastic compute cloud

amazon uses xen based virtualization technology to implement ec2. the cloud makes provisioning a machine as easy as executing a simple script command. when you are through with the machine, you simply terminate it and pay only for the hours that you've used.

ec2 provides three types of virtual hardware that you can instantiate. these are summarized in the table below.

machine type hourly cost memory cpu units platform small Instance $0.10 1.7 GB 1 32-bit platform large Instance $0.40 7.5 GB 4 64-bit platform extra large Instance $0.80 15 GB 8 64-bit platform note: one compute unit provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.

target deployments

to keep things relatively simple, the target deployment for our load test is basic; the full lamp stack runs on a single server. this is step zero in the five deployment steps that i outlined in an open-source infrastructure for high-traffic drupal sites.

our benchmark

our benchmark consists of a base drupal install, with 5,000 users and 50,000 nodes of content-type "page". nodes are an even distribution of 3 sizes, 1K, 3K and 22K. the total database size is 500Mb.

during the test, 10 threads read nodes continually over a 5 minute period. 5 threads operate logged-in. the other 5 threads operate anonymously (logged-out). each thread reads nodes randomly from the pool of 50,000 available.

this test is a "maximum" throughput test. it creates enough load to utilize all of the critical server resource (cpu in this case). the throughput and response times are measured at that load. tests to measure performance under varying load conditions would also be very interesting, but are outside the scope of this article.

the tests are designed to benchmark the lamp stack, rather than weighting it towards apache. consequently they do not load external resources. that is, external images, css javascript files etc are not loaded, only the initial text/html page. this effectively simulates drupal running with an external content server or cdn.

the benchmark runs in apache jmeter. jmeter runs on a dedicated small-instance on ec2.

benchmarking is done with op-code caching on and off. since our tests are cpu bound, op-code caching makes a significant difference to php's cpu consumption.

our testing environment

the tests use a debian etch xen instance, running on ec2. this instance is installed with:
  • MySQL: 5.0.32
  • PHP: 5.2.0-8
  • Apache: 2.2.3
  • APC: 3.0.16
  • Debian Etch
  • Linux kernel: 2.6.16-xenU

the tests use a default drupal installation. drupal's caching mode is set to "normal". no performance tuning was done on apache, mysql or php.

the results

all the tests ran without error. each of the tests resulted in the server running at close to 100% cpu capacity. the tests typically reached steady state within 30s. throughputs gained via jmeter were sanity checked for accuracy against the http and mysql logs. the raw results of the tests are shown in the table below. instance apc? logged-in throughput logged-in response logged-out throughput logged-out response small off 194 1.50 664 0.45 large off 639 0.46 2,703 0.11 xlarge off 1,360 0.20 3,741 0.08 small on 905 0.30 3,838 0.07 large on 3,106 0.10 8,033 0.04 xlarge on 4,653 0.06 12,548 0.02

note: response times are in seconds, throughputs are in pages per minute

the results - throughput

the throughput of the system was significantly higher for the larger instance types. throughput for the logged-in threads was consistently 3x lower than the logged-out threads. this is almost certainly due to the drupal cache (set to "normal").

throughput was also increased by about 4x with the use of the apc op-code cache.


the results - response times

the average response times were good in all the tests. the slowest tests yielded average times of 1.5s. again, response times where significantly better on the better hardware and reduced further by the use of apc.


conclusions

drupal systems perform very well on amazon ec2, even with a simple single machine deployment. the larger hardware types perform significantly better, producing up to 12,500 pages per minute. this could be increased significantly by clustering as outlined here.

the apc op-code cache increases performance by a factor of roughly 4x.

these results are directly applicable to other cpu bound lamp application stacks. more consideration should be given to applications bound on other external resources, such as database queries. for example, in a database bound system, drupal's built-in cache would improve performance more significantly, creating a bigger divergence in logged-out vs logged-in throughput and response times.

although performance is good on ec2, i'm not recommending that you rush out and deploy your lamp application there. there are significant challenges in doing so and ec2 is still in beta at the time of writing (Jan 08). it's not for the faint-of-heart. i'll follow up in a later blog with more details on recommended configurations.

tech blog

if you found this article useful, and you are interested in other articles on linux, drupal, scaling, performance and LAMP applications, consider subscribing to my technical blog.

resources

Jan 19 2008
Jan 19
i recently posted an introductory article on using jmeter to load test your drupal application. if you've read this article and are curious about how to build a more sophisticated test that mimics realistic load on your site, read on.

the previous article showed you how to set up jmeter and create a basic test. to produce a more realistic test you should simulate "real world" use of your site. this typically involves simulating logged-in and logged-out users browsing and creating content. jmeter has some great functionality to help you do this.

as usual, all code and configurations have been tested on debian etch but should be useful for other *nix flavors with subtle modifications. also, although i'm discussing drupal testing, the method below really applies to any web application. if you aren't already familiar with jmeter, i'd strongly recommend that you read my first post before this one.

an overview

the http protocol exchanges for realistic tests are quite complex, and painful to manually replicate. jmeter kindly includes http-proxy functionality, that allows you to "record" browser based actions, which can be used to form the basis of your test. after recording, you can manually edit these actions to sculpt your test precisely.

our test - browsers and creators

as an example, let's create a test with two test groups: creators and browsers. creators are users that arrive at the site, stay logged out, browse a few pages, create a page and then leave. browsers, are less motivated individuals. they arrive at the site, log in, browse some content and then leave.

setting up the test - simulating creators

to create our test, fire up jmeter and do the following.

create a thread group. call it "creators". add a "http request defaults" object to the thread group. check the "retrieve all embedded resources from html files" box.

add a cookie manager to the thread group of type "compatibility". add an "http proxy server to the workbench", as follows:


modify the "content-type filter" to "text/html". your jmeter-proxy should now look like:


navigate in your browser to the start of your test e.g. your home page. clear your cookies (using the clear private data setting). open up the "connection settings option" in firefox preferences and specify a manual proxy configuration of localhost, port 8080. this should look like:


note: you can also do this using internet explorer. in ie7 go to the "connections" tab of the internet options dialog. click the "lan settings" button, and setup your proxy.

start the jmeter-proxy. record your test by performing actions in your browser: (a) browse to two pages and (b) create a page. you should see your test "writing itself". that should feel good.

now stop the jmeter-proxy. your test should look similar to:


setting up the test - simulating browsers

create another thread group above the first. call it browsers. again, add a "http request defaults" object to the thread group. check the "retrieve all embedded resources from html files" box.

add a cookie manager to the thread group of type "compatibility". start the jmeter-proxy again. record your test by performing actions: (a) login and then (b) browse three pages. your test should look like:


stop the jmeter-proxy. undo the firefox proxy.

setting up the test - cleaning up

you can now clean up the test as you see fit. i'd recommend:
  • change the number of threads and iterations on both thread-groups to simulate the load that you care about.
  • modify the login to happen only once on a thread. see the diagram below.


and optionally:

  • rename items to be more meaningful.
  • insert sensible timers between requests.
  • insert assertions to verify results.
  • add listeners to each thread group. i recommend a "graph results" and a "view results tree" listener.

your final test should look like the one below. note that i didn't clutter the example with assertion and timers:


running your test

you should now be ready to run your test. as usual, click through to the detailed results in the tree to verify that your test is doing something sensible. ideally you should do this automatically with assertions. your results should look like:


notes

the test examples that i chose intentionally avoided logged-in users creating content. you'll probably want these users to create content, but you'll likely get tripped up by drupal's form token validation, designed to block spammers and increase security. modifying the test to work around this is beyond the scope of this article, and probably not the best way to solve the problem. if someone knows of a nice clean way to disable this in drupal temporarily, perhaps they could comment on this article.

resources

tech blog

if you found this article useful, and you are interested in other articles on linux, drupal, scaling, performance and LAMP applications, consider subscribing to my technical blog.
Jan 14 2008
Jan 14
there are many things that you can do to improve your drupal application's scalability, some of which we discussed in the recent scaling drupal - an open-source infrastructure for high-traffic drupal sites article.

when making scalability modifications to your system, it's important to quantify their effect, since some changes may have no effect or even decrease your scalability. the value of advertised scalability techniques often depends greatly on your particular application and network infrastructure, sometimes creating additional complexity with little benefit.

apache jmeter is a great tool to simulate load on your system and measure performance under that load. in this article, i demonstrate how to setup a testing environment, create a simple test and evaluate the results.

as usual, all code and configurations have been tested on debian etch but should be useful for other *nix flavors with subtle modifications. also, although i'm discussing drupal testing, the method below really applies to any web application.

the testing environment

you should install and run the jmeter code on a server that has good resources and has a high-bandwidth, low-latency network access to your application server or load balancer. the maximum load that you can simulate is clearly constrained on these parameters. so are the accuracy of your timing results. therefore, for very large deployments you may need to run multiple non-gui jmeter instances on several test machines, but for most of us, a simple one test-machine configuration will suffice, i recently simulated over 12K pageviews/minute from a modest single-core server that wasn't close to capacity.

jmeter has a great graphical interface that allows you to define, run and analyze your tests visually. a convenient way to run this is to ssh to the jmeter test machine using x forwarding, from a machine running an x server. this should be as simple as issuing the command:

$ ssh -x testmachine.example.com

note, you'll need a minimal x install on your server for this. you can get one with:

$ sudo apt-get install xserver-xorg-core xorg

and then running the jmeter gui from that ssh session. jmeter should now appear on your local display, but run on the test machine itself. if you are having problems with this, skip to troubleshooting a the end of this article. this setup is good for testing a remote deployment. you can also run the gui on windows.

x forwarding can become unbearably slow once your test is running, if the test saturates your test server's network connection. if so, you might consider defining the test using the gui and running it on the command line. read more about remote testing on the apache site, and on command line jmeter later in this article.

setting up the test server - download and install java

jmeter is a 100% java implementation, so you'll need a functional java runtime install.

if you don't have java 1.4 or later, then you should start by installing it. to do so, make sure you've got a line in /etc/apt/sources.list like this:

deb http://ftp.debian.org/debian/ etch main contrib non-free

if you don't then add it, and do a apt-get update. once you've done this, do:

$ sudo apt-get install sun-java5-jre

installation on vista is as easy as downloading and installing the latest zip from http://jakarta.apache.org/site/downloads/downloads_jmeter.cgi, unzipping it and running jmeter.bat. please don't infer that i'm condoning or suggesting the use of windows vista ;)

setting up the test server - download and install jmeter

next, download the latest stable version of jmeter from the jmeter download page, for example:

$ wget http://apache.mirrors.tds.net/jakarta/jmeter/binaries/jakarta-jmeter-2.3.1.tgz

and then install it:

$ tar xvfz jakarta-jmeter-2.3.1.tgz

you should now be able to run it by:

$ cd ./jakarta-jmeter-2.3.1/bin
$ ./jmeter

if you are having problems running jmeter, see the troubleshooting section at the end of this article.

setting up a basic test

jmeter is a very full featured testing application. we'll scratch the surface of it's functionality and setup a fairly simplistic load test. you may want to do something a bit more sophisticated, but this will at least get you started.

to create the basic test, run jmeter as described above. the first step is to create a "thread group" object. you'll use this object to define the simulated number of users (threads) and the duration of the test. right mouse click the test plan node and select:
add -> thread group

specify the load that you'll exert on your system, for example, pick 10 users (threads) and a loop count (how many times each thread will execute your test). you can optionally modify the ramp up period e.g. a 10s ramp up in this example would create one new user ever second.

now add a sampler by right mouse clicking the new thread group and choosing:
add -> sampler -> http request. make sure to check the box "retrieve all embedded resources from html files", to properly simulate a full page load.

now add a listener to view the detailed results of your requests. the "results tree" is a good choice. add this to your thread group by selecting: add -> listener -> view results tree. note that after you run your test, you can select a particular request in the left panel and then select the "response data" tab on the right, to verify that you are getting a sensible response from your server, as shown below.

finally let's add another listener to graph our result data. choose:
add -> listener -> graph results. this produces a graph similar to the graph on the right.

if you want to create a more sophisticated test, you'll probably want to create realistic use scenarios, including multiple requests spaced out using timers, data creation by logged in users etc. you'll probably want to verify results with assertions. all of this is relatively easy, and you can read more on apache's site about creating a test plan. you can get information on login examples and cookie support here. you can also read the follow up to this blog: load test your drupal application scalability with apache jmeter: part two

running your test

controlling your test is now a simple matter of choosing the menu items: run -> start, run -> stop, run -> clear all etc. it's very intuitive. while your test is running, you can select the results graph, and watch the throughput and performance statistics change as your test progresses.

if you'd like to run your test in non-gui mode, you can run jmeter on the command line as follows:

$ jmeter --nongui --testfile basicTest.jmx --logfile /tmp/results.jtl

this would run a test defined in file basicTest.jmx, and ouput the results of the test in a file called /tmp/results.jtl. once the test is complete, you could, for example, copy the results file locally and run jmeter to visually inspect and analyse the results, with:

$ jmeter --testfile basicTest.jmx

or just run jmeter as normal and then open your test.

you may then use the listener of choice (e.g. "graph results") to open your results file and display the results.

interpreting your drupal results

most production sites run with drupal's built-in caching turned on. you can look at your performance setting in the administration page at: http://www.example.com/admin/settings/performance. this caching makes a tremendous difference to throughput, but when users are logged in, they bypass this cache.

therefore, to get a realistic idea of your site performance, it's a good idea to calibrate your system with caching on and caching off, and linearly interpolate the results to get a true idea of your maximum throughput. for example, if your throughput is 1,000 views per minute with caching, and 100 without caching and at any given point in time 50% of your users are logged in, you could estimate your throughput at (1000 + 100) / 2 = 550, that is 550 views per minute.

alternatively, you could build a more sophisticated test that simulates close-to-realistic site access including logged-in sessions. clearly, the more work you put into your load tests, the more accurate your results will be. see the followup article for details on building a more sophisticated test.

an example test - would a static file server or cdn help your application?

jmeter allows you to easily estimate the effect of configuration changes, sometimes without actually making the changes. recently i read robert douglass' interesting article on using lighttpd as a static file server for drupal, i was curious how much of a difference that would make.

simply un-checking the "retrieve all embedded resources from html files" on the http request allowed me to simulate all the static resources coming from another (infinitely fast) server.

for my (image intensive) application the results were significant, about a 3x increase in throughput. clearly the real number depends on many factors including the static resources (images, flash etc) used by your application and the ratio of first time to repeat users or your site (repeat users have your content cached). it seems fair to say that this technique would significantly improve throughput for most sites and presumably page performance would be significantly improved too, especially if the static resources were cdn hosted.

troubleshooting your jmeter install

if you are having problems with your jmeter install, then:

make sure that the java version you are running is compatible i.e. 1.4 or later, by:

$ java -version
java version "1.5.0_10"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_10-b03)

make sure that you have all the dependencies installed, if you get the error "cannot load awt toolkit: gnu.java.awt.peer.gtk.gtktoolkit", you might you might have to install the gcj runtime library. do this as follows:

$ sudo apt-get install libgcj7-awt

if jmeter hangs or stalls, you probably don't have the right java version installed or on your path.

if you're still having problems, take a look in the application log file jmeter.log for clues. this gets created in the directory that you run jmeter in.

if you are having problems getting x forwarding to work, make sure that it is enabled in your sshd config file e.g. /etc/ssh/sshd_config. you should have a line like:

X11Forwarding yes

if you change this, don't forget to restart the ssh daemon.

resources

further reading

if you'd like to build a more sophisticated test, take a look at my next blog: load test your drupal application scalability with apache jmeter: part two.

tech blog

if you found this article useful, and you are interested in other articles on linux, drupal, scaling, performance and LAMP applications, consider subscribing to my technical blog.

thanks to curtis (serverjockey) hilger for introducing me to jmeter.

Nov 15 2007
Nov 15

if you've setup a clustered drupal deployment (see scaling drupal step three - using heartbeat to implement a redundant load balancer), a good next-step, is to scale your database tier.

in this article i discuss scaling the database tier up and out. i compare database optimization and different database clustering techniques. i go on to explore the idea of database segmentation as a possibility for moderate drupal scaling. as usual, my examples are for apache2, mysql5 and drupal5 on debian etch. see the scalability overview for related articles.

deployment overview

this table summaries the characteristics of this deployment choice scalability: good redundancy: fair ease of setup: poor

servers

in this example, i use:

web server drupal-lb1.mydomain.com192.168.1.24 data server drupal-data-server1.mydomain.com192.168.1.26 data server drupal-data-server2.mydomain.com192.168.1.27 data server drupal-data-server3.mydomain.com192.168.1.28 mysql load balancer mysql-balance-1.mydomain.com192.168.1.94

first steps first - optimizing your database and application

the first step to scaling your database tier should include identifying problem queries (those taking most of the resources), and optimizing them. optimizing may mean reducing the volume of the queries by modifying your application, or increasing their performance using standard database optimization techniques such as building appropriate indexes. the devel module is a great way to find problem queries and functions.

another important consideration is the optimization of the database itself, by enabling and optimizing the query cache, tuning database parameters such as the maximum number of connections etc. using appropriate hardware for your database is also a huge factor in database performance, especially the disk io system. a large raid 1+0 array for example, may do wonders for your throughput, especially combined with a generous amount of system memory available for disk caching. for more on mysql optimization, take a look at the great o'reilly book by jeremy zawodny and derek balling on high performance mysql.

when it's time to scale out rather than up

you can only (and should only) go so far scaling up. at some point you need to scale out. ideally, you want a database clustering solution that allows you do exactly that. that is, add nodes to your database tier, completely transparently to your application, giving you linear scalability gains with each additional node. mysql cluster promises exactly this. it doesn't offer full transparency however, due to limitations introduced by the ndb storage engine required by mysql cluster. having said that, the technology looks extremely promising and i'm interested if anyone has got a drupal application running successfully on this platform. you can read more on mysql clustering on the mysql cluster website or in the the mysql clustering book by alex davies and harrison fisk.

less glamorous alternatives to mysql cluster

without the magic of mysql cluster, we've still got some, admittedly less glamorous, alternatives. one is to use traditional mysql database cluster, where all writes go to a single master and reads are distributed across several read-only-nodes. the master updates the read-only-nodes using replication.

an alternative is to segment read and write requests by role, thereby partitioning the data into segments, each one resident on a dedicated database.

these two approaches are illustrated below:

there are some significant pitfalls to both approaches:

  • the traditional clustering approach, introduces a replication lag i.e. it takes a non-trivial amount of time, especially under load, for writes to make it back to the read-only-nodes. this may not be problematic for very specific applications, but is problematic in the general case
  • the traditional clustering approach scales only reads, not writes, since each write has to be made to each node.
  • in traditional clustering the total effective size of your memory cache is the size of a single node (since the same data is cached on each node), whereas with segmentation it's the sum of the nodes.
  • in traditional clustering each node has the same hardware optimization pattern, whereas with segmentation, it can be customized according to the role it's playing.
  • the segmentation approach reduces the redundancy of the system, since theoretically a failure of any of the nodes takes your "database" off line. in practice, you may have segments that are non essential e.g. logging. you can, of course, cluster your segments, but this introduces the replication lag issue.
  • the segmentation approach relies on a thorough understanding of the application, and the relative projected load on each segment to do properly.
  • the segmentation approach is fundamentally very limited, since there are a limited number of segments for a typical application.

more thoughts on database segmentation

from one perspective, the use of memcache is a database segmentation technique i.e. it takes part of the load on the database (from caching) and segments this into a specialized and optionally distributed caching "database". there is a detailed step-by-step guide on lullabot on doing this on debian etch and drupal module.

you can continue this approach on other areas of your database, dedicating several databases to different roles. for example, if one of the functions of your database is to serve as a log, why not segment all log activity onto a single database? clearly, it's important that your segments are distinct i.e. that applications don't need joins or transactions between segments. you may have auxiliary applications that do need complex joins between segments e.g. reporting. this can be easily solved by warehousing the data back into a single database to serve specifically this auxiliary application (warehousing in this case).

while i'm not suggesting that the next step in your scaling exercise necessarily should be segmentation, this clearly depends on your application and preferences, we're going to explore the idea anyway. it's my blog afterall :)

what segmentation technologies to use?

there are several open source tools that you can use to build a segmentation infrastructure. sqlrelay is a popular database-agnostic proxying tool that can be used for this purpose. mysql proxy is, as the name suggests, a mysql specific proxying tool.

in this article i focus on mysql proxy. sqlrelay (partly due to it's more general purpose nature) is somewhat difficult to configure, and inherently less flexible than mysql proxy. mysql proxy on the other hand is quick to setup and use. it has a simple, elegant and flexible architecture that allows for a full range of proxying applications, from trivial to uber-complex.

more on mysql proxy

jan kneschke's brainchild, mysql proxy is a lightweight daemon that sits between your client application (apache/modphp/drupal in our case) and the database. the proxy allows you to perform just about any transformation on the traffic, including segmentation. the proxy allows you to hook into 3 actions; connect, query and result. you can do whatever you want to in these steps, manipulating data and performing actions using lua scripts. lua is a fully featured scripting language, designed for high performance. clearly a key consideration in this application. don't worry too much about aFsc (another scripting language). it's easy to pick up. it's powerful and intuitive.

even if you don't intend to segment your databases, you might consider a proxy configuration for other reasons including logging, filtering, redundancy, timing and analysis and query modification. for example, using mysql proxy to implement a hot standby database (replicated) would be trivial.

the mysql site states clearly (as of 09Nov2007); "MySQL Proxy is currently an Alpha release and should not be used within production environments". Feeling lucky?

a word of warning

the techniques described below, including the overall method and the use of mysql proxy, are intended to stimulate discussion. they are not intended to represent a valid production configuration. i've explored this technique purely in an experimental manner. in my example below i segment cache queries to a specific database. i don't mean to imply that this is a better alternative to memcache. it isn't. anyway, i'd love to hear your thoughts on the general approach.

don't panic, you don't really need this many servers

before you get yourself into a panic over the number of boxes i've drawn in the diagram, please bear in mind that this is a canonical network. in reality you could use the same physical hardware for both loadbalancers, or, even better, you could use xen to create this canonical layout and, over time, deploy virtual servers on physical hardware as load necessitated.

down to business - set up and test a basic mysql proxy

o.k., enough of the chatter. let's get down to business and setup a mysql proxy server. first, download and install the latest version of mysql proxy from http://dev.mysql.com/downloads/mysql-proxy/index.html.

tar xvfz mysql-proxy-0.6.0-linux-debian3.1-x86.tar.gz

make sure that your mysql load balancer can access the database on your data server i.e. on your data server, run mysql and enter:

GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, DROP, INDEX, ALTER, CREATE
TEMPORARY TABLES, LOCK TABLES
ON drupaldb.*
to [email protected]'192.168.1.94' IDENTIFIED BY 'password';
FLUSH PRIVILEGES;

check that your load balancer can access the database on your data server i.e. on your load balancer do:

# mysql -e "select * from users limit 1" --host=192.168.1.26 --user=drupal --password=password drupaldb

now do a quick test of the proxy, run the proxy server, pointing to your drupal database server:

./mysql-proxy --proxy-backend-addresses=192.168.1.26 &

and test the proxy:

echo "select * from users" |  mysql --host=127.0.0.1 --port=4040 --user=drupal --password=password drupaldb

now change your drupal install to point at the load balancer, rather than your data server directly i.e. edit your settings.php on your webserver(s) and point your drupal install to the mysql load balancer, rather than at your database server:

$db_url = 'mysql://drupal:[email protected]:4040/drupaldb';

asking mysql proxy to segment your database traffic

the best way to segment a drupal databases depends on many factors, including the modules you use and the custom extensions that you have. it's beyond the scope of this exercise to discuss segmentation specifics, but, as an a example i've segmented the database into 3 segments, a cache server, a log server and a general server (everything else).

to get started segmenting, create two additional database instanaces (drupal-data-server2, drupal-data-server3), with a copy of the data from drupal-data-server3. make sure that you GRANT the mysql load balancer permission on to access each database as described above.

you'll now want to start up your proxy server, pointing to these instances. below, i give an example of a bash script that does this. it starts up the cluster and executes several sql statements, each one bound for a different member of the cluster, to ensure that the whole cluster has started properly. note that you'd also want to build something similar as a health check, to ensure that they kept functioning properly and stopping the cluster (proxy) as soon as a problem was detected.

here's the source for runProxy.sh:

:
BASE_DIR=/home/john
BIN_DIR=${BASE_DIR}/mysql-proxy/sbin

# kill the server if it's running
pkill -f mysql-proxy

# make sure any old proxy instance is dead before firing up the new one
sleep 1

# run the proxy server in the background
${BIN_DIR}/mysql-proxy \
--proxy-backend-addresses=192.168.1.26:3306 \
--proxy-backend-addresses=192.168.1.27:3306 \
--proxy-backend-addresses=192.168.1.28:3306 \
--proxy-lua-script=${BASE_DIR}/databaseSegment.lua &

# give the server a chance to start
sleep 1

# prime the pumps!
# execute some sql statements to make sure that the proxy is running properly
# i.e. that it can establish a connection to the range of servers in question
# and bail if anything fails
for sqlStatement in \
   "select cid FROM cache limit 1" \
   "select nid FROM history limit 1" \
   "select name FROM variable limit 1"
do
   echo "testing query: ${sqlStatement}"
   echo ${sqlStatement} |  mysql --host=127.0.0.1 --port=4040 \
       --user=drupal --password=password drupaldb || { echo "${sqlStatement}: failed (is that server up?)"; exit 1; }
done

you'll notice that this script calls references databaseSegment.lua, this is the a script that uses a little regex magic to map queries to servers. again, the actual queries being mapped serve as examples to illustrate the point, but you'll get the idea.. jan has a nice r/w splitting example, that can be easily modified to create databaseSegment.lua.

most of the complexity in jan's code is around load balancing (least connections) and connection pooling within the proxy itself. jan points out (and i agree) that this functionality should be made available in a generic load-balancing lua module. i really like the idea of having this in lua scripts to allow others to easily extend it, for example, by adding a round robin alternative. keep an eye on his blog for developments. anyway, for now, let's modify his example, add a some defines and a method to do the mapping:

local CACHE_SERVER = 1
local LOG_SERVER = 2
local GENERAL_SERVER = 3

-- select a server to use based on the query text, this will return one of
-- CACHE_SERVER, LOG_SERVER or GENERAL_SERVER
function choose_server(query_text)
   local cache_server_strings = { "FROM cache", "UPDATE cache",
                                  "INTO cache", "LOCK TABLES cache"}
   local log_server_strings =   { "FROM history", "UPDATE history",
                                  "INTO history" , "LOCK TABLES history",
                                  "FROM watchdog", "UPDATE watchdog",
                                  "INTO watchdog", "LOCK TABLES watchdog" }

   local server_table = { [CACHE_SERVER] = cache_server_strings,
                          [LOG_SERVER] = log_server_strings }

   -- default to the general server
   local server_to_use = GENERAL_SERVER

   -- find a server registered for this query_text in the server_table
   for i=1, #server_table do
      for j=1, #server_table[i] do
         if string.find(query_text, server_table[i][j])
         then
            server_to_use = i
            break
         end
      end
   end

   return server_to_use
end

and then call this in read_query:

-- pick a server to use
proxy.connection.backend_ndx = choose_server(query_text)

test your application

now test your application. a good way to see the queries hitting your database servers, is to (temporarily) enable full logging on each of them and watch the log.edit /etc/mysql/my.cnf and set:

# Be aware that this log type is a performance killer.
log             = /var/log/mysql/mysql.log

and then:

# tail -f /var/log/mysql/mysql.log

further work

to develop this idea further:
  • someone with better drupal knowledge than me could define a good segmentation structure for typical drupal application, with the query fragments associated with each application.
  • additionally, the scripts could handle exceptional situations better e.g. a regular health check for the proxy.
  • clearly we've introduced another single-point-of-failure in the database load balancer. the earlier discussion of heartbeat applies here.
  • it would be wonderful to bypass all this nonsense and get drupal running on a mysql cluster. i'd love to hear if you've tried it and how it went.

references and documentation

tech blog

if you found this article useful, and you are interested in other articles on linux, drupal, scaling, performance and LAMP applications, consider subscribing to my technical blog.
Nov 11 2007
Nov 11

i got some good feedback on my dedicated data server step towards scaling. kris buytaert in his everything is a freaking dns problem blog points out that nfs creates an unnecessary choke point. he may very well have a point.

having said that, i have run the suggested configuration in a multi-web-server, high-traffic production setting for 6 months without a glitch, and feedback on his blog gives example of other large sites doing the same thing. for even larger configurations, or if you just prefer, you might consider another method of synchronizing files between your web servers.

kris suggests rsync as a solution, and although luc stroobant points out the delete problem, i still think it's a good, simple solution. see the diagram above.

the delete problem is that you can't simply use the --delete flag on rsync. since in an x->y synchronization, a delete on node x looks just like an addition to node y.

i speculate that you can partly mitigate this issue with some careful scripting, using a source-of-truth file server to which you first pull only additions from the source nodes, and then do another run over the nodes with the delete flag (to remove any newly deleted files from your source-of-truth). unfortunately you can't do the delete run on a live site (due to timing problems if additions happen after your first pass and before your --delete pass), but you can do this as a regularly scheduled maintenance task when your directories are not in flux.

i include a bash script below to illustrate the point. i haven't tested this script, or the theory in general. so if you plan to use it, be careful.

you could call this script from cron on your data server. you could do this, say, every 5 minutes for a smallish deployment. even though that this causes a 5 minute delay in file propagation, the use of sticky sessions ensures that users will see files that they create immediately, even if there is a slight delay for others. additionally, you could schedule it with the -d flag during system downtime.

the viability of this approach depends on many factors including how quickly an uploaded file must be available for everyone and how many files you have to synchronize. this clearly depends on your application.

synchronizeFiles -- a bash script to keep your drupal web server's files directory synchronized

#!/bin/bash

# synchronizeFiles -- a bash script to keep your drupal web server's files directory
#                     synchronized - http://www.johnandcailin.com

# bail if anything fails
set -e

# don't synchronize deletes by default
syncDeletes=false

sourceServers="192.168.1.24 192.168.1.25"
sourceDir="/var/www/drupal/files"
sourceUser="www-data"
targetDir="/var/drupalFiles"

# function to print a usage message and bail
usageAndBail()
{
   echo "Usage syncronizeFiles [OPTION]"
   echo "     -d       synchronize deletes too (ONLY use when directory contents are static)"
   exit 1;
}

# process command line args
while getopts hd o
do     case "$o" in
        d)     syncDeletes=true;;
        h)     usageAndBail;;
        [?])   usageAndBail;;
       esac
done

# do initial addition only schronization run from sourceServers to targetServer
for sourceServer in ${sourceServers}
do
   echo "bi directionally syncing files between ${sourceServer} and local"

   # pull any new files to the target
   rsync -a ${sourceUser}@${sourceServer}:${sourceDir} ${targetDir}/..

   # push any new files back to the source
   rsync -a ${targetDir} ${sourceUser}@${sourceServer}:${sourceDir}/..
done

# synchronize deletes (only use if directory contents are static)
if test ${syncDeletes} = "true"
then
   for sourceServer in ${sourceServers}
   do
      echo "DELETE syncing files from ${sourceServer} to ${targetDir}"

      # pull any new files to the target, deleting from the source of truth if necessary
      rsync -a --delete ${sourceUser}@${sourceServer}:${sourceDir} ${targetDir}
   done
fi

tech blog

if you found this article useful, and you are interested in other articles on linux, drupal, scaling, performance and LAMP applications, consider subscribing to my technical blog.
Oct 29 2007
Oct 29
the authors of drupal have paid considerable attention to performance and scalability. consequently even a default install running on modest hardware can easily handle the demands of a small website. my four year old pc in my garage running a full lamp install, will happily serve up 50,000 page views in a day, providing solid end-user performance without breaking a sweat.

when the times comes for scalability. moving of of the garage

if you are lucky, eventually the time comes when you need to service more users than your system can handle. your initial steps should clearly focus on getting the most out of the built-in drupal optimization functionality, considering drupal performance modules, optimizing your php (including considering op-code caching) and working on database performance. John VanDyk and Matt Westgate have an excellent chapter on this subject in their new book, "pro drupal development"

once these steps are exhausted, inevitability you'll start looking at your hardware and network deployment.

a well designed deployment will not only increase your scalability, but will also enhance your redundancy by removing single points of failure. implemented properly, an unmodified drupal install can run on this new deployment, blissfully unaware of the clustering, routing and caching going on behind the scenes.

incremental steps towards scalability

in this article, i outline a step-by-step process for incrementally scaling your deployment, from a simple single-node drupal install running all components of the system, all the way to a load balanced, multi node system with database level optimization and clustering.

since you almost certainly don't want to jump straight from your single node system to the mother of all redundant clustered systems in one step, i've broken this down into 5 incremental steps, each one building on the last. each step along the way is a perfectly viable deployment.

tasty recipes

i give full step-by-step recipes for each deployment, that with a decent working knowledge of linux, should allow you to get a working system up and running. my examples are for apache2, mysql5 and drupal5 on debian etch, but may still be useful for other versions / flavors.

note that these aren't battle-hardened production configurations, but rather illustrative minimal configurations that you can take and iterate to serve your specific needs.

the 5 deployment configurations

the table below outlines the properties of each of the suggested configurations: step 0step 1step 2step 3step 4 step 5 separate web and db no yes yes yes yes yes clustered web tier no no yes yes yes yes redundant load balancer no no no yes yes yes db optimization and segmentation no no no no yes yes clustered db no no no no no yes scalabilty poor- poor fair fair good great redundancy poor- poor- fair good fair great setup ease great good good fair poor poor-
in step 0, i outline how to install drupal, mysql and apache to get a get a basic drupal install up-and-running on a single node. i also go over some of the basic configuration steps that you''ll probably want to follow, including cron scheduling, enabling clean urls, setting up a virtual host etc.
in step 1, i go over a good first step to scaling drupal; creating a dedicated data server. by "dedicated data server" i mean a server that hosts both the database and a fileshare for node attachments etc. this splits the database server load from the web server, and lays the groundwork for a clustered web server deployment.
in step 2, i go over how to cluster your web servers. drupal generates a considerable load on the web server and can quickly become resource constrained there. having multiple web servers also increases the the redundancy of your deployment.
in step 3, i discuss clustering your load balancer. one way to do this is to use heartbeat to provide instant failover to a redundant load balancer should your primary fail. while the method suggested below doesn't increase the loadbalancer scalability, which shouldn't be an issue for a reasonably sized deployment, it does increase your the redundancy.

in this article i discuss scaling the database tier up and out. i compare database optimization and different database clustering techniques. i go on to explore the idea of database segmentation as a possibility for moderate drupal scaling.


the holy grail of drupal database scaling might very well be a drupal deployment on mysql cluster. if you've tried this, plan to try this or have opinions on the feasibility of an ndb "port" of drupal, i'd love to hear it.

tech blog

if you found this article useful, and you are interested in other articles on linux, drupal, scaling, performance and LAMP applications, consider subscribing to my technical blog.
Oct 21 2007
Oct 21

if you've setup your drupal deployment with a separate database and web (drupal) server (see scaling drupal step one - a dedicated data server), a good next step, is to cluster your web servers. drupal generates a considerable load on the web server and can quickly become resource constrained there. having multiple web servers also increases the the redundancy of your deployment. as usual, my examples are for apache2, mysql5 and drupal5 on debian etch. see the scalability overview for related articles.

one way to do this is to use a dedicated web server running apache2 and mod_proxy / mod_proxy_balancer to load balance your drupal servers.

deployment overview

this table summaries the characteristics of this deployment choice scalability: fair redundancy: fair ease of setup: fair

servers

in this example, i use:

web server drupal-lb1.mydomain.com192.168.1.24 web server drupal-lb2.mydomain.com192.168.1.25 data server drupal-data-server1.mydomain.com192.168.1.26 load balancer apache-balance-1.mydomain.com192.168.1.34

network diagram


load balancer setup: install and enable apache and proxy_balancer

create a dedicated server for load balancing. install apache2 (apt-get install apache2) and then install mod proxy_balancer and proxy_http with dependencies

# a2enmod proxy_balancer
# a2enmod proxy_http

enable mod_proxy in mods-available/proxy.conf. note that i'm leaving ProxyRequests off since we're only using the ProxyPass and ProxyPassReverse directives. this keeps the server secure from spammers trying to use your proxy to send email.

<IfModule mod_proxy.c>
        # set ProxyRequests off since we're only using the ProxyPass and ProxyPassReverse
        # directives. this keeps the server secure from
        # spammers trying to use your proxy to send email.

        ProxyRequests Off

        <Proxy *>
                AddDefaultCharset off
                Order deny,allow
                Allow from all
                #Allow from .example.com
        </Proxy>

        # Enable/disable the handling of HTTP/1.1 "Via:" headers.
        # ("Full" adds the server version; "Block" removes all outgoing Via: headers)
        # Set to one of: Off | On | Full | Block

        ProxyVia On
</IfModule>

configure mod_proxy and mod_proxy_balancer

mod_proxy and mod_proxy balancer serve as a very functional load balancer. however mod_proxy_balancer makes slightly unfortunate assumptions about the format of the cookie that you'll use for sticky session handling. one way to work around this is to create your own session cookie (very easy with apache). the examples below describe how to do this

first create a virtual host or use the default (/etc/apache2/sites-available/default) and add this configuration to it:

<Location /balancer-manager>
SetHandler balancer-manager

Order Deny,Allow
Deny from all
Allow from 192.168
</Location>

<Proxy balancer://mycluster>
  # cluster member 1
  BalancerMember http://drupal-lb1.mydomain.com:80 route=lb1

  # cluster member 2
  BalancerMember http://drupal-lb2.mydomain.com:80 route=lb2
</Proxy>

ProxyPass /balancer-manager !
ProxyPass / balancer://mycluster/ lbmethod=byrequests stickysession=BALANCEID
ProxyPassReverse / http://drupal-lb1.mydomain.com/
ProxyPassReverse / http://drupal-lb2.mydomain.com/

note:
  • i'm allowing access to the balancer manager (the web UI) from any IP matching 192.168.*.*
  • i'm load balancing between 2 servers (drupal-lb1.mydomain.com, drupal-lb2.mydomain.com) on port 80
  • i'm defining two routes for these servers called lb1 and lb2
  • i'm excluding (!) the balancer-manager directory fro the ProxyPass to allow access to the manager ui on the load balancing server
  • i'm expecting a cookie called BALANCEID to be available to manage sticky sessions
  • this is a simplistic load balancing configuration. apache has many options to control timeouts, server loading, failover etc. too much to cover but read more in the apache documentation

configure the web (drupal) servers to write a session cookie

on each of the web (drupal) servers, add this code to your vhost configuration:

RewriteEngine On
RewriteRule .* - [CO=BALANCEID:balancer.lb1:.mydomain.com]

making sure to specify the correct route e.g. lb1 on drupal-lb1.mydomain.com etc.

you also probably want to setup your cookie domain properly in drupal, i.e. modify drupal/sites/default/settings.php as follows:

# $cookie_domain = 'example.com';
$cookie_domain = 'mydomain.com';

important urls

useful urls for testing are:

the balancer manager

the mod_proxy_balancer ui enables point-and-click update of balancer members.

the balancer manager allows you to dynamically change the balance factor or a particular member, change it's route or put it in the off line mode.

debugging

to debug your configuration it's useful to turn up apache's debugging level on your apache load balancer by adding this to your vhost configuration:

LogLevel debug

this will produce some very useful debugging output (/var/log/apache2/error.log) from the proxying and balancing code.

firefox's cookie viewer tools->options->privicy->show cookies is also useful to view and manipulate your cookies.

if you plan to experiment with bringing servers up and down to test them being added and removed from the cluster you should consider setting the "connection pool worker retry timeout" to a value lower than the default 60s. you could set them to e.g. 10s by changing your configuration to the one below. a 10s timeout allows for quicker test cycles.

BalancerMember http://drupal-lb1.scream.squaretrade.com:80 route=lb1 retry=10
BalancerMember http://drupal-lb2.scream.squaretrade.com:80 route=lb2 retry=10

next steps

one single-point-of-failure in this deployment is the apache load balancer. consider clustering your load balancer with scaling drupal step three - using heartbeat to implement a redundant load balancer

references and documentation

tech blog

if you found this article useful, and you are interested in other articles on linux, drupal, scaling, performance and LAMP applications, consider subscribing to my technical blog.
Oct 13 2007
Oct 13

if you've already installed drupal on a single node (see easy-peasy-lemon-squeezy drupal installation on linux), a good first step to scaling a drupal install is to create a dedicated data server. by dedicated data server i mean a server that hosts both the database and a fileshare for node attachments etc. this splits the database server load from the web server, and lays the groundwork for a clustered web server deployment. here's how you can do it. as usual, my examples are for apache2, mysql5 and drupal5 on debian etch. see the scalability overview for related articles.

deployment overview

this table summaries the characteristics of this deployment choice: scalability: poor redundancy: poor ease of setup: good

servers

in this example, i use:

web server drupal-lb1.mydomain.com192.168.1.24 data server drupal-data-server1.mydomain.com192.168.1.26

update

the recipe below uses nfs, you might want to consider using rysnc as an alterative. see the discussion in step one B -- john, 11 nov 2007 if you plan to run with a single webserver for a while, you can skip the nfs / rsync malarky until step 2 -- john, 11 nov 2007

data server: setup mysql and prepare it for remote access

install mysql. i use mysql5. you'll need to enable this for remote access. edit /etc/mysql/my.cnf and change the bind address to your local server address e.g.

# bind-address          = 127.0.0.1
bind-address            = 192.168.1.26

now allow access to your database from your web (drupal) servers. run mysql and do:

GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, DROP, INDEX, ALTER, CREATE
TEMPORARY TABLES, LOCK TABLES
ON drupaldb.*
TO [email protected]'drupal-lb1.mydomain.com' IDENTIFIED BY 'password';
FLUSH PRIVILEGES;

restart mysql:

# /etc/init.d/mysql restart

data server: setup a shared nfs partition

moving drupal's file data (node attachments etc) to an nfs server does two things. it allows you to manage all your important data on a single server, simplifying backups etc. it also paves the way for web server clustering, where clearly it doesn't make sense to write these files onto random web servers in the cluster.

install the nfs server:

apt-get install nfs-kernel-server

make a directory to share:

# mkdir -p /var/drupalSharedFiles/files
# chown www-data.www-data /var/drupalSharedFiles/files

share this directory using nfs by adding this line to /etc/exports:

/var/drupalSharedFiles 192.168.1.1/24(rw,sync)

now share it:

# exportfs -a

web (drupal) server: install drupal and point it to your data server

install drupal on your web server (see the drupal installation recipe for specifics). make sure that it can connect to your database server. you can verify database connectivity using:

# mysql -e "select * from users limit 1" --user=drupal --host=drupal-data-server1.mydomain.com --password=password drupaldb

if this doesn't work, go back to the instructions above and make sure you did your binding and granting properly.

now edit your sites/default/settings.php to point it to your new data server e.g.

$db_url = 'mysql://drupal:[email protected]/drupaldb';

hit your drupal site and make sure that it sees the new database.

next, mount the area for shared files, add this to the /etc/fstab file:

192.168.1.26:/var/drupalSharedFiles /var/drupalSharedFiles nfs rw,hard,intr,async,users,noatime 0 0

make sure that you've got the portmapper installed on the client, you'll need this to do a performant nfs mount. if you don't have it:

# apt-get install portmap

and do a:

# mount -a

next, change the local files directory to point to your remote one, cd to your drupal directory (probably /var/www/drupal) and:

# cp -a files/.htaccess /var/drupalSharedFiles/files
# rm files/.htaccess ; rmdir files
# ln -s /var/drupalSharedFiles/files/ .

it's a good idea to verify that drupal is happy with this new arrangement by visiting the status report, e.g. by hitting http://drupal-lb1.mydomain.com/drupal/?q=admin/logs/status and making sure that it sees your nfs area as writable. you can also just upload an attachment and see what happens.

you should be all set.

next steps

got all that working? want more scalability and redundancy? consider clustering your drupal servers with step two - sticky load balancing with apache mod proxy

tech blog

if you found this article useful, and you are interested in other articles on linux, drupal, scaling, performance and LAMP applications, consider subscribing to my technical blog.

About Drupal Sun

Drupal Sun is an Evolving Web project. It allows you to:

  • Do full-text search on all the articles in Drupal Planet (thanks to Apache Solr)
  • Facet based on tags, author, or feed
  • Flip through articles quickly (with j/k or arrow keys) to find what you're interested in
  • View the entire article text inline, or in the context of the site where it was created

See the blog post at Evolving Web

Evolving Web