Mar 07 2019
Mar 07

In our last meetup here on Long Island, we reviewed Preston So's recent book "Decoupled Drupal in Practice". We had the opportunity to record the meetup, figured it can't hurt to post it here!

We also demonstrate how setup a super simple REST server and connect it to a React component. 

In our last meetup here on Long Island, we reviewed Preston So's recent book "Decoupled Drupal in Practice". We had the opportunity to record the meetup, figured it can't hurt to post it here!

We also demonstrate how setup a super simple REST server and connect it to a React component. 

[embedded content]

Mar 20 2018
Mar 20

MidCamp Group

Wow its been a while for us here on the Sego blog. Man, we have been busy and having a bunch of fun building value for our partners and clients... we just really need to write about it more. 

We just got back from Chicago. We were honored to be invited to deliver the module development training at the camp. We had a blast doing it and wanted to publically give a special thank you to all the organizers and folks that came out for the training and the camp overall. Chicago is an awesome town and we hope to be back to connect with the community out there. 

All the sessions are available here, thanks to the awesome @kevinjthull

Oct 10 2016
Oct 10

Software is an ever changing interweaving of collections of ideas expressed in code to solve various problems. In today's day an age the problems that software is solving is expanding at an ever increasing rate. The term “software is eating the world” is more and more relevant every day.

Although software is being deployed in every area of our daily lives the process which teams develop software is extremely inconsistent. Most of this may be a function of the maturing processes in a relatively new field. We are starting to see patterns emerge in the area of “devops” and better managed development workflows. At the core of a lot of these topics is how we test our software.

As developers we are extremely interested in building, learning, and automating. A smaller (but growing) collection of us gets excited about testing. Personally I have struggled for years in various environments to pin down a good testing strategy. That is still a learning curve I am on… and most likely will never be off of. It is what is inspiring me to write this post.

 

So what is software testing?

If you ask an end user, and many developers, what they think software testing is they would most likely say “Well, click (or tap) around the application and see what breaks”.  This isn’t necessarily incorrect. This is simply one level of a deeper topic. Therefore in order to define what software testing is we need to first define what levels of software testing exists. So let's break this down from the bottom level (the single lines of code) to the top level (the system as a whole):

  1. Unit testing

  2. Integration testing (sometimes referred to as functional tests)

  3. System testing

  4. Acceptance testing

Each of these levels builds upon the previous to provide for a consistent and comprehensive testing environment for a given project.

How much testing do we need?

Does your organization or project require a comprehensive suite of tests in each level? This depends on many factors. Purists will most likely say YES but the realist in me wants to say most likely not.

Organizations and the software developers you choose to work with should have honest conversations around the requirements of a feature, what should be tested, and how we will be testing. From these conversations we can map a out a testing plan that will serve as a communication tool for both project stakeholders and developers. Ideally the testing plan would integrate into an overall CI / CD system to provide an organic view of the projects state. For smaller projects a simple Google doc will suffice.

 

So how do we actually test?

Armed with functional requirements and a solid understanding of what is important to be tested within the scope of these requirements we can start to write our tests. For web applications, we have a slew of tools to choose from. For unit testing in the PHP world our go to framework should be PHPUnit. For functional tests we have a few options in the Drupal CMS. These tests are developed using a framework that allows us to mock up a full application in an isolated environment which we can run tests.

This allows us to run the entire suite of tests before we merge in a feature branch. This ensures that the new feature works and that it is not breaking any other existing functionality. The last piece is critical. I’ve worked on some very large software projects that did not test for regressions. Each change we made cause a ripple of frustrations through our user base. No bueno! With a well developed test suite and clear communication we should be able to mitigate these risks!

In coming posts I would like to explore some of the base classes we have available to test our solutions in the context of Drupal. This will hopefully give you a more concrete understanding on how we can take a test plan and translate them into executable tests.

What level of testing do you do with your clients / projects? I'd love to discuss in the comments! 

Jun 29 2016
Jun 29

With our new configuration management system as part of Drupal 8 core we now have a powerful system to manage site configuration between our different environments. This system does not declare a strict workflow for how to use configuration. In this post I’d like to explore some workflows we have been exploring.

First let's define what site configuration is.

There are two types of configuration in a Drupal site. First, simple configuration which stores basic key/value pairs such as integers, strings, and boolean values. Second, configuration entities, which we explored in this post. Basically configuration entities are more complex structures that define the architecture of our site. This would be things such as content types, views, and image styles.

Both types of configuration entities are exportable.

Where does this configuration live?

Site configuration lives in two places depending on its stage in life (no pun intended). There are two stages in a piece of configurations life:

  1. Staging

  2. Active

Active configuration lives in the database by default. It is the configuration that is currently being used by the site. Staged configuration lives on the file system by default.

When you make changes to the site within the web interface, for example changing the site name, you are changing the active configuration. This active configuration can them be exported and staged on another instance of the same site. The last piece is key:

Configuration is only to be moved between instances of the same site.

For example, I change the site name on our DEV environment and want to move this change to our TEST environment for testing.

Ok, let's talk workflows

There are a few ways we can manage the site configuration. Site configuration, like our code should follow a “flow up” model. That is, configuration will be committed to our git repo and moved up into higher order environments:

LOCAL → DEV → TEST → PROD

In this workflow configuration changes will be made as far down the chain as possible. In this case on a developer's local environment. These configuration changes should be exported and committed to their own branch. This branch can then be pushed for review.

Once pushed these configuration changes will be “staged” on the DEV environment. A site administrator will need to bring these staged changes into the “active” configuration. There are two ways we do that today:

  1. Through the sites UI from under ‘admin/config/development/configuration’

  2. Using a drush command ‘drush cim’ command to import staged configuration to active.

From what we are seeing this seems to be the defacto workflow at this point.

Further workflow Ideas

I am thinking there could be some interesting workflows that could emerge here. One idea is to have a branch (or tag) that triggers configuration imports using drush and fires off some automated tests. If all passes then merge into another designated branch for movement to a live environment.

Another idea is to use some of the emerging contrib modules to manage different “configuration states”. I believe this was discussed in the config_tools projects issue queue. Using this idea we can tag known “good” configurations and move between different “configuration states” of a given site. I am thinking we could even have the configuration hosted in separate repo then our code, if that makes any sense.

Bottom Line  

The new configuration management system offers a powerful tool to move changes on our sites between different instances. It does not however define an exact workflow to follow. I look forward to talking to folks on how they leverage this system to coordinate the efforts of high performing teams!

If you or your team is using this in different ways we would love to hear about it!

Apr 11 2016
Apr 11

This past weekend we were honored to co-host the Drupal Global Training day at DoSomething.org in NYC. This training was focused on Drupal 8 module development. We have been training on the ins and out of Drupal 8 module development for over a year now but this time we changed the format, considerably. I think for the better! 

Using the Role Notices module, developed by Ted Bowman, we put together an exercise that walks you through building the functionality it exposes step by step. We also built a list of resources chock full of links pertaining to various tools and docs for getting your chops up with D8 development. 

All this work is open source and available at this link. I am really hoping that this content can serve as a valuable resource for folks looking to learn the proper flow of developing a Drupal 8 module.

There are so many exciting concepts and programming patterns to explore in D8, we hope you continue to join us during this jounrney. 

Mega thanks to everyone that helped make this happen! 

Mar 28 2016
Mar 28

When developing applications for Drupal the need to interact with what we call Entities arises quickly in our development cycle. The Entity system in Drupal has evolved greatly over the last couple of versions. Drupal 8 now contains a full set of APIs to define, manipulate, and manage entities of various types.

One aspect of Drupal 8 that left me a bit lost was the two different base Entity types. That is:

In this post, I’d like to talk a bit about the key differences between these two types of entities and discuss some of the potential use cases for each. I want to keep this as succinct as possible while still be able to clearly explain the two. Let me know how I did in the comments!

What is a Configuration Entity?

Generally speaking a configuration entity is something that a site administrator or site builder would be creating. There are many types of configuration entities that Drupal 8 implements in core. Some examples include Views, Image Styles, Vocabularies, Display settings, etc.

The idea of a configuration entity enables us to integrate the powerful new configuration system in D8 to the full featured Entity API. Combine this with the fact that we can export all of these configuration entities to YAML files and we have a very powerful solution for our development operations.

Configuration Entity information is stored as YAML. (as of writing active configuration is stored in the database but exportable as YAML)

What is a Content Entity?

Content Entities are pieces of content that are generally created by the content creators of our solutions. These entities serve as the underpinnings of the structured content within our application. Content entities come in various different types in core Drupal 8. This includes nodes, users, block types, comments, etc.

Content entities allow us to create entities that are tightly integrated with the Field API and the same Entity API configuration entities derive from. This allows content entities to be fieldable by site builders in order to define the structure of our content while still exposing a consistent set of APIs for developers to work with.

Content Entity information is stored in the database.

To Summarize

With the above being said we can summarize the core differences between these base Entities into two core areas:

  1. Where the information is stored.
    1. For Configuration Entities data is stored as YAML.
    2. For Content Entities data is stored in tables / fields in our database.
  2. Who is creating the information
    1. For Configuration Entities site builder / site administrators.
    2. For Content Entities content creators / end users.

My motivation to write this was to explore and better understand the differences between these two core types of Entities BUT I think it may be just as important (if not more) to understand the similarities. Both are now under the scope of same unified Entity API. That is important. They provide integration points from the Entity API to two other important APIs, the configuration API and field API respectfully.

As we continue to work with these systems I look forward to learning more about how to leverage these systems to develop solutions for our clients!

Mar 04 2016
Mar 04

In the world of web development we developers are presented with a pallet of options when tasked with developing a solution. We have a wide array of server side programming languages, their respective frameworks, various DBMS systems, and a growing choice of client side presentation frameworks. Choosing the best software stack for given requirement can be overwhelming.

Over the past few years we have been interested in the ability of the Drupal CMS to be built for web applications and function more as a framework then as a predefined CMS platform. We spent a long time working with the inner workings of Drupal 6, Drupal 7, and now Drupal 8.  We have successfully implemented various systems using this framework that have real data models, users, and requirements. Over a few forth coming articles I would like to present some of the modules, tools, and techniques we have discovered along our journey. Some of the more interesting discoveries was the ability to setup true relational data models using the field system. The skill set required to build these apps is distributed in a slightly different way then the more “traditional” popular frameworks. It is almost reminiscent of when you developed small (non-shared) DB apps with Access. Field creation and their specs are defined using the web interface and as such can be performed by less skilled but trained staff. At the same time these fields can be consumed by developers to bundle specific application functions. I believe this is a powerful proposition for development shops and can prove useful for servicing customers of these apps.

In future posts I would like to delve into the specifics of these methods and share with the community the powerful framework the Drupal system can be used for. Drupal 8 represents a very positive shift in how we approach our solutions. It is truly exciting times!

Dec 09 2015
Dec 09

We are excited to announce the release of the Exploring Drupal 8 podcast series!

We partnered with the fine folks at TalkingDrupal.com to put togehter a six episode series to get you up to speed with Drupal 8 fast. Our goal was to provide some insight in the major changes we are all seeing with Drupal 8 in a quick easily consumable format and hopefully inspire listeners to explore further.

We have released the series in a Netflix style binge format. So get together snuggle up in a warm blanket and Drupal 8 and chill.

http://www.talkingdrupal.com/exploring-drupal-8

In addition we will be hosting two webinars after the holidays to answer questions in a more interactive format. Register via the link above. We hope to see you there!

Nov 04 2015
Nov 04

With Drupal 8 in RC2 we believe it is time to get your engines humming with learning all the in's and out's of the new version of our favorite content managment system! Towards this goal we have put together a few programs to get you and your team up to speed fast. We have been hitting the road over the past few months getting the word out and providing Drupal training wherever we can.

Our journey started up north at Drupal North. We had to the opporintity to give a few talks around Drupal 8 Site Building and Drupal 8 Module Development. Turn out was excellent and the camp was, as expected, filled with awesome folks. It was an excellent time... Toronto rocks! 

This lead up to NYCCamp, which was held at the United Nations. We had the opportunity to give the half day Drupal 8 Module Development training. Again, we had a great turn out and was able to get everyone feeling confortable coding quickly! We leveraged our Stack Starter platform to distribute full development environments and we were super excited that it worked smoothly! We recieved a bunch of great feedback and was happy to be able to implement much of it very quickly. 

Our most recent stop of our Drupal 8 training tour was up in Providance RI at NEDCamp. The folks in Rhode Island can put together a camp! We had the opportunity to give a talk on Drupal 8 site building and provided a full day training on module development. We took the things we learned at previous camps and built on them for NEDCamp. We think the outcome was a positive one! 

We are looking forward to continuing to give our training at various camps. Through our training efforts and the development of TryDrupal8.com we are proud to say we have spun up thousands of Drupal 8 instances over the beta's and we believe the future is only going to get brigher!

Stay tuned for some more exciting training news in the near future! If you or your team is looking for a quick and cost effective way to get up to speed give us a shout

Oct 02 2015
Oct 02

For those of you who know us, you know we have been excited about Drupal 8 for a very long time. I can recall discussing at one of our user group meetups over two years ago now touting how D8 is going to be a major step forward for us in the Drupal community. With ZERO critical issues left and Dries announcing at DrupalCon Barcelona that we can expect a RC (release candidate) with in the next few days... the time is near!

If you have not started to get your company up to speed with the developments in D8 now is the perfect time! Whether you are a site builder, developer, or stake holder you can not ignore the exciting developments. This release will truly be garnering a new age in Drupal, enabling us to take advantage and integrate with a whole world of open source technologies that have traditionally been challenging to leverage within the context of Drupal. At Sego, although we have a sweet spot for Drupal, we strongly believe in choosing the best tool for the job. With Drupal 8 it will make it much easier for us to develop clean, best of breed solutions for our clients leveraging a wide spectrum of open source technologies. That get us buzzing!! We are up in Road Island next week for NEDCamp. We will be doing the D8 module development training. If you are in the area, its a great value to get up to speed with some of the development changes we can expect.

Oct 25 2013
Oct 25

The problem

Search is a hard problem, it really is. Let me show this by using an example of a food chain that wants to add Drupal as their homepage of the whole chain. Of course, like many other organisations they do not only have Drupal running but also a subset of other web frameworks open and closed source systems.

Now, they wanted Drupal to become the front-page of all that content, but would you want to migrate all of this content in Drupal just so it becomes navigable so the Drupal site could redirect all content to the right site? Since we do not want our Drupal site, that will become our front page/portal of all this content, to directly reach out to all these other systems, because that would be impossible to scale and maintain, we are thinking of using Apache Solr as our Search Index to serve all this different content to our system.

Different sites that expose their content to a Drupal Frontend

A possible solution

One of the possible solutions is to convert all non-Drupal sites to Drupal sites and use the Apache Solr Multisitesearch module. By indexing all your content from all Drupal sites into the same Apache Solr index will allow you to search across all of them simply by using 1 query to the Apache Solr index. Since we know we are using Drupal and the module version is the same we know that we are capable of searching all this content in a similar way. We call this the Drupal way, since all knowledge on how the mapping between the Drupal fields and the Solr index is inside the Drupal module and is a "Drupalism".

Two snippets from a multisite setup

What if we want to add content from non-Drupal sites?

That's a very good question, as a organisation that maintains multiple websites with different technologies you are not always able to switch all of them to the same platform or keep everything updated in a similar fashion.

Luckily the Apache Solr already did the groundwork to make this possible. The way the Apache Solr Search module works is that it is completely independent from the Drupal nodes on the website it is running. It will show you whatever is in the Apache Solr index as long as it follows some very basic structure :

  • id
  • site
  • content

The Apache Solr Search module does not need much more than this information to just show data from other sources. But we now encounter another problem. We do want to see more contextual information than just a title and a snippet. We want to use all possible data and we can describe that as a mapping problem because the module translates field names to solr field names and this could wildly differ from site to site. Even if you only use Drupal sites!

Note: not all these fields are mapped in the same as they would be mapped for real, I invented a quick mapping but the concept should be clear. The solr fields that are used in the index are shown in Bold

An Example

Site 1

  • Content type : Event
  • Field : field_startdate -> dm_field_startdate
  • field : field_enddate -> dm_field_enddate
  • field : field_location -> ss_field_location
  • field : field_description -> ts_field_description

Site 2

  • Content type : Event
  • Field : field_start -> dm_field_start
  • field : field_end -> dm_field_end
  • field : field_location_event -> ss_field_location_event
  • field : field_body -> ts_field_body

Both sites have Drupal of a same version and have the same module version of Apache Solr Search installed, yet they are not able to share this information across systems as we lack a way of understanding what each field is. We are essentially missing the semantic data that describes this field in a generic way.

To make it even more complex I'll add third site with Events that is non-Drupal, say it's Wordpress and we prefix everything with WP because of some decision in the system. I do not claim that WP does any of this, it's for the sake of the argument ;-)

Site 3

  • Content Structure : Event
  • Field : wp_startdate -> dm_wp_startdate
  • field : wp_enddate -> dm_wp_enddate
  • field : wp_location_event -> ss_wp_location_event
  • field : wp_descripton -> ts_wp_description

Even if we have a way to index this content in Solr via Wordpress, we do not have a shared agreement on how to name these Solr fields and so we miss out on tons of features such as sharing Facets to filter all of our content in a consistent way and we are not able to show consistent Rich Snippets.

Rich snippets are a way to show metadata in search results. Google has more info about them

Another possible solution using Semantic Data

Describe all of our structured content with Semantic Data such as schema.org and RDFa so that we can translate this in common field names and in effect translate this in common solr field names! By using the CURIE standard and the rdfx module (that requires the ARC2 library) that can translate CURIE uri's to full uri's we can now easily have a common way of figuring out how our fields should be named.

Site 1

Site 2

Site 3

Note: We include the schema.org url as we can have different sources for RDF metadata and these can mean different things.

By defining our structure in a standard way we now have uniformity in our Solr index field mappings and can do really cool things with this

Two snippets from a multisite setup with Rich Snippets and RDF metadata

What can you do already?

All of this is already possible if you install the rich_snippets module and use branch 1815744, I committed the test code to index and register facets based on these RDF mappings. An example site can be seen here : http://rdfsolr13test.devcloud.acquia-sites.com

Future possibilities

By having a crawler that implement this exact same mapping schema we can show external content from sites we don't maintain in our Drupal site. By imitating a content type in Drupal and define the mappings for those fields we are able to "identify" similar content and show those in the search index. This opens a lot of possibilities of bridging the gap between Drupal and non-Drupal sites without requiring a massive migration. It also places Drupal in a position that is suited for rapid development as content can quickly be brought available through the means of a shared Solr index.

There is still some work to make it easier, but I think we are already half way by defining what we want and by defining a standard on how we map rdf properties to solr field names and I've been thinking to make a very easy library for this so it can be common good and escape the Drupal island.

Thoughts? Ideas? Excited? Please leave a comment :-)

Mar 14 2013
Mar 14

We had another Drupal Search and Solr office hours. This time mkalkbrenner came to us with a list of issues that he really wanted to see resolved as he already applied them in production.

His use case was a (pay attention!) Drupal site with the following features

  • Multilanguage (node translation, entity translation. Of course not the both in 1 site. He had two sites, each one with a different translation method)
  • Cross Site Search (1 or more Drupal sites, could be Drupal 6 or 7, that have the same solr index)
  • Multi Solr indexes per Site, also Cross Site Search enabled
  • Support for Dev/Staging/Production stages by means of dragging and dropping the database (so everything needs to be exportable)

The list of issues he had to overcome with the Apache Solr Search Integration Module in combination with the Apache Solr Multilingual Module are listed here. You can see it was quite a lot.

I'm happy to say that after an hour of work we solved some of these issues, but we went even further and spent another 4 hours together to fix all of the above issues. Now Markus does not need to patch the module anymore to achieve what he wants. I highly recommend you to take a look at the multilingual Solr module as it has loads of really cool features that make a multilingual search awesome. Thank Acquia for sponsoring my time to allow me to give back to the community!

I'd like to invite you to come to the next Drupal Search office hours. The topic is general Search so not just these modules. It will take place in two weeks, the 27th March 2013

Nov 29 2012
Nov 29

Drupal Search has a great ecosystem of modules to integrate with technologies such as Solr. However, it needs more vision and direction to grow and be a great platform where other developers feel comfortable with and are able to make the right decisions. Also We are convinced that if we all come together and talk, get some decisions and actually get to work on a regular basis we can come up with a solution for Drupal that kicks a**! An example of the greatness showed earlier is the co-operation between the maintainers of Apache Solr and Search API to create a common solrconfig and schema so you take away the pain of having separate schemas for the modules and end up with frustrating as Solr can be quite daunting to set up.

So, that is why we decided to move forward with creating office hours for this aspect of Drupal.

Who is we you might ask?

  • Peter Wolanin is a contributor to the search space in Drupal since the early days. He is the lead maintainer of the Apache Solr Search module.
  • Chris Pliakas is the lead maintainer of the Facet API module. The Facet API module is integrated tightly with the Search API and the Apachesolr modules. Chris is super fascinated by the Search world and would love to see Drupal thrive in this space.
  • Thomas Seidl is the lead maintainer of the Search API and the Search API Solr module. He started a whole new concept of Search in Drupal.
  • Nick Veenhof is co-maintainer of the Apache Solr and Facet API modules and has plenty of other search contributions. He is also the owner of this blog ;-).

When?

We hold the Drupal Search/Solr contribution mentoring ("office hours") every other week in #drupal-apachesolr on freenode:

XML Feed

iCal Feed

HTML

(Calendar ID: [email protected])

You want to join?

You are free to join the channel at any time! If you want to get a notification in advance, sign up to join at http://groups.drupal.org/node/270638

Why mentoring?

There are a lot of people that want to help but don't know how, these are the people we want to encoure to get involved. Also all the maintainers listed above feel we should talk more often then once a year on a Drupalcon to discuss the direction we are taking with these modules.

I hope to see you next Wednesday! We will be capturing the talks there and I will put them on my blog as an archive.

Nov 07 2012
Nov 07

We all love test and staging environments but it becomes a problem when you have Solr integrated in your project and you have a test solr, a staging solr and a production solr core. To re-index a site is not a big problem, but what if you want to go live and switch indexes immediately?

Drupal.org has wrestled with this problem. So I wanted to show you how they do it.

Create three Solr cores - mysite_core1 (will be used for tests) - mysite_core2 (will be used for staging initially) - mysite_core3 (will be used for production initially)

Make sure your staging site indexes as if it was the production site. There are a number of tricks you can apply.

Site hash

Make sure both site hashes are exactly the same, otherwise it won't correctly find the indexed content


  1. variable_set('apachesolr_site_hash', 'mysite_custom_hash');

Base URL

When Solr indexes content, it will use absolute urls. To trick solr you could set the base url to reflect production.


  1. # $base_url = 'http://www.example.com'; // NO trailing slash!

Alter the documents that are being sent

You could also not do the base_url approach and hardcode the fields you want to change. Using the hook_apachesolr_index_documents_alter() you could any of the staging urls to reflect production urls. csevb10 created something similar but used a different hook. Below is more or less the same code, using the documents_alter hook.


  1. function hook_apachesolr_index_documents_alter($documents, $entity, $entity_type, $env_id) {

  2. $start_url = variable_get('apachesolr_url_switch_start_url', 'http://staging');

  3. $end_url = variable_get('apachesolr_url_switch_end_url', 'http://production');

  4. 'site',

  5. 'url',

  6. 'content',

  7. 'teaser',

  8. );

  9. foreach ($documents as $id => $document) {

  10. foreach ($elems as $elem) {

  11. $documents[$id]->{$elem} = str_replace($start_url, $end_url, $document->{$elem});
  12. }

  13. }

Create an update function and let it set the correct index position


  1. function my_module_update_7001() {

  2. // I am assuming your environment is called "core"

  3. // mark all content to be reindexed so our table would be in sync with the transfered nodes.

  4. apachesolr_index_node_solr_reindex('core');

  5. // fetch your current apachesolr_index_last from staging

  6. $staging_index_last = array();
  7. variable_set('apachesolr_index_last',$staging_index_last);

  8. // fetch your current apachesolr_index_updated from staging

  9. $staging_updated = array();
  10. variable_set('apachesolr_index_updated', $staging_updated);

  11. }

Switching the core

Once you move to production, make sure you switch the cores. So mysite_core2 becomes production and mysite_core3 becomes staging. You could also copy the complete data dir from solr to your thrid core but that might not be automatic enough for some you.

Enjoy

Oct 24 2012
Oct 24

Recently I've been involved in drupal.org by upgrading the site to the latest version of Apache Solr Search Integration (from 6.x-1.x to 6.x-3.x, and in the near future to 7.x-1.x). This upgrade path is necessary as we still want to have a unified search capability across Drupal 6 and Drupal 7 sites, for example groups.drupal.org and drupal.org.

If you want to know more about multisite search capabilities with Solr and Drupal, I suggest you read http://nickveenhof.be/blog/lets-talk-apache-solr-multisite as it explains a whole lot about this subject.

One issue that we encountered during the migration is that all content needed to be reindexed, which takes a really long time because Drupal.org has so much content. The switch needed to happen as quickly as possible, and the out-of-the-box indexer prevented us from doing this. There are multiple solutions to the dev/staging/production scenario for Solr, and I promise I will tackle those in another blogpost. Update: I just did http://nickveenhof.be/blog/content-staging-apache-solr

This blogpost is aiming on making the indexing speed way quicker by utilizing all the horsepower you have in your server.

Problem

Take a look at the normal indexing scheme : Apache Solr Search Integration's normal indexing procedure

This poses a number of problems that many of you have encountered. The indexing process is slow not because of Solr, but because Drupal has to process each node one at a time in preparation for indexing. And a node_load/view/render takes time. And what about Solr? Solr does not even sweat handling the received content ;-)


  1. function apachesolr_index_node_solr_document(ApacheSolrDocument $document, $node, $entity_type, $env_id) {

  2. ...

  3. // Heavy part starts here

  4. $build = node_view($node, 'search_index');

  5. $text = drupal_render($build);

  6. // heavy part stops here

  7. ...

You could avoid this by commenting this code out and not have the full body text, which is useful if you only want facets. It could also be optimized by completely disabling caching since we do not want these nodes to be cached during a massive indexing loop. You can view the Cache disable blogpost figure out how I've done that.

Architecting a solution

And this is the parallel indexing scheme : Apache Solr Search Integration's parallel indexing procedure

I went looking for a solution that could provide an API to achieve this architecture. After learning a lot about the php fork method it seemed way too complex for what I would need. Httprl on the other hand looked like a good solution. With an API that allowed me to execute code on another bootstrapped drupal and by making this a blocking request, but in parallel, I could execute the same function multiple times with different arguments.

What does httprl do? Using stream_select() it will send http requests out in parallel. These requests can be made in a blocking or non-blocking way. Blocking will wait for the http response; Non-Blocking will close the connection not waiting for the response back. The API for httprl is similar to the Drupal 7 version of drupal_http_request().

As a result, I created the Apache Solr Parallel Indexing module which you can download, enable, configure the achieve parallel indexing.

Try it out for yourselves

  • Enable Module
  • Go to admin/settings/apachesolr/settings
  • Click Advanced Settings
    • Set the amount of nodes you want to index (I'd set it to 1000)
    • Set the amount of CPU's you have (I've set to 8, I have 4 physical, but they can handle 2 indexing processes each)
    • Make sure you have the hostname set if your IP does not directly translates to your domain
  • If your IP does not resolve to your drupal site, go to admin/settings/httprl and set it to -1. This will almost always be the case for testing
  • Index

As you can see, the drush command stays the same. The module will take over the batch logic, so regardless of the UI or drush, it will use multiple drupal bootstraps to process the items.

Results

Hardware : Macbook Pro mid 2011, i5, 4GB RAM, 256GB SSD

I've seen a 4x-6x improvement in time depending on the amount of complexity you have in node_load/entity_load.

Without Parallel Indexing

  • Calculation : 40k nodes in 20 minutes
  • Nodes per second : 33

With Parallel indexing

  • Calculation : 117k nodes in 19 minutes
  • Nodes per second : 112

Another example I got from swentel is this :

Without Parallel Indexing

  • Calculation : 644k nodes in 270 minutes
  • Nodes per second : 39

With Parallel indexing

  • Calculation : 664k nodes in 90 minutes
  • Nodes per second : 119

Calculating this in to a million nodes, it will take on average 3 hours to finish the whole lot. Comparing this with a blogpost that talked about speeding up indexing and clocked at 24 hour per million, this is a massive improvement.

In the real life case of swentel, this has had an impact of factor of 3. Meaning the indexing went 300% faster compared to not using the module. I'd say its worth to look at it at least.

Take a look at the screenshots below how I measured and monitored all of this, it was fun!

Future & Improvements

It is still an exercise of balance, the bigger your Drupal site, the longer it takes. As the drupal.org team has this in production, they encountered some problems of overloading their systems. If you overload the system, it is possible that the timeout of the request has crossed its limit and will die. This means that you won't get feedback on how many items it has processed. As a general rule, do not set the amount of cpu's too high. Perhaps half of what you really have to start with and experiment. Be conservative with these settings and monitor your system.

I also would like to experiment with the drupal queue, but this is a D7 only API and as this module had to work with Drupal 6 and 7, I decided to opt for this simpler approach. There is a great blogpost about Search API and Queue's but it involves some coding.

Oct 24 2012
Oct 24

Sometimes you wished that all of Drupal cache would be disabled for the sake of testing or any other purpose.

Just call the enable/disable function at will in your custom code. Dont forget to enable caching for production, if you intend to use this in your development

Drupal 7


  1. function my_module_cache_disable() {

  2. $cache_backends = variable_get('cache_backends', array());
  3. $cache_backends[] = 'includes/cache-install.inc';

  4. variable_set('cache_backends', $cache_backends);

  5. }

  6. // Default to throwing away cache data

  7. variable_set('cache_default_class','DrupalFakeCache');

  8. // Rely on the DB cache for form caching - otherwise forms fail.

  9. variable_set('cache_class_cache_form', 'DrupalDatabaseCache');

  10. }

  11. function my_module_cache_enable() {

  12. // Default to throwing away cache data

  13. variable_del('cache_default_class');

  14. // Rely on the DB cache for form caching - otherwise forms fail.

  15. variable_del('cache_class_cache_form');

  16. }

Drupal 6


  1. function my_module_cache_disable() {

  2. variable_set('cache_inc', './includes/cache-install.inc');

  3. }

  4. function my_module_cache_enable() {

  5. variable_del('cache_inc');

  6. }

Jul 09 2012
Jul 09

In this third and fast paced blog series I'll be talking about Apache Solr Clear Url's.

Some history

When we roll back a year or two, I was working for a company where I was working on a search-critical solr + Drupal website and Drupal was praised for its ability to have dynamic clean urls. We needed search pages with clean url's but also the facets needed to be clean. I remember that one of the members of the team had put in a lot of effort to make this work. And believe me, he really almost died trying! This required a lot of code and basically a lot of menu_hook_alters to take over the existing functionality.

So, let's fast forward 2 years now. In Drupal 7 (and also Drupal 6, because the whole module has been backported) there is an ability to use search pages, with dynamic urls and these search pages can have facets if you use the Facet API module. Facet Api is basically a module that takes care of the whole display and url logic when you provide it with some facet data. Thanks to Acquia, I was able to spent a lot of time in understanding the whole structure and get up to speed with all the search related modules that were in development. So thank you for that opportunity!

Goal

What we want is to convert

http://nickveenhof.be/search/site/screencast?f[0]=im_taxonomy_vocabulary_1%3A101

to

http://nickveenhof.be/search/site/screencast/im_taxonomy_vocabulary_1/101

Roadbumps

There were a few hurdles to jump over before clean url facetting could be reality we needed to have a better understanding of how the search menu implementations work. Before we can understand those we should analyze the basics. Let's scope to the core search module. I've modified the code snippet a bit for better readability.


  1. $items["search/node/%menu_tail"] = array(
  2. 'title' => $search_info['title'],

  3. 'load arguments' => array('%map', '%index'),
  4. 'page callback' => 'search_view',

  5. 'page arguments' => array($module, 2),
  6. ...

  7. );

You can see that a menu_tail is used and this "Loads path as one string from the argument we are currently at".

This may sound a bit incomprehensible but in essence it means that everything after the 'search/node' path is seen as the argument/query. Let's take a look at some examples.

  • fill in "test and test2" in the search box
  • value is url encoded using drupal_encode_path
  • url becomes search/node/test%20and%20test2"

Another scenario

  • fill in "test/test/test/test"
  • value is url encoded using drupal_encode_path
  • url becomes search/node/test/test/test/test

Wait, what? Why isn't the slash encoded? Try to understand the drupal_encode_path function, but because of aesthetic reasons slashes are not escaped. We are faced with a problem. We don't know what our search query exactly is when we start adding custom paths (such as facet paths) after or before the search query.


  1. function drupal_encode_path($path) {

  2. // For aesthetic reasons slashes are not escaped.

  3. }

Facet Api Clean URL's

Let's jump back to Apache Solr Search Integration, this module utilises Facet Api for all its facetting needs. If we want to utilise clean urls for search facets, we needed to fix this %menu_tail problem.

So, I sat down during the Drupal Dev Days a couple weeks ago and spoke to dasjo. He is the original author of Facet Api Pretty Paths and it was already working in combination with Search API and that was because Search API did not follow the core search paths, nor even depend on core Search.

I've twisted and spun my head around this difficult problem and, after trying too many hacky regular expressions, a tip from sun helped us finding a solution for fetching the search query from the url, avoiding drupal_encode_path and allowing the facets to work on search pages. Because of the difficulties discussed above the main Apache Solr module had to be patched and some adjustments had to be made to the Facet Api Pretty Paths module but we now have a working combo.

The module is generic enough not to contain any Search API or Apache Solr Search Integration specific logic. It works with any Facet Api implementation. A live demo can be found at absolventen.at (Search API) or you can try it out yourselves. For sake of a demo, I enabled it on my blog (Apache Solr) so you can test it out here as well. I warn you though, it is an alpha! Url's might still look messy Clean Url's for Apache Solr and Facet Api

Future work

The module Facet Api Pretty Paths is still in alpha stage and needs your help to decipher the facet api urls in to human readable snippets. The trickiest part is for example the date range, where a number of possibilities are valid (Want to see which ones?). For example, we need to get the following one clean dynamically.

http://nickveenhof.be/search/site/drupalcon?f[0]=ds_created%3A%5B2008-01-01T00%3A00%3A00Z%20TO%202009-01-01T00%3A00%3A00Z%5D

But the result is not very pretty yet...

http://nickveenhof.be/search/site/drupalcon/ds_created/%5B2008-01-01T00%3A00%3A00Z%20TO%202009-01-01T00%3A00%3A00Z%5D

Also the Facet Api Slider is not fully compatible with Facet Api Pretty Paths, so help is requested here. And it might be nice to use tokens to replace urls and whatnot. Not enough time, argh! :-)

We do need your help. There are some beta blockers pending and even if you just report an issue, it would help. Please join the issue queue!

Jul 08 2012
Jul 08

in the second part of this Let's talk Apache Solr I'll be handling the revamp of Apache Solr Attachments. Apache Solr Attachments is a module that allows you to index documents. File formats that can be indexed include HTML, XML, Microsoft Office documents, OpenDocument, PDF, RTF, .zip and other compression formats, text formats, audio formats, video formats and more. For a complete list of supported document formats, see the Apache Tika documentation.

This module is great when you have a content-rich website where the search should extend to anything further than regular site-content search. The module already exists since the early days of Drupal 6, where there was not even a discussion about entities. We were talking about nodes as if they were the only important thing that would ever exist in Drupal. Anything could be done with nodes, you name it! And I'll tell you... I've seen some ugly things that people have done with those poor nodes...

As we progressed in to the Drupal 7 era, it became clear that entities are now to be treated with respect and therefore code had to be refactored.

Architecture

In order to understand the major improvements, take a look at the following diagram.

Apache Solr Attachments before the relaunch

So, in earlier versions, it inspected every node to see if it had an attachment. However, this had 1 big flaw. Suppose you have a million nodes, and only 10 nodes had files attached to them, it still had to inspect those million nodes. Naturally this results in a slower system where a bunch of cpu cycles are not really being used for useful computing. Because of the Drupal 6 architecture it was the only way to get a reliable graph of the attached files.

Now, let's take a look at the newer version.

Apache Solr Attachments after the relaunch

It makes use of the EntityFieldQuery to fetch the files that are attached to entities (most probably nodes, see limitations). Let's look at the following snippet that fetches all files from filefields in entities that have the filefields defined. This is a little bit stripped down, so I hope it is still readable. You will notice that we spawn an object called ApachesolrAttachmentsEntityFieldQuery. This is due to the fact that a regular EntityFieldQuery can't return anything else aside from entity_id, entity_bundle and entity_type. We needed the file id from the filefield for those entities without creating complex code, so we embraced the OO concepts and extended EntityFieldQuery.


  1. // Get all the fields in our system

  2. $fields = field_info_field_by_ids();

  3. foreach ($fields as $field_id => $field_info) {

  4. // if the field is typed as file, continue

  5. if ($field_info['type'] == 'file') {

  6. // find the entities and bundles where this field has been attached to

  7. foreach ($field_info['bundles'] as $entity_type => $bundles) {

  8. $entity_info = entity_get_info($entity_type);

  9. $query = new ApachesolrAttachmentsEntityFieldQuery();

  10. $results_query = $query

  11. ->entityCondition('entity_type', $entity_type)

  12. ->fieldCondition($field_info['field_name'])

  13. // Fetch all file ids related to the entities

  14. ->addExtraField($field_info['field_name'], 'fid', 'fid')

  15. ->execute();

  16. }

  17. }

  18. }

  19. return $results;

If you want to see how it works, or you want to use this in your own project you can take a look at the sandbox pcambra and I created Entity Field Query Extra Fields. The addExtraField() function was added in our extended class and basically allows you to ask for any value of a row that is available in the data storage for your field.

What we achieved here is a stable and reliable way to get all attached files without inspecting each and every node separately. Definitely a huge #win!

Media

Another interesting improvement is that Apache Solr Attachments now has Media/File Entity support. Media is a drop-in replacement for the Drupal core upload field with a unified User Interface where editors and administrators can upload, manage, and reuse files and multimedia assets. Any files uploaded before Media was enabled will automatically take advantage of for many of the features it comes with.

One of the challenges here was that Media/File Entity added multiple bundles to the file entity type and that the fields are more dynamic. I'm sure not all use cases have been tested but so far you can already select any of the media entity bundles and let them index. As far as I, and many others in the queue, have tested this works quite well.

Indexing Media bundles

Different indexing methods

Just to point out, because this is not new. Apache Solr Attachments can extract info from a separate tika jar file, hosted on your website or from an embedded tika app in Apache Solr. This is an advantage because, when set up right, you can offload this processing to solr. Preferably this is a redundant solution so it can handle the heavy lifting for you. Acquia Search allows you to seamlessly integrate this module in your website because it offers the ability to extract attachments for no additional fees. Take care when you set this up yourselves so that your search queries are not slower during the extracting of document content.

Limitations

Please note, files that are not attached to another entity won't be indexed by default. It is certainly possible, but nobody has asked for it yet. Also, even though the code supports files that are attached to anything else than nodes, it has not been tested yet. Do you know a use case for this and are you able to test it? Please go ahead and report back in the issue queue of Apache Solr attachments.

If this interests you, we welcome you to help us making Apache Solr Attachments better! Questions and or remarks are very welcome.

Jul 03 2012
Jul 03

Have you ever worked with the Apache Solr Search Integration project? I certainly hope so! At Acquia, we have invested a lot of time to make this module stable for Drupal 7. We did not make any sacrifices in regards of speed and/or optimizations. For one, I have spent a lot of time during my thesis in making the Apache Solr Search Integration module awesome and I also tried to understand (and eventually understood!) why Peter Wolanin was so reluctant in accepting to rely on some of the upcoming and popular modules.

You will see that this module is still a stand-alone module and we have good reasons for that. Let's assume that you have a huge website with tons of nodes and are moving away from contrib modules, the last thing you want to see is that your contrib module depends on many other modules and eventually depend on thebugfixing speed of that contrib module. Also, we needed to have an easy backport path to Drupal 6 and this is also one of the reasons why the Apache Solr Search Integration module does not depend on Entity Api and Ctools. For the critics : it does have ctools integration whenever ctools is enabled.

Multisite Search

The Drupal 6.x-3.x branch of Apachesolr has the exact same API and schema as the 7.x-1.x branch. This has the additional benefit that it allows you to create the following schema.

So, if you pay close attention, you can see that many sites talk to the same index. They can use the same index and search each others content. If you want to enhance this experience, it is recommended you install the apachesolr_multisitesearch module that allows you to see which sites have their content indexed in the shared Solr index. It also allows you to isolate a site and only let that site search its own content.

Coding for multisite

The following code snippet should explain how the linkage happens between two Drupal sites. This snippet is by no means something you should copy paste for use in your production site. It explanatory :-)


  1. function custom_module_apachesolr_query_alter(DrupalSolrQueryInterface $query) {

  2. // Add our hash to the filter, this limits the search set to the current site only

  3. $query->addFilter('hash', apachesolr_site_hash());

  4. // Only search within the story content type, since a multisite environment cannot

  5. // share ids we need to filter on the machine name. You can link content types across sites

  6. // if they share the same machinename

  7. $query->addFilter('bundle_name', 'story');

  8. }

Jun 24 2012
Jun 24

My first review

It took me some days to actually get started on this and start writing. As some of you already know, Packt has given me the Drupal 7 - Multi-sites Configuration book to review. You might wonder, won't this review be biased because he got it? Honestly, I did my best to write down my real thoughts and details that you, as the reader, would find interesting. At least, I gave it my best shot! Please don't shoot me because I am being too critical or too soft!

The book itself

http://www.packtpub.com/drupal-7-multi-sites-configuration/book
There is a picture of the book in the attachment below.

Concepts

The book starts of with some concepts in a Multi-site environment such as the advantages of running Drupal on the same codebase, having a staging/testing environment before deploying to a live site, easier server administration and even partly avoiding some of your shared hosting restrictions.
This chapter is actually really well written and well defined. It makes clear that you need to have a use-case before diving into the matter. It also makes reading the book a little more pleasant, because you have these ideas in your mind.

All basic install stuff

Next up was setting up a server. I did not really expect this part to be in the book, and mainly not being biased so heavily towards vagrant... I think I missed the whole concept of why vagrant is so important in a multi-site setup. Also the lengthy guides on how to install Drupal (apache, mysql, hosts files, ...) were a little obsolete if you'd ask me. I'd enjoy it more if some more time was spent in how Drupal actually recognizes different domain names. Luckily my appetite was partly satisfied in Chapter 3.

The book continues on on how to create site folders, special cases with subdomains and even a new one for me, domains with subdirectories! Matt also explains how to use the sites.php file, which I actually never use because of bad habits of mine I guess. This explanation was very educational and eye-opening. A picture with the structure of the folder would have been nice as a way to make the dense text a little lighter, but this is me being too harsh probably! It continues to explain the files directory, and importantly, how to configure that directory in a multi-site environment. Unfortunately also some more (obsolete?) Vagrant information.

Secrets!

So, after Chapter 3 you should be very accustomed to the way you can install multiple drupal databases on the same codebase. The only minus here is that you are at 3/4th of the book already! Luckily the last part of the book unveils some secrets about Multi-site installations.

For example, one of these secrets is a shared configuration file! This is not a standard Drupal pattern. With some simple tweaks, Matt explains you in detail how this can work for you. If one of you is hosted in the Acquia Cloud, I'll unveil a little secret here, we apply the same trick ;-). The book continues about sharing modules/themes/subthemes or explicitly not sharing one! You might want to share your main theme, but not the subthemes, a logical pattern but not so many of you will implement that pattern. I could recommend reading this section, because it also has memory implications if you run many multi-sites on 1 drupal code base.

Updating

Chapter 4 is all about updating your site, not very exciting material but it is a necessity. Good for people that are a starter in this field. He pays enough attention to the details so that is a plus. I do feel there is a lot of clutter in the book that might not apply to the masses. And again, the assumption that you have vagrant is distracting me from the core knowledge you need to know.

Advanced topics

Chapter 5 : Advanced Multi-Sites! All about Favicons and robots. I hope Matt meant robots.txt ;-). Some other nice concepts for shared authentication float to the top such as OpenID, LDAP and Directory services, Services module, Bakery, etc... Very interesting chapter if you need to know more about Single Sign on, shared content and or structures.

Solr, you knew I had to make a separate header for that!

The last part references multi-site searching and this is imho not complete enough. it does mention Apache Solr as a possible solution but it does not mention apachesolr_multisitesearch as the glue between your site searches. If you use this module, you can easily search multiple sites through one interface. I'll even say more, you can even index a D6 and a D7 in the same index and search them through a common interface (Works with apachesolr-7.x-1.0 and apachesolr-6.x-3.x)

TL;DR

This book is a great addition to your knowledge base if you plan to set up a multi-site environment and you want to do it right. It mentions vagrant a little to much and it does not talk about Aegir (which imho is the example of a multi-site spawner). I've enjoyed reading it and learned tricks here and there. I also missed some performance metrics, but maybe I somehow missed that. If you want this book for free, you can still enter the give-away I am allowed to do in my other blog post

Certainly a good read if you are new to the topic!

Jun 21 2012
Jun 21

Interrupt

This is not a regular blog post of mine, no Solr internals this time :-)
Packt has contacted me in order to let me give away books to three participants. The details of what you have to do can be found below. I do find it a good chance to offer this to all of you, there is no reward or money attached for me so that puts us all in the same position.

I did get a Packt book to review (Drupal Multi Site book) and I'll post my initial thoughts on that one somewhere tomorrow.

Win Free Copies of New Drupal Mini books by Packt

The Drupal mini series which they have launched this month, contains 3 books ranging 100 to 140 pages in size that are practical guides/modules to achieving specific tasks and applications.

Read more about the Packt’s Drupal mini June campaign here: http://www.packtpub.com/news/new-drupal-mini-books

The books you can choose from

  • Drush User’s Guide
    A practical guide full of examples and step-by-step instructions to start using Drush, Drupal’s command line interface.
  • Drupal 7 Multi Sites Configuration
    Configure and install several sites on one instance of Drupal.
  • Drupal 7 Multilingual Sites
    Apply the numerous multilingual modules to your Drupal site and configure it for any number of languages or currencies.

How to Enter this contest?

  • Add a comment to this post letting me know which mini book interests you the most and why
  • Prizes: 3 Lucky winners get a free copy of any one of Packt’s Drupal mini books of their choice. Print copies can only be shipped to winners in the US and UK, and if you’re from any other country, you’d receive the e-book version.
  • Deadline: The contest will close on 30th June, 2012.
  • Winners will be contacted by email, so be sure to use your real email address when you comment!

Why wait? Comment away!

The winners are
Ellen Boeke
Vincent Youmans
Jos (Pieter VL)

You will be contacted by Packt as soon as possible! The winner were chosen according to a random numbering sequence : http://stattrek.com/statistics/random-number-generator.aspx

Jun 13 2012
Jun 13

You don't *just* install a new version!

I'm sure you had this situation before : "A new version arrives, they promise you heaven and when you take the dive you actually are in hell. Everything is broken and you don't really understand why". A very common case of diving in to the deep. To prevent this I was asked, during my internship at Acquia, to verify if the new Solr 3.5 would perform at least equally well at the exact same searches as it did with Solr 1.4. Before upgrading a lot of testing should happen so that nobody is surprised with sudden problems.

During this process I learned a bunch about Solr server administration, master/slave replication and load testing. Hopefully I've saved you some time in your exploration of the solrconfig and its mergePolicies! And moreover I'd like to thank Acquia and especially Peter Wolanin for his guidance!

What we do know is that the index format of Solr 1.4 can be read by Solr 3.x. This is crucial information to have when updating existing indexes. Be warned, there is a very important difference to be made when updating masters and slaves in a replication setup. When upgrading, you should always upgrade your slave first! If you upgrade your master first, and a 3.5 index is being replicated to a 1.4 Slave, you are asking for troubles.

A soon as a first commit/write action is made, Solr will execute an index upgrade process. A fresh index or a re-index is recommended, but it will certainly still work

This blog post was published some time ago, but I am re-publishing this since we have finished the migration of Acquia Search to Solr 3.5 with success so hopefully this will be of interest for some of you.

Drupal is an application that has very deep integration with the Apache Solr application and is updating Solr during cron runs (every 30 minutes for example). This does imply that the indexing speed should not be very high but the search speed should be. Apache Solr has a concept of segments (your index is spread over multiple segments) and if a search is executed it needs to gather all these segments and search them. Logically, more segments = slower results.

Solr 3.5 came with a new default MergePolicy and that required some testing to see if we could trust this new MergePolicy (TieredMergePolicy).

Information regarding these policies can be found here : http://java.dzone.com/news/merge-policy-internals-solr

And read up on the following docs :
LogByteSizeMergePolicy
LogDocMergePolicy
TieredMergePolicy

Steps taken to execute these tests

  • Load existing index files in to a new core.
  • Extract Documents from this index
  • Use the extracted documents to insert them in a clean and new core with different configuration
  • Re-run the access log of that subscription for the searches, repeat this twice, use 3000 queries per access log and discard everything except the select queries and repeat this process 3 times to make sure we have a balanced result set
  • If you have more questions about these tests, please leave a comment and I'll be happy to provide you with an answer!

Conclusions

If you want to migrate to Solr 3.5 coming from Solr 1.4 with low risk of changes you should keep using the LogByteMergePolicy with a mergefactor of 4 (Default in the Drupal configs).
However, the TieredMergePolicy is interesting when understood correctly. I'd love some more comments on that topic from people that know more about it.

The big result of this test is that Solr 3.5 versus 1.4 is a big big performance win. Also good to know is that the MergePolicy should be set explicitly when using LuceneMatchVersion.

Carefully I dare to say that the difference between RHEL5 and Ubuntu 10.04 are immense. I have to do some extra testing to be sure that this result is actually true

Charts and extra Legend information

  • S14 stands for Solr 1.4
  • S35 stands for Solr 3.5
  • LB stands for Load Balancer (C1.medium)
  • SL stands for Slave, this means that the attack happened from the LB to the SL (these results happend 3 times for to contain less variable delays)
  • MA stands for Master, this means that the attack happened from the LB to the MA (these results happend 3 times for to contain less variable delays)
  • MergeFactor for LogbyteMerge and LogDocMerge is set to 4
  • Default means the Default merge policy, Solr 1.4 this is LogByteMergePolicy and for Solr 3.5 this depends on the LuceneMatchVersion
  • L35 means that Lucene has been set to Lucene 3.5 instead of the default
  • When Lucene 3.5 is set for Solr 3.5 and no merge policy was set, this defaults to TieredMergePolicy
  • When Settings is defined, it applies to specific TieredMergePolicy settings
    • maxMergeAtOnce says how many segments can be merged at a time for "normal" (not optimize) merging
    • segmentsPerTier controls how many segments you can tolerate in the index (bigger number means more segments)
  • Distro /Kernel version for most of them CentOS 5 2 32 2 6 18 200906190310 / 2.6.18-xenU-ec2-v1.0
  • U stands for Ubuntu : Ubuntu 10.04.4 LTS / 2.6.32-341-ec2

Specifications

Specifications of the Master

Large Instance (M1.large)
7.5 GB memory
4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each)

Specifications of the Slave

High-CPU Medium Instance (C1.medium)
1.7 GB of memory
5 EC2 Compute Units (2 virtual cores with 2.5 EC2 Compute Units each)

Taking away extreme values for better visibility

Apr 29 2012
Apr 29

This is part 3 in a series of a few that helps people explore the Apache Solr module.

This tutorial explains visually how to install the Acquia Connector and how to connect your Drupal site to the Acquia Search service
It continues where screencast 2 stopped.

Step 1) Install the acquia connector module
Step 2) Create your free dev cloud subscription
Step 3) Connect everything
Step 4) Index, search and be happy that you now have a high availability and a highly scalable search solution without maintenance worries!

This is partly a preparation for a workshop I am preparing for Drupalcamp Porto.

Apr 29 2012
Apr 29

This is part 2 in a series of a few that helps people explore the Apache Solr module.

This tutorial explains visually how to install facet API and how to get started with basic faceted search.
It continues where screencast 1 stopped.

Step 1) Install the module
Step 2) Enable some Facets and looking at the options of the facets
Step 3) Puts them in their respective region
Step 4) Be happy!

This is partly a preparation for a workshop I am preparing for Drupalcamp Porto.

Apr 25 2012
Apr 25

This is part 1 in a series of a few that helps people explore the Apache Solr module.

This tutorial explains visually how to install the module and how to get started with the solr server. My plan is to create a bunch of these tutorials to help people out. I hope I can cover all the UI + all API functions at least once in practice.

This is partly a preparation for a workshop I am preparing for Drupalcamp Porto.

Links used in this tutorial
http://drupal.org/project/apachesolr
http://www.nickveenhof.be/blog/simple-guide-install-apache-solr-3x-drupal-7

About Drupal Sun

Drupal Sun is an Evolving Web project. It allows you to:

  • Do full-text search on all the articles in Drupal Planet (thanks to Apache Solr)
  • Facet based on tags, author, or feed
  • Flip through articles quickly (with j/k or arrow keys) to find what you're interested in
  • View the entire article text inline, or in the context of the site where it was created

See the blog post at Evolving Web

Evolving Web