Feeds

Upgrade Your Drupal Skills

We trained 1,000+ Drupal Developers over the last decade.

See Advanced Courses NAH, I know Enough
Jul 05 2020
Jul 05

Migrate in core is among my favorite parts of Drupal 8 and 9. The framework is super flexible, and it makes migrating content from any source you can dream up pretty straight forward. Today I want to show a trick that I use when I receive a csv (or Excel file) from clients, where they want all of the contents in it migrated to Drupal. One very simple example would be a list of categories.

Typically the file will come with one term on each line. However, migrate would want us to set an ID for all of the terms, which currently none of the rows have. One solution to this is to place an ID on all of the rows manually with some sort of spreadsheet software, and then point our migration to the new column for its IDs. But since that both involves the words "manual" and "spreadsheet software" it immediately makes me want to find another solution. Is there a way we can set the row id programmatically based on the row number instead? Why, yes, it is!

So, here is a trick I use to set the ID from the line number:

The migration configuration looks something like this:

id: my_module_categories_csv
label: My module categories
migration_group: my_module
source:
  # We will use a custom source plugin, so we can set the 
  # ID from there.
  plugin: my_module_categories_csv
  track_changes: TRUE
  header_row_count: 1
  keys:
    - id
  delimiter: ';'
  # ... And the rest of the file 

As stated in the yaml file, we will use a custom source plugin for this. Let's say we have a custom module called "my_module". Inside that module folder, we create a file called Categories Csv.php inside the folder src/Plugin/migrate/source/CategoriesCsv.php. And in that file we put something like this:

<?php

namespace Drupal\my_module\Plugin\Migrate\source;

use Drupal\migrate\Plugin\MigrationInterface;
use Drupal\migrate\Row;
use Drupal\migrate_source_csv\Plugin\migrate\source\CSV;

/**
 * Source plugin for Categories in csv.
 *
 * @MigrateSource(
 *   id = "my_module_categories_csv"
 * )
 */
class CategoriesCsv extends CSV {

  /**
   * {@inheritdoc}
   */
  public function prepareRow(Row $row) {
    // Delta is here the row number.
    $delta = $this->file->key();
    $row->setSourceProperty('id', $delta);
    return parent::prepareRow($row);
  }

}
   

In the code above we set the source property of id to the delta (the row number) of the row. Which means you can have a source like this:

Name
Category1
Category2
Category3

Instead of this

id;Name
1;Category1
2;Category2
3;Category3

The best part of this is that when your client changes their mind, you can just update the file instead of editing it before updating it. And with editing, I mean "manually" and with "spreadsheet software". Yuck.

To finish this post, here is an animated gif called "spreadsheet software yuck"

Jun 11 2020
Jun 11

On a project I just discovered I had to make a change to both how some blocks were configured, but also remove access to some other blocks (with the module Layout builder restrictions). This is a somewhat disruptive change. Even though the editors are not supposed to need or want to use any of the blocks I was going to do disruptive changes to, I had to make sure I was not sabotaging their work somehow. So I wanted to find out what blocks and layouts were actually used on the different pages build with Layout builder.

So I wrote a simple query to find them all, and print them to my console. I then realized this might be handy to have for later. Or heck, maybe it is even handy for other people. Let's just throw it on the blog. It looks like this:

<?php

use Drupal\node\Entity\Node;

$rows = \Drupal::database()
  ->select('node__layout_builder__layout', 'ns')
  ->fields('ns', ['entity_id'])
  ->groupBy('ns.entity_id')
  ->execute();
$types = [];
$layouts = [];
foreach ($rows as $row) {
  /** @var \Drupal\node\Entity\Node $node */
  $node = Node::load($row->entity_id);
  if (!$node) {
    continue;
  }
  if (!$node->hasField('layout_builder__layout') || $node->get('layout_builder__layout')->isEmpty()) {
    continue;
  }
  $layout = $node->get('layout_builder__layout')->getValue();
  foreach ($layout as $item) {
    /** @var \Drupal\layout_builder\Section $section */
    $section = $item['section'];
    $layouts[$section->getLayoutId()] = TRUE;
    $section_array = $section->toArray();
    foreach ($section_array["components"] as $delta => $component) {
      $id = $component["configuration"]["id"];
      if (in_array($id, $types)) {
        continue;
      }
      $types[] = $id;
    }
  }
}
print_r($types);
print_r(array_keys($layouts));

This will now give me the plugin IDs of all of the blocks used on a layout builder page, along with all the layout IDs of a layout being used on a layout builder page.

Now I can do disruptive changes without worrying about actually disrupting the work for editors.

Disclaimer: This does not take into account any unsaved work any editors have in their layout. If you have a very active site with very active editors, you might want to add that into the mix. But that is left as an exercise to the reader.

Hope that is useful to someone else, but if not, here is a useful GIF about it being summer soon (for me anyway).

May 23 2020
May 23

My last blog post was about my journey towards "getting rid" of Disqus as a comment provider, and moving over to hosting my comments on Github. The blog post was rather light on technical details, so here is the full technical walkthrough with code examples and all.

Since I already explained most of the reasoning and architectural choices in that post I will just go ahead with a step by step of the technical parts I think is interesting. If that means I end up skipping over something important, please let me know in the (Github) comments!

Step 0: Have a Gatsby.js site that pulls it's content from a Drupal site

This tutorial does not cover that, since many people have written such tutorials. As mentioned in my article "Geography in web page performance", I really like this tutorial from Lullabot, and Gatsby.js even has an official page about using Drupal with Gatsby.js.

Step 1: Create a repo to use for comments

The first step was to create an actual repository to act as a placeholder for the comments to be. It should be public, since we want anyone to be able to comment. I opted for the repo https://github.com/eiriksm/eiriksm.dev-comments. To create a repository on GitHub, you need an account, and then just visit https://github.com/new

Step 2: Associate blog posts with issues

The second step is to have a way to indicate which issue is connected to which blog post. Luckily I am still using Drupal, so I can just go ahead and create a field called "Issue id". Then, for each post I write, I create an issue on said repo, and take a note of the issue id. For example, this blog post has a corresponding issue at https://github.com/eiriksm/eiriksm.dev-comments/issues/6, so I just go ahead and write "6" in my issue id field.

Step 3: Fetch all the issues and their comments to Gatsby.js

This is an important part. To be able to render the comments at build time, the most robust approach is to actually "download" the comments and have them available in graphql. To do that you can try to use the source plugin for GitHub, but I opted for getting the relevant issues and comments and creating the source myself. Why, you may ask? Well the main reason is because then I can more easily control how the issue comments are stored, and by extension also store the drupal ID of the node along with the comments.

To achieve this, I went ahead and used the same JSONAPI source as Gatsby will use under the hood, and then I am fetching the issue comment contents of each of the nodes that has an issue ID reference. This will then go into the gatsby-node.js file, to make the data we fetch available to the static rendering. Something along these lines:

exports.sourceNodes = async({ actions, createNodeId, createContentDigest }) => {
  const { createNode } = actions
  try {
    // My build steps knows the basic auth for my API endpoint.
    const login = process.env.BASIC_AUTH_USERNAME
    const password = process.env.BASIC_AUTH_PASSWORD
    // API_URL holds the base URL of my Drupal site. Like so:
    // API_URL=https://example.com/
    let data = await fetch(process.env.API_URL + 'jsonapi/node/article', {
      headers: new fetch.Headers({
        "Authorization": `Basic ${new Buffer(`${login}:${password}`).toString('base64')}`
      })
    })
    let json = await data.json()
    // Create a job to download each of the corresponding issues and their comments.
    let jobs = json.data.map(async (drupalNode) => {
      // Here you can see the name of my field. Change this to whatever your field name is.
      if (!drupalNode.attributes.field_issue_comment_id) {
        return
      }
      let issueId = drupalNode.attributes.field_issue_comment_id
      // Initialize the data we want to store.
      let myData = {
        drupalId: drupalNode.id,
        issueId,
        comments: []
      }
      // As you can see, this is hardcoded to the repo in question. You might want to change this.
      let url = `https://api.github.com/repos/eiriksm/eiriksm.dev-comments/issues/${issueId}/comments`
      // We need a Github token to make sure we do not hit any rate limiting for anonymous API
      // usage.
      const githubToken = process.env.GITHUB_TOKEN
      let githubData = await fetch(url, {
        headers: new fetch.Headers({
          // Hardcoded to my own username. This probably will not work for you.
          "Authorization": `Basic ${new Buffer(`eiriksm:${githubToken}`).toString('base64')}`
        })
      })
      let githubJson = await githubData.json()
      myData.comments = githubJson
      // This last part is more or less taken from the API docs for Gatsby.js:
      // https://www.gatsbyjs.org/docs/node-apis/#sourceNodes
      let nodeMeta = {
        id: createNodeId(`github-comments-${myData.drupalId}`),
        parent: null,
        mediaType: "application/json",
        children: [],
        internal: {
          type: `github__comment`,
          content: JSON.stringify(myData)
        }
      }
      let node = Object.assign({}, myData, nodeMeta)
      node.internal.contentDigest = createContentDigest(node)
      createNode(node)
    })
    // Run all of the jobs async.
    await Promise.all(jobs)
  }
  catch (err) {
    // Make sure we crash if something goes wrong.
    throw err
  }
}

After we have set this up, we can build the site and start exploring the data in graphql. For example by running npm run develop, and visiting the graphql endpoint in the browser, in my case http://localhost:8000/___graphql:

$ npm run develop

> [email protected] develop /home/eirik/github/eiriksm.dev
> gatsby develop
# ...
# Bunch of output
# ...
⠀
You can now view eiriksm.dev in the browser.
⠀
  http://localhost:8000/
⠀
View GraphiQL, an in-browser IDE, to explore your site's data and schema
⠀
  http://localhost:8000/___graphql

This way we can now find Github comments with graphql queries, and filter by their Drupal ID. See example animation below, where I find the comments belonging to my blog post "Reflections on my migration journey from Disqus comments".

Graphql browsing

In the end I ended up with the following graphql query, which I then can use in the blogpost component:

allGithubComment(filter: { drupalId: { eq: $drupal_id } }) {
  nodes {
    drupalId
    comments {
      body
      id
      created_at
      user {
        login
      }
    }
  }
}

This gives me all Github comments that are associated with the Drupal ID of the current page. Convenient.

After this it's only a matter of using this inside of the blog post component. Since this site mix and match comment types a bit (I have a mix of Disqus exported comments and Github comments), it looks something like this:

let data = this.props.data
if (
  data.allGithubComment &&
  data.allGithubComment.nodes &&
  data.allGithubComment.nodes[0] &&
  data.allGithubComment.nodes[0].comments
) {
  // Do something with the comments.
}

Step 4: Make it possible to fetch the comments client side

This is the part that makes it feel real-time. Since the above code with the graphql storing of comments only runs when I deploy this blog, comments can get quickly outdated. And from a UX perspective, you would want to see your own comment instantly after posting it. This means that even if we could rebuild and deploy every time someone posted a comment, this would not be instant enough to make sure a person actually sees their own comment. Enter client side fetching.

In the blog post component, I also make sure to do another check to github when you are viewing the page in a browser. This is because it is not necessary for the build step to fetch it once again, but I want the user to see the latest updates to the comments. So in the componentDidMount function I have something like this:

componentDidMount() {
  if (
    typeof window !== `undefined` &&
    // Again, only doing this if the article references a github comment id. And again, this
    // references the field name, and the field name for me is field_issue_comment_id.
    this.props.data.nodeArticle.field_issue_comment_id
  ) {
    // Fetch the latest comments using Githubs open API. This means the person using it will 
    // use from their own API limit, and I do not have to expose any API keys client side.
    window
      .fetch(
        "https://api.github.com/repos/eiriksm/eiriksm.dev-comments/issues/" +
          this.props.data.nodeArticle.field_issue_comment_id +
          "/comments"
      )
      .then(async response => {
        let json = await response.json()
        // Now we have the comments, do some more work mixing them with the stored ones, 
        // and render.
      })
  }
}

Step 5 (bonus): Substitute with Gitlab

Ressa asked in a (Github) comment on my last blogpost about doing this comment setup with Gitlab:

Do you know if something like this is possible with Gitlab?

Fair question. Gitlab is an alternative to Github, but differs in that the software it uses is itself Open Source, and you can self-host your own version of Gitlab. So let's see if we can substitute the moving parts with Gitlab somehow? Well, turns out we can!

The first parts of the project is the same: Create a field, have an account on Gitlab, create a repo for the comments. Then, to source the comment nodes, all I had to change was this:

--- a/gatsby-node.js
+++ b/gatsby-node.js
@@ -125,15 +125,25 @@ exports.sourceNodes = async({ actions, createNodeId, createContentDigest }) => {
           issueId,
           comments: []
         }
-        let url = `https://api.github.com/repos/eiriksm/eiriksm.dev-comments/issues/${issueId}/comments`
-        const githubToken = process.env.GITHUB_TOKEN
+        let url = `https://gitlab.com/api/v4/projects/eiriksm%2Feiriksm.dev-comments/issues/${issueId}/notes`
+        const githubToken = process.env.GITLAB_TOKEN
         let githubData = await fetch(url, {
           headers: new fetch.Headers({
-            "Authorization": `Basic ${new Buffer(`eiriksm:${githubToken}`).toString('base64')}`
+            "PRIVATE-TOKEN": githubToken
           })
         })
         let githubJson = await githubData.json()
+        if (!githubJson[0]) {
+          return
+        }
         myData.comments = githubJson
+        // Normalize it slightly.
+        myData.comments = myData.comments.map(comment => {
+          comment.user = {
+            login: comment.author.username
+          }
+          return comment
+        })
         let nodeMeta = {
           id: createNodeId(`github-comments-${myData.drupalId}`),
           parent: null,

As you might see, I took a shortcut and re-used some variables, meaning the graphql is querying against GithubComment. Which is a bad name for a bunch of Gitlab comments. But I think cleaning those things up would be an exercise for the reader. The last part with client side updates was not possible to do in a similar way. The API for that requires authenticating the request. Much like the above code from sourcing the comments. But while the sourcing happens at build time, client side updates happens in the browser. Which would mean that for this to work, there would be a secret token in your JavaScript for everyone to see. So that is not a good option. But it is indeed theoretically possible, and could for example be achieved with a small proxy server, or serverless function.

The even cooler part is that the exact same code also works for a self-hosted Gitlab instance. So you can indeed run open source software and own your own data with this approach. That being said, not using a server was one of the requirements for me, so that would defeat the purpose. In such a way, one might as well use Drupal as a headless comment storage.

I would still like to finish with it all working in Gitlab too though. In an animated gif, as I always end my posts. Even though you are not allowed to test it yourself (since, you know, the token is exposed in the JavaScript), I think it serves as an illustration. And feel free to (Github) comment if you have any questions or feel I left something out.

Reflections on my migration journey from Disqus comments

Mar 15 2020
Mar 15

Querying the migrate database from custom code

Mar 03 2020
Mar 03
Feb 22 2020
Feb 22

Earlier we wrote about stress testing, featuring Blazemeter where you could learn how to do crash your site without worrying about the infrastructure. So why did I even bother to write this post about the do-it-yourself approach? We have a complex frontend app, where it would be nearly impossible to simulate all the network activities faithfully during a long period of time. We wanted to use a browser-based testing framework, namely WebdriverI/O with some custom Node.js packages on Blazemeter, and it proved to be quicker to start to manage the infrastructure and have full control of the environment. What happened in the end? Using a public cloud provider (in our case, Linode), we programmatically launched the needed number of machines temporarily, provisioned them to have the proper stack, and the WebdriverI/O test was executed. With Ansible, Linode CLI and WebdriverIO, the whole process is repeatable and scalable, let’s see how!

Infrastructure phase

Any decent cloud provider has an interface to provision and manage cloud machines from code. Given this, if you need an arbitrary number of computers to launch the test, you can have it for 1-2 hours (100 endpoints for a price of a coffee, how does this sound?).

There are many options to dynamically and programmatically create virtual machines for the sake of stress testing. Ansible offers dynamic inventory, however the cloud provider of our choice wasn’t included in the latest stable version of Ansible (2.7) by the the time of this post. Also the solution below makes the infrastructure phase independent, any kind of provisioning (pure shell scripts for instance) is possible with minimal adaptation.

Let’s follow the steps at the guide on the installation of Linode CLI. The key is to have the configuration file at ~/.linode-cli with the credentials and the machine defaults. Afterwards you can create a machine with a one-liner:

linode-cli linodes create --image "linode/ubuntu18.04" --region eu-central --authorized_keys "$(cat ~/.ssh/id_rsa.pub)"  --root_pass "$(date +%s | sha256sum | base64 | head -c 32 ; echo)" --group "stress-test"

Given the specified public key, password-less login will be possible. However this is far from enough before the provisioning. Booting takes time, SSH server is not available immediately, also our special situation is that after the stress test, we would like to drop the instances immediately, together with the test execution to minimize costs.

Waiting for machine booting is a slightly longer snippet, the CSV output is robustly parsable:

## Wait for boot, to be able to SSH in.
while linode-cli linodes list --group=stress-test --text --delimiter ";" --format 'status' --no-headers | grep -v running
do
  sleep 2
done

However the SSH connection is likely not yet possible, let’s wait for the port to be open:

for IP in $(linode-cli linodes list --group=stress-test --text --delimiter ";" --format 'ipv4' --no-headers);
do
  while ! nc -z $IP 22 < /dev/null > /dev/null 2>&1; do
    sleep 1
  done
done

You may realize that this is overlapping with the machine booting wait. The only benefit is that separating the two allows more sophisticated error handling and reporting.

Afterwards, deleting all machines in our group is trivial:

for ID in $(linode-cli linodes list --group=stress-test --text --delimiter ";" --format 'id' --no-headers);
do
  linode-cli linodes delete "$ID"
done

So after packing everything in one script, also to put an Ansible invocation in the middle, we end up with stress-test.sh:

#!/bin/bash

LINODE_GROUP="stress-test"
NUMBER_OF_VISITORS="$1"

NUM_RE='^[0-9]+$'
if ! [[ $NUMBER_OF_VISITORS =~ $NUM_RE ]] ; then
  echo "error: Not a number: $NUMBER_OF_VISITORS" >&2; exit 1
fi

if (( $NUMBER_OF_VISITORS > 100 )); then
  echo "warning: Are you sure that you want to create $NUMBER_OF_VISITORS linodes?" >&2; exit 1
fi

echo "Reset the inventory file."
cat /dev/null > hosts

echo "Create the needed linodes, populate the inventory file."
for i in $(seq $NUMBER_OF_VISITORS);
do
  linode-cli linodes create --image "linode/ubuntu18.04" --region eu-central --authorized_keys "$(cat ~/.ssh/id_rsa.pub)" --root_pass "$(date +%s | sha256sum | base64 | head -c 32 ; echo)" --group "$LINODE_GROUP" --text --delimiter ";"
done

## Wait for boot.
while linode-cli linodes list --group="$LINODE_GROUP" --text --delimiter ";" --format 'status' --no-headers | grep -v running
do
  sleep 2
done

## Wait for the SSH port.
for IP in $(linode-cli linodes list --group="$LINODE_GROUP" --text --delimiter ";" --format 'ipv4' --no-headers);
do
  while ! nc -z $IP 22 < /dev/null > /dev/null 2>&1; do
    sleep 1
  done
  ### Collect the IP for the Ansible hosts file.
  echo "$IP" >> hosts
done
echo "The SSH servers became available"

echo "Execute the playbook"
ansible-playbook -e 'ansible_python_interpreter=/usr/bin/python3' -T 300 -i hosts main.yml

echo "Cleanup the created linodes."
for ID in $(linode-cli linodes list --group="$LINODE_GROUP" --text --delimiter ";" --format 'id' --no-headers);
do
  linode-cli linodes delete "$ID"
done

Provisioning phase

As written earlier, Ansible is just an option, however a popular option to provision machines. For such a test, even a bunch of shell command would be sufficient to setup the stack for the test. However, after someone tastes working with infrastructure in a declarative way, this becomes the first choice.

If this is your first experience with Ansible, check out the official documentation. In a nutshell, we just declare in YAML how the machine(s) should look, and what packages it should have.

In my opinion, a simple playbook like this below, is readable and understandable as-is, without any prior knowledge. So our main.yml is the following:

- name: WDIO-based stress test
  hosts: all
  remote_user: root

  tasks:
    - name: Update and upgrade apt packages
      become: true
      apt:
        upgrade: yes
        update_cache: yes
        cache_valid_time: 86400

    - name: WDIO and Chrome dependencies
      package:
        name: "{{ item }}"
        state: present
      with_items:
         - unzip
         - nodejs
         - npm
         - libxss1
         - libappindicator1
         - libindicator7
         - openjdk-8-jre

    - name: Download Chrome
      get_url:
        url: "https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb"
        dest: "/tmp/chrome.deb"

    - name: Install Chrome
      shell: "apt install -y /tmp/chrome.deb"

    - name: Get Chromedriver
      get_url:
        url: "https://chromedriver.storage.googleapis.com/73.0.3683.20/chromedriver_linux64.zip"
        dest: "/tmp/chromedriver.zip"

    - name: Extract Chromedriver
      unarchive:
        remote_src: yes
        src: "/tmp/chromedriver.zip"
        dest: "/tmp"

    - name: Start Chromedriver
      shell: "nohup /tmp/chromedriver &"

    - name: Sync the source code of the WDIO test
      copy:
        src: "wdio"
        dest: "/root/"

    - name: Install WDIO
      shell: "cd /root/wdio && npm install"

    - name: Start date
      debug:
        var=ansible_date_time.iso8601

    - name: Execute
      shell: 'cd /root/wdio && ./node_modules/.bin/wdio wdio.conf.js --spec specs/stream.js'

    - name: End date
      debug:
        var=ansible_date_time.iso8601

We install the dependencies for Chrome, Chrome itself, WDIO, and then we can execute the test. For this simple case, that’s enough. As I referred to earlier:

ansible-playbook -e 'ansible_python_interpreter=/usr/bin/python3' -T 300 -i hosts main.yml

What’s the benefit over the shell scripting? For this particular use-case, mostly that Ansible makes sure that everything can happen in parallel and we have sufficient error-handling and reporting.

Test phase

We love tests. Our starter kit has WebdriverIO tests (among many other type of tests), so we picked it to stress test the full stack. If you are familiar with JavaScript or Node.js the test code will be easy to grasp:

const assert = require('assert');

describe('podcasts', () => {
    it('should be streamable', () => {
        browser.url('/');
        $('.contact .btn').click();

        browser.url('/team');
        const menu = $('.header.menu .fa-bars');
        menu.waitForDisplayed();
        menu.click();
        $('a=Jobs').click();
        menu.waitForDisplayed();
        menu.click();
        $('a=Podcast').click();
        $('#mep_0 .mejs__controls').waitForDisplayed();
        $('#mep_0 .mejs__play button').click();
        $('span=00:05').waitForDisplayed();
    });
});


This is our spec file, which is the essence, alongside with the configuration.

Could we do it with a bunch of requests in jMeter or Gatling? Almost. The icing on the cake is where we stress test the streaming of the podcast. We simulate a user who listens the podcast for 10 seconds. For for any frontend-heavy app, realistic stress testing requires a real browser, WDIO provides us exactly this.

The WebdriverIO test execution - headless mode deactivated

Test execution phase

After making the shell script executable (chmod 750 stress-test.sh), we are able to execute the test either:

  • with one visitor from one virtual machine: ./stress-test.sh 1
  • with 100 visitors from 100 virtual machines for each: ./stress-test.sh 100

with the same simplicity. However, for very large scale tests, you should think about some bottlenecks, such as the capacity of the datacenter on the testing side. It might make sense to randomly pick a datacenter for each testing machine.

The test execution consists of two main parts: bootstrapping the environment and executing the test itself. If bootstrapping the environment takes too high of a percentage, one strategy is to prepare a Docker image, and instead of creating the environment again and again, just use the image. In that case, it’s a great idea to check for a container-specific hosting solution instead of standalone virtual machine.

Would you like to try it out now? Just do a git clone https://github.com/Gizra/diy-stress-test.git!

Result analysis

For such a distributed DIY test, analyzing the results could be challenging. For instance, how would you measure requests/second for a specific browser-based test, like WebdriverI/O?

For our case, the analysis happens on the other side. Almost all hosting solutions we encounter support New Relic, which could help a lot in such an analysis. Our test was DIY, but the result handling was outsourced. The icing on the cake is that it helps to track down the bottlenecks too, so a similar solution for your hosting platform can be applied as well.

However what if you’d like to somehow gather results together after such a distributed test execution?

Without going into detail, you may study the fetch module of Ansible, so you can gather a result log from all the test servers and have it locally in a central place.

Conclusion

It was a great experience that after we faced some difficulty with a hosted stress test platform; in the end, we were able to recreate a solution from scratch without much more development time. If your application also needs special, unusual tools for stress-testing, you might consider this approach. All the chosen components, such as Linode, WebdriverIO or Ansible are easily replaceable with your favorite solution. Geographically distributed stress testing, fully realistic website visitors with heavy frontend logic, low-cost stress testing – it seems now you’re covered!

Feb 22 2020
Feb 22

Here is a quick tip if you want to create a step definition that has an argument with multiple lines. A multiline string argument if you like.

I wanted to test that an email was sent, with a specific subject, to a specific person, and containing a specific body text.

My idea was to create a step definition that looked something like this:

Then an email has been sent to "[email protected]" with the subject "Subject example" and the body “one of the lines in the body

plus this is the other line of the body, after an additional line break”

So basically my full file is now this:

@api @test-feature
Feature: Test this feature
  Scenario: I can use this definition
    Then an email has been sent to "[email protected]" with the subject "Subject example" and the body “one of the lines in the body

    plus this is the other line of the body, after an additional line break”


My step definition looks like this:

  /**
   * @Then /^an email has been sent to :email with the subject :subject and the body :body$/
   */
  public function anEmailHasBeenSentToWithTheSubjectAndTheBody($email, $subject, $body) 
  {
      throw new PendingException();
  }

Let’s try to run that.

$ ./vendor/bin/behat --tags=test-feature

In Parser.php line 393:
                                                                                                                                                    
  Expected Step, but got text: "    plus this is the other line of the body, after an additional line break”" in file: tests/features/test.feature  

Doing it that way simply does not work. You see, by default a line break in the Gherkin DSL has an actual meaning, so you can not do a line break in your argument, expecting it to just pass along everything up until the closing quote. What we actually want is to use a PyString. But how do we use them, and how do we define a step to receive them? Let’s start by converting our step definition to use the PyString multiline syntax:

@api @test-feature
Feature: Test this feature
  Scenario: I can use this definition
    Then an email has been sent to "[email protected]" with the subject "Subject example" and the body
    """
    one of the lines in the body

    plus this is the other line of the body, after an additional line break
    """

Now let’s try to run it:

$ ./vendor/bin/behat --tags=test-feature                                                                                        
@api @test-feature
Feature: Test this feature

  Scenario: I can use this definition                                                                 # tests/features/test.feature:3
    Then an email has been sent to "[email protected]" with the subject "Subject example" and the body
      """
      one of the lines in the body
      
      plus this is the other line of the body, after an additional line break
      """

1 scenario (1 undefined)
1 step (1 undefined)
0m0.45s (32.44Mb)

 >> default suite has undefined steps. Please choose the context to generate snippets:

A bit closer. Our output actually tells us that we have a missing step definition, and suggests how to define it. That’s better. Let’s try the suggestion from the output, now defining our step like this:

  /**
   * @Then an email has been sent to :email with the subject :subject and the body
   */
  public function anEmailHasBeenSentToWithTheSubjectAndTheBody2($email, $subject, PyStringNode $string)
  {
      throw new PendingException();
  }

The difference here is that we do not add the variable name for the body in the annotation, and we specify that we want a PyStringNode type parameter last. This way behat will know (tm).

After running the behat command again, we can finally use the step definition. Let's have a look at how we can use the PyString class.

  /**
   * @Then an email has been sent to :email with the subject :subject and the body
   */
  public function anEmailHasBeenSentToWithTheSubjectAndTheBody2($email, $subject, PyStringNode $string)
  {
      // This is just an example.
      $mails = $this->getEmailsSomehow();
      // This is now the important part, you get the raw string from the PyStringNode class.
      $body_string = $string->getRaw();
      foreach ($mails as $item) {
          // Still just an example, but you probably get the point?
          if ($item['to'] == $mail && $item['subject'] == $subject && strpos($item['body'], $body_string) !== FALSE) {
              return;
          }
      }
      throw new \Exception('The mail was not found');
  }

And that about wraps it up. Writing tests are fun, right? As a bonus, here is an animated gif called "Testing".

Geography in web page performance

Feb 08 2020
Feb 08

Migrating Panel nodes into Layout Builder nodes. Is that even possible?

Feb 28 2019
Feb 28

Drupalcamp Oslo is coming up, and it is going to be awesome!

Oct 29 2018
Oct 29

Creating services with optional dependencies in Drupal 8

Jun 07 2018
Jun 07

How to find the route name in Drupal 8?

Apr 19 2018
Apr 19

Updating to Drupal 8.5 with composer

Mar 02 2018
Mar 02

Automatic updates using violinist.io

Nov 27 2017
Nov 27

Drupal Camp Oslo is coming this November

Aug 09 2017
Aug 09
Aug 09 2017
Aug 09

The annual meeting of Drupal enthusiasts in Norway (and elsewhere) will take place on the 11th of November, with community sprints happening November 12th.

Every year, the camp attracts visitors from Drupal professionals and hobbyists from Norway, but also from the surrounding countries. If you want to meet Drupal entusiasts from our region, this is a great chance to do so.

We also want to invite people who wants to speak to submit their session proposals to our website. Whether you are a seasoned conference speaker or if you want to have your first session, you are very welcome to submit your talk to Drupal Camp Oslo.

If you prefer to attend, we are just as welcome to you as well! Tickets are now available for purchase, and at the moment they are extra early-bird cheap! We also have different price tiers if you attend as a hobbyist or student, and we hope for a diverse audience of attendees, both in the sessions and in the sprints on Sunday!

If you have any questions regarding the event, feel free to reach out to me in the comments, by mail or on twitter

See you in Oslo in November!

Drupal and IoT. Code examples, part 2: User temperatures!

May 25 2015
May 25

Drupal and IoT. Code examples, part 1

May 03 2015
May 03

Drupal and the Internet of Things

Apr 13 2015
Apr 13

Twice the fun: Twig on the server, twig on the client.

Jan 22 2015
Jan 22

Headless Drupal with head fallback

Nov 23 2014
Nov 23

Now running Drupal 8, in the most hipster way imagined.

Jul 27 2014
Jul 27

My first site on Drupal 8. Or securing a Drupal site like a paranoid.

Dec 12 2013
Dec 12

One-liner for fetching the live-database on your dev server (or localhost)

Oct 14 2013
Oct 14
Oct 14 2013
Oct 14

So when I first realised that I was neglecting this blog, I somewhat found comfort in that at least it hadn't been a year, right? OK, so now it has almost been a year. Does that mean I have stopped finding solutions to my Drupal problems (as the "About me" block states)? Well, no. The biggest problem is remembering to blog about them, or finding the time. But finding the time is not a Drupal problem, and I most definitely have not found the solution to that problem. Anyway, I digress. Let's end the streak with a quick tip that I use all the time: Syncing live databases without having drush aliases.

If you use drush-aliases, you could just do drush sql-sync. But for me, even if I do, I still prefer this method, as I find it extremely easy to debug.

First I make sure I am in my Drupal root:

$ cd /path/to/drupal

Then I usually make sure I can bootstrap Drupal from drush:

$ drush st

If that output says my database is connected, then let's test the remote connection:

$ ssh [email protected] "drush  --root=/path/in/remote/server st"

If that output also says we are connected, it means we are good to go, with my favourite command ever:

$ ssh [email protected] "drush  --root=/path/in/remote/server sql-dump" | drush sql-cli

If you have a drush-alias for the site, you can also go:

drush @live-site-alias sql-dump | drush sql-cli

OK. So what does this do?

The first part says we want to ssh in to our remote live server. Simple enough. The next part in double quotes, tells our terminal to execute the command on the remote server. And the command is telling our remote server to dump the entire database to the screen. Then we pipe that output to the command "drush sql-cli" on our local machine, which basically says that we dump a mysql database into a mysql command line.

Troubleshooting:

If you get this error:

bash: drush: command not found

I guess you could get this error if you are for example on shared hosting, and use a local installation of drush in your home folder (or something). Simply replace the word drush with the full path to your drush command. For example:

$ ssh [email protected] "/path/to/drush  --root=/path/in/remote/server sql-dump" | drush sql-cli

If you get this error:

A Drupal installation directory could not be found

You probably typed in the wrong remote directory. Try again!

I'm going to end this with a semi-related animated gif. Please share your semi-related workflow commands (or animated gifs) in the comments.

Drupal and HTTP requests revisited

Oct 27 2012
Oct 27

Quick tip for dev sites with advanced integration

Oct 27 2012
Oct 27

Drupal in 1 HTTP request: Results!

Jul 03 2012
Jul 03

How to create webforms programatically

Jun 10 2012
Jun 10

Virtual host, hosts file, downloading and enabling modules and installing Drupal in one line.

May 08 2012
May 08
May 08 2012
May 08

Woah, that's a long title.

Today I wanted to share the little shell script I use for setting up local development sites in a one-liner in the terminal. Be advised that I don't usually write shell scripts, so if you are a pro, and I have made some obvious mistakes, please feel free to give improvements in the comment section. Also, while i have been writing this post, i noticed that Klausi also has a kick ass tutorial and shell script on his blog. Be sure to read that as well, since the article is much more thorough than mine. But since my script is for those even more lazy, I decided to post it anyway.

The idea I had was pretty simple. I constantly have the need to set up a fresh, working copy of drupal in various versions with various modules, i.e for bug testing, prototyping, or just testing a patch on a fresh install. While drush make and installation profiles are both awesome tools for a lot of things, I wanted to install Drupal without making make files and writing .info files, and at the same time generate the virtual host and edit my hosts file. And why not also download and enable the modules i need. For a while I used just

$ drush si

(site-install) on a specified directory for flushing and starting over, but part since I have little experience in writing shell scripts (I said that, right?), I thought, what the hey, let's give that a go. Fun to learn new stuff. On my computer the script is called vhost, but since this is not a describing name, and the script is all about being lazy, let's call it lazy-d for the rest of the article (lazy-d for lazy drupal. Also, it is kind of catchy).

So the basic usage is

$ ./lazy-d [domainname] [drupalversion] “[modules to download and enable]”

For example:

$ ./lazy-d example.dev 7 “views ctools admin_menu coffee”

This will download drupal 7 to example.dev in your www directory, create a database and user called example.dev, install Drupal with the site name example.dev, edit your hosts file so you can navigate to example.dev in your webbrowser, and download and enable views, ctools admin_menu and coffee modules.

Running the script like this:

$ ./lazy-d example.dev

will do the same, but with no modules (so Drupal 7 is default, as you might understand).

You can also use it to download Drupal 8, but the drush site install does not work at the moment (it used to when I wrote the script a couple of weeks back though). Drupal 6 is fully supported.

The script has been tested on ubuntu 12.04 and mac os x 10.7. On mac you may have to do some tweaking (depending on whether you have MAMP or XAMPP or whatever) to the mysql variable at the top. And probably do something else with the a2ensite and apache2ctl restart as well (again, depending on your setup).

The script obviously requires apache, php, mysql and drush (git if downloading drupal 8).

OK, to the code:

#!/bin/bash

# Configuration
VHOST_CONF=/etc/apache2/sites-available/
ROOT_UID=0
WWW_ROOT=/var/www
MYSQL=`which mysql`
DRUSH=`which drush`
# If you don't use drush aliases, uncomment this.
DRUSHRC=/home/MYUSERNAME/.drush/aliases.drushrc.php
TMP_DIR=/var/www/tempis
[email protected]
LOG_DIR=/var/log

# Check if we are running as root
if [ "$UID" -ne "$ROOT_UID" ]; then
  echo “You must be root to run this script.”
  exit
fi

# Check that a domain name is provided
if [ -n "$1" ]; then
  DOMAIN=$1
else
  echo “You must provide a name for this site”
  echo -n “Run this script like $0 mycoolsite.test.”
  exit
fi

# Downloading the correct Drupal version.
cd $WWW_ROOT
if [ -n "$2" ]; then
  VERSION=$2
else
  VERSION=7
fi

case $VERSION in
  7) $DRUSH dl --destination=$TMP_DIR
  ;;
  8) git clone --recursive --branch 8.x http://git.drupal.org/project/drupal.git $1
  ;;
  6) $DRUSH dl drupal-6 --destination=$TMP_DIR
  ;;
esac

# Move the folder to the correct place. Drupal 8 was cloned directly to the correct place.
if [ "$VERSION" -ne "8" ]; then
  cd $TMP_DIR
  mv drupal* $WWW_ROOT/$DOMAIN
  rm -rf $TMP_DIR
fi

# Create the database and user.
Q0="DROP DATABASE IF EXISTS \`$1\`;"
Q1="CREATE DATABASE IF NOT EXISTS \`$1\`;"
Q2="GRANT ALL ON \`$1\`.* TO '$1'@'localhost' IDENTIFIED BY '$1';"
Q3="FLUSH PRIVILEGES;"
SQL="${Q0}${Q1}${Q2}${Q3}"

# If you want to keep you password in the file, uncomment the following line and edit the password.
# $MYSQL -uroot --password=Mypassword -e "$SQL"

# If you have no password, uncomment the following line to skip the password prompt.
# $MYSQL -uroot -e "$SQL"

# Else, this prompts for password. If you have uncommented any of the above,
# comment this out, please.
echo "Need the MYSQL root password"
$MYSQL -uroot -p -e "$SQL"

echo "<VirtualHost *:80>" >> $VHOST_CONF/$DOMAIN
echo "    ServerAdmin $SERVER_ADMIN" >> $VHOST_CONF/$DOMAIN
echo "    DocumentRoot $WWW_ROOT/$DOMAIN" >> $VHOST_CONF/$DOMAIN
echo "    ServerName $DOMAIN" >> $VHOST_CONF/$DOMAIN
echo "    ErrorLog $LOG_DIR/$DOMAIN-error_log" >> $VHOST_CONF/$DOMAIN
echo "    CustomLog $LOG_DIR/$DOMAIN-access_log common" >> $VHOST_CONF/$DOMAIN
echo "" >> $VHOST_CONF/$DOMAIN
echo "    <Directory \"$WWW_ROOT/$DOMAIN\">" >> $VHOST_CONF/$DOMAIN
echo "      AllowOverride All" >> $VHOST_CONF/$DOMAIN
echo "    </Directory>" >> $VHOST_CONF/$DOMAIN
echo "</VirtualHost>" >> $VHOST_CONF/$DOMAIN

echo "127.0.0.1 $DOMAIN" >> /etc/hosts

# This will enable the virtual host in apache. If you are on mac, just include
# the vhost directory in the apache config, and uncomment the follwing line.
a2ensite $DOMAIN

# If you are on mac, check if this will really restart your MAMP/XAMPP apache.
# Or just uncomment and restart with the pretty GUI manually.
apache2ctl restart

cd $WWW_ROOT/$DOMAIN
drush si --db-url=mysql://$1:[email protected]/$1 --site-name=$1 --account-pass=admin
if [ -n "$3" ]; then
  drush dl $3
  drush en $3
fi

# Create Drush Aliases
if [ -f $DRUSHRC ] && [ -n "$DRUSHRC" ]; then
  echo "" >> $DRUSHRC
  echo "<?php" >> $DRUSHRC
  echo "\$aliases['$1'] = array (" >> $DRUSHRC
  echo " 'root' => '$WWW_ROOT/$DOMAIN'," >>$DRUSHRC
  echo " 'uri' => 'http://$1'," >>$DRUSHRC
  echo ");" >>$DRUSHRC
  echo "?>" >> $DRUSHRC
else
  echo "No drush alias written."
fi

echo “Done. Navigate to http://$DOMAIN to see your new site. User/pass: admin.”
exit 0

All done, indeed. Pretty straight forward code. Known issues: will drop existing database if you specify one that exists. But why would you do that? Don't you know what you are doing?

To finish this off, here is an animated gif called "alien computer":

Drupal and HTTP requests - how low can you go?

Feb 25 2012
Feb 25

Creating nodes with images using phonegap and services

Feb 04 2012
Feb 04
Jan 21 2012
Jan 21

You're probably thinking, “But I don’t want to think about Drupal 9 yet!”

Well, adulting sometimes means doing things we don’t want to do, like thinking about CMS version upgrades. 

We can make this easy.

For now, here’s what you need to know about Drupal 9.

1. Drupal 9 is targeted for a June 2020 release.  

Eighteen months following the release of Drupal 9, in November of 2021, Drupal 7 and 8 will both hit EOL status. This means that you will have a little more than a year to move your site from Drupal 7 or 8, following the June 2020 release date of Drupal 9. 

The normal schedule would dictate that Drupal 7 hit EOL shortly after the Drupal 9 launch. However, because so many sites are still on D7, and at this point, many may just skip D8 completely, the decision has been made to extend D7 support for an additional year. As a result, D7 and D8 will both lose support at the same time.

The November 2021 date is significant because that is also when Symfony 3, which is a major dependency for Drupal 8, will lose support.

What this means for you as a D7 or D8 site owner is that you should have a plan to be on Drupal 9 by the summer of 2021.

2. The transition from Drupal 8 will not be painful.

Unlike Drupal 8, which for practical purposes was a new CMS sharing the Drupal name, Drupal 9 is being developed on Drupal 8. That means if you are up to date with Drupal 8.9 the move to Drupal 9 should be relatively easy. It probably won’t be Drupal 8.4 to 8.5 easy, but it should be nothing close to the level of effort of moving from Drupal 7 to Drupal 8. In a lot of ways, Drupal 9 is just the next six-month update that comes after Drupal 8.9, with the added complication of being the version that we use move depreciated code and APIs out of the codebase.

My recommendation if you are still on Drupal 7: migrate to Drupal 8 when it's convenient. As stated above Drupal 9 is really just Drupal 8.10.1. So you kind of are migrating to Drupal 8, even it’s called Drupal 9 at that point. You won’t save any effort by waiting until Drupal 9 is out, you’ll just be on the outdated D7 codebase longer.

Concerning modules: as long as modules aren’t depending on any APIs that are being depreciated in D9, contributed modules should continue to work with both D8 and D9. 

The good news if you are on an updated version of D8 the transition to D9 should be smooth. If you are on D7 you are essentially doing the same thing whether you migrate to Drupal 8 or Drupal 9.

We’re here to help! If you want to talk in more depth about what Drupal 9 has in store, and start to make a plan for the transition, contact us for information.

Dec 22 2011
Dec 22

This is an addition to the Bibliography & Citation module (BibCite), which enables users to automatically prepopulate bibliographic references data by entering DOI (Digital object identifier).

Let us remind you what bibliography and citation are. A bibliography is a list of all of the sources you have used (whether referenced or not) in the process of researching your work.
A citation is a borrowing of a text fragment, formulas, provisions, illustrations, tables, and other elements of the author. Also, the citation is a non-literal, translated or paraphrased reproduction of a fragment of text, an analysis of the content of other publications in the text of the work.

The Bibliography & Citation - Crossref module’s purpose is to save time and make the process of adding information easy and convenient for admins. With this module, the website’s admins shouldn’t search and write by themselves: 

  • the authors' names
  • the titles of the works
  • the names and locations of the companies that published your copies of the sources
  • the dates your copies were published
  • the page numbers of your sources in special fields.

Everything will be filled out automatically. The only thing admins need to know and write is DOI (Digital object identifier) of books, conference proceedings, articles, working papers, technical reports.

DOI is a modern standard for designating the provision of information on the Internet, used by all major international scientific organizations and publishing houses. 

DOI is unique words and numbers consisting of two parts - prefix and suffix. For example, 12.3456/789; here 12.3456 is a prefix and a publisher id (12 is an identifier tag, 3456 is a line pointing to a publisher).
789 - suffix, it is an object identifier pointing to a specific object.

The full information about the BibCite module created by ADCI Solutions you can find in this case study.

You can download for using the Bibliography & Citation - Crossref and the Bibliography & Citation module on drupal.org website.

Dec 01 2011
Dec 01

Turmoil in the Drupal community?

Considering the fact that there are around 800.000 websites currently operating on Drupal 7, there will be a huge resource drain for upgrading to the latest installment. Not only will be it be an expensive feat to achieve, but also time demanding. Quite frankly speaking, there is not enough time to be able to upgrade all of the Drupal 7 websites and also not enough Drupal developers to be able to take on the workload. So, what do you think, is it feasable for so many websites to upgrade to Drupal 9 in such a short period of time?

Drupal 9 will be released in the summer of 2020

Drupal 8 has been released on November 19, 2015, this makes it 3 years old already. Its successor, Drupal 9, is making its way towards a release. Drupal 9 is scheduled to be released on 3rd June in 2020. But what does this mean for Drupal 7 and 8?

For starters, Drupal 8 and 7 will stop receiving support in 2021. This is mainly because Symfony 3, one of the biggest dependencies of Drupal 8, will stop receiving support. Drupal 7 will not be backed up by the official community and by the Drupal association on Drupal.org. What this means is that the automated testing services for Drupal 7 will be shut down. On top of that, the Drupal Security Team will stop providing security patches. If you are not able to upgrade to Drupal 9, there will still be some organisations that will provide Drupal 7 Vendor Extended Support, which will be a paid service. However, despite this, there will be a approximately year's worth of time to be able to plan for and upgrade to the latest installment of Drupal.

Overview over the consequences for Drupal 7

What this means for your Drupal 7 sites is, as of November 2021:

  • Drupal 7 will no longer be supported by the community at large. The community at large will no longer create new projects, fix bugs in existing projects, write documentation, etc. around Drupal 7.
  • There will be no more core commits to Drupal 7.
  • The Drupal Security Team will no longer provide support or Security Advisories for Drupal 7 core or contributed modules, themes, or other projects. Reports about Drupal 7 vulnerabilities might become public creating 0 day exploits.
  • All Drupal 7 releases on all project pages will be flagged as not supported. Maintainers can change that flag if they desire to.
  • On Drupal 7 sites with the update status module, Drupal Core will show up as unsupported.
  • After November 2021, using Drupal 7 may be flagged as insecure in 3rd party scans as it no longer gets support.
  • Best practice is to not use unsupported software, it would not be advisable to continue to build new Drupal 7 sites.
  • Now is the time to start planning your migration to Drupal 8.

Source: https://www.drupal.org/psa-2019-02-25

Drupal promises a smooth upgrade to Drupal 9

Good news is that, the change from Drupal 8 to Drupal 9 will not be as abrupt as the change from Drupal 7 to 8 was. This is because Drupal 9 will be based off of Drupal 8, in fact, the first release of Drupal 9 will be similar to the last release of Drupal 8. In short, there will be some new features added, the deprecated code will be removed and the dependenciess will be updated, however, the Drupal experience will not be reinvented. Now, in order to have really smooth upgrade, the only thing necessary is to keep your Drupal 8 updated at all times. This will ensure that your upgrade will come as fluid as possible, without many inconveniences.

What is the best course of action to follow when upgrading to Drupal 9

Well, at first, you have a couple of options at your disposal:

  1. You either wait for Drupal 9 to be launched and then make the change from 7 directly to 9.
  2. You make first the change to Drupal 8 from 7, which is going to be an abrupt change anyway, and then you prepare for the change to Drupal 9.
  3. Your final option would be to find a new CMS altogether, which would be the most resource hungry option out of all. 

So, considering the choices you have at hand, the best of the bunch would be to start preparing for upgrade to Drupal 8 and then, when the time comes, to Drupal 9. By doing this, you will have enough time to plan ahead your upgrade roadmap, without having to compromise on the quality of the upgrade by rushing it. On top of that, this is an opportunity to review how good your website is at attracting leads and converting those leads to sales. Moreso, you can check if your website is still in line with your enterprise vision and mission statement, if not, then here is an opportunity to make your site reflect the beforenamed aspects of your business.

Even though change might look scary to some of you, this is an opportunity to also evaluate and improve your online digital presence. So make use of this chance at its fullest to create and provide a better online environment for your potential and current customers.

Nov 23 2011
Nov 23

Internal search ensures that once people make it to your site, they find precisely what they need. If your internal search isn’t delivering the goods, visitors tend to search elsewhere. In this article, I’ll walk you through the steps we took to refine an on-site search experience to keep our client’s customers happy.

In my upcoming talk at DrupalCon, my colleague Matthias Michaux and I will show you how to improve search results with Drupal machine learning which I’ll show you step-by-step.

DrupalCon: [Machine Learning] Creating more relevant search results with "Learn to Rank". Wed, 28 Oct, 15:00 to 15:40, in the Auditorium.

When Internal Search isn’t working

The irony of having so much great content is that it makes it harder for customers to find exactly what they need. A great on-site search experience can mean the difference between keeping a customer and losing one. When it’s broken, an internal search might cough up an old dusty document and disregard something more recent; or it might give a seeker something broadly popular to most browsers, but not the niche topic they need.

A client came to us with a broken search experience that wasn’t displaying the most relevant content for some of their most important keywords. They’d done their homework and provided us a list of over 150 keywords and the corresponding search results.

Their site was built with Drupal 7 and relied on Search API and Search API Solr for content indexing and content display. We started out with two important troubleshooting steps to make sure everything was plugged in correctly. Apache Solr configurations, and their boosting settings.

First step: Follow Apache Solr Best-Practices for Drupal Search

Since the website was in Dutch, we started by applying our usual best practices for word matching. No instance of Apache Solr search will work properly if the word matching and language understanding haven’t been appropriately configured. For example, if you’re dealing with Apache Solr search in multiple languages, you have to make sure you configure Apache Solr to handle multilingual content correctly.

For this particular website, many of the Apache Solr best practices hadn’t been followed.

  1. Store rendered output in a single field. You always need to make sure to store the rendered output of your content and additional data (such as meta tags) in one single field in the index. This way, it’s a lot easier to search for all the relevant data. It will make your optimizations easier further down the road and it will render your queries a lot smaller & faster.
  2. Filter HTML code. Also make sure to filter out any remaining HTML code, as it does not belong in the index.
  3. Eliminate field label cruft in Drupal. Last but not least, make sure to index this data using as little markup as possible, and get rid of the field labels. You can do this by assigning a specific view mode to each content type. (Drupal 7 View Modes, Drupal 8 View Modes.)

If you’re having trouble with these basics, just get in touch with us so we can give you a hand.

Next step: Improve Search Relevance with Boosting in Apache Solr

Once you’ve improved how content is indexed in Solr, you can make sure visitors are finding the best content. Apache Solr works with relevance scores to determine where a result should be positioned in relation to all of the other results. In search, boosting a page increases the chance it will show up in a search. 

With Solr’s boosting functionality you can improve the relevance score of certain factors which change the overall generation of results. The client website that I mentioned earlier, for example, had some custom code written into it to boost the content based on the content type during the time of indexing.

Let’s dig into this a bit deeper.

http://localhost:8983/solr/drupal7/select?
q=ereloonsupplement& // Our term that we are searching for
defType=edismax& // The eDisMax query parser is designed to process simple phrases (without complex syntax) entered by users and to search for individual terms across several fields using different weighting (boosts) based on the significance of each field while supporting the full Lucene syntax.
qf=tm_search_api_aggregation_1^1.0& // Search on an aggregated Search Api field containing all content of a node.
qf=tm_node$title^4.0& // Search on a node title
qf=tm_taxonomy_term$name^4.0& // Search on a term title
fl=id,score& // Return the id & score field
fq=index_id:drupal7_index_terms& // Exclude documents not containing this value
fq=hash:e3lda9&  // Exclude documents not containing this value
rows=10& // Return top 10 items
wt=json& // Return it in JSON
debugQuery=true // Show me debugging information

When looking at the debugging information, we can see the following:

"e3lda9-drupal7_index_terms-node/10485": "\n13.695355 = max of:\n  13.695355 = weight(tm_search_api_aggregation_1:ereloonsupplement in 469) [SchemaSimilarity], result of:\n    13.695355 = score(doc=469,freq=6.0 = termFreq=6.0\n), product of:\n      6.926904 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:\n        10.0 = docFreq\n        10702.0 = docCount\n      1.9771249 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:\n        6.0 = termFreq=6.0\n        1.2 = parameter k1\n        0.75 = parameter b\n        248.69698 = avgFieldLength\n        104.0 = fieldLength\n",


This information shows that item scores are calculated based on the boosting of the fields that our system has had to search through. We can further refine it by adding other boost queries such as:

bq={!func}recip(rord(ds_changed),1,1000,1000)


This kind of boosting has been around in the Drupal codebase for a very long time and it even dates back to Drupal 6. It is possible to boost documents based on the date they were last updated, so that more recent documents will end up with higher scores.
 

Did this fix the search experience? Hmm…

We took those steps to make sure Apache Solr was working as we’d expect. Sounds great, right? Not quite for the customer, though!

One of the tricks with boosting is that tweaking the boosting settings in one term is going to impact the results for another. In this case, the client was going back and forth for small changes in boosting overtime, without knowing the effect. Every change developers make to search boosting settings requires a full check on all the other search terms. It’s a constant battle - and it’s a frustrating one. We discovered that even resolving these issues with boosting didn’t fix the problem. The top results in key searches were not the documents they wanted customers to find. In the screenshot, you can see that for this particular query, the most relevant result according to the customer is only ranked as number 7. We’ve had earlier instances where the desired result wouldn’t even be in the top 50!

In comes Machine Learning and Learning To Rank

Learning to rank or machine-learned ranking (MLR) is a method for improving the ranking models by training them as to what is relevant or not.

We quote from the Apache Solr website: “In information retrieval systems, Learning to Rank is used to re-rank the top N retrieved documents using trained machine learning models. The hope is that such sophisticated models can make more nuanced ranking decisions than standard ranking functions like TF-IDF or BM25.” from Apache Solr Docs: Learning to Rank

So to improve the search, we used an in-house application that allows trained human raters from the client’s team to indicate which search results are ‘Not Relevant’,  ‘Somewhat Relevant’, ‘Relevant’ or ‘Highly Relevant’, respectively. Human judgment for relevance testing is one way to make sure the best results make their way to the top.

The application sends direct queries to Solr and allows the client to select certain documents that are relevant for the query. Dropsolid adjusted this so that it can properly work with Drupal Search API Solr indexes for Drupal 7 and 8.

We’ve used the application from the screenshot to fill in all the preferred search results when it comes down to search terms. In the background, it translates this to a JSON document that lists the document IDs per keyword and their customer relevance score from 0 to 3, with 3 being the most relevant.

This is a very important step in any optimization process, as it defines a baseline for the tests that we are going to perform.
 

Search results

Footnote: This image is a separate application in Python based on a talk of Sambhav Kothari from Bloomberg Tech at Fosdem.

Ranker performance

Using the baseline that we’ve set, we can now calculate how well our original boosting impacts the search results. What we can calculate, as shown in the screenshot above, is the F-Score, the recall & precision of the end result when using our standard scoring model.

Precision is the ratio of correctly predicted positive observations to the total predicted positive observations.
The question that Precision answers is the following: of all results that labeled as relevant, how many actually surfaced to the top? High precision relates to the low false-positive rate.” The higher, the better, with a maximum of 10.
Src: https://blog.exsilio.com/all/accuracy-precision-recall-f1-score-interpretation-of-performance-measures/

Recall is the ratio of correctly predicted positive observations to all observations in the actual class.

The question recall answers is: Of all the relevant results that came back, what is the ratio compared to all documents that were labeled as relevant? The higher the better, with a maximum of 1.
Src: https://blog.exsilio.com/all/accuracy-precision-recall-f1-score-interpretation-of-performance-measures/

If we just look at the top-5 documents, the combined F-Score is 0.35, with a precision of 0.28 and a recall of 0.61. This is quite bad, as only 60% of our relevant documents appear in the top 5 of all search queries. The precision tells us that from the top-5 documents, only 30% has been selected as relevant.
 

Training our ranking model

Before all of this can work, we have to let Solr know what possible features exist that it might use to decide the importance of each feature based on feedback. An example of such a feature could be the freshness of a node - based on the changed timestamp in Drupal -, or it could just as well be the score of the query against a specific field or data such as meta tags. For reference, I’ve listed them both below:

{  
   "name":"freshnessNodeChanged",
   "class":"org.apache.solr.ltr.feature.SolrFeature",
   "params":{  
      "q":"{!func}recip( ms(NOW,ds_node$changed), 3.16e-11, 1, 1)"
   },
   "store":"_DEFAULT_"
},
{  
   "name":"metatagScore",
   "class":"org.apache.solr.ltr.feature.SolrFeature",
   "params":{  
      "q":"{!edismax qf=tm_metatag_description qf=tm_metatag_keywords qf=tm_metatag_title}${query}"
   },
   "store":"_DEFAULT_"
}

Using the RankLib library (https://sourceforge.net/p/lemur/wiki/RankLib/), we can train our model and import it into Apache Solr. There are a couple of different models that you can pick to train - for example Linear or Lambdamart - and you can further refine the model to include the number of trees and metrics to optimize for.
You can find more details at https://lucene.apache.org/solr/guide/7_4/learning-to-rank.html
 

Applying our ranking model

We can, using the rq parameter, apply our newly trained model and re-rank the first 100 results according to the model.

rq={[email protected]19-02-11-12:24+reRankDocs=100}

If we look at the actual result, it shows us that the search results that we’ve marked as relevant are suddenly surfacing to the top. Our model assessed each property that we defined and it learned from the feedback! Hurray!
 

Search results 2


We can also compare all the different models. If we just look at the top-5 documents, the combined F-Score of our best performing model is 0.47 (vs 0.35), with a precision of 0.36 (vs 0.28) and a recall of 0.89 (vs 0.61) This is a lot better, as 90% of our relevant documents appear in the top 5 of all search queries. The precision tells us that from the top-5 documents, 36% has been selected as relevant. This is a bit skewed, though, as for some results we only have one highlighted result.

So, to fully compare, I did the calculations it for the first result. With our original model, we only see 46% of our desired results pop up as the first result. With our best-performing model, we improve this score to 79%!

Obviously, we still have some work to do to turn 79% up to 100%, but I would like to stress that this result was achieved without changing a single thing to the client’s content. The remaining few cases are missing keywords in the meta tags of Dutch words that somehow are not processed correctly in Solr.
 

Solr


Of course, we wouldn’t be Dropsolid if we hadn’t integrated this back into Drupal! Are you looking to give back to the community and needing a hand to optimize your search? Give us the opportunity to contribute this back for you - you won’t regret it.
 

Drupal Solr

In brief

We started out troubleshooting this perplexing search problem by making sure we had the essential configurations in the right place. When that didn’t fix things, we turned to machine learning in Drupal search. 

We compiled a learning dataset, we trained our model and uploaded the result to Apache Solr. Next, we used this model during our queries to re-rank the last 100 results based on the trained model. It is still important to have a good data model, which means getting all the basics on Search covered first.

Need help with optimizing your Drupal 7 or Drupal 8 site search? Contact us.

If you’re coming to DrupalCon, join myself and my colleague Matthias Michaux for an in-depth look at machine learning and Drupal search.

DrupalCon: [Machine Learning] Creating more relevant search results with "Learn to Rank" Wed, 30 Oct, 16:15 to 16:55, in the Auditorium

How to find the field language in Drupal 7

Aug 02 2011
Aug 02

jQuery not working in Drupal 7? $ is not a function?

Jun 15 2011
Jun 15

Where is the submenu displayed in Drupal 7?

Jun 15 2011
Jun 15

Pages

About Drupal Sun

Drupal Sun is an Evolving Web project. It allows you to:

  • Do full-text search on all the articles in Drupal Planet (thanks to Apache Solr)
  • Facet based on tags, author, or feed
  • Flip through articles quickly (with j/k or arrow keys) to find what you're interested in
  • View the entire article text inline, or in the context of the site where it was created

See the blog post at Evolving Web

Evolving Web