Upgrade Your Drupal Skills

We trained 1,000+ Drupal Developers over the last decade.

See Advanced Courses NAH, I know Enough
Feb 01 2018
Feb 01

As a digital agency we need to have a good content management solution for our clients. Even in situations where we are developing more custom apps than content web applications, we still need a good, modular CMS solution. As Symfony developers, we wanted to find powerful CMS solutions built on Symfony. We wanted to use our Symfony knowledge for building custom things on our chosen CMS solution. In this article, I will show you what we learned and how you can build things using Symfony inside Drupal.

This article was originally published in the February 2018 issue of php[architect] magazine. To read the complete article please subscribe or purchase the complete issue.

Mar 23 2017
Mar 23

Preface

We recently had the opportunity to work on a Symfony app for one of our Higher Ed clients that we recently built a Drupal distribution for. Drupal 8 moving to Symfony has enabled us to expand our service offering. We have found more opportunities building apps directly using Symfony when a CMS is not needed. This post is not about Drupal, but cross posting to Drupal Planet to demonstrate the value of getting off the island. Enjoy!

Writing custom authentication schemes in Symfony used to be on the complicated side. But with the introduction of the Guard authentication component, it has gotten a lot easier.

One of our recent projects required use to interface with Shibboleth to authenticate users into the application. The application was written in Symfony 2 and was using this bundle to authenticate with Shibboleth sessions. However, since we were rewriting everything in Symfony 3 which the bundle is not compatible with, we had to look for a different solution. Fortunately for us, the built-in Guard authentication component turns out to be a sufficient solution, which allows us to drop a bundle dependency and only requiring us to write only one class. Really neat!

How Shibboleth authentication works

One way Shibboleth provisions a request with an authenticated entity is by setting a "remote user" environment variable that the web-server and/or residing applications can peruse.

There is obviously more to Shibboleth than that; it has to do a bunch of stuff to do the actual authenticaiton process. We defer all the heavy-lifting to the mod_shib Apache2 module, and rely on the availability of the REMOTE_USER environment variable to identify the user.

That is pretty much all we really need to know; now we can start writing our custom Shibboleth authentication guard:



namespace AppBundle\Security\Http;

use Symfony\Component\HttpFoundation\JsonResponse;
use Symfony\Component\HttpFoundation\RedirectResponse;
use Symfony\Component\HttpFoundation\Request;
use Symfony\Component\HttpFoundation\Response;
use Symfony\Component\Routing\Generator\UrlGeneratorInterface;
use Symfony\Component\Security\Core\Authentication\Token\TokenInterface;
use Symfony\Component\Security\Core\Exception\AuthenticationException;
use Symfony\Component\Security\Core\User\UserInterface;
use Symfony\Component\Security\Core\User\UserProviderInterface;
use Symfony\Component\Security\Guard\AbstractGuardAuthenticator;
use Symfony\Component\Security\Http\Logout\LogoutSuccessHandlerInterface;

class ShibbolethAuthenticator extends AbstractGuardAuthenticator implements LogoutSuccessHandlerInterface
{
    
    private $idpUrl;

    
    private $remoteUserVar;

    
    private $urlGenerator;

    public function __construct(UrlGeneratorInterface $urlGenerator, $idpUrl, $remoteUserVar = null)
    {
        $this->idpUrl = $idpUrl;
        $this->remoteUserVar = $remoteUserVar ?: 'HTTP_EPPN';
        $this->urlGenerator = $urlGenerator;
    }

    protected function getRedirectUrl()
    {
        return $this->urlGenerator->generateUrl('shib_login');
    }

    
    public function start(Request $request, AuthenticationException $authException = null)
    {
        $redirectTo = $this->getRedirectUrl();
        if (in_array('application/json', $request->getAcceptableContentTypes())) {
            return new JsonResponse(array(
                'status' => 'error',
                'message' => 'You are not authenticated.',
                'redirect' => $redirectTo,
            ), Response::HTTP_FORBIDDEN);
        } else {
            return new RedirectResponse($redirectTo);
        }
    }

    
    public function getCredentials(Request $request)
    {
        if (!$request->server->has($this->remoteUserVar)) {
            return;
        }

        $id = $request->server->get($this->remoteUserVar);

        if ($id) {
            return array('eppn' => $id);
        } else {
            return null;
        }
    }

    
    public function getUser($credentials, UserProviderInterface $userProvider)
    {
        return $userProvider->loadUserByUsername($credentials['eppn']);
    }

    
    public function checkCredentials($credentials, UserInterface $user)
    {
        return true;
    }

    
    public function onAuthenticationFailure(Request $request, AuthenticationException $exception)
    {
        $redirectTo = $this->getRedirectUrl();
        if (in_array('application/json', $request->getAcceptableContentTypes())) {
            return new JsonResponse(array(
                'status' => 'error',
                'message' => 'Authentication failed.',
                'redirect' => $redirectTo,
            ), Response::HTTP_FORBIDDEN);
        } else {
            return new RedirectResponse($redirectTo);
        }
    }

    
    public function onAuthenticationSuccess(Request $request, TokenInterface $token, $providerKey)
    {
        return null;
    }

    
    public function supportsRememberMe()
    {
        return false;
    }

    
    public function onLogoutSuccess(Request $request)
    {
        $redirectTo = $this->urlGenerator->generate('shib_logout', array(
            'return'  => $this->idpUrl . '/profile/Logout'
        ));
        return new RedirectResponse($redirectTo);
    }
}

Let's break it down:

  1. class ShibbolethAuthenticator extends AbstractGuardAuthenticator ... - We'll extend the built-in abstract to take care of the non-Shibboleth specific plumbing required.
  2. __construct(...) - As you would guess, we are passing in all the things we need for the authentication guard to work; we are getting the Shibboleth iDP URL, the remote user variable to check, and the URL generator service which we need later.
  3. getRedirectUrl() - This is just a convenience method which returns the Shibboleth login URL.
  4. start(...) - This is where everything begins; this method is responsible for producing a request that will help the Security component drive the user to authenticate. Here, we are simply either 1.) redirecting the user to the Shibboleth login page; or 2.) producing a JSON response that tells consumers that the request is forbidden, if the client is expecting application/json content back. In which case, the payload will conveniently inform consumers where to go to start authenticating via the redirect property. Our front-end application knows how to handle this.
  5. getCredentials(...) - This method is responsible for extracting authentication credentials from the HTTP request i.e. username and password, JWT token in the Authorization header, etc. Here, we are interested in the remote user environment variable that mod_shib might have set for us. It is important that we check that the environment variable is actually not empty because mob_shib will still have it set but leaves it empty for un-authenticated sessions.
  6. getUser(...) - Here we get the credentials that getCredentials(...) returned and construct a user object from it. The user provider will also be passed into this method; whatever it is that is configured for the firewall.
  7. checkCredentials(...) - Following the getUser(...) call, the security component will call this method to actually verify whether or not the authentication attempt is valid. For example, in form logins, this is where you would typically check the supplied password against the encrypted credentials in the the data-store. However we only need to return true unconditionally, since we are trusting Shibboleth to filter out invalid credentials and only let valid sessions to get through to the application. In short, we are already expecting a pre-authenticated request.
  8. onAuthenticationFailure(...) - This method is called whenever our authenticator reports invalid credentials. This shouldn't really happen in the context of a pre-authenticated request as we 100% entrust the process to Shibboleth, but we'll fill this in with something reasonable anyway. Here we are simply replicating what start(...) does.
  9. onAuthenticationSuccess(...) - This method gets called when the credential checks out, which is all the time. We really don't have to do anything but to just let the request go through. Theoretically, this would be there we can bootstrap the token with certain roles depending on other Shibboleth headers present in the Request object, but we really don't need to do that in our application.
  10. supportsRememberMe(...) - We don't care about supporting "remember me" functionality, so no, thank you!
  11. onLogoutSuccess(...) - This is technically not part of the Guard authentication component, but to the logout authentication handler. You can see that our ShibbolethAuthenticator class also implements LogoutSuccessHandlerInterface which will allow us to register it as a listener to the logout process. This method will be responsible for clearing out Shibboleth authentication data after Symfony has cleared the user token from the system. To do this we just need to redirect the user to the proper Shibboleth logout URL, and seeding the return parameter to the nice logout page in the Shibboleth iDP instance.

Configuring the router: shib_login and shib_logout routes

We'll update app/config/routing.yml:



shib_login:
  path: /Shibboleth.sso/Login

shib_logout:
  path: /Shibboleth.sso/Logout

You maybe asking yourself why we even bother creating known routes for these while we can just as easily hard-code these values to our guard authenticator.

Great question! The answer is that we want to be able to configure these to point to an internal login form for local development purposes, where there is no value in actually authenticating with Shibboleth, if not impossible. This allows us to override the shib_login path to /login within routing_dev.yml so that the application will redirect us to the proper login URL in our dev environment.

We really can't point shib_logout to /logout, though, as it will result in an infinite redirection loop. What we do is override it in routing_dev.yml to go to a very simple controller-action that replicates Shibboleth's logout URL external behavior:



...

  public function mockShibbolethLogoutAction(Request $request)
  {
      $return = $request->get('return');

      if (!$return) {
          return new Response("`return` query parameter is required.", Response::HTTP_BAD_REQUEST);
      }

      return $this->redirect($return);
  }
}

Configuring the firewall

This is the last piece of the puzzle; putting all these things together.







services:
  app.shibboleth_authenticator:
    class: AppBundle\Security\Http\ShibbolethAuthenticator
    arguments:
      - '@router'
      - '%shibboleth_idp_url%'
      - '%shibboleth_remote_user_var%'

---






imports:
  - { resources: config.yml }
  - { resources: security.yml }

---

imports:
  - { resources: config.yml }
  - { resources: security_dev.yml } 

---






security:
  firewall:
    main:
      stateless: true
      guard:
        authenticators:
          - app.shibboleth_authenticator

      logout:
        path: /logout
        success_handler: app.shibboleth_authenticator

---





security:
  firewall:
    main:
      stateless: false
      form_login:
        login_path: shib_login
        check_path: shib_login
        target_path_parameter: return

The star here is actually just what's in the security.yml file, specifically the guard section; that's how simple it is to support custom authentication via the Guard authentication component! It's just a matter of pointing it to the service and it will hook it up for us.

The logout configuration tells the application to allocate the /logout path to initiate the logout process which will eventually call our service to clean up after ourselves.

You also notice that we actually have security_dev.yml file here that config_dev.yml imports. This isn't how the Symfony 3 framework ships, but this allows us to override the firewall configuration specifically for dev environments. Here, we add the form_login authentication scheme to support logging in via an in-memory user-provider (not shown). The authentication guard will redirect us to the in-app login form instead of the Shibboleth iDP during development.

Also note the stateless configuration difference between prod and dev: We want to keep the firewall in production environments stateless; this just means that our guard authenticator will get consulted in all requests. This ensures that users will actually be logged out from the application whenever they are logged out of the Shibboleth iDP i.e. when they quit the web browser, etc. However we need to configure the firewall to be stateful during development, otherwise the form_login authentication will not work as expected.

Conclusion

I hope I was able to illustrate how versatile the Guard authentication component in Symfony is. What used to require multiple classes to be written and wired together now only requires a single class to implement, and its very trivial to configure. The Symfony community has really done a great job at improving the Developer Experience (DX).

Setting pre-authenticated requests via environment variables isn't just used by mod_shib, but also by other authentication modules as well, like mod_auth_kerb, mod_auth_gssapi, and mod_auth_cas. It's a well-adopted scheme that Symfony actually ships with a remote_user authentication listener starting 2.6 that makes it very easy to integrate with them. Check it out if your needs are simpler i.e. no custom authentication-starter/redirect logic, etc.

May 01 2015
May 01

In the previous blog post we shared how we implemented the first part of our problem in Drupal and how we decided that splitting our project into discrete parts was a good idea. I'll pick up where we left off and discuss why and how we used Symfony to build our web service instead of implementing it in Drupal.

RESTful services with Symfony

Symfony being a web application framework, did not really provide built-in user-facing features that we were able to use immediately, but it gave us development tools and a development framework that expedite the implementation of various functionality for our web service. Even though much of the work was cut out for us, the framework took care of the most common problems and the usual plumbing that goes with building web applications. This enabled us to tackle the web service problem with a more focused approach.

Other than my familiarity and proficiency with the framework, the reason we chose Symfony over other web application frameworks is that there is already a well-established ecosystem of Symfony bundles (akin to modules in Drupal) that are centered around building a RESTful web service: FOSRestBundle provided us with a little framework for defining and implementing our RESTful endpoints, and does all the content-type negotiation and other REST-related plumbing for us. JMSSerializerBundle took care of the complexities of representing our objects into JSON which our clients to consume. We also wrote our own little bundle with which we use Swagger UI to provide a beautiful documentation to our API. Any changes to our code-base that affects the API will automatically update the documentation, thanks to NelmioApiDocBundle in which we contributed the support for generating Swagger-compliant API specifications.

We managed to encapsulate all the complexities behind our search engine within our API: not only do we index content sent over by Drupal, but we also had to index thousands of data that we are pulling from a partner at a daily basis. On top of that, the API also appends search results from another search API provided by one other partner should we run out of data to provide. Our consumers doesn't know this and neither should Drupal -- we let it worry about content management and sending us the data, that's it. In fact,

In fact, Drupal never talks to Elasticsearch directly. It only talks to our API and authenticating itself should it need to write or delete anything. This also means we can deploy the API on another server without Drupal breaking because it can no longer talk to a firewalled search index. This way we keep everything discrete and secure.

In the end, we have an REST API with three endpoints:

  1. a secured endpoint which receives content which are then validated and indexed, which is used by Drupal,
  2. a secured endpoint which is used to delete content from the index, which is also used by Drupal, and finally;
  3. a public endpoint used searching for content that matches the specifications provided via GET parameters, which will be used in Drupal and by other consumers.

Symfony and Drupal 8

Symfony is not just a web application framework, but is also a collection of stand-alone libraries that can be used by themselves. In fact, the next major iteration of Drupal will use Symfony components to modernize its implementations of routing & URL dispatch, request and response handling, templating, data persistence and other internals like organizing the Drupal API. This change will definitely enhance the experience of developing Drupal extensions as well as bring new paradigms to Drupal development, especially with the introduction of services and dependency-injection.

Gluing it together with AngularJS

Given that we have a functioning web service, we then used AngularJS to implement the rich search tools on our Drupal 7 site.

AngularJS is a front-end web framework which we use to create rich web applications straight on the browser with no hard dependencies on specific back-ends. This actually helped us prototype our search tools and the search functionality faster outside of Drupal 7. We made sure that everything we wrote in AngularJS are as self-contained as possible, in which case we can just drop them into Drupal 7 and have them running with almost zero extra work. It was just a matter of putting our custom AngularJS directives and/or mini-applications into a panel, which in turn we put into Drupal pages. We have done this AngularJS-as-Drupal-panels before in other projects and it has been really effective and fun to do.

To complete the integration, it was just a matter of hooking into Drupal's internal mechanism in order to pass along authored content into our indexing API when they are approved, or deleting them when they are unpublished.

Headless Drupal comes to Drupal 8!

The popularity of front-end web frameworks has increased the demand for data to be available via APIs as templating and other display-oriented tasks has rightfully entered into the domain of client-side languages and out of back-end systems. It's exciting to see that Drupal has taken initiative and has made the content it manage available through APIs out-of-the-box. This means it will be easier to build single-page applications on top of content managed in Drupal. It is something that we at ActiveLAMP are actually itching to try.

Also, now that Google has added support for crawling Javascript-driven sites for SEO, I think single-page applications will soon rise from being just "experimental" and become a real choice for content-driven websites.

Using Composer dependencies in Drupal modules

We used the Guzzle HTTP client library in one of our modules to communicate with the API in the background. We pulled the library into our Drupal installation by defining it as a project dependency via the Composer Manager module. It was as simple as putting a bare-minimum composer.json file in the root directory of one of our modules:

{
  "require" : {
    "guzzlehttp/guzzle" : "4.\*"
  }
}

...and running these Drush commands during build:

$ drush composer-rebuild
$ drush composer-manager install

The first command collects all defined dependency information in all composer.json files found in modules, and the second command finally downloads them into Drupal's library directory.

Composer is awesome. Learn more about it here.

How a design deicision saved us from a potentially costly mistake

One hard lesson we learned is that its not ideal to use Elasticsearch as the primary and sole data persistence layer in our API.

During the early stages of developing our web service, we treated Elasticsearch as the sole data store by removing Doctrine2 from our Symfony application and doing away with MySQL completely from our REST API stack. However we still employed the Repository Pattern and wrote classes to store and retrieve from Elasticsearch using the elasticsearch-php library. These classes also hide away the details on how objects are transformed into their JSON representation, and vice-versa. We used the jms-serializer library for the data transformations; its an excellent package that takes care of the complexities behind data serialization from PHP objects to JSON or XML. (We use the same library for delivering objects through our search API which could be a topic for a future blog post.)

This setup worked just fine, until we had to explicitly define date-time fields in our documents. Since we used UNIX timestamps for our date-time fields in the beginning, Elasticsearch mistakenly inferred them to be float fields. The explicit schema conflicted with the inferred schema and we were forced to flush out all existing documents before the update can be applied. This prompted us to use a real data store which we treat as the Single Version of Truth and relegate Elasticsearch as just an index lest we lose real data in the future, which would be a disaster.

Making this change was easy and almost painless, though, thanks to the level of abstraction that the Repository Pattern provides. We just implemented new repository classes with the help of Doctrine which talk to MySQL, and dropped them in places where we used their Elasticsearch counter-part. We then hooked into Doctrine's event system to get our data automatically indexed as they are written in and out of the database:



use ActiveLAMP\AppBundle\Entity\Indexable;
use ActiveLAMP\AppBundle\Model\ElasticsearchRepository;
use Doctrine\Common\EventSubscriber;
use Doctrine\ORM\Event\LifecycleEventArgs;
use Doctrine\ORM\Events;
use Elasticsearch\Common\Exceptions\Missing404Exception;

class IndexEntities implements EventSubscriber
{

    protected $elastic;

    public function __construct(ElasticsearchRepository $repository)
    {
        $this->elastic = $repository;
    }

    public function getSubscribedEvents()
    {
        return array(
            Events::postPersist,
            Events::postUpdate,
            Events::preRemove,
        );
    }

    public function postPersist(LifecycleEventArgs $args)
    {
        $entity = $args->getEntity();

        if (!$entity instanceof Indexable) {
            return;
        }

        $this->elastic->save($entity);
    }

    public function postUpdate(LifecycleEventArgs $args)
    {
        $entity = $args->getEntity();

        if (!$entity instanceof Indexable) {
            return;
        }

        $this->elastic->save($entity);
    }

    public function preRemove(LifecycleEventArgs $args)
    {
        $entity = $args->getEntity();

        if (!$entity instanceof Indexable) {
            return;
        }

        try {
            $this->elastic->delete($entity);
        } catch (Missing404Exception $e) {
            
        }
    }
}

Thanks, Repository Pattern!

Overall, I really enjoyed building out the app using the tools we decided to use and I personally like how we put the many parts together. We observed some tenets behind service-oriented architectures by splitting the project into multiple discrete problems and solving each with different technologies. We handed Drupal the problems that Drupal knows best, and used more suitable solutions for the rest.

Another benefit we reaped is that developers within ActiveLAMP can focus in on their own domain of expertise: our Drupal guys take care of Drupal work that non-Drupal guys like me aren't the best fit for, while I can knock out Symfony work which is right up my alley. I think we at ActiveLAMP has seen the value of solving big problems through divide-and-conquer being diversified in the technologies we use.

Feb 27 2015
Feb 27

There are thousands of situations in which you do not want to reinvent the wheel. It is a well known principle in Software Engineering, but not always well applied/known into the Drupal world.

Let’s say for example, that you have a url that you want to convert from relative to absolute. It is a typical scenario when you are working with Web (but not just Web) crawlers. Well, you could start building your own library to achieve the functionality you are looking for, packaging all in a Drupal module format. It is an interesting challenge indeed but, unless for training or learning purposes, why wasting your time when someone else has already done it instead of just focussing on the real problem? Especially if your main app purpose is not that secondary problem (the url converter).

What’s more, if you reuse libraries and open source code, you’ll probably find yourself in the situation in which you could need an small improvement in that nice library you are using. Contributing your changes back you are closing the circle of the open source, the reason why the open source is here to stay and conquer the world (diabolical laugh here).

That’s another one of the main reasons why lot’s of projects are moving to the Composer/Symfony binomium, stop working as isolated projects and start working as global projects that can share code and knowledge between many other projects. It’s a pattern followed by Drupal, to name but one, and also by projects like like phpBB, ezPublish, Laravel, Magento,Piwik, …

Composer and friends

Coming back to our crawler and the de-relativizer library that we are going to need, at this point we get to know Composer. Composer is a great tool for using third party libraries and, of course, for contributing back those of your own. In our web crawler example, net_url2 does a the job just beautifully.

Nice, but at this point you must be wondering… What does this have to do with Drupal, if any at all? Well, in fact, as everyone knows, Drupal 8 is being (re)built following this same principle (DRY or don’t repeat yourself) with an strong presence of the great Symfony 2 components in the core. Advantages? Lots of them, as we were pointing out, but that’s the purpose of another discussion

The point here is that you don’t need to wait for Drupal 8, and what’s more, you can start applying some of this principles in your Drupal 7 libraries, making your future transition to Drupal 8 even easier.

Let’s rock and roll

So, using a php library or a Symfony component in Drupal 7 is quite simple. Just:

  1. Install composer manager
  2. Create a composer.json file in your custom module folder
  3. Place the content (which by the way, you’ll find quite familiar if you’ve already worked with Symfony / composer yaml’s):
    "require": {
      "pear/net_url2": "2.0.x-dev"
     }
    
  4. enable the custom module

And that’s it basically. At this point we simply need to tell drupal to generate the main composer.json. That’s basically a composer file generated from the composer.json found in each one of the modules that include a composer themselves.

Lets generate that file:

drush composer-rebuild

At this point we have the main composer file, normally in a vendor folder (if will depend on the composer manager settings).

Now, let’s make some composer magic :

drush composer update

At this point, inside the vendors folder we should now have a classmap, containing amongst others our newly included library.

Hopefully all has gone well, and just like magic, the class net_url2 is there to be used in our modules. Something like :

$base = new Net_URL2($absoluteURL);

Just remember to add the library to your class. Something like:

use Net_URL2;

In the next post we’ll be doing some more exciting stuff. We will create some code that will live in a php library, completely decoupled but at the same time fully integrated with Drupal. All using Composer magic to allow the integration.

Why? Again, many reasons like:

  1. Being ready for Drupal 8 (just lift libraries from D7 or D6 to D8),
  2. Decoupling things so we code things that are ready to use not just in Drupal, and
  3. Opening the door to other worlds to colaborate with our Drupal world, …
  4. Why not use Dependency Injection in Drupal (as it already happens in D8)? What about using the Symfony Service container? Or something more light like Pimple?
  5. Choose between many other reasons…

See you in my next article about Drupal, Composer and friends, on the meantime, be good :-).

Updated: Clarified that we are talking about PHP Libraries and / or Symfony components instead of bundles. Thanks to @drrotmos and @Ross for your comments.

Jan 22 2015
Jan 22

It isn't just about Drupal here at ActiveLAMP -- when the right project comes along that diverges from the usual demands of content management, we get to use other cool technologies to satisfy more exotic requirements. Last year we had a project that presented us with the opportunity to broaden our arsenal beyond the Drupal toolbox. Basically, we had to build a website which handles a growing amount of vetted content coming in from the site's community and 2 external sources, and the whole catalog is available through the use of a rich search tool and also through a RESTful web service which other of our client's partners can use to search for content to display on their respective websites.

Drupal 7 -- more than just a CMS

We love Drupal and we recognize its power in managing content of varying types and complexity. We at ActiveLAMP have solved a lot of problems with it in the past, and have seen how potent it can be. We were able to map out many of the project's requirements to Drupal functionality and we grew confident that it is the right tool for the project.

We pretty much implemented the majority of the site's content-management, user-management, and access-control functionality with Drupal, from content creation, revision, display, and for printing. We relied heavily on built-in functionality to tie things together. Did I mention that the site and content-base and theme components are bi-lingual? Yeah, the wide foray of i18n modules took care of that.

One huge reason we love Drupal is because of its striving community which drives to make it better and more powerful every day. We leveraged open-sourced modules that the community has produced over the years to satisfy project requirements that Drupal does not provide out-of-the-box.

For starters, we based our project on the Panopoly distribution of Drupal which bundles a wide selection of modules that gave us great flexibility in structuring our pages and saving us precious time in site-building and theming. We leveraged a lot of modules to solve more specialized problems. For example, we used the Workbench suite of modules to take care of the implementation of the review-publish-reject workflow that was essential to maintain the site's integrity and quality. We also used the ZURB Foundation starter theme as the foundation for our site pages.

What vanilla Drupal and the community modules cannot provide us we wrote ourselves, thanks to Drupal's uber-powerful "plug-and-play" architecture which easily allowed us to write custom modules to tell Drupal exactly what we need it to do. The amount of work that can be accomplished by the architecture's hook system is phenomenal, and it elevates Drupal from being just a content management system to a content management framework. Whatever your problem, there most probably is a Drupal module for it.

Flexible indexing and searching with Elasticsearch

A large aspect to our project is that the content we handle should be subject to a search tool available on the site. The criterias for searching do not only demand the support for full-text searches, but also filtering by date-range, categorizations ("taxonomies" in Drupal), and most importantly, geo-location queries and sorting by distance (e.g., within n miles from a given location, etc.) It was readily apparent that SQL LIKE expressions or full-text search queries with the MyISAM engine for MySQL just wouldn't cut it.

We needed a full-pledged full-text search engine that also supports geo-spatial operations. And surprise! -- there is a Drupal module for that (A confession: not really a surprise). The Apache Solr Search modules readily provide us the ability to index all our content straight from Drupal and into Apache Solr, an open-source search platform built on top of the famous Apache Lucene engine.

Despite the comfort that the module provided, I evaluated other options which eventually led us to Elasticsearch, which we ended up using over Solr.

Elasticsearch advertises itself as:

“a powerful open source search and analytics engine that makes data easy to explore”

...and we really found this to be true. Since it is basically a wrapper around Lucene and exposing its features through a RESTful API, it is readily available to any apps no matter which language it is written in. Given the wide proliferation and usage of REST APIs in web development, it puts a familiar face on a not-so-common technology. As long as you speak HTTP, the lingua franca of the Web, you are in business.

Writing/indexing documents into Elasticsearch is straight-forward: represent your content as a JSON object and POST it up into the appropriate endpoints. If you wish to retrieve it on its own, simply issue a GET request together with its unique ID which Elasticsearch assigned it and gave back during indexing. Updating it is also a PUT request away. Its all RESTful and nice.

Making searches is also done through API calls, too. Here is an example of a query which contains a Lucene-like text search (grouping conditions with parentheses and ANDs and ORs), a negation filter, a basic geo-location filtering, and with results sorted by distance from a given location:

POST /volsearch/toolkit_opportunity/_search HTTP/1.1
Host: localhost:9200
{
  "from":0,
  "size":10,
  "query":{
    "filtered":{
      "filter":{
        "bool":{
          "must":[
            {
              "geo_distance":{
                "distance":"100mi",
                "location.coordinates":{
                  "lat":34.493311,
                  "lon":-117.30288
                }
              }
            }
          ],
          "must_not":[
            {
              "term":{
                "partner":"Mentor Up"
              }
            }
          ]
        }
      },
      "query":{
        "query_string":{
          "fields":[
            "title",
            "body"
          ],
          "query":"hunger AND (financial OR finance)",
          "use_dis_max":true
        }
      }
    }
  },
  "sort":[
    {
      "_geo_distance":{
        "location.coordinates":[
          34.493311,
          -117.30288
        ],
        "order":"asc",
        "unit":"mi",
        "distance_type":"plane"
      }
    }
  ]
}

Queries are written following Elasticsearch's own DSL (domain-specific language) which are in the form of JSON objects. The fact that queries are represented as tree of search specifications in the form of dictionaries (or “associative arrays” in PHP parlance) makes them a lot easier to understand, traverse, and manipulate as needed without the need of third-party query builders that Lucene's query syntax leaves to be desired. It is this syntactic sugar that helped convinced us to use Elasticsearch.

What makes Elasticsearch flexible is that it is at some degree schema-less. It really made it quite quick for us to get started and get things done. We just hand it with documents with no pre-defined schema and it just does it job at trying to guess the field types, inferring from the data we provided. We can specify new text fields and filter against them on-the-go. If you decide to start using richer queries like geo-spatial and date-ranges, then you should explicitly declare fields as having richer types like dates, date-ranges, and geo-points to tell Elasticsearch how to index the data accordingly.

To be clear, Apache Solr also exposes Lucene through a web service. However we think Elasticsearch API design is more modern and much easier to use. Elasticsearch also provides a suite of features that lends it to easier scalability. Visualizing the data is also really nifty with the use of Kibana.

The Search API

Because of the lack of built-in access control in Elasticsearch, we cannot just expose it to third-parties who wish to consume our data. Anyone who can see the Elasticsearch server will invariably have the ability to write and delete content from it. We needed a layer that firewalls our search index away from public. Not only that, it will also have to enforce our own simplified query DSL that the API consumers will use.

This is another aspect that we looked beyond Drupal. Building web services isn't exactly within Drupal's purview, although it can be accomplished with the help of third-party modules. However, our major concern was in regards to the operational cost of involving it in the web service solution in general: we felt that the overhead of Drupal's bootstrap process is just too much for responding to API requests. It would be akin to swatting a fruit fly with a sledge-hammer. We decided to implement all search functionality and the search API itself in a separate application and writing it with Symfony.

More details on how we introduced Symfony into the equation and how we integrated together will be the subject of my next blog post. For now we just like to say that we are happy with our decision to split the project's scope into smaller discrete sub-problems because it allowed us to target each one of them with more focused solutions and expand our horizon.

About Drupal Sun

Drupal Sun is an Evolving Web project. It allows you to:

  • Do full-text search on all the articles in Drupal Planet (thanks to Apache Solr)
  • Facet based on tags, author, or feed
  • Flip through articles quickly (with j/k or arrow keys) to find what you're interested in
  • View the entire article text inline, or in the context of the site where it was created

See the blog post at Evolving Web

Evolving Web