Upgrade Your Drupal Skills

We trained 1,000+ Drupal Developers over the last decade.

See Advanced Courses NAH, I know Enough

Riding the Semweb: the Toneelstof case

Parent Feed: 

Riding the semweb

A few weeks back, we blogged about the Semantic Web and how it will gain more importance in day-to-day life. We've seen how the lack of easy-to-use tools to leverage its power is keeping it from becoming mainstream and saw how Drupal fits in the story. And so Krimson, in an effort to bring the semweb in Drupal, takes part in a Flemish government-sponsored research project called Archipel.

Archipel is a consortium of private and public stakeholders: sociocultural entities, academic institutions and private enterprises. It's a research project that runs over two years and is sponsored by IWT (Government agency for Innovation by Science and Technology) Its main goal is to create a common platform which facilitates the exchange of data in an open and transparent fashion between large repositories that contain digitized audiovisual heritage. The project relies on concepts and technologies taken from the semantic web: Linked open data, RDF, SPARQL, OAI-PMH harvesting...

Krimson has been engaged as a technical partner. Our role is to realize a series of project sites that interact with a common open data layer. These sites are real use cases with hard functional requirements issued by other partners that act as 'clients' within the Archipel project. This approach allows us to test Drupal modules that support semantic technologies, discover gaps and give feedback to their maintainers.

As the project has almost rounded its first year and parts of the platform are slowly becoming functional, Krimson also met the core goals of its first project case.

The Toneelstof case

Our first client, VTi (Vlaams Theater Instuut / Institute for the professional performing arts in Flanders), runs a successful project, called Toneelstof, that documents the history of the performing arts in the Low Countries. Over the past years, the main deliverable of Toneelstof were sets of DVD's containing archived interviews with important players (directors, actors, producers, writers,...) and other historical documents (videoclips from plays,...) VTi plays a double role: acting as a provider, their holdings are opened up through the Archipel platform. As a consumer, the Toneelstof site reuses data stored in the shared open layer.

Our other partners, Inuits and IBBT (Interdisciplinary institute for BroadBand Technology), created an environment allowing easy ingest of objects by VTi. Objects are harvested via the OAI-PMH protocol and stored in a central triple store. The objects consist of an archival copy and dissemination copies in different web accessible formats. Metadata is mapped to Dublin Core. The triple store features a SPARQL endpoint through which data is made available to the outside world. Krimson is to build a website for the Toneelstof case that can connect with the SPARQL endpoint, launch a SPARQL query, retrieve video clips and their accompanying metadata and present them to the end user in an usable and accessible way.

The website itself isn't ready for release yet, but we've made a screencast of the current state of things:

Support for Semantic webtechnology is a fast evolving domain within the Drupal ecosystem. There are no production-ready modules available off the shelf. So our options were limited. We could have build our own custom solution but that would entail several drawbacks.

  • Developing from scratch, without community support, takes a lot of time.
  • Building our own components means less chance of reusing them in other projects...
  • ... components that might not be suitable for contribution back to the community.
  • Since there are already several modules under way, we might end up reinventing the wheel.

So, we decided to base the Toneelstof project on existing modules that offer SPARQL support. This approach would cover the first miles without having to invest extra effort. If we were to need a new feature or encounter bugs, we could dive into the code and contribute our own solutions as patches to the different module projects. We would also enjoy the benefits of community feedback as we published our patches for testing.

SPARQL Views

It became quickly apparent that SPARQL Views would become our weapon of choice. This module allows you to to compose and issue queries to remote SPARQL endpoints through the Views module. Building on top of the Views API gives development a serious boost since it does all the heavy lifting: handling of filters and arguments, rendering of an entire view, rows and fields, dynamic composition and execution of a query. The SPARQL Views module goal is to integrate the ARC2 library functionality in Views and adapt the interface allowing it to also handle SPARQL query composition.

Lin Clark, maintainer of SPARQL Views, has created several screencasts on how this module works:

Did we benefit from this approach? Yes, rather then building everything from scratch, we were able to spend time improving the SPARQL Views module. We ended up fixing several bugs and adding two features to the module: support for Views argument handling and Views pagers.

We had to pass multiple keywords to the View as an array of arguments. We added a primitive argument handler to the module which does just that. In SPARQL, you can add a variety of different functions to FILTER expressions though. At the moment, our SPARQL Views arguments only understand the regex() function. This is enough to suit our purposes but other useful functions still need to be implemented.

The module didn't support paging. We had to write a specific pager class for the SPARQL Views module. For any kind of paging to work, two queries are needed: a COUNT query to establish the total number of objects in a result set and a ranged query to retrieve the actual objects for a given page number. The SPARQL specification is actually a recommendation. This means it's still in full development and lacks certain features which are available in other query languages. A well defined COUNT modifier is not yet supported. The ARC2 library provides it's own SPARQL+ extensions which include COUNT support, but if the endpoint does not have SPARQL+ installed, a COUNT query will return an error. Instead, our pager class retrieves the entire result set. Determining the total number of objects is done in PHP. This solution works for small sets but doesn't scale well when queries return larger sets of data.

While SPARQL Views harnesses the power of the Views framework, there are several drawbacks. Views requires you to register your datasources before you can query them. It doesn't automatically read out the entire database structure. This means you have to explicitly define tables and their relationships in code using hook_views_data(). Handlers for fields, filters and arguments are statically associated with those fields. This allows Views to apply the correct handlers at runtime.

In SPARQL, variables are dynamically bound. Without a static definition of the available fields, Views does not know which handlers to instantiate. SPARQL Views tries to solve this issue by running the query from within hook_views_data() and associate a mapping based on the structure of the resultset. Views' architecture is not build to alter the data definition in the context of hook_views_data() when a query is run, though. This resulted in a series of nasty hacks on the part of SPARQL Views to register those fields and handlers nonetheless. Another tradeoff is that Views caching has to be disabled to make this work.

SPARQL Views comes with it's own generic field handler which is applied on all the attributes in a result set. Part of the flexibility of the Views framework is it's ability to instantiate specific handlers which can be assigned to fields depending on their datatype. For instance, imagecache formatters are only available for fields which are associated with the imagefield handler. SPARQL Views is not yet ready to automatically determine the type of a field and assign a specific handler. A possible solution might be tracking down the predicate of the matching triple pattern and looking at the associated schema against which the query was run. For now, the lack of typed fields restricts developers to the theming layer, overriding theme_views_view_field(), and project specific code to get the job done. Modules like Display Suite do make it easier to theme the overall view.

Conclusion

The best way to conclude our first year on the Archipel project is to provide an answer to a few questions.

Is it easy to query and reuse open data in our own Drupal projects?

Easy publishing RDF formatted data has made a few leaps in the past year. But querying data is still non-trivial. SPARQL is an unfinished specification. Major features like aggregated queries have yet to be defined. Most tools are still experimental. SPARQL Views is arguably the most flexible tool available although it's still under heavy development and it comes with it's quirks. A good understanding of RDF and SPARQL is still required if you want to ride the semweb.

What about performance?

The common triple store only contains a few dozen objects and our SPARQL queries are fairly simple. Since we don't retrieve large sets of data this means there is currently no notable performance hit at this point. With the problems we raised in the article in mind and an increase of the amount of data in the triple store under way, scaling up is our next challenge.

So, when will SPARQL Views be really ready?

This is the classic chicken and egg dilemma. Without testing and contributions, it will take longer for tools like SPARQL Views to evolve. Then again, as long as they are still experimental, developers tend to stay away from them. If the Toneelstof case proves one thing, it's this: starting to use these tools and returning feedback drives development.

How can I help?

If you're up to it, start by downloading and reading the instructions at the project homepage. The latest version of the module is available on GitHub.

Author: 
Original Post: 

About Drupal Sun

Drupal Sun is an Evolving Web project. It allows you to:

  • Do full-text search on all the articles in Drupal Planet (thanks to Apache Solr)
  • Facet based on tags, author, or feed
  • Flip through articles quickly (with j/k or arrow keys) to find what you're interested in
  • View the entire article text inline, or in the context of the site where it was created

See the blog post at Evolving Web

Evolving Web