Drupal Media Files Now Available Via Solr Search

Parent Feed: 

Achieve Internet Releases the Drupal 7 ApacheSolr Media Module

This module allows website administrators to index files of any type so they can be included in site-wide search results. This is very useful for enterprise websites that need to manage a large number of files, such as videos, PDFs, documents in Excel, Word, and PowerPoint, as well as images. ApacheSolr Media module can index any field within the file entities, including title, description, and taxonomy fields.

Why the ApacheSolr Media Module

Over the years Achieve Internet has built our fair share of publishing and media websites.  A few years back it was only the large entertainment companies like NBCU or publishers like Fastcompany.com that required complex media management.  A lot has changed over the last five years and today it seems everyone is a publisher of one kind or another. An even greater challenge is enterprise organizations have gone global and the need for files and media to be distributed over multiple languages is at an all time high.   A great example of this issue is how organizations are managing their printed material, such as installation instructions, troubleshooting guides, catalogs, data sheets, brochures, and marketing material. It’s one thing for a website administrator to find those files, it’s a completely different challenge to make those files available via public facing search results. Achieve Internet’s new ApacheSolr Media module allows your website visitors access to all these files through a simple site search.

Example of This Module in Action on Hunterindustries.com

(The files below are PDF and assorted zip files, however this can be used for video, documents, even zipped files)

Screen Shot - ApacheSolr Media - Screen Shot 2.png

Screen Shot - ApacheSolr Media - Screen Shot 3.png

Engineering Challenges

One of the challenges faced in building the ApacheSolr Media module was to review all published nodes, detect all referenced files, and create a separate Solr document for every referenced file. Some nodes contained a large number of referenced files, so the page timing-out is a real issue. Achieve solved this issue by including each file attached to a node to count toward the node processing limit per cron run. For example, if the ApacheSolr Media module is configured to index up to 10 nodes per cron run, and the nodes have five files per node, then the module will index two nodes and 10 files per cron run.

The ApacheSolr Media module is the fourth module released by Achieve Internet in the fall of 2011.  The other modules are:

Media Updates

Fresh content is one key to great web experiences. This module simplifies the process of updating media content by allowing replacement of existing files. This release includes the capability to quickly and easily replace a media file currently in use at various locations on your site.   Media Updates Module Blog

Views Media Browser

One of the biggest problems in managing media is being able to find assets. This module enables views filtering, allowing you to refine by clarifying any type of field in your media files. Filtering by taxonomy terms and searching text fields are just two powerful examples. Having the capability to selectively screen information is extremely valuable. Views Media Browser Module Blog

Media Translation

This new module allows you to easily manage media files and taxonomy terms within multiple languages. Additional capability includes an automated “detect and replace” function that aligns files with the language mode displayed.  Media Translation Blog

Combining these modules together with the original Drupal Media module can produce a powerful and rich media management experience. An example of that power is by adding the Media Translation module to the ApacheSolr Media module, we can create “translation sets” of files, which will group together all translated versions of the same file. By integrating the two modules, we can create a node, attach files to it, and then when the node is translated, the correct language version of the referenced files automatically display on the correct language site – including in the search results. 

Assumptions

Like every good module Achieve did need to set the parameters to accomplish our goal of helping publishing and media related websites manage their files.  The most important assumption that dictates the outcome of this module would be; that only files to be indexed for the search are files that are referenced by a node. For example, for one of our recent clients, the only files included in the search results are those that are attached to a published product node, support document, or other node. This makes it much easier to display only relevant and current content without having to frequently delete outdated file content from the site.

This module also assumes that you are using version 1 of the Media module, and that your nodes use media selector fields to reference the files.

Items to Consider

There are a few issues that need to be considered before installing the ApacheSolr Media module:

Files must be attached to at least one content type that is indexed by Solr, and the files to be indexed must be attached in a media selector field. There can be multiple media selector fields per node.

This module does not index the content of the files – for example, the PDF file itself, or the Excel file, etc. It indexes only the fields in the file entities.

All file types to be indexed for the search must use the same title field.

Installation and Setup

Getting the ApacheSolr Media module set up and ready to use is simple – just install the module.

To set up the remainder of the items to support this module:

Configure the File Types

1.     Go to Configuration and select File types (admin/config/media/file-types).

2.     Click on manage fields for the file type you want to configure.

3.     Create a generic File Title field of type Text or add an existing File Title field.

       You must have a single Title field that is used by all file types; otherwise your files will  not have a title in the search results.

4.     Add any additional fields needed.

5.     Repeat for all file types you want to index for the search.

Configure the Content Types

1.     Go to Structure and select Content types (admin/structure/types).

2.     Click on manage fields.

3.     Add one or more Multimedia asset fields to the content type. These are the fields that will reference the file entities to be indexed.

Note: You must have at least one content type that is indexed by Solr that contains a Multimedia asset field.

Achieve recommends reusing the same field across multiple content types.

Configure Solr Integration

1.     Go to Configuration and select Apache Solr Search (admin/config/search/apachesolr).

2.     Click on the Media tab.

3.     Select the field to use as the media file title.

4.     Select the media fields attached to nodes to include in the Solr index.

Rebuild the Solr Index

1.     Go to Configuration and select Apache Solr Search (admin/config/search/apachesolr).

2.     Click on the Search Index tab.

3.     Select either Queue content for reindexing or Delete the index and click Begin.

Future Plans

We are delighted to be co-maintainers of this module with shenzhuxi.

Given the time Achieve and Shenzhuxi would like to see 2.x version of the ApacheSolr Media module use the upcoming File Entity module (under development in conjunction with the Media module version 2). This will make the ApacheSolr Media module much more flexible because it can work with any file management system.

The group would also like to expand the functionality of the ApacheSolr Media module to be able to index the file content itself (e.g., the actual PDF) instead of just the file entity fields. That however is an entirely different challenge and may need to be done separately from this module. 

This is only v.1 and like every good Drupal module the real future and power of this module will come from the community.  We would love your feedback, input and contribution to the ApacheSolr Media module. The power of Drupal comes from our collaboration!

For more information on Achieve Internet please visit our Drupal.org Market page. http://drupal.org/node/1123842

Author: 
Original Post: 

About Drupal Sun

Drupal Sun is an Evolving Web project. It allows you to:

  • Do full-text search on all the articles in Drupal Planet (thanks to Apache Solr)
  • Facet based on tags, author, or feed
  • Flip through articles quickly (with j/k or arrow keys) to find what you're interested in
  • View the entire article text inline, or in the context of the site where it was created

See the blog post at Evolving Web

Evolving Web