Author

Upgrade Your Drupal Skills

We trained 1,000+ Drupal Developers over the last decade.

See Advanced Courses NAH, I know Enough
Jan 07 2021
Jan 07

In our recent project, we had a requirement from one of our clients where we need to validate data in CSV files based on custom requirements. This validated CSV would need to be imported into Drupal 8 into various content types.  

In this article, we will look at the requirement, the library, the architecture of the custom module, the different components of the module with some code samples and finally adding some ideas on how this module can be made more reusable and even contributed.

Introduction

Our client is a well known international NGO with offices worldwide, each with different types of data management systems and frameworks. They wanted a centralized system to manage the data from each of these offices. Having concluded that Drupal 8 was the ideal solution to implement that centralized system, the challenge was to set up a migration pipeline to bring in data from all of the offices and their varying frameworks. Consequently, the files generated by these systems needed to be validated for specific constraints before being imported into our Drupal system.

Challenges and Goals

Following are the goals that the system should meet:  

  1. The CSV files were in a custom format and there were multiple files with different structures and needed to be handled accordingly. Each column needed to have another validator. 
  2. The files needed to be validated for errors before they could be imported and the errors needed to be logged with line numbers and relevant error messages. 
  3. The validation had to be triggered automatically when the files were downloaded from a central location. 
  4. Notification emails had to be sent on successful and failed validation to the IT admins. 
  5. After successfully validating the files, the validator needed to trigger the next step of the process, which is importing the files.

The main challenges

  1. The validation had to cross-reference the incoming data with existing data and also with data in different files (referential integrity checks). 
  2. We also had to check the uniqueness of certain columns in the CSV files. Doing this in a database is pretty easy and straightforward, but this had to be done before inserting it into the database.

Step 1: Choosing a CSV reader library

The first step was to figure out a PHP based CSV reader library. League CSV was found to be the best option due to the below reasons:

  1. It was managed by composer and was already being used by the Migrate module in Drupal core and hence no additional code needed to be added for the library to work.
  2. The library covered many common scenarios like iterating through rows of the CSV, getting the field values and headers, and streaming large CSV files.
  3. And finally, it was implemented in an object-oriented way.

Step 2: Architectural requirements

Below are the requirements we had concerning the architecture of the code:

  1. The code needs to work as an independent service to call it at different locations of code and thereby invoke validation wherever required.
  2. The validations need to be as generic as possible so that the same validation rule can be reused for different fields in the same CSV or in others.
  3. We need to have an extensible way to specify the validation to be done for each field. For example, whether a specific field can be allowed to be blank.

Step 3: Designing the components of the validator

To satisfy the above architectural requirements, we designed the validator module into the following sub-components:

The main service class

Below are the main responsibilities of this class:

  1. Load the CSV library and loop through each of the files in a particular folder.
  2. Use the methods supplied by the CSV league to read the file into our variables. For example, each row of the file will be stored in an array with an index containing each column data.
  3. During processing, the filename is taken in and checked to see if the validator method in the Validator class matching the filename exists.  
  4. If the method exists, then validation is done for the file and errors are logged into the error log table.
  5. If there are no errors, the class triggers the next event, which is migration using a predefined custom event via the Event API of Drupal. 
  6. This also passes the status of the import to the calling class so that emails can be triggered to the site admins.

The Validators class

Here, we basically assign constraints for each file type in a method. The input to the validator class would be a complete row.  

The Constraints class

This class contains the individual constraints that check if a particular type column meets the required criteria. These constraints are methods that take in the column value as a parameter and return an error message if it does not meet the criteria for that column type. This class will be invoked from the validators class for each column in every row.

The Error log

As its name suggests, the validator needed to capture the errors and log them somewhere. We defined a custom table using the database hooks provided by Drupal. A custom view was defined in code to read the data from this table. The errors captured by the constraint class were logged into the database using this logger.

Eventsubscriber and mail notification

We needed the validation to be auto-triggered when the files were downloaded. To achieve this, we tapped into Drupal’s EventSubscriber and Response APIs. 

Referential Integrity checks

Most of the columns did not have any relation with existing data and could be validated on the fly. However, some of the data had to be validated if it has corresponding references either in the database or in another CSV file. We did this as follows.

  1. For those values which act as a parent, dump them into a temporary table, which will be cleared after validation is completed.
  2. When we arrive at another CSV with a column that references values dumped above, then we query the above table to check if the value is present. If yes, return TRUE.
  3. If the value is not present in the temporary table, then we search the Drupal database as the value might have been imported as part of the previous import. If not, then we throw a referential error for that row in the CSV.

The code snippets are available here.

We used the migrated data as a source for a headless backend using REST. For more details on the specifications, refer to our blog on how to validate API response using OpenAPI3.

Future scope and ideas to extend this as a module by itself

We have written the module with an architecture where the validators can be reused but require some coding effort. Below are changes that can be done to make this module a contribution.

  1. Add configurations to have a list of files that need to be validated.
  2. Each file will have an option to add the fields that need to be validated and the type of data (similar to what you have when creating content type).
  3. Based on the above list of files and field types, we can validate any number of CSVs with any number of columns. 
  4. We would need to modify the above classes to fetch the columns' data type and call respecting constraints for each CSV.

As a result of doing the above changes, anyone will be able to use this module to validate CSV files with their own columns.

Hope this blog helped you with this module and how it can be made more reusable and even contributed. Share your experience in the comments below! 

About Drupal Sun

Drupal Sun is an Evolving Web project. It allows you to:

  • Do full-text search on all the articles in Drupal Planet (thanks to Apache Solr)
  • Facet based on tags, author, or feed
  • Flip through articles quickly (with j/k or arrow keys) to find what you're interested in
  • View the entire article text inline, or in the context of the site where it was created

See the blog post at Evolving Web

Evolving Web