Upgrade Your Drupal Skills

We trained 1,000+ Drupal Developers over the last decade.

See Advanced Courses NAH, I know Enough
Jul 25 2023
Jul 25
Pierce Lamb10 min read

Jul 25, 2023
Prompt: “An AI that has automated the world”

I’ve been building ChatGPT-powered features since its release and have been absolutely amazed by what it can do. Even though I’ve been interacting with GPT every day for a while now, oftentimes its outputs still feel like magic. For months, creating inputs (prompts) for GPT was a fun new way to program this tool, but handling its output was a mess. In my prompting, I experimented with various ways to steer its output to give it structure and used basic string functions to slice and dice the results, but it never worked 100% of the time. I evaluated a number of tools that tried to fix this problem, but none of them quite fit our use case. All of this changed dramatically with the advent of Function Calling. For those that have been living in a cave, OpenAI fine-tuned GPT3.5-turbo and GPT-4 so that these models can accept a list of functions, defined using JSON schema, as input along with a prompt. The model(s) can then select a function and call it as output. That is, GPT will return an output that answers one’s prompt, but in the form of a function call with the function name and inputs for its parameters. Thus, a program accepting that output can choose to call that function with those inputs and return the output to GPT for further processing. This seemingly simple functionality was the basis for an entire ecosystem of ChatGPT “agents” that existed prior to OpenAI rolling out function calling; now a user can build a custom agent using just the OpenAI libraries.

I suspect, however, that, like me, a large majority of people using function calling are using it to get structured output out of GPT. For many, there are stomach-turning security issues with allowing a LLM to execute code in your program, but for receiving structured output, function calling is extremely useful. In this blog post I will demonstrate how to use function calling to receive structured output and that one small improvement — adding an explanation parameter — appears to improve the outputs of function calls and helps to debug the prompts that generate them significantly.

If you’re trying to solve a problem with ChatGPT, at some point in time you’ve googled the phrase “prompt engineering.” And, if you got lucky, you came across Lilian Weng’s post on the topic that saved you from having to read a bunch of verbose academic papers. Lilian does a fantastic job at explaining the technical side of prompt engineering and summarizing the results of academic papers; her work can immediately improve your prompting. However, I want to diverge for a second to explain how I often think about prompting LLMs that has recently helped me.

Imagine you’ve hired a high school or college graduate that excelled in all language or liberal arts classes. They possess an elite ability to apply language reasoning skills to any tasks that involve language. However, for your specific task, they lack any prior context on how to perform the task. Thus, for them to succeed at performing your task, you can either teach them all the context you’ve learned about the task, or you can give them a set of criteria to follow where all they need to perform the task is to apply the criteria and use their language reasoning skills.

Since you’ve hired this person with no background or context on what you do, one thing you might do to verify the results they produce is ask them to provide a chain of reasoning or rationale for why they arrived at the answer they did. For the new employee, being forced to write down a chain of reasoning will likely help them arrive at an accurate answer. For you, their manager, it will help you to understand if they’ve arrived at an accurate answer and if there are any gaps or issues in the criteria you provided them. Furthermore, for any third party you might show their work, it would reduce any confusion about the answer by seeing an explanation of how the answer was found.

Of course, in our example the employee is ChatGPT and this style of requesting a chain of reasoning is now a common technique for improving results. Chain-of-thought (CoT) prompting was introduced by Wei et al. 2022 as well as Kojima et al. 2022, Zhou et al. 2022 and later made very popular in the ReAct paper by Yao et al. 2023. In its most basic form, simply asking ChatGPT to think step-by-step before providing an answer fits this bill, though these papers introduce a much more nuanced perspective.

Prior to function calling, I was using chain-of-thought prompting not only to obtain more accurate answers, but also to debug edge cases in my prompts. When you’re providing ChatGPT criteria to follow to help create an answer, oftentimes there will be an edge case you have not considered and ChatGPT’s explanation will show the gap in your criteria. This explanation would be provided in the text blob that came back from ChatGPT. If you just stored the raw prompt and raw answer and examined these when your provided answer didn’t make much sense, this explanation would oftentimes elucidate the gap in your criteria. However, when you start using function calling, you likely will not be getting a text blob back.

Here are a few examples from my work with ChatGPT:

  • Return the top-k most similar questions to a given question,
  • Create a synthesized answer between two answers to the same question,
  • Find a specific column type in tabular data,
  • Answer a new question based on similar previous questions’ answers,
  • Answer a question using related documentation to find the answer.

In all of these examples, once you move to function calling, ChatGPT is going to start filling a parameter with its answer to these requests. However the explanation of how it arrived at that answer, which was regularly a feature of the old text blob, is lost. As a result of this, I started adding an explanation parameter to every single function call and modifying the prompt so ChatGPT knew to dump an explanation into this parameter. Not only did this subjectively seem to improve the answers overall (in the same way CoT is supposed to), it also provided me with a way to debug my prompts when answers went haywire. Let’s look at a more concrete example to better understand what I mean.

We will look at an example of answering a new question based on similar questions’ answers. Note that “similarity” is measured via an approximate nearest neighbors search over previous questions that have been embedded as vectors. Let’s take a look at the function call we are sending to GPT:

{
"name": "get_answer_and_explanation",
"description": "Returns the answer to the new question and an explanation of how the answer was created",
"parameters": {
"type": "object",
"properties": {
"answer": {
"type": "string",
"description": "The answer to the new question",
},
"explanation": {
"type": "string",
"description": "An explanation of how the answer was created",
},
},
"required": ["answer", "explanation"],
},
}

As you can see, this is pretty straight forward. We just add a second parameter to the function call called “explanation” and give it a succinct description. GPT will create an answer to our new question, fill it in the answer parameter and then explain how it arrived at that answer in the explanation parameter. Next, let’s see how we connect this function call back to the prompt.

Recall that the goal is to steer ChatGPT and provide a list of criteria it can follow to utilize its incredible reasoning abilities without providing it a mass of in-context examples (my main reason for avoiding the latter is because of token rate limits). Here is the beginning of the system prompt:

You are a helpful assistant that answers questions about a company based on
similar questions the company has responded to in the past. Below are 5
questions the company has responded to in the past. After those 5 questions,
there will be a new question for you to answer.

This provides the most basic steering on the task ChatGPT is to perform. Next, also in the system prompt, we provide the list of criteria that will help it understand how to apply its reasoning abilities and adjudicate edge cases.

Please think step-by-step when answering the new question and use the criteria
listed below as guidelines for answering the new question:
a.) Use the old 5 questions and their responses as context to answer the
new question.
b.) The old 5 questions are sorted by similarity to your new question.
However, the most similar questions may not have the best responses
to answer your new question. Analyze the old questions and their
answers and then pick one of the following strategies for answering
the new question based on your analysis:
b1.) Two or more old questions are similar to the new question and have
good responses. Synthesize these responses into an answer for the
new question.
b2.) Two or more old questions are similar to the new question,
but their responses disagree or contradict each other.
Pick the response that answers the new question the best.
b3.) One old question has the best response for answering the new question.
Discard the other old questions and use that response to answer
the new question.
b4.) The old questions are sufficiently dissimilar to the new question,
answer the new question with “cannot be answered”
c.) Additional criteria on how to evaluate a response:
c1.) A good response is usually sentence(s) of English. Note it can contain
grammatical and spelling mistakes.
c2.) A good response could be considered a response to it’s question.
c3.) The voice of a good response should be written as if it is a
human at a company responding to the question. Sometimes it’s as
simple as “Yes” or “No” and other times it includes additional
information.

Now, I didn’t come up with 10 criteria in a vacuum. This list was built up over successive requests to ChatGPT and analysis of its responses. And, like I’ve already argued at length, the explanation parameter was the single best method for understanding what ChatGPT was doing and thereby enhancing this list of criteria. Finally, also in the system prompt, we provide formatting instructions which is what will help ChatGPT correctly call the function:

Your answer should be the answer to the new question and an 
explanation of how you arrived at that answer that includes
your step-by-step reasoning using the criteria above.
Use this format:

Answer:
Explanation:

You can see here that the end of the system prompt is what connects the whole prompt back up to the function call. Now GPT can easily see how its response to the prompt can match the function’s input parameters.

Given what we know about the function call and prompt, what are some examples of these explanations? Here is one of the more simple examples where GPT-4’s explanation provides so much more detail on how GPT-4 created the answer:

"answer": "Yes, our company does have cyber insurance.",
"explanation": "The new question is 'does your company have cyber insurance?'
The most similar old questions are 'does your company have cyber insurance?'
and 'how much coverage do you have in cyber insurance?' The responses to
these old questions confirm that the company does have cyber insurance.
Therefore, the strategy b3 is used: One old question has the best response
for answering the new question. The response to the old question 'does your
company have cyber insurance?' is used to answer the new question."

I find it so cool when GPT-4 highlights which strategy it’s chosen to create an answer:

“explanation”: “The new question is asking about the amount of coverage
the company has in cyber insurance. Looking at the old questions, question
2 and question 3 are the most similar to the new question and both provide
a clear response about the amount of coverage the company has in cyber
insurance. The company maintains a Cyber Liability with a combined limit
of $5,000,000 each occurrence and $10,000,000 annual aggregate. Therefore,
I used strategy b1, synthesizing these responses into an answer for the new
question.”

Here is another where it synthesizes across many questions:

“explanation”: “The answer is synthesized from multiple previous questions
and their responses. Questions 2, 3, 6, 8, 9, 11, 14, 17, 18, 19, 20, and 21
all provide information about the various tools and infrastructure pieces
that the company monitors. These responses collectively provide a
comprehensive answer to the new question.”

How about an older one where the explanation helped me to improve the list of criteria?

“answer”: “Cannot be answered.”,
“explanation”: “I apologize, but I cannot provide an answer based on the
given responses. The two answers provided are contradictory, and it is
unclear which one accurately reflects the types of hygiene maintenance
required for the company. Therefore, I cannot answer the question with
certainty.”

Without this explanation, I would have had no idea why GPT-4 was choosing Cannot be answered as the answer and likely not have been able to debug what was happening. This is a situation that is typical with function calling and why I think the `explanation` parameter is so helpful. It is easy to see how this would guide one to engineer the prompt differently. I hope this blog has shown how this small change to one’s function calls can improve quality-of-life with respect to ChatGPT outputs.

Finally, I should also address the paper “The Unreliability of explanations in Few-shot prompting for Textual Reasoning” by Ye & Durrett 2022. Here, the authors provided few-shot prompts to GPT3.5 that included explanations and found that when GPT3.5 hallucinated an answer it would often hallucinate an explanation that entailed that answer (it was less likely to produce an explanation that did not entail the answer). Any regular user of ChatGPT should know that it is possible for GPT to hallucinate its answer and explanation and so it’s always good to check an entire prompt and make sure this is not happening. That said, it is reasonable to presume that GPT-4 has gotten better at this sort of thing. Further, my instinct from my usage of GPT models is that few-shot examples often cause GPT to try and fit its responses to these examples at the expense of hallucinations. That is, GPT would rather produce an explanation that more closely models the few-shot explanations and is a hallucination than one that does not model them and is factually accurate. I do not have any research to back up this instinct so take it with a grain of salt. At any rate, one key difference between the paper and what I’ve described above is that there are no few-shot examples on the explanation parameter; we only attempt to constrain it via the criteria list. My subjective take from repeated use of this setup is that GPT-4’s explanations appear to always entail the answer and almost always cite which criteria it has chosen to create the answer which raises confidence about factual accuracy. But any GPT user should always be checking answers and explanations regularly for hallucinations.

Apr 19 2023
Apr 19
Pierce Lamb15 min read

Apr 19, 2023

This is the third post in a three part series on creating a reusable ML pipeline that is initiated with a single config file and five user-defined functions. The pipeline is finetuning-based for the purposes of classification, runs on distributed GPUs on AWS Sagemaker and uses Huggingface Transformers, Accelerate, Datasets & Evaluate, PyTorch, wandb and more.

This post originally appeared on VISO Trust’s Blog

This post will cover the training and testing (inference) steps. These are the core steps in a ML pipeline where a model is hyper-parameter tuned and the test set is used to measure performance. If you have landed on this post first, check out the first post in the series detailing the pipeline setup and the second post detailing the data steps.

Training and Tuning

The reason I have combined Training and Tuning into one section is that Tuning just is a set of training jobs where performance is incrementally improved through the changing of hyperparameters. As such, underneath the covers, the two types of jobs are calling the same code. Like we have previously, let’s take a look first at perform_training() and perform_tuning() to see how the code interacts with Sagemaker.

Zooming into perform_training(), we encounter the first bit of backend code that handles a use case we have not yet discussed: comparing two models. If you recall in part one, one of the motivations for creating this pipeline was to rapidly test multiple Document Understanding models and compare performance between them. As such, the pipeline is built to handle, in a single experiment, multiple models being passed in the settings.ini file the experimenter defines. In fact, the MODEL_NAMES parameter from this file can accept one or many model names, the latter implying that the experimenter wants to run a comparison job. A comparison job has no impact on Data Reconciliation or Data Preparation; we want these steps to be isomorphic to a single model job as the idea is that n models get trained and tested on the exact same snapshot of training data. With that preamble, perform_training() looks like this:

The loop here is iterating over either a list with n model names or a list with a single model name. For each model name, an Estimator() is constructed and .fit() is called which kicks off a training job on Sagemaker. get_estimator_kwargs() will look familiar to anyone who has trained on Sagemaker already:

Settings are extracted from the config we discussed in the first post in the series, the most important of which is config.docker_image_path. As a refresher, this is the ECR URL of the training image the experimenter created in the setup that is used between Sagemaker Processor/Training/Tuning jobs and contains all needed dependencies. Next, perform_training checks a boolean from the settings.ini file, USE_DISTRIBUTED which defines whether or not the experimenter expects distributed GPU training to occur. If so, it sets some extra Estimator parameters which are largely inspired by the _distribution_configuration function from the sagemaker-sdk.

I will digress for a moment here to talk about one such parameter, namely, an environment variable called USE_SMDEBUG. SMDEBUG refers to a debugging tool called Sagemaker Debugger. For reasons I cannot explain and have not been answered by AWSlabs, this tool is on by default and distributed training would not work for some models, producing mysterious exception traces. It only became obvious to me when carefully examining the traces and seeing that it was some code in smdebug that was ultimately throwing. Furthermore, there are a variety of ways to turn off smdebug, for instance passing 'debugger_hook_config': False as done above or environment={‘USE_SMDEBUG’:0}. However, these methods only work on Training jobs. Again, for reasons I cannot explain, the only way to turn off SMDEBUG on Tuning jobs is to set the env var inside the docker container being used: ENV USE_SMDEBUG="0"; the other methods explained above somehow never make it to a Tuning jobs constituent Training jobs. An unfortunate side effect of this is that it makes it difficult for an experimenter to configure this environment variable. At any rate, hopefully AWSlabs fixes and or makes smdebug exceptions more user friendly.

The call to .fit() makes the actual call to the AWS API. The config.training_data_uri parameter specifies the S3 URI of the encoded training data from the Data Preparation step; the training instance will download this data to local disk before it executes where it can be easily accessed by multiple GPU processes. How does the job know what code to execute? That is specified in the base docker container which is extended by the experimenter:

These environment variables are used by the sagemaker-training library to kick off the training script. At this point we would dive into train.py,but since it is also used by a Tuning job, let’s take a look at how we kick off a Tuning job. The beginning of a Tuning job is nearly identical to a Training job:

But now, instead of calling .fit(), we need to set up a few more parameters a Tuning job requires. A Tuning job requires a set of constant hyperparameters and tunable hyperparameters. As such, here an example of what an experimenter might write in the settings.ini file to represent this:

Here the constants will not change between tuning jobs, but the tunable parameters will start with guesses and those guesses will get better as jobs complete. The -> and , are syntax I’ve chosen; in this context -> stands for an interval while , stands for categorial options. Having seen this, the next piece of the Tuning job setup should make sense:

Now we have our dict of tunable parameters we can pass to the HyperparameterTuner object:

This should look somewhat familiar to what we just did for Training with a few extra parameters. So far, the HyperparameterTuner object takes the constructed Estimator() object that will be re-used for each constituent Training job and the tunable hyperparameters we just discussed. A Tuning job needs to measure a metric in order to decide if one set of hyperparameters are better than another. objective_metric_name is the name of that metric. This value is also used in the metric_definitions parameter which explicitly defines how the HyperparameterTuner job can extract the objective metric value from the logs for comparison. To make this more concrete, this is how these values are defined in an example settings.ini file:

Finally, the max_jobs parameter defines how many total Training jobs will constitute the Tuning job and max_parallel_jobs defines how many can run in parallel at a given time. Like the Estimator in the Training job, we call fit() to actually kick off the Tuning job and pass it the training_data_uri like we did previously. With this in place, we can now look at train.py and see what executes when a Training or Tuning job is executed.

The goal of train.py is to fine tune a loaded model using a set of distributed GPUs, compute a number of metrics, determine which is the best model, extract that model’s state_dict, convert that model to torchscript, and save these files along with a number of graphs to S3. Huggingface’s Accelerate, Evaluate and Transformers libraries are all used to greatly simplify this process. Before continuing, I have to give a brief shoutout to the Accelerate devs who were extremely responsive while I was building this pipeline.

Note that in a distributed setting, every GPU process is going to execute this same train.py file. While much of this coordination can be passed off to Accelerate, it is helpful to understand that while working inside it. Diving a level deeper, train.py is going to:

  • Read hyperparameters and determine if the running job is a tuning job, training job or comparison job
  • Determine if gradient accumulation will be utilized
  • Construct the `Accelerator()` object which handles distribution
  • Initialize wandb trackers
  • Load split training data and create `Dataloader()`s for training and validation
  • Set up an optimizer with learning rate scheduling
  • Execute a training and validation loop, computing metrics and storing metric histories and determining what the best model was
  • Plot curves for metrics
  • Extract the curves, statistics and best model from the loops
  • Write all of this data to S3

We start by reading the passed hyperparameters and setting a few values that can be used throughout the training process:

_tuning_objective_metric is a hyperparamter set by Sagemaker that allows us to easily differentiate between Training and Tuning jobs. As we’ve mentioned before, the run_num is an important setting that allows us to organize our results and version our models in production so they easily connect back to training runs. Finally, job_type_str allows us to further organize our runs as training / tuning and comparison jobs.

Next we determine if gradient accumulation is needed. Briefly, gradient accumulation allows us to set batch sizes that are larger than what the GPUs we’re running on can store in memory:

Control now moves to setting up the Accelerator() object which is the tool for managing distributed processing:

Here we encounter a core concept in Accelerate, is_main_process. This boolean provides a simple way to execute code on one of the distributed processes. This is helpful if we want to run code as if we’re on a single process; for instance if we want to store a history of metrics as the training loop executes. We use this boolean to set up wandb so we can easily log metrics to wandb. Additionally, accelerator.print() is similar to if accelerator.is_main_process print(...), it ensures whatever statement is only printed once.

Recall that we passed config.training_data_uri to the .fit() call for both Training and Tuning jobs. This downloads all of the training data to the Sagemaker instance’s local disk. Thus, we can use Datasets load_from_disk() function to load this data. Note in the following code SAGEMAKER_LOCAL_TRAINING_DIR is just the path to the dir that data is downloaded to.

Each process loads the dataset, id2label file, metrics and creates dataloaders. Note the use of Huggingface’s evaluate library to load metrics; these can be used in tandem with Accelerate to make metric tracking simple during distributed training. We will see shortly how Accelerator provides one simple function to handle distributed training.

In this code block, we first call the user-defined function load_model to receive the loaded model defined however the experimenter would like. Thus far, this function has typically looked like a call to a Transformers from_pretrained() function, though this is not enforced.

A common learning rate optimizer is created and used to create a learning rate scheduler. Finally, we encounter another core concept in Accelerator, namely, wait_for_everyone(). This function guarantees that all processes have made it to this point before proceeding to the next line of code. It must be called before the prepare() function which prepares all of the values we’ve created thus far for training (in our case, distributed training). wait_for_everyone() is used regularly in Accelerator code; for example, it is nice to have when ensuring that all GPUs have completed the training loop. After the prepare() step, the code enters a function to perform the training and validation loop. Next, we will look at how Accelerator works inside that loop.

At the start of the loop, we initialize a number of values to track throughout training. Here we use is_main_process again to create a single version of metric histories which we will use to plot graphs. In this example, we are only tracking training loss, validation accuracy and f1, but any number of metrics could be tracked here. Next, we enter the loop, set the model in train() mode and enter the train() function:

As execution enters a batch, it first needs to check if we’re running a comparison job. If so, it needs to extract the appropriate parameters for the current model’s forward() function. If you recall, for comparison jobs, in the Data Preparation step we combined all inputs in the same pyarrow format, but prepended with the model_name (e.g. longformer_input_ids). get_model_specific_batch() just returns those parameters of the batch that match the current model_name.

Next, we encounter with accelerator.accumulate(model), a context manager that recently came out in Accelerate that manages gradient accumulation. This simple wrapper reduces gradient accumulation to a single line. Underneath that manager, back propagation should look familiar to readers who have written ML code before, the one big difference is calling accelerator.backward(loss) instead of loss.backward().

Upon completing a training batch, execution sets the model in .eval() mode and moves into the validation loop:

Here we encounter another key accelerate function, gather_for_metrics(). This recently added function makes it much easier to gather predictions in a distributed setting so they can be used to calculate metrics. We pass the returned values to the f1_metric and acc_metric objects we created earlier using the Evaluate library. The validation loop then computes the scores and returns them.

After sending the batch through training and validation, we perform tracking on the values we initialized at the beginning:

Since is_main_process contains the references to our history-tracking datastructures, we use it to append our new values. accelerator.log links up with the init_trackers call we made earlier: .log sends these values to the tracker earlier initialized. In our case wandb will create graphs out of these values. Finally we use the F1 score to determine the best model over time.

After the training and validation loop is done, we execute:

We start by ensuring that all processes have completed the training/validation loop and then call unwrap_model to extract the model from its distributed containers. Since the main process contains our metric histories, we use it to plot curves for each metric and calculate model statistics; we then return out the best model, curves and statistics.

Now that the training/validation loops are complete and we’ve determined a best model, we need to convert that best model to torchscript and save all the returned files to S3.

Here we call end_training since we are using wandb and use is_main_process since we no longer need distribution. accelerator.save() is the correct way to save the model to disk, but we need to convert it to torchscript to mirror production as closely as possible. Briefly, Torchscript is a way of converting a python-based model into a serializable, production-friendly format that need not have a python dependency. As such, when testing inference on an unseen test set, it is best to test on the model that would be in production. One way to convert a model is to call torch.jit.trace passing it the model and a sample instance which is how we’ve implemented the conversion:

First, we take the best model and put it in CPU and evaluation mode. We then grab a sample instance out of the training data. Next, we encounter another user-defined function ordered_input_keys(). If you recall, this function returns the parameter names for a model’s forward() function in the correct order. It probably didn’t make sense earlier why this function was needed, but now it should: the example_inputs parameter of torch.jit.trace takes a tuple of input values which must match the exact parameter ordering of the forward() function.

Now, if we’re running a comparison job, then ordered_input_keys() is going to return a dictionary of OrderedDict’s with keys based on each model’s name. Thus, we test for this scenario and use the same get_model_specific_batch() function we used during training to extract a sample instance for the current model being converted.

Next, we iterate the ordered input keys and call .unsqueeze(0) on each parameter of the sample instance. The reason for this is because the forward() function expects a batch size as the first dimension of the input data; .unsqueeze(0) adds a dimension of 1 onto the tensors representing each parameter’s data.

Now we are ready to run the trace, passing the model, the example inputs and setting two parameters to false. The strict parameter controls whether or not you want the tracer to record mutable containers. By turning this off, you can allow, for example, your outputs = model(**batch) to remain a dict instead of a tuple. But you must be sure that the mutable containers used in your model aren’t actually mutated. check_trace checks that the same inputs run through the traced code produce the same outputs; in our case, leaving this True was producing odd errors, likely because of some internal non-deterministic operations, so we set it to False. Again, the ultimate test of the performance of the model is the inference step which we will be discussing next.

Finally, we save the traced model to local disk so it can be uploaded to s3. The final step of the train.py file is to upload all of these generated files to S3. In the case of a tuning job, we only retain the generated files from the run with the best objective metric score:

And with that, we have completed discussing the training/tuning step of the ML Pipeline. Next, we will look at the inference step where we load the torchscript model, perform inference on the unseen test set and collect statistics.

Inference

In the Training/Tuning step, we convert our best model into torchscript which means it can easily run on the CPU or multi-CPU environment. This enables us to hijack a Sagemaker Processor instance to perform our inference job. Like the previous sections, we will first look at how an inference job is initiated. Because we can use a Processor instance, it is identical to our Data Preparation step except for pointing it at our /test/ data and our inference.py file.

Refer to the Data Preparation section of the second post to learn more about Processor/ScriptProcessor jobs. Note the differences of input_source_dir pointing at /test/ and `code` pointing at inference.py. Since these are so similar, we will move on to looking at the inference.py file.

We’ve discussed repeatedly the importance of run_num and how it is used to help identify the current experiment not only while training, but also the current model in production (so a production model can be linked to a training experiment). The inference.py will use the experiment parent directory to find the test data and the run_num to find the correct trained model.

The inference.py starts by downloading the id2label file so we can translate between model predictions and human-readable predictions:

Recall from previous sections that the ML pipeline is capable of running comparison jobs (n models trained and tested on the same dataset). Inference is the step where comparison really shines, allowing you to compare performance on identical data. In the next code block, we will load n models to prepare for inference. Recall that if a single model was trained, it is passed as a list with a single value:

This loop iterates the model names, downloads/loads the torchscript converted model and initializes statistics tracking for each. Let’s take a look at each inner function:

This function constructs the path the .pt file will be behind and downloads the .pt file. It then calls torch.jit.load and sets the model to eval mode, ready for inference. init_model_stats initializes values we will track per model, for each label which provides us facts that we can use to build statistics:

And init_metrics() simply loads the metrics we used earlier in the training step:

Next, we get the test data from the Data Preparation step:

With the models and data loaded, we are now ready to run inference:

The inference code will use config.is_comparison repeatedly to execute code specific to comparison jobs. It starts by initializing statistics specifically for comparisons which we will skip for now. Next, it enters the main loop which iterates through each instance of unseen test data. The ground truth label is extracted and execution enters the inner loop over the model names (in the case of one model this is just a List with a single entry). is_comparison is called to extract the data specific to the current model using the same function used in Training (get_model_specific_batch). The instance is then prepared for the forward() function using the same technique we used in covert_to_torchscript: each value gets .unsqueeze(0) called in order to add a batch size of 1 as the first dimension of the tensor.

We then grab the currently loaded model and pass the instance to it. We extract the most confident prediction from the returned logits by calling argmax(-1). Now let’s look at the remainder of the loop (note this begins inside the inner loop):

We take the prediction produced by the model and pass it and the ground truth to our accuracy and f1 metrics. We then increment the counters we initialized at the beginning:

If inference.py is running a comparison job, we then add counts to the structure we initialized earlier; we will skip over these calls and jump to process_statistics which occurs after the inference code has finished looping:

This function looks intimidating, but all it is doing is calculating the F1 score and Accuracy per label, sorting the results by F1 score descending, calculating the overall F1 and Accuracy and uploading the results to S3 under the correct parent dir and run_num.

If you’ve followed the ML Pipeline blogs up to this point, it is prescient to revisit the folder structure that is built on S3 while the entire pipeline executes that we laid out in the first blog:

This folder structure recurs for every machine learning experiment, containing everything one would need to quickly understand the experiment or reproduce it and link an experiment to what is in production.

Prima facie, it seems like a simple part of the overall pipeline, but I believe it is one of the most important: imbuing each experiment with desirable properties like navigability, readability, reproducibility, versioning and more.

If you’ve been following these blogs up to this point then you’ve been on quite a journey. I hope they provide some guidance in setting up your own ML Pipeline. As we continue to modify ours we will post on blog-worthy topics so stay tuned. If you can check out the first two posts in the series here: Part One: Setup, Part Two: Data Steps.

Apr 19 2023
Apr 19
Pierce Lamb12 min read

Apr 19, 2023

This is the second post in a three part series on creating a reusable ML pipeline that is initiated with a single config file and five user-defined functions. The pipeline is finetuning-based for the purposes of classification, runs on distributed GPUs on AWS Sagemaker and uses Huggingface Transformers, Accelerate, Datasets & Evaluate, PyTorch, wandb and more.

This post originally appeared on VISO Trust’s Blog

This post will cover the two data steps, data reconciliation and data preparation. These are common steps in a ML process where data is collected, cleaned and encoded the way a model will expect. If you have landed on this post first, check out the first post in the series detailing the pipeline setup. You can also jump to the third post in the series detailing training and testing.

Data Reconciliation

Of all the pipeline steps, the Data Reconciliation step is the step most likely to be customized to your specific use case. It represents the taking off point for collecting, cleaning, filtering etc the training data that will compose your experiment and getting it on S3. In our case, the raw training data exists in flat files already on S3 while the labels required for supervised training exist in a production database. This is, in fact, why I called it ‘Data Reconciliation.’ In our case, the production database labels are being reconciled with the flat files on s3.

As it is unlikely the reader has the exact same setup, I will try and highlight some of the re-usable parts of Data Reconciliation without getting too into our specific flavor of Data Reconciliation. Recall that a major architecture decision in the pipeline is a separate set of training data for every experiment; the goal of this step, then, is to collect the raw data, clean it and copy it to the bucket and folder on S3 where this experiment’s storage will reside (for e.g. EXP-3333-longformer/data/reconciled_artifacts).

I’ll create a distinction here between ‘artifacts’ and ‘files’ to better understand what follows. For every ‘artifact’ uploaded into our system, tens of ‘files’ are created that represent data and analysis about the given ‘artifact.’ As such, our raw data is composed of these sets of files per uniquely identified artifact.

The first step in Data Reconciliation is to collect all of the raw data. In our case, this means authenticating to a read replica of the production database, and running a query that contains artifact identifiers related to their ground truth classification labels. We then collect all of the S3 file paths on the production instance of S3 keyed by the same artifact GUID identifier.

Data Reconciliation knows which S3 file paths to collect via a settings.ini value passed by the experimenter call FILES_FROM_PROD. For e.g. imagine each artifact has a file called raw_text.json, the experimenter would pass FILES_FROM_PROD=raw_text.json and Data Reconciliation would find the S3 path to every raw_text.json file on the production S3 bucket.

Using the artifact identifiers (GUIDs), we then filter the production database results such that both datasets contain the exact same artifact identifiers and drop duplicates using the file hash. At this point the labels and S3 paths to the flat files are now reconciled; the actual files and the label just need to be copied to the correct experiment directory.

Before that copying begins, note that we now have unique insight into the training data for this experiment. Using the filtered database results, we can discover exactly the labels that will be trained on, and the instance count per label:

Where df is a pandas dataframe of the filtered database results. Now every experiment has a unique_labels_and_counts.json in its /data folder the experimenter can interrogate to see which labels and their counts are associated with this training data set.

At this point, we encounter our first user-defined function. process_func is an optional function that will run after Data Reconciliation has copied files for every artifact identifier; it gives the experimenter the opportunity to execute some arbitrary code for each artifact identifier. As an example, when we go to train we need access to the ground truth labels extracted from the production database. process_func gives us the ability to create an additional file per artifact, say, ground_truth_label.json that contains this label. Furthermore, if one’s model requires additional files to train on, for e.g. an image of a given page, that additional file can be created here, per artifact. Because it’s optional, the user could not define it; thus:

Now that we have our reconciled data and our process_func, we have to copy data from the production S3 bucket into our experiment S3 directory. This can easily occur in parallel, so we utilize multiprocessing to kick it off as a parallel process:

This function gets the df we discussed earlier, the experiment bucket, the dict of artifact identifier (GUID) to list of desired file paths (raw_training_data_paths), the parent experiment dir (s3_artifact_path), the number of parallel processes (either a config value or multiprocessing.cpu_count()) the process_func and a boolean that determines whether or not to overwrite.

First, it uses the same function that created raw_training_data_paths except pointed at the experiment bucket and with EXP-3333-longformer/data/reconciled_artifacts/ as a filter. This gives us a dict of what training data already exists for the experiment in case Data Reconciliation failed and had been restarted; we don’t copy the same data again. Next, it splits the reconciled data per process and for each split, creates a process and calls the add_to_research_experiment function. Let’s take a look at that function:

The parameters to this function should be fairly straightforward given our discussion of copy_s3_data_in_parallel. The function iterates the data frame chunk directly checking for three different copying scenarios. I am aware that iterating a data frame directly is generally frowned upon in favor of a vectorized approach. In our case, these chunks are fairly small so it is not something we worry about. For each artifact, this function checks to see if, first, overwriting (reload) was set to true, if the current artifact already exists in the experiment and whether or not the proposed artifact has additional files to add to it and finally if it does not exist. In each case it calls an additional function that will copy the correct set of files. Next, let’s take a look at copy_to_s3:

This function is straight forward, and nicely shows what gets passed to process_func if the user has defined it. It gets the row from the df representing the current artifact, the existing files for the artifact _after_ copying, the experiment path and the overwriting boolean. This gives the experimenter a lot of flexibility on what he/she can do per artifact.

The final step of Data Reconciliation is a validation step where we use the config value FILES_ON_RESEARCH to validate that each artifact has the files it needs for training. The reason we can’t just use the earlier FILES_FROM_PROD value is because new files may have been created in process_func. So FILE_ON_RESEARCH may look like raw_text.json, page_01.png for example. This validation step is meant to provide some assurance that when we move onto Data Preparation, each artifact will have every file it needs and we don’t need to write code to handle missing files. So after all of our parallel processing completes, validate_data_was_created runs which we will view in partial stub form:

This function takes the full df, the list of desired files defined by FILES_FROM_PROD, the list of desired files that should be in the experiment FILES_ON_RESEARCH, the experiment directory (EXP-3333-longformer/data/reconciled_artifacts/) and the user defined process_func. It collects all the existing file paths for the given experiment and iterates them, popping file names off FILES_ON_RESEARCH to check if they exist for each artifact. If files are missing, it then discovers if they are FILES_FROM_PROD files and retrieves them from the prod S3 bucket or if they are process_func files which it re-runs to generate them. Once this step is complete, we can have high confidence that all of our raw training data files exist for each artifact. As such, we can move on to Data Preparation.

Data Preparation

The data preparation step is meant to take the raw training files for the experiment and encode them so they are prepared to be input into a model’s forward() function. For this task, we will utilize the HuggingFace Datasets library and specifically its powerful map() function. This is also the first task that will utilize Sagemaker, specifically Sagemaker Processor jobs.

Let’s start by taking a look at how the Processor job is constructed and called. First, we utilize the Sagemaker Python SDK’s ScriptProcessor class. This allows us to run an arbitrary script on a Processor instance. Creating the ScriptProcessor object will look like:

As you can see, this construction is basically defined by config values. Arguably the most important is config.docker_image_path. This carefully constructed docker image which we spoke about in the first post in this series is re-used among all Sagemaker jobs (Processor/Training/Tuning). We spoke in the first post about how an experimenter extends a base image that contains all common dependencies like cuda enabled pytorch, transformers, datasets, accelerate, numpy, etc and adds any of their model-specific dependencies. That base image also contains lines that allow it to run on these different Sagemaker instances, we’ll discuss one now and more during our discussion of training:

Sagemaker Training/Tuning jobs always look in the /opt/ml/code directory for custom dependencies while Processor jobs look in /opt/ml/processing. These lines copy all of our ML pipeline code into these directories to ensure that all custom dependencies are available in either type of job. Now if we jump back over to where we constructed the ScriptProcessor object, this is how we kick off the job:

One feature of Processor jobs that is easy to miss is that before the script is executed, Sagemaker copies everything from the S3 URI provided in the source param onto local disk in the destination path. Building your script around this fact will give you huge performance benefits which we will discuss more later on. Another important point that may not be immediately obvious is that the command param combined with the code param is basically like defining an ENTRYPOINT for the Processor job. While its not exactly accurate, you can imagine these params creating this command in the container:

ENTRYPOINT [‘python3’, ‘/opt/ml/code/src/preprocessing/data_preparation.py’]

So the code above is constructing the S3 URI to the reconciled artifacts we created in the Data Reconciliation step and passing it in the source` param and the Processor job copies all of this data to local disk before it kicks off. SAGEMAKER_LOCAL_DATA_DIR defines where that data will be copied and is specified in data_preparation.py` so the path can be used there as well. Processor jobs can output data which is why I’ve defined outputs, but for now the data_preparation.py script is not utilizing this feature. Now that we’ve discussed how it is kicked off, we can take a look at encoding data in data_preparation.py.

The first step at the beginning of encoding is to define the S3 directory where data will be saved and get the label file we produced during Data Reconciliation. We read a config value to get the encoded data dir, namely, ENCODED_DATA_DIR. The value will typically be full_dataset, but it gives the experimenter the ability to produce smaller test datasets if desired (e.g. partial_dataset). So the full path will look like:

encoded_data_dir = f"{config.s3_parent_dir}/data/prepared_data/{config.encoded_data_dir}"

Or EXP-3333-longformer/data/prepared_data/full_dataset

Next, we get the unique_labels_and_counts.json file we uploaded during Data Reconciliation as our ground truth for supervised learning. We give the experimenter the ability to modify the ground truth here through some basic knobs: IGNORED_LABELS and NUM_LABELS_THRESHOLD; I could imagine a number of other options here. These knobs are self explanatory:

After modifying the labels the way the experimenter wants, execution moves onto the get_artifact_paths function. This function gets the paths on local disk that raw training data was copied to and returns them in a format that the Huggingface Datasets library will expect:

get_artifact_paths is called using the same path we passed to Processor.run() to define where data should be copied along with the results of the MODEL_INPUT_FILES config param. Following our example, this value would simply be [raw_text.json]. A Huggingface.arrow_dataset.datatsets.Dataset is eventually going to expect data formatted where each row constitutes an instance of training data, and each column represents the path to the needed input file. In our case it would look like:

This would be easy to represent in pandas, but since we’d prefer to not depend on pandas and will utilize Dataset.from_dict(), get_artifact_paths represents this structure using the file names as keys and lists to contain the paths.

Execution then enters the directory defined in SAGEMAKER_LOCAL_DATA_DIR and extracts the list of subdirs which, in our case, are guids for each artifact. It iterates these subdirs collecting the filenames for all files that are children of each subdir. It then uses the passed MODEL_INPUT_FILES to validate that each needed file is there and adds it to the artifact_paths dict. We now have a dict that is ready for Datasets processing.

Control now moves to a get_encoded_data() function that will kick off Huggingface.arrow_dataset.datasets.Dataset.map() which is a very powerful abstraction for encoding datasets. get_encoded_data is intended to setup the map() function for parallel processing of raw training data encoding and is the main part of the Data Preparation step:

This function sets up the mapper, executes it, splits the returned encoded data and saves the split, encoded data to S3. The function takes the get_artifact_paths data we just generated (as data), a list of the labels only from unique_labels_and_counts.json, a few directory paths and the number of parallel processes to spin up. It starts by generating two label dicts in handle_labels, label2id.json and id2label.json which will be used downstream to convert between the integer values predicted by the model and actual string labels.

Next, one of our user defined functions get_dataset_features is called. As you may have noticed from the hints in Datasets classpaths, Datasets uses PyArrow as the backend for writing and reading data. PyArrow needs to enforce a schema it writes to and reads from; get_dataset_features` allows the experimenter to write that schema. This function returns a Datasets Features object which packages up this schema for the backend. Following our Longformer example, this function might look like:

The keys here represent the parameters the Longformer forward() function will expect when performing the forward pass. Now that we have these features, we can call Dataset.from_dict() on our get_artifact_paths data and we are fully ready for the mapper. The mapper has a variety of options, but the core concept is applying a function to every instance of training data that encodes and returns it. Let’s take a closer look at the call in Data Preparation:

Here we pass the function we want to execute per instance, preprocess_data; fn_kwargs allows us to specify additional parameters we want to pass to that function; batched means that preprocess_data will receive batches of data instead of single instances; this allows us to perform additional filtering. features are the features we retrieved from get_dataset_features, we remove the column names so they aren’t encoded and finally the number of processes to process in parallel between.

With this in place, we can take a look at def preprocess_data which is executed by each process in parallel:

The function first validates that each column of data has the exact same length and returns that length so it can be iterated over. It then iterates the batch, constructing a single instance and passing it to another user-defined function, encode_data. encode_data gives the experimenter the ability to define exactly how a single training instance is encoded with the option of returning None if additional filtering is desired. For instance, say we were using a Huggingface Transformers Tokenizer to encode; a single_instance here represents the file paths to the data we need, so we would get that data, say, in a variable called text_content and call something like this:

Where TOKENIZER is defined as a constant outside the function so it’s not re-constructed each time this function is called. If we continue following preprocess_data we can see that it simply skips single_instance’s where encode_data returns None. Finally, the encoded input is returned to the mapper in the correct Features format.

I’m going to skip looking at get_train_valid_test_split(), but suffice it to say that it uses Datasets internal function dataset.train_test_split() to split data using percentages and writes a metadata file that shows the counts of the split and associated labels to the experimenter.

And with that, Data Preparation is complete. Recall from the beginning that this will run as a ScriptProcessor job on a Sagemaker Processor instance. These instances tend to have lots of vCPU’s and can really take advantage of the parallel processing we’re doing in the mapper. The encoded data will end up on S3 ready to be downloaded by a Training or Tuning job which is discussed in the third post in this series. You can jump to the first and third post via these links: Part One: Setup, Part Three: Training and Inference.

Apr 19 2023
Apr 19
Pierce Lamb7 min read

Apr 19, 2023

Or rather, creating a reusable ML Pipeline initiated by a single config file and five user-defined functions that performs classification, is finetuning-based, is distributed-first, runs on AWS Sagemaker, uses Huggingface Transformers, Accelerate, Datasets & Evaluate, PyTorch, wandb and more.

This post originally appeared on VISO Trust’s Blog

This is the introductory post in a three part series. To jump to the other posts, check out Creating a ML Pipeline Part 2: The Data Steps or Creating a ML Pipeline Part 3: Training and Inference

Introduction

On the Data & Machine Learning team at VISO Trust, one of our core goals is to provide Document Intelligence to our auditor team. Every document that passes through the system is subject to collection, parsing, reformatting, analysis, reporting and more. Part of that intelligence is automatically determining what type of document has been uploaded into the system. Knowing what type of document has entered the system allows us to perform specialized analysis on that document.

The task of labeling or classifying a thing is a traditional use of machine learning, however, classifying an entire document — which, for us, can be up to 300+ pages — is on the bleeding edge of machine learning research. At the time of this writing, researchers are racing to use the advances in Deep Learning and specifically in Transformers to classify documents. In fact, at the outset of this task, I performed some research on the space with keywords like “Document Classification/Intelligence/Representation” and came across nearly 30 different papers that use Deep Learning and were published between 2020 and 2022. For those familiar with the space, names like LayoutLM/v2/v3, TiLT/LiLT, SelfDoc, StructuralLM, Longformer/Reformer/Performer/Linformer, UDOP and many more.

This result convinced me that trying a multitude of these models would be a better use of our time than trying to decide which was the best among them. As such, I decided to pick one and use the experience of fine-tuning it as a proof-of-concept to build a reusable ML pipeline the rest of my team could use. The goal was to reduce the time to perform an experiment from weeks to a day or two. This would allow us to experiment with many of the models quickly to decide which are the best for our use case.

The result of this work was an interface where an experimenter writes a single config file and five user defined functions that kick off data reconciliation, data preparation, training or tuning and inference testing automatically.

When I set out on that proof-of-concept (pre-ML Pipeline), it took over a month to collect and clean the data, prepare the model, perform inference and get everything working on Sagemaker using distribution. Since building the ML Pipeline, we’ve used it repeatedly to quickly experiment with new models, retrain existing models on new data, and compare the performance of multiple models. The time required to perform a new experiment is about half a day to a day on average. This has enabled us to iterate incredibly fast, getting models in production in our Document Intelligence platform quickly.

What follows is a description of the above Pipeline; I hope that it will save you from some of the multi-day pitfalls I encountered building it.

ML Experiment Setup

An important architectural decision we made at the beginning was to keep experiments isolated and easily reproducible. Everytime an experiment is performed, it has its own set of raw data, encoded data, docker files, model files, inference test results etc. This makes it easy to trace a given experiment across repos/S3/metrics tools and where it came from once it is in production. However, one trade off worth noting is that training data is copied separately for every experiment; for some orgs this simply may be infeasible and a more centralized solution is necessary. With that said, what follows is the process of creating an experiment.

An experiment is created in an experiments repo and tied to a ticket (e.g. JIRA) like EXP-3333-longformer. This name will follow the experiment across services; for us, all storage occurs on S3, so in the experiment's bucket, objects will be saved under the EXP-3333-longformer parent directory. Furthermore, in wandb (our tracker), the top level group name will be EXP-3333-longformer.

Next, example stubbed files are copied in and modified to the particulars of the experiment. This includes the config file and user defined function stubs mentioned above. Also included are two docker files; one dockerfile represents the dependencies required to run the pipeline, the other represents the dependencies required to run 4 different stages on AWS Sagemaker: data preparation, training or tuning and inference. Both of these docker files are made simple by extending from base docker files maintained in the ML pipeline library; the intent is that they only need to include extra libraries required by the experiment. This follows the convention established by AWS’s Deep Learning Containers (DLCs) and, in fact, our base sagemaker container starts by extending one of these DLCs.

There is an important trade off here: we use one monolithic container to run three different steps on Sagemaker. We preferred a simpler setup for experimenters (one dockerfile) versus having to create a different container per Sagemaker step. The downside is that for a given step, the container will likely contain some unnecessary dependencies which make it larger. Let’s look at an example to solidify this.

In our base Sagemaker container, we extend:

FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04

This gives us pytorch 1.10.2 with cuda 11.3 bindings, transformers 4.17, python 3.8 and ubuntu all ready to run on the GPU. You can see available DLCs here. We then add sagemaker-training, accelerate, evaluate, datasets and wandb. Now when an experimenter goes to extend this image, they only need to worry about any extra dependencies their model might need. For example, a model might depend on detectron2 which is an unlikely dependency among other experiments. So the experimenter would only need to think about extending the base sagemaker container and installing detectron2 and be done worrying about dependencies.

With the base docker containers in place, the files needed for the start of an experiment would look like:

In brief, these files are:

  • settings.ini: A single (gitignored) configuration file that takes all settings for every step of the ML pipeline (copied into the dockerfiles)
  • sagemaker.Dockerfile: Extends the base training container discussed above and adds any extra model dependencies. In many cases the base container itself will suffice.
  • run.Dockerfile: Extends the base run container discussed above and adds any extra run dependencies the experimenter needs. In many cases the base container itself will suffice.
  • run.sh: A shell script that builds and runs run.Dockerfile.
  • build_and_push.sh: A shell script that builds and pushes sagemaker.Dockerfile to ECR.
  • user_defined_funcs.py: Contains the five user defined functions that will be called by the ML pipeline at various stages (copied into the dockerfiles). We will discuss these in detail later.

These files represent the necessary and sufficient requirements for an experimenter to run an experiment on the ML pipeline. As we discuss the ML pipeline, we will examine these files in more detail. Before that discussion, however, let’s look at the interface on S3 and wandb. Assume that we’ve set up and run the experiment as shown above. The resulting directories on S3 will look like:

The run_number will increment with each subsequent run of the experiment. This run number will be replicated in wandb and also prefixed to any deployed endpoint for production so the exact run of the experiment can be traced through training, metrics collection and production. Finally, let’s look at the resulting wandb structure:

I hope that getting a feel for the interface of the experimenter will make it easier to understand the pipeline itself.

The ML pipeline

The ML pipeline will (eventually) expose some generics that specific use cases can extend to modify the pipeline for their purposes. Since it was recently developed in the context of one use case, we will discuss it in that context; however, below I will show what it might look like with multiple:

Let’s focus in on ml_pipeline:

The environment folder will house the files for building the base containers we spoke of earlier, one for running the framework and one for any code that executes on Sagemaker (preprocessing, training/tuning, inference). These are named using the same conventions as AWS DLCs so it is simple to create multiple versions of them with different dependencies. We will ignore the test folder for the remainder of this blog.

The lib directory houses our implementation of the ML pipeline. Let’s zoom in again on just that directory.

Let’s start with run_framework.py since that will give us an eagle eye view of what is going on. The skeleton of run_framework will look like this:

The settings.ini file a user defines for an experiment will be copied into the same dir (BASE_PACKAGE_PATH) inside each docker container and parsed into an object called MLPipelineConfig(). In our case, we chose to use Python Decouple to handle config management. In this config file, the initial settings are: RUN_RECONCILIATION/PREPARATION/TRAINING/TUNING/INFERENCE so the pipeline is flexible to exactly what an experimenter is looking for. These values constitute the conditionals above.

Note the importlib line. This line allows us to import use-case specific functions and pass them into the steps (shown here is just data reconciliation) using an experimenter-set config value for use case.

The moment the config file is parsed, we want to run validation to identify misconfigurations now instead of in the middle of training. Without getting into too much detail on the validation step, here is what the function might look like:

The _validate_funcs function ensures that functions with those definitions exist and that they are not defined as pass (i.e. a user has created them and defined them). The user_defined_funcs.py file above simply defines them as pass, so a user must overwrite these to execute a valid run. _validate_run_num throws an exception if the settings.ini-defined RUN_NUM already exists on s3. This saves us from common pitfalls that could occur an hour into a training run.

We’ve gotten to the point now where we can look at each pipeline step in detail. You can jump to the second and third post via these links: Part Two: The Data Steps, Part Three: Training and Inference.

Aug 10 2022
Aug 10
Pierce Lamb7 min read

Aug 10, 2022

(I put Custom Workflow Orchestration In Python in an Art Generator and the above popped out)

This post originally appeared on my employer VISO Trust’s blog. It is lightly edited and reproduced here.

On the Data & Machine Learning team at VISO Trust, one of our core goals is to provide Document Intelligence to the audit team. Every Document that passes through the system is subject to collection, parsing, reformatting, analysis, reporting and more. Every day, we work to expand this feature set, increase its accuracy and deliver faster results.

Why we needed workflow orchestration

There are many individual tasks executed which eventually result in what’s provided by Document Intelligence, including but not limited to:

  • Security Control Language Detections
  • Audit Framework Control ID Detections
  • Named Entity Extraction like organizations, dates and more
  • Decryption of encrypted pdfs
  • Translation of foreign language pdfs
  • Document Classification
  • Document Section Detection

Until our workflow orchestration implementation, the features listed above and more were all represented in code inside a single function. Over time, this function became unwieldy and difficult to read; snippets of ceremony, controls, logging, function calls and more sprinkled throughout. Moreover, this is one of the most important areas of our app where new features will be implemented regularly. So the need to clean this code up and make it easier to reason about became clear. Furthermore, execution inside this function occurred sequentially despite the fact that some of its function calls could occur in parallel. While in its current state, parallel execution isn’t required, we knew that in the near future, features in the roadmap would necessitate it. With these two requirements:

  • task execution that is easier to reason about and
  • the ability to execute in parallel

We knew we needed to either use an existing workflow orchestration tool or write it custom. We began with some rough analysis of what was going on in our main automation function, namely, we formalized each ‘step’ into a concept called Task and theorized on which Task’s could execute in parallel. At the time of the analysis, we had 11 ‘Tasks’ each of which required certain inputs and produced certain outputs; based on these inputs and outputs, we determined that a number could run in parallel. With this context, we reviewed some of the major open source python toolkits for workflow orchestration:

Both of these toolkits are designed for managing workflows that have tens, hundreds up to thousands of tasks to complete and can take days or weeks to finish. They have complex schedulers, user interfaces, failure modes, options for a variety of input and output modes and more. Our pipeline will reach this level of complexity someday, but with an 11 Task pipeline, we decided that these toolkits added too much complexity for our use. We resolved to build a custom workflow orchestration toolkit guided by the deep knowledge in these more advanced tools.

Our custom workflow orchestration

The first goal was to generalize all of the steps in our automation service into the concept of a Task. A few examples of a Task would be:

  • detecting a document’s language,
  • translating a foreign language document,
  • processing OCR results into raw text,
  • detecting keywords inside text,
  • running machine learning inference on text.

Just reading this list gives one a feel for how each Task is dependent on a previous Task’s output to run. Being explicit about dependencies is core to workflow orchestration, so the first step in our Task concept was defining what inputs a given Task requires and what outputs it will produce. To demonstrate Task’s, we will develop a fake example Task called DocClassifyInference, the goal of which is to run ML inference to classify a given document. Imagine that our model uses both images of the raw pdf file and the text inside it to make predictions. Our Task, then, will require the decrypted PDF and the paginated text of the pdf in order to execute. Further, when it’s complete it will write a file to S3 containing its results. Thus, the start of our example Task might look like:

DocClassifyInference subclasses S3Task, an abstract class that enforces defining a method to write to s3. S3Task itself is a subclass of the Task class which enforces that subclasses define input keys, output keys and an execute method. The keys are enforced in a Pipeline class:

This Pipeline will become the object that manages state as our Tasks execute. In our case we were not approaching memory limits so we decided to keep much of the Task state in-memory though this could easily be changed to always write to and read from storage. As a state manager, the Pipeline can also capture ceremony prior to executing any Tasks that downstream Tasks may require.

Continuing on with DocClassifyInference, as a subclass of the abstract class Task, DocClassifyInference will have to implement def execute as well (enforced by Task). This method will take a Pipeline and return a Pipeline. In essence, it receives the state manager, modifies the state and returns it so the next Task can operate on it. In our example case, execute will extract the decrypted pdf and paginated text so they can be used as inputs for a ML model to perform document classification. Let’s look at the entire stubbed out DocClassifyInference:

It’s easy to see how DocClassifyInference gets the Pipeline state, extracts what it needs, operates on that data, sets what it has declared it’s going to set and returns the Pipeline. This allows for an API like this:

Which of course was much cleaner than what we had previously. It also lends itself to writing easy, understandable unit tests per Task as well as adhering more closely to functional programming principles. So this solves our first goal of making the code cleaner and more easy to reason about. What about parallel processing?

Parallel Processing

Similar to Luigi and Apache Airflow, the goal of our workflow orchestration is to generate a topologically sorted Directed Acyclic Graph of Tasks. In short, having each Task explicitly define its required inputs and intended output allows the Tasks to be sorted for optimal execution. We no longer need to write the Tasks down in sequential order like the API described above, rather we can pass a Task Planner a list of Tasks and it can decide how to optimally execute them. What we’ll want then is a Task Planner that is passed a List of Tasks, sorts the Tasks topologically and returns a list where each member is a list that contains Tasks. Let’s take a look at what this might look like using some of our examples from above:

Here I have retained our examples while adding two new Tasks: KeywordDetection and CreateCSVOutput. You can imagine these like matching keywords in the paginated text and modifying the results of RunDocInference & KeywordDetection to create a formatted CSVOutput. When the Task Planner receives this list, we’ll want it to topologically sort the tasks and output a data structure that looks like this:

In the above List, you can imagine each of its members is a ‘stage’ of execution. Each stage has one-to-many Tasks; in the case of one, execution occurs sequentially and in the case of many, execution occurs in parallel. In english, the expected_task_plan can described like so:

  • DecryptPDF depends on nothing and creates a consumable PDF,
  • PaginatedText depends on a consumable PDF and creates a list of strings
    - RunDocInference depends on both and classifies the document
    - KeywordDetection depends on paginated text and produces matches
  • CreateCSVOutput depends on doc classification and keyword detection and produces a formatted CSV of their outputs.

An example of the function that creates the expected_task_plan above might look like:

This function gets the list of Tasks, ensures that no two Task outputs have identical keys, adds the nodes to a sorter by interrogating the Task input_keys and output_keys and sorts them topologically. In our case the sorter comes from graphlib’s TopologicalSorter which is described here. Getting into what each of these functions are doing would take us too far afield so we will move on to executing a task plan.

With the expected_task_plan shown above, an execute_task_plan() function is straightforward:

Here we iterate over the task list deciding between sequential execution or parallel execution. In the latter case, we utilize python’s threading.Thread library to create a thread per task and use idiomatic methods for starting and joining threads. Wait, then what is TaskThread?

In our case, we wanted to ensure that an exception in a child thread will always be raised to the calling thread so the calling thread can exit immediately. So we extended the threading.Thread class with our own class called TaskThread. Overriding threading.Thread’s .run() method is fairly common (so common that it’s suggested in run()’s comments); we overrode run() to set an instance variable carrying an exception’s content and then we check that variable at .join() time.

The calling thread can now try/except at .join() time.

Conclusion

With these structures in place, the file containing the automation service’s primary functions was reduced from ~500 lines to ~90. Now when we create our threadpool to consume SQS messages, we get the Task plan like so task_plan = get_task_plan() and pass the task_plan into each thread. Once execution reaches the main function for performing document intelligence, what previously was a large section of difficult-to-read code now becomes:

The introduction of parallel processing of these Task’s shaved consistent time off of performing document intelligence (an average of about a minute). The real benefit of this change, however, will come in the future as we add more and more Tasks to the pipeline that can be processed in parallel.

While we’ve reduced the time-to-audit significantly from the former state-of-the-art, we are definitely not done. Features like the above will enable us to continue reducing this time while maintaining consistent processing times. We hope this blog helps you in your workflow orchestration research.

Apr 09 2021
Apr 09
Pierce Lamb11 min read

Apr 9, 2021

This blog covers how to set up and use Personalized Paragraphs. If you’re looking for how Personalized Paragraphs was built, check out Personalized Paragraphs: Porting Smart Content to Paragraphs for Drupal. If you have any questions you can find me @plamb on the Drupal slack chat, there is also a #personalized_paragraphs channel.

You’ve visited the Personalized Paragraphs Module Page and downloaded the module. In doing that, composer should have also downloaded Smart Content, Entity Usage, and Paragraphs. You can verify this by visiting /admin/modules and checking that the modules are there. If they are not enabled, make sure Entity Usage, Paragraphs/Paragraphs Library, Smart Content, Smart Content Browser and Personalized Paragraphs are enabled. Now what?

Segment Sets

The entry point for using Smart Content is the Segment Set. Segment Sets define how you want to segment traffic via given conditions. We’ll be using the conditions out of the Smart Content Browser module as an example for this blog. As such, imagine that you want to segment traffic based on browser width. For your first segment, perhaps you want to segment visitors based on if their browser is greater than 1024px wide, or less than 1024px (I know this is a silly example, but it is nice for basic understanding). So based on the 1024px breakpoint, you want users above this width to see a certain experience and users below it to see a different one. We’ll define this in a Segment Set.

  • Visit /admin/structure/smart_content_segment_set
  • Click ‘Add Global Segment Set’.
  • Give your segment set a label like ‘Browser Width Segments.’
  • Click ‘Add Segment’
  • Give your segment a title like ‘less-than-1024px’
  • In the condition dropdown look for the ‘Browser’ header and select ‘Width’
  • Click ‘Add Condition’
  • Set the width to ‘less than’, ‘1024’px
  • Click ‘Add Segment’
  • Give your segment a title like ‘greater-than-1024px’
  • In the condition dropdown look for the ‘Browser’ header and select ‘Width’
  • Click ‘Add Condition’
  • Set the width to ‘greater than’, ‘1024’px
  • Click ‘Add Condition’
  • In the label area, add ‘Default Segment’
  • Under ‘Common’ select the condition ‘True’
  • Check ‘Set as default segment’
  • Save

With Browser Width Segments in place, we now have a way of segmenting traffic based on the width of the users browser (I recognize we only needed the less-than-1024px and default segments here, but it helps for learning to show all conditions). Users with browsers less than 1024px wide will see one piece of content and users with browsers greater will see a different piece. We added the default segment for clarity; in a situation where a user does not match any segment, it will display.

Personalized Content

Now we have to decide what content we’d like to personalize based on browser width. Let’s say we have a content type called ‘Homepage’ and we want to personalize the banner area of our homepage. Ideally we would use a paragraph to represent the banner and display different banners based on browser width. The first thing we want to do is create the paragraph that will be our banner.

  • Open the homepage node’s structure page (or whichever node you’re personalizing)
  • Take note of the fields it currently contains that constitute the banner
  • In another tab or window, visit /admin/structure/paragraphs_type/add
  • Give it a name like ‘Personalization — Homepage Banner’
  • Make sure to check the box that says ‘Allow adding to library.’ If this box is missing, Paragraphs Library likely is not enabled, check /admin/modules and see if its enabled
  • Inside your new Paragraph, re-create each field that represents the banner from the tab you have open on the homepage.
  • I typically add a boolean field to check if the paragraph is a personalized paragraph and
  • I typically add a campaign ID text field to add a campaign to the paragraph which can be pushed into the dataLayer
  • Neither of these are necessary for this tutorial, but may be to you in the future
  • Save your paragraph
  • Now visit /admin/content/paragraphs (added via Paragraphs Library)
  • Click ‘Add Library Item’
  • Add a label like ‘Homepage Banner — less-than-1024px’
  • Click the paragraph dropdown and select ‘Personalization — Homepage Banner’
  • Fill in the fields
  • Save
  • Click ‘Add Library Item’
  • Add a label like ‘Homepage Banner — greater-than-1024px’
  • Click the paragraph dropdown and select ‘Personalization — Homepage Banner’
  • Fill in the fields
  • Save
  • Do this one more time, but for ‘Homepage Banner — Default Banner’

If you’ve completed these steps, you now have 3 personalized banners that match the 3 segments we created above. Okay, so we have our segment set and our personalized content, now what?

Personalized Paragraphs

The next step is adding a Personalized Paragraph to the node you’re personalizing. In order to do that, we:

  • Add a new field to our homepage node.
  • Select the Add a New Field dropdown, find the ‘Reference Revisions’ header and select ‘Paragraph’
  • Give it a name like ‘Personalized Banner’
  • Save
  • On the ‘Field Settings’ page, under ‘allowed number of values’ select ‘Limited’ — 1 (this may change in the future)
  • Now, on the ‘Edit’ page for this new field, under the ‘Reference Type’ fieldset find and select ‘Personalized Paragraph’
  • Save

If you’ve completed all of these steps (note that you may need to flush caches), you can now load an instance of your homepage node and click ‘edit’ or add a new homepage node and you should see something that looks like this on the page (if not, click the ‘Add Personalized Paragraph button’):

The first step is to give your personalized paragraph a name that is unique within the page like ‘personalized_homepage_banner.’ It is possible to continue while leaving this field blank (and you can change it later), it is used only for identifying a personalized paragraph in front end code. Next we should find our segment set, ‘Browser Width Segments’ in the segment set dropdown and press ‘Select Segment Set.’ After it finishes loading, we should see 3 reactions which match the 3 segments we created earlier. In the paragraph dropdowns, we’ll select the respective paragraph we made for each segment, for e.g. ‘Homepage Banner — less-than-1024px’ will go in the segment titled ‘SEGMENT LESS-THAN-1024PX.’ After saving we will be redirected to viewing the saved page.

You may or may not see any change to your page at this point, it depends on your template file for this page. The content itself will be behind content.field_personalized_banner in your respective template file. One way to see that it’s on the page is to search the html for whatever you named the paragraph type used by Paragraphs Library. In our example it would be personalization-homepage-banner. This string should be found in the classes of an element in the HTML. The key, however, is that you now have access to the winning paragraph in your template file. Here’s an example of how we might display field_personalized_banner in a template file:

{% if content.field_personalized_banner %}


{{ content.field_personalized_banner }}




{% endif %}

Assuming you are seeing the content of your Personalized Paragraphs on the page, you can change your browser size to either larger than 1024px or less than, refresh, and you should get the other experience. If you don’t this is likely because Smart Content stores information in local browser storage regarding the experience you first received (specifically the _scs variable); it checks this variable before doing any processing for performance reasons. One way around this is to load a fresh incognito window, resize it to the test width and then load your page, another is to go into your local storage and delete the _scs variable.

There are many ways to segment traffic in Smart Content beyond browser conditions, Smart Content provides sub modules for Demandbase, UTM strings and 6sense. I’ve also written a module for Marketo RTP which we will be open sourcing soon. I wrote a blog about it here which can be used as a guide for writing your own connector.

At this point in using Personalized Paragraphs, there are a lot of ways you could go with displaying the front end and you definitely don’t need to read the remainder of this blog. I’ll cover the way my org displays it, but I’ll note that there isn’t some sort of best practice, it is just what works for us. You absolutely do not need to use Personalized Paragraphs in this manner and I encourage you to experiment and find what works for you.

Front end processing

You may have noticed that our personalized banner field is ‘overwriting’ the existing banner fields on our homepage node (as opposed to replacing). That is, our original banner fields and personalized banner field are doing the same ‘thing’ and now we have two of them on the page. The reason for this decision is based on how Smart Content works. When the page is loaded Smart Content decides which paragraph has won then uses ajax to retrieve that paragraph and load it onto the page. If we replaced the original banner with the Personalized Paragraph, the banner would appear to ‘pop-down’ after the page had begun loading whenever the ajax returned. We felt that this experience was worse than the small performance hit of loading both banners onto the page (it also provides a nice fall back should anything fail). This ‘pop-down’ effect of course only occurs when you’re personalizing a portion of the page that contributes to the flow; if it was an element that doesn’t you’d see more of a ‘pop-in’ effect which is less jarring.

Because we are loading two of the same ‘things’ onto the page, we need to use javascript to decide which one gets displayed. We need a way to inform some JS code of whether or not a decision paragraph is on the page (else display fallback) and also differentiate between which decision paragraph we’re operating on (in case there is more than one on the page). The entry point for this goes back to how Personalized Paragraphs was built, to a more controversial point near the end. In a hook_preprocess_field__entity_reference_revisions function we load the Personalized Paragraph onto the page then execute this code:

$para_data = [
'token' => $token,
];
$has_name = !$para->get('field_machine_name')->isEmpty();
$name = $has_name ? $para->get('field_machine_name')->getValue()[0]['value'] : '';
$variables['items'][0]['content']['#attached']['drupalSettings']['decision_paragraphs'][$name] = $para_data;

In essence, this code stores the information we’re after in drupalSettings so we have access to it in javascript. We get a key, [‘decision_paragraphs’] which contains an associative array of personalized paragraphs machine names which have their decision tokens as values. With this information we can now manipulate the front end as we need.

Before I show how we display the default or the winner on the front end, I want to reiterate that this is not a best practice necessarily; it’s just a design decision that works for us. All of the below code is stored in a custom module specifically for personalization. First, we create a js file called ‘general_personaliztion.js’ that will be attached on any page where Personalized Paragraphs run. To continue following our example, we add this at the top of our homepage template:

{{ attach_library(‘/general_personalization_js’) }}

Following the style much of the smart content javascript is written in, this file defines an object that can be accessed later by other JS files:

Drupal.behaviors.personalization = {};

Drupal.behaviors.personalization.test_for_decision_paragraph =
function(paragraphs_and_functions, settings) {
...
};</span>

This first function will test if a decision paragraph is on the page:

function(paragraphs_and_functions, settings) {
if(settings.hasOwnProperty('decision_paragraphs')) {
paragraphs_and_functions.forEach((display_paragraphs, paragraph_name) => {
if (!settings.decision_paragraphs.hasOwnProperty(paragraph_name)) {
var show_default = display_paragraphs['default'];
show_default();
}
});
} else {
paragraphs_and_functions.forEach((display_paragraphs, paragraph_name) => {
var show_default = display_paragraphs['default'];
show_default();
});
}
};

It first tests to see if the key ‘decision_paragraphs’ exists in the passed settings array, if it doesn’t it takes the passed paragraphs_and_functions Map, iterates over each Personalized Paragraph machine name, grabs the function out of it that displays the default experience and executes it. If ‘decision_paragraphs’ does exist, it does the same iteration checking to see if the Map of machine names passed to it are represented in that decision_paragraphs key, if not it gets their default function and executes it. This means we can change machine names, delete personalized paragraphs etc and guarantee that the default experience will display no matter what we do. So how do we call this function with the right parameters?

We create a new file, personalization_homepage.js which is now attached directly underneath the previous file in the homepage template:

{{ attach_library('/general_personalization_js') }}
{{ attach_library(‘/personalization_homepage') }}

This file will only ever be attached to the homepage template. Inside this file, we create an object to represent the default and personalized experiences for our banner:

var banner_display_functions = {
'default': personalization_default_banner,
'personalized': personalization_set_banner
};

The values here are functions defined elsewhere in the file that execute the JS necessary to display either the default or personalized experiences. Next we create a Map like so:

var paragraphs_and_functions = new Map([
['personalization_homepage_banner', banner_display_functions],
]);

The map has personalized paragraph machine names as keys and the display object as values. You can imagine with another Personalized Paragraph on the page, we’d just add it as a member here. An unfortunate side effect of this design is the hardcoding of those machine names in this JS file. I’m sure there is a way around this, but for performance reasons and how often we change these paragraphs, this works for us. With this in place, calling our test_for_decision_paragraphs function above is straight forward:

Drupal.behaviors.personalizationHomepage = {
attach: function (context, settings) {
if (context === document) { Drupal.behaviors.personalization.test_for_decision_paragraph(paragraphs_and_functions, drupalSettings);
}

...
}

This populates test_for_decision_paragraph with the correct values. It is inside context === document to ensure that it executes at only the right time (without this control the default banner will ‘flash’ multiple times). So now the code is in place to test for a decision paragraph and display the default experience if it is not there. What about displaying the winning personalized experience?

We create another function inside general_personalization.js, ‘test_for_winner’:

Drupal.behaviors.personalization.test_for_winner =
function(current_decision_para_token, paragraphs_and_functions) {
//Iterate over the display blocks, matching display block UUID to decision_block_token
paragraphs_and_functions.forEach((display_paragraphs, paragraph_name) => {
if(drupalSettings.decision_paragraphs.hasOwnProperty(paragraph_name)) {
var decision_paragraph_token = drupalSettings.decision_paragraphs[paragraph_name].token;
if (decision_paragraph_token === current_decision_para_token) {
var show_winner = display_paragraphs['personalized'];
show_winner($(event.detail.response_html))
}
}
});

};</span>

(note it may be event.detail.data for you)

Remember how we talked about that hook_preprocess_hook in which we attached the decision token of a personalized paragraph keyed by machine name? Our test_for_winner iterates those machine names extracting the decision token and comparing it to the decision token of the current winning paragraph; when a match is found, it uses the winning machine name to find and run the function that displays the personalized experience and executes it. So how do we call this function? Inside the same Drupal.behaviors.personalizationHomepage{…} in personalization_homepage.js we add:

window.addEventListener('smart_content_decision', function (event) {
Drupal.behaviors.personalization.test_for_winner(event.detail.decision_token, paragraphs_and_functions);
});
}

(note: we’ve made some edits to the broadcasted event object internally. I invite you to print this object and see what it contains.

And here is where we touch one of the areas Smart Content creates an offering for front end processing: when a winner is chosen by Smart Content, it broadcasts a ‘smart_content_decision’ event which contains a bunch of information including the decision token of the winning content. This is what test_for_winner uses to compare to existing personalized paragraphs to select a display function. Smart Content will broadcast this event for every winning paragraph (e.g. if we have n personalized paragraphs on the page, the event will broadcast n times with each paragraph’s winning token) and test_for_winner allows us to know which paragraph it’s currently broadcasting for and allows us to execute that paragraphs personalized display function.

There are a number of other cool things we can do with this broadcasted event, for example, pushing the campaign name of the winning paragraph into the dataLayer, but I will leave this for another time.

I hope that the example we’ve used throughout has helped you to better understand how to use Personalized Paragraphs and the brief tour of our front end design has given you a starting point.

If you’re looking for how Personalized Paragraphs was built, check out Personalized Paragraphs: Porting Smart Content to Paragraphs for Drupal. If you have any questions you can find me @plamb on the Drupal slack chat, there is also a #personalized_paragraphs channel.

Mar 15 2021
Mar 15
Pierce Lamb24 min read

Mar 15, 2021

This blog covers how the Personalized Paragraphs module was built, if you’re looking for how to use Personalized Paragraphs, check out the How To Use Personalized Paragraphs blog. If you have any questions you can find me @plamb on the Drupal slack chat, there is also a #personalized_paragraphs channel.

In 2020, I was tasked by my organization with finding or developing a personalization solution for our Drupal-based website. By personalization, I mean a tool that will match anonymous users into segments and display a certain piece of content based on that segmentation. Brief searching led me into the arms of Smart Content, a platform for personalization developed by the clever folks over at Elevated Third. Smart Content is a toolset for managing segmentation, decisions, reactions etc all within the Drupal framework. As a general platform, it makes no assumptions about how you want to, say, display the content to the user or pass results back to your analytics platform. However, it comes with a number of sub modules so you don’t need to develop these solutions on your own. Out of the box, Smart Content includes ‘Smart Content Block’ which allows you to utilize Drupal’s Block interface to manage your personalized content. There are a number of reasons this was a good idea, but it also presented some difficulties (at least for us).

After installing Smart Content, the most straightforward way to use personalized blocks was to create a Smart Content Decision Block in the block layout builder. However, to get control over where the block was placed (i.e. instead of in a region across many pages), we needed to disable the block, load it independently in a preprocess and attach it to the relevant page’s theme variables; a bit cumbersome. I recognize that there are other options like Block Field out there, but this appeared to be the most out-of-the-box way to use Smart Content Block. As a block-based solution, we found that we had to make changes to the blocks on prod then drag the changes back to our development branches and environments because exporting block config would cause UUID issues on merge. As our use cases grew, this became more cumbersome. In addition, my organization heavily leans on Paragraphs to power content inside of Nodes (and very sparingly uses blocks). After about 6 months of using Smart Content we decided we should see if we could utilize Paragraphs to power personalization.

The funny thing about Paragraphs is that they don’t ‘float free’ of Nodes in the same way that blocks do; at their core they are referenced by Nodes. Or at least I thought. When we discussed using paragraphs I did some brief research and saw that others had successfully attempted porting Smart Content to Paragraphs. Upon testing this module, I found that it relied on an old, fairly different version of Smart Content and also included a lot of extra code relevant to that organization’s use case. Further, it lacked the extremely well-thought interface for adding a segment set and reactions that’s contained in Smart Content Block. However, the key insight its author’s included was the use of the Paragraphs Library. Paragraphs Library is an optional sub module of Paragraphs that was quietly added in 2018 and allows users to create Paragraphs that ‘float free’ in just the way we’d need to personalize them. With this in hand, I thought I would try porting the experience of Smart Content Block to Paragraphs.

The Supporting Structure

The porting process began by digging into the smart_content_block sub-module of Smart Content. The entry point is Plugin/Block/DecisionBlock.php which appeared to be an Annotated Block plugin. When constructed, it had a control structure which created a further plugin ‘multiple_block_decision’ which I found defined in Plugin/smart_content_block/Decision. Further, in one of MultipleBlockDecision’s functions, it creates an instance of the display_blocks plugin which was defined in Plugin/smart_content_block/Reaction:

I knew that these three files combined must work together to create the nice user experience that administrating smart_content_block currently has. So I set about to emulate them, but with Paragraphs instead of blocks.

Paragraphs did not come pre-packaged with an obvious Annotation plugin to achieve what I wanted, so I created one. I sought to mimic the one included with Blocks and thus in the Annotation/PersonalizedParagaph.php file, I defined it as: class PersonalizedParagraph extends Plugin {…}. With this in hand I could now create a Plugin/PersonalizedParagraph/DecisionParagraph.php that mimicked smart_content_block’s DecisionBlock:

/**
* Class Decision Paragraph.
*
* @package Drupal\personalized_paragraphs\Plugin\PersonalizedParagraph
*
* @PersonalizedParagraph(
* id = "personalized_paragraph",
* label = @Translation("Personalized Paragraph")
* )
*/

However, before I defined class DecisionParagraph, I knew I needed to extend something similar to BlockBase and implement ContainerFactoryPluginInterface just like DecisionBlock.php does. I opened Core/Block/BlockBase.php and attempted to mirror it as closely as I could. I created personalized_paragraph/Plugin/PersonalizedParagraphBase.php. Here is the comparison between the two:

abstract class BlockBase extends ContextAwarePluginBase implements BlockPluginInterface, PluginWithFormsInterface, PreviewFallbackInterface {

use BlockPluginTrait;
use ContextAwarePluginAssignmentTrait;
...
}</span>abstract class PersonalizedParagraphsBase extends ContextAwarePluginBase implements PersonalizedParagraphsInterface, PluginWithFormsInterface, PreviewFallbackInterface {

use ContextAwarePluginAssignmentTrait;
use MessengerTrait;
use PluginWithFormsTrait;
...
}</span>

And other than cosmetic function name changes, the classes are largely the same. They implement PluginWithFormsInterface which is defined as:

Plugin forms are embeddable forms referenced by the plugin annotation. Used by plugin types which have a larger number of plugin-specific forms.

Which certainly sounds like exactly what we need (a way to plug one form into another). You may have noticed one difference though, I had to create an interface, PersonalizedParagraphsInterface to mirror BlockPluginInterface. Again, these two files are largely the same, I’ll leave it to the reader to check them out.

At this point, I now had the beginning of a DecisionParagraph.php and the files that back it, Annotation/PersonalizedParagraphs.php, PersonalizedParagraphsBase.php and PersonalizedParagraphsInterface.php. Since DecisionParagraph.php is a plugin, I knew I’d need a Plugin Manager as well. My next step was to create a /Plugin/PersonalizedParagraphsManager.php. This file is as default as it gets when it comes to a Plugin Manager:

class PersonalizedParagraphsManager extends DefaultPluginManager {

/**
* Constructs a new ParagraphHandlerManager object.
*
* @param \Traversable $namespaces
* An object that implements \Traversable which contains the root paths
* keyed by the corresponding namespace to look for plugin implementations.
* @param \Drupal\Core\Cache\CacheBackendInterface $cache_backend
* Cache backend instance to use.
* @param \Drupal\Core\Extension\ModuleHandlerInterface $module_handler
* The module handler to invoke the alter hook with.
*/
public function __construct(\Traversable $namespaces, CacheBackendInterface $cache_backend, ModuleHandlerInterface $module_handler) {
parent::__construct(
'Plugin/PersonalizedParagraph',
$namespaces,
$module_handler,
'Drupal\personalized_paragraphs\Plugin\PersonalizedParagraphsInterface',
'Drupal\personalized_paragraphs\Annotation\PersonalizedParagraph'
);

$this->alterInfo('personalized_paragraphs_personalized_paragraphs_info');
$this->setCacheBackend($cache_backend, 'personalized_paragraphs_personalized_paragraphs_plugins');
}
}</span>

You can see in its constructor that it gets created with all the key files we need to create the DecisionParagraph plugin. Note that a plugin manager also requires a module.services.yml which I defined as follows:

services:
plugin.manager.personalized_paragraphs:
class: Drupal\personalized_paragraphs\Plugin\PersonalizedParagraphsManager
parent: default_plugin_manager

I knew at this point I must be really close. However if you recall the screenshot above, I was still missing mirrors of MultipleBlockDecision and DisplayBlocks. My next step was to create /Plugin/smart_content/Decision/MultipleParagraphDecision.php and /Plugin/smart_content/Reaction/DisplayParagraphs.php. While over the course of building Personalized Paragraphs these files would get edited, the class stubs would be identical. This is largely because Smart Content creates Annotated Plugin types for many of its core functions which makes it extremely easy to extend. Comparing MultipleBlockDecision and MultipleParagraphDecision:

/**
* Provides a 'Multiple Block Decision' Decision plugin.
*
* @SmartDecision(
* id = "multiple_block_decision",
* label = @Translation("Multiple Block Decision"),
* )
*/
class MultipleBlockDecision extends DecisionBase implements PlaceholderDecisionInterface {
...
}
/**
* Provides a 'Multiple Paragraph Decision' Decision plugin.
*
* @SmartDecision(
* id = "multiple_paragraph_decision",
* label = @Translation("Multiple Paragraph Decision"),
* )
*/
class MultipleParagraphDecision extends DecisionBase implements PlaceholderDecisionInterface {
...
}

And this is isomorphic in the case of DisplayBlocks.php and DisplayParagraphs.php. With MultipleParagraphDecision and DisplayParagraphs in place, I just had to go change where they were created from multiple_block_decision -> multiple_paragraph_decision in DecisionParagraph and display_blocks -> display_paragraphs in MultipleParagraphDecision. At this point, my /src/ folder structure looked like this:

Very close to the structure of smart_content_block. Okay so now I have all this plugin code defined, but how will Drupal know when and where to create instances of personalized_paragraph?

When and Where does this run?

The first step was to create a Paragraph Type called ‘Personalized Paragraph’. Simple enough. At the time I created this, I did not think it would need any fields, but we will discuss later why I did. The Personalized Paragraph Type would be the entry point for a Paragraph inside a node to basically say “hey I’m going to provide personalized content.”

Our first ever use case for personalized content was our homepage banner, so to test my code, I created another Paragraph type called Personalization — Homepage Banner (the reason for this naming convention is that you can imagine many personalization use cases all being grouped together by starting with ‘Personalization -’). The key switch I needed to flip in creating this test Paragraph was this:

“Allow adding to library” meant that this specific Paragraph Type could have members created in the Paragraphs Library that ‘float free’ from any node. With that flipped, I just needed to mirror the fields that produce our homepage banner in this Paragraph Type. Now I could load the Paragraphs Library, /admin/content/paragraphs, and create every personalized paragraph I needed to support personalizing the homepage banner. This step is discussed in more detail in the How To Use Personalized Paragraphs blog.

Now, in order to test the ‘when and where’ question above, I loaded our ‘homepage’ content type and added a new field, ‘Personalized Banner’ that referenced a ‘Paragraph’:

And in the Paragraph Type to reference I selected Personalized Paragraph:

The Personalized Banner field was now telling our homepage node that it would contain personalized content. With this structure in place, I could now programmatically detect that a personalized_paragraph was being edited in the edit form of any homepage node and displayed when a homepage node was viewed. Further, I’d be able to use the Paragraphs I’d added to the library to display when different Smart Content segments were matched.

The Form Creation Journey

I wanted to get the node edit form working first, so in personalized_paragraphs.module, I needed to detect that a Paragraph of type personalized_paragraph was in a form. I created a:

function personalized_paragraphs_field_widget_entity_reference_paragraphs_form_alter(&$element, FormStateInterface &$form_state, $context){...}

Which is a form_alter hook that I knew would run for every Paragraph in a form, so I immediately needed to narrow it to personalized_paragraphs Paragraphs:

$type = $element['#paragraph_type'];
if($type == 'personalized_paragraph'){
...
}
}

So I was hooked into any form that contains a Personalized Paragraph. This captures the ‘when and where’ that I needed to load the plugin code defined above. So the next step was to load the plugin inside our control structure:

if ($plugin = personalized_paragraphs_get_handler('personalized_paragraph')) {
$build_form = $plugin->buildConfigurationForm([], $form_state);
$element['subform']['smart_content'] = $build_form;
}

And the code for _get_handler:

function personalized_paragraphs_get_handler($plugin_name) {
$plugin_manager = \Drupal::service('plugin.manager.personalized_paragraphs');
$definitions = $plugin_manager->getDefinitions();

foreach ($definitions as $plugin_id => $definition) {
if ($plugin_id == $plugin_name) {
return $plugin_manager->createInstance($plugin_id);
}
}
return false;
}</span>

So what’s going on here? Well, we know we’re acting on the form that builds a Paragraphs edit interface. Once we know that, we can go ahead and load the Annotation plugin we defined in the beginning (personalized_paragraph) using the custom plugin manager we defined (plugin.manager.personalized_paragraphs). This will give us an instance of DecisionParagraph. With that instance, we can call DecisionParagraph’s buildConfigurationForm method passing it an empty array. When it returns, that empty array will be a filled render array which mirrors the smart_content_block user experience exactly, but within a Personalized Paragraph. So all we need to do is attach it in its own key (smart_content) to the element’s ‘subform’ and it will display in the right area.

So what is happening inside buildConfigurationForm? I won’t be going too in depth in here as most of this is simply mimicking smart_content_block. Suffice it to say that when the DecisionParagraph is constructed, an instance of MultipleParagraphDecision is also constructed. ->buildConfigurationForm ends up being called in both classes. You can view the code in each to get a sense of how the form render array is built. Now, with this code in place, we end up with an experience exactly like smart_content_block, but inside a Paragraph inside a Node; this is what the personalized paragraph in my homepage type looked like:

This is ultimately what anyone who has used smart_content_block would want out of a Paragraphs-based version. Since we had been using smart_content_block, we had a number of Segment Sets already to test from. Here is the result of selecting our Homepage Customer Block Segment Set:

I would like to digress for a moment here to discuss one of the most difficult bugs I encountered in the process. Getting the ‘Select Segment Set’ ajax to work was an absolute journey. On first implementation, the returned content was an empty . That class name led me to ManagedFile.php which is a class that provides an AJAX/progress aware widget for uploading and saving a file. This of course was odd because this element was not an uploading/file widget, however this particular Node edit form did have elements like this on the page. After stepping through execution in both Symfony and core’s FormBuilder what I discovered is this (line 1109 of FormBuilder):

// If a form contains a single textfield, and the ENTER key is pressed
// within it, Internet Explorer submits the form with no POST data
// identifying any submit button. Other browsers submit POST data as
// though the user clicked the first button. Therefore, to be as
// consistent as we can be across browsers, if no 'triggering_element' has
// been identified yet, default it to the first button.
$buttons = $form_state->getButtons();
if (!$form_state->isProgrammed() && !$form_state->getTriggeringElement() && !empty($buttons)) {
$form_state->setTriggeringElement($buttons[0]);
}

In short, I was pressing ‘Select Segment Set’, the triggering element wasn’t being found as the form was rebuilt in FormBuilder, and the code was just setting it to the first found button on the page (hence MangedFile.php). I have no objection with the comment or reason for this code block, but it makes it extremely difficult to figure out why your AJAX button isn’t working. If, for example, it triggered a log statement inside the if that said something like “the triggering element could not be matched to an element on the page during form build” it would have saved me multiple days of pain.

FormBuilder attempts to match the triggering element by comparing the name attribute of the pressed button to the name attributes of buttons on the page as it rebuilds the form. The issue was occurring because smart_content_block creates the name from a UUID it generates when MultipleDecisionBlock is created. In Personalized Paragraphs, this creation occurs inside a field_widget_entity_reference_paragraphs_form_alter which is called again while the form is rebuilt. As such a new UUID is generated, and FormBuilder cannot match the two elements.

The solution was to create a name that is unique within the edit form (so it can be matched), but does not change when the form is rebuilt. I added this above ->buildConfigurationForm:

$parent_field = $context['items']->getName();
$plugin->setConfigurationValue('parent', $parent_field);
$build_form = $plugin->buildConfigurationForm([], $form_state);
$element['subform']['smart_content'] = $build_form;

The machine name of the field that contains the personalized paragraph is passed along configuration values in DecisionParagraph to MultipleParagraphDecision where it is extracted and used to create the name attribute of the button. This solved the issue. Okay, now back to the returned Reactions.

The class that builds the Reactions after a Segment Set is selected is DisplayParagraphs; an instance is created for each Reaction, the code that executes this is found in MultipleParagraphDecision inside the stubDecision() method and buildSelectedSegmentSet method if the Reactions already exist. The Reactions are the first place we depart from the smart_content_block experience.

Seasoned users of smart_content_block will notice that the ‘Add Block’ button is missing. One of the most difficult problems I encountered while porting smart_content_block was getting the ajax buttons in the form experience to work correctly. Because of this, I opted to just hide them here (commented the code that built them in DisplayParagraphs.php) and instead validate and submit whatever is in the select dropdown at submission time. I liked the simplicity of this anyway, but it means that a given reaction could never contain more than one paragraph. This is an area ripe for contribution inside personalized_paragraphs.

In order to populate the select dropdowns in the Reactions, I first needed to go create some test Paragraphs Library items that would exist in them. I loaded /admin/content/paragraphs, selected ‘Add Library Item’ and then Add -> ‘Personalization — Homepage Banner’ (the Paragraph I created earlier to mimic the content I’m personalizing). I created a few instances of this Paragraph. Now I could go back to DisplayParagraphs.php and figure out how to retrieve these paragraphs.

Looking at the buildConfigurationForm method, it was clear that an array of $options was built up and passed to the form render array, so I needed to simply create some new options. Since we’re dealing with ContentEntities now, this was pretty easy:

$pg_lib_conn = $this->entityTypeManager->getStorage('paragraphs_library_item');
$paragraphs = $pg_lib_conn->loadMultiple();
$options = [];
$options[''] = "- Select a Paragraph -";
foreach($paragraphs as $paragraph){
$maybe_parent = $paragraph->get('paragraphs')->referencedEntities();
if(!empty($maybe_parent)) {
$parent_name = $maybe_parent[0]->bundle();
$options[$parent_name][$paragraph->id()] = $paragraph->label();
} else {
$options[$paragraph->id()] = $paragraph->label();
}
}

The code loads all of the existing paragraphs_library_items and splits them by Paragraph Type for easy selection in the dropdown which is how it works in smart_content_block. $options is later passed to a render array representing the select dropdown.

With this in place, we’re able to add a personalized_paragraph to a node, select a segment set, load reactions for that segment set and select the personalized paragraphs we want to display. Beautiful. What happens when we press Save?

The Form Submission Journey

Due to the way I was loading the Segment/Reaction form into the node edit form, none of the existing submit handlers were called by default. Thankfully the submit function attached to DecisionParagraph, paragraphSubmit, was designed in a way that it calls all the nested submit functions, i.e. MultipleParagraphDecision::submitConfigurationForm, which loops while calling DisplayParagraphs::submitConfigurationForm. So all I needed to do was attach paragraphSubmit as a custom handler like so:

function personalized_paragraphs_form_node_form_alter(&$form, FormStateInterface $form_state, $form_id){
$node = $form_state->getFormObject()->getEntity();
$personalized_fields = _has_personalized_paragraph($node);
if(!empty($personalized_fields)){
if ($plugin = personalized_paragraphs_get_handler('personalized_paragraph')) {
_add_form_submits($form, $plugin);
}
}
}

For reference, _has_personalized_paragraph looks like this:

function _has_personalized_paragraph($node){
$fields = [];
foreach ($node->getFields() as $field_id => $field) {
$settings = $field->getSettings();
$has_settings = array_key_exists('handler_settings', $settings);
if ($has_settings) {
$has_bundle = array_key_exists('target_bundles', $settings['handler_settings']);
if ($has_bundle) {
foreach ($settings['handler_settings']['target_bundles'] as $id1 => $id2) {
if ($id1 == 'personalized_paragraph' || $id2 == 'personalized_paragraph') {
array_push($fields, $field_id);
}
}
}
}
}
return $fields;
}

I’ll note here that it certainly ‘feels’ like there should be a more Drupal-y way to do this. I’ll also note that at the time of this writing, PP’s have not been tested on Paragraphs that contain them more than one level deep; my sense is that this function would fail in that case (another area ripe for contributing to the module).

Okay, so now we know that when someone presses ‘save’ in the node edit form, our custom handler will run.

paragraphSubmit departs pretty heavily from DecisionBlock::blockSubmit. First, since a Node could have an arbitrary number of personalized paragraphs, we must loop over $form_state’s userInput and detect all fields that have personalized paragraphs. Once we’ve narrowed to just the personalized fields, we loop over those and feed their subforms to similar code that existed in DecisionBlock::blockSubmit.

paragraphSubmit narrows to the form array for a given personalized paragraph and then passes that array to DecisionStorageBase::getWidgetState (a smart_content class) which uses NestedArray::getValue(). Users of this function know you pass an array of parent keys and a form to ::getValue() and it gives back null or a value. When I initially wrote this code, I hardcoded ‘0’ as one of the parents, thinking this would never change. However, one big difference in smart_content_block and personalized_paragraphs is that by virtue of being a paragraph, a user can press ‘Remove’, ‘Confirm Removal’ and ‘Add Personalized Paragraph’. In the form array that represents the personalized paragraph, pressing these buttons will increment that number by 1. So in paragraphSubmit, it will now have a 1 key instead of a 0 key. To handle this, I wrote an array_filter to find the only numerical key in the form array:

$widget_state = $form[$field_name]['widget'];
$filter_widget = array_filter(
$widget_state,
function ($key) {
return is_numeric($key);
},
ARRAY_FILTER_USE_KEY
);
$digit = array_key_first($filter_widget);

$parents = [$field_name, 'widget', $digit, 'subform', 'smart_content'];</span>

In comments above, it’s noted this will fail if someone attempts to create a field that has multiple Personalized Paragraphs in it (array_key_first will return only the first one). This is another area ripe for contribution in Personalized Paragraphs.

DecisionStorageBase::getWidgetState gets a decision storage representation from the form state and returns it. I added code here to ensure that the decision is always of type ContentEntity and not ConfigEntity (smart content defines both). Next, the code uses the $parents array and passed in $form variable to get the actual $element we’re currently submitting. It then runs this code:

if ($element) {
// Get the decision from storage.
$decision = $this->getDecisionStorage()->getDecision();
if ($decision->getSegmentSetStorage()) {
// Submit the form with the decision.
SegmentSetConfigEntityForm::pluginFormSubmit($decision, $element, $form_state, ['decision']);
// Set the decision to storage.
$this->getDecisionStorage()->setDecision($decision);
}
}

It’s easy to miss, but this line:

SegmentSetConfigEntityForm::pluginFormSubmit($decision, $element, $form_state, ['decision']);

Is what submits the current $element to the submit handler in MultipleParagraphDecision whose submit handler will ultimately call DisplayParagraphs submit handler ($decision in this case is the instance of MultipleParagraphDecision). So the chain of events is like this:

  • Node_form_alter -> add paragraphSubmit as a custom handler.
  • On submission, paragraphSubmit calls MultipleParagraphDecision::submitConfigurationFrom (via ::pluginFormSubmit)
  • This function has a looping structure which calls DisplayParagraphs::submitConfigurationForm for each Reaction (via ::pluginFormSubmit).

Before completing our walk through of paragraphSubmit, let’s follow the execution and dive into these submit handlers.

MultipleParagraphDecision::submitConfigurationForm is largely identical to MultipleBlockDecision::submitConfigurationForm. It gets the SegmentSetStorage for the current submission and loops for each Segment, creating a DisplayParagraphs instance for that segment uuid. It achieves this by calling:

$reaction = $this->getReaction($segment->getUuid());
SegmentSetConfigEntityForm::pluginFormSubmit($reaction, $form, $form_state, [
'decision_settings',
'segments',
$uuid,
'settings',
'reaction_settings',
'plugin_form',
]);

Where $reaction ends up being an instance of DisplayParagraphs for the current segment uuid. ::pluginFormSubmit is called like above which calls DisplayParagraphs::submitConfigurationForm.

This function starts by calling DisplayParagraphs::getParagraphs which is used all over DisplayParagraphs and modeled after DisplayBlocks::getBlocks. Because the block implementation can use PluginCollections, it’s easy for getBlocks to grab whatever block information is stored on the current reaction. I could not find a way to emulate this with paragraphs, so I opted to get paragraph information directly from the form input. If you recall my solution to the ajax button matching problem above (passing the unique machine ID of the parent field backwards via config values), getParagraphs implementation will look familiar.

First, for any call to ->getParagraphs that is not during validation or submission the caller passes an empty array which tells getParagraphs to try and get the Reaction information from the current configuration values (i.e. while its building dropdowns or sending an ajax response). Second, when called during validation or submission, the caller passes the result of $form_state->getUserInput(). After the non-empty passed array is detected, this code executes:

$field_name = $this->getConfiguration()['parent_field'];
$widget_state = $user_input[$field_name];
$filter_widget = array_filter(
$widget_state,
function ($key) {
return is_numeric($key);
},
ARRAY_FILTER_USE_KEY
);
$digit = array_key_first($filter_widget);
$parents = [$digit, 'subform','smart_content','decision','decision_settings','segments'];
$reaction_settings = NestedArray::getValue($user_input[$field_name], $parents);
$reaction_arr = $reaction_settings[$this->getSegmentDependencyId()];
$paragraphs[$field_name][$this->getSegmentDependencyId()] = $reaction_arr;

getParagraphs extracts the machine ID of the current parent_field out of its configuration values and uses it to parse the UserInput array. The value is extracted similarly to paragraphSubmit (filter for a numeric key and call ::getValue()) and then an array of reaction information keyed by parent field name and current segment set UUID is created and passed back to the caller.

submitConfigurationForm then extracts the paragraph ID out of this array and creates an array that will store this information in the configuration values of this instance of DisplayParagaphs (highly similar to DisplayBlocks). At this point control switches back to MultipleParagraphDecision and the $reaction variable now contains the updated configuration values. The reaction information is then set via DecisionBase::setReaction(), the ReactionPluginCollections config is updated and the instance variable MultipleParagraphsDecision->reactions is updated. At this point, control then goes back to paragraphSubmit.

Before we step back there, I wanted to note that DisplayParagraphs::getParagraphs is another area ripe for contribution. I skipped over the first portion of this function; this function is called in multiple areas of DisplayParagraphs to either get submitted form values (which we discussed) or to retrieve the existing values that are already in the configuration. As such, the function is built around a main control structure that branches based on an empty user input. This could definitely be done in a more clean, readable way.

Okay, back to paragraphSubmit. At this point we have completed everything that was called inside ::pluginFormSubmit which stepped through all of our nested submission code. The $decision variable has been updated with all of that information and the decision is now set like this:

// Submit the form with the decision.
SegmentSetConfigEntityForm::pluginFormSubmit($decision, $element, $form_state, ['decision']);
// Set the decision to storage.
$this->getDecisionStorage()->setDecision($decision);

Now that we have built the submitted decision, we need to save it and inform the paragraph it’s contained in that this decision is attached to it:

if ($this->getDecisionStorage()) {
$node = $form_state->getFormObject()->getEntity();
$personalized_para = $node->get($field_name)->referencedEntities();
if($personalized_para == null) {
//Paragraphs never created and saved a personalized_paragraph.
\Drupal::logger('personalized_paragraphs')->notice("The node: ".$node->id()."has a personalized paragraph (PP) and was saved, but no PP was created");
}else {
$personalized_para = $personalized_para[0];
}
if(!$node->isDefaultRevision()){
//A drafted node was saved
$this->getDecisionStorage()->setnewRevision();
}
$saved_decision = $this->getDecisionStorage()->save();
$personalized_para->set('field_decision_content_token', $saved_decision->getDecision()->getToken());
$personalized_para->save();
if ($saved_decision instanceof RevisionableParentEntityUsageInterface) {
$has_usage = $saved_decision->getUsage();
if(!empty($has_usage)){
$saved_decision->deleteUsage();
}
$saved_decision->addUsage($personalized_para);
}
}

We first get the $node out of the $form_state; recall that we are inside a structure that is looping over all personalized fields, so we use $field_name to get the referenced personalized paragraph out of the field. In an earlier version, the paragraphSubmit handler ran first before any other handler; because of this, on a new node, the paragraph had not been saved yet and ->referencedEntities returned null. With it executing last this should never happen, but I left a check and log statement for it just in case there is something I have not thought of. Next we check for defaultRevision so we can inform the decision content that it’s part of a draft instead of a published node. Finally we save the decision, pass the returned token to the personalized paragraphs hidden field that stores it and then add to the decision_content_usage table which tracks usage of decisions and their parents.

At this point we have handled the Create and Update states of a personalized paragraph inside a Node edit form. What about the read state? Now that we’ve attached the decision token to the decision_content_token field of our personalized paragraph, we can go back to our field_widget_entity_reference_paragraphs_form_alter and add:

$parents = ['subform', 'field_decision_content_token', 'widget', 0, 'value', '#default_value'];
$decision_token = NestedArray::getValue($element, $parents);
if($decision_token){
$plugin->loadDecisionByToken($decision_token);
}

loadDecisionByToken is a custom function I added to DecisionParagraph.php that looks like this:

public function loadDecisionByToken($token){
$new_decision = $this->getDecisionStorage()->loadDecisionFromToken($token);
$new_decision->setDecision($this->decisionStorage->getEntity()->getDecision());
$this->decisionStorage = $new_decision;
}

In essence, this takes the attached decision_token, loads the decision it represents out of the database and sets the the decisionStorage inside DecisionParagraph to that decision. By virtue of doing this, when ->buildConfigurationForm is later called, it gives us back the form that represents the segment set and reactions from the saved decision. Create, Read, Update… what about Delete?

When it comes to Paragraphs, delete is a fickle mistress. Because you can ‘Remove, Confirm Removal,’ ‘Add’ a Paragraph and then save the Node, Paragraphs must create a new paragraph which orphans the old paragraph. The bright minds behind Entity Reference Revisions have created a QueueWorker that finds these orphaned paragraphs and cleans them up, kind of like a garbage collector. At the time of this writing, Personalized Paragraphs does not implement something similar, and this is yet another area ripe for contribution. For example, if one saved a Node with a filled decision in a personalized paragraph then edited the node, remove/confirm removal/added a new personalized paragraph, filled out the decision and saved, both decisions would still be in the decision_content tables. Now if one deletes that Node, the current personalized paragraph’s decision will be deleted, but the old one will not, essentially orphaning that old decision. Here is how delete currently works:

function personalized_paragraphs_entity_predelete(EntityInterface $entity)
{
if ($entity instanceof Node) {
if ($fields = _has_personalized_paragraph($entity)) {
foreach ($fields as $field) {
$has_para = $entity->get($field)->referencedEntities();
if (!empty($has_para)) {
$has_token = !$has_para[0]->get('field_decision_content_token')->isEmpty();
if ($has_token) {
$token = $has_para[0]->get('field_decision_content_token')->getValue()[0]['value'];
_delete_decision_content($token);
}
}
}
}
}
}

In a hook_entity_predelete, we detect the deletion of a node with personalized paragraphs, iterate those paragraphs and delete the decision represented by the token currently attached to the paragraph. So we’ll get the current decision token, but not any old ones. Given that the only way to change the segment set of an existing decision is to ‘remove’ ‘confirm removal’ ‘add’, this will likely happen often. The consequence is that decision tables will grow larger than they need to be, but hopefully we, or an enterprising user, will create a fix for this in the near future.

Okay so we’ve handled adding the smart_content_block experience to any Node edit form with a personalized_paragraph. What about viewing a personalized_paragraph?

For viewing, we have the old tried and true hook_preprocess_hook to the rescue. We deploy a personalized_paragraphs_preprocess_field__entity_reference_revisions so our hook will run for every Paragraph; we quickly narrow to only those paragraph’s that reference personalized_paragraphs:

$parents = ['items', 0, 'content', '#paragraph'];
$para = NestedArray::getValue($variables, $parents);
if($para->bundle() == 'personalized_paragraph'){
...
}

Next, we attempt to get the decision token out of the decision_content_token field so we can pass it to DecisionParagraph::build():

if($para->bundle() == 'personalized_paragraph'){
if ($plugin = personalized_paragraphs_get_handler('personalized_paragraph')) {
$has_token = !$para->get('field_decision_content_token')->isEmpty();
if($has_token) {
$token = $para->get('field_decision_content_token')->getValue()[0]['value'];
$build = $plugin->build($token);
...
}

Where the build function looks like:

public function build($token) {
$this->loadDecisionByToken($token);
$decision = $this->getDecisionStorage()->getDecision();

$build = [
'#attributes' => ['data-smart-content-placeholder' => $decision->getPlaceholderId()],
'#markup' => ' ',
];

$build = $decision->attach($build);
return $build;
}</span>

Which is slightly modified from DecisionBlock::build(). We load the decision content that was attached to the personalized_paragraphs, then call the DecisionBase::attach() function on that decision. This passes control to a number of functions that create the magic inside smart_content. When attach() returns, we are given an array that smart_content.js will process to decide on and retrieve a winning Reaction. To complete the function:

$has_token = !$para->get('field_decision_content_token')->isEmpty();
if($has_token) {
$token = $para->get('field_decision_content_token')->getValue()[0]['value'];
$build = $plugin->build($token);
$has_attached = array_key_exists('#attached', $build);
if ($has_attached && !empty($build['#attached']['drupalSettings']['smartContent'])) {
$variables['items'][0]['content']['#attributes'] = $build['#attributes'];
$variables['items'][0]['content']['#attached'] = $build['#attached'];

$para_data = [
'token' => $token,
];
$has_name = !$para->get('field_machine_name')->isEmpty();
$name = $has_name ? $para->get('field_machine_name')->getValue()[0]['value'] : '';
$variables['items'][0]['content']['#attached']['drupalSettings']['decision_paragraphs'][$name] = $para_data;
} </span>

We get the $build array back from ->build and verify that it has the appropriate attachments to run smart content. If it doesn’t we log a statement demonstrating that something in the build function has failed. If it does, we attach the correct piece of the build array to our variables array. I want to focus in on this code block to complete the discussion:

$para_data = [
'token' => $token,
];
$has_name = !$para->get('field_machine_name')->isEmpty();
$name = $has_name ? $para->get('field_machine_name')->getValue()[0]['value'] : '';
$variables['items'][0]['content']['#attached']['drupalSettings']['decision_paragraphs'][$name] = $para_data;

This code block represents how my organization manages the front end of personalized paragraphs and I’ll admit it’s an assumption on how you, the user, might want to manage it. If you’ve been following along, you’ll have noticed that pesky ‘Machine Name’ field I attached to personalized paragraphs. Here is where it comes into play. We extract the passed name which should be unique to the page itself; that name and the decision_content_token is attached to drupalSettings so it is available to javascript files using drupalSettings. With the name and token available in javascript, one can now:

a.) Detect that the decision paragraph loaded (is the decision_paragraphs key in drupalSettings? Does it contain this unique machine name?) and if not, ensure a default experience loads,

b.) Run javascript functions that display the winning experience or the default experience.

Since our method for managing the front end is beyond the scope of how personalized paragraphs was built, I’ll discuss it more in the How To Use Personalized Paragraphs blog.

There’s one more function to discuss that gets called as the front end experience is being displayed and that is DisplayParagraphs::getResponse. When smart_content.js selects a winner, it runs some ajax which calls ReactionController which loads the winning Reaction and calls its ->getResponse method. I had to slightly modify this method from DisplayBlocks to deal with Paragraphs:

public function getResponse(PlaceholderDecisionInterface $decision) {
$response = new CacheableAjaxResponse();
$content = [];
// Load all the blocks that are a part of this reaction.
$paragraphs = $this->getParagraphs([]);
if (!empty($paragraphs)) {
// Build the render array for each block.
foreach ($paragraphs as $para_arr) {
$pg_lib_conn = $this->entityTypeManager->getStorage('paragraphs_library_item');
$para_lib_item = $pg_lib_conn->load($para_arr['id']);
$has_para = !$para_lib_item->get('paragraphs')->isEmpty();
if($has_para){
$para_id = $para_lib_item->get('paragraphs')->getValue();
$target_id = $para_id[0]['target_id'];
$target_revision_id = $para_id[0]['target_revision_id'];
$para = Paragraph::load($target_id);
$render_arr = $this->entityTypeManager->getViewBuilder('paragraph')->view($para);
$access = $para->access('view',$this->currentUser, TRUE);
$response->addCacheableDependency($access);
if ($access) {
$content[] = [
'content' => $render_arr
];
$response->addCacheableDependency($render_arr);
}
}

}
}
// Build and return the AJAX response.
$selector = '[data-smart-content-placeholder="' . $decision->getPlaceholderId() . '"]';
$response->addCommand(new ReplaceCommand($selector, $content));
return $response;
}</span>

Instead of getting the content from the BlockCollection configuration, to send back, I had to grab the id stored in the config values, which loads a Paragraphs Library Item. That item references a Paragraph, so I grab its ID and load the Paragraph. The render array is created off the loaded Paragraph and sent back to the ajax to be displayed.

Phew, somehow we’ve gotten to what feels like the end. I’m sure there is something I’m forgetting that I’ll need to add later. But if you’ve made it this far then you are a champion. I hope what the Smart Content folks have created and my little extension work for your use case and that this blog has made you aware of how things work much more quickly than just reading and debugging the code would.

If you have any questions you can find me @plamb on the Drupal slack chat, there is also a #personalized_paragraphs channel and a How To Use Personalized Paragraphs blog.

Sep 22 2020
Sep 22
Pierce Lamb12 min read

Sep 22, 2020

Smart Content is a module for Drupal which enables personalized content selection for anonymous and authenticated users. The module supplies the UI and logic for creating and making these selections as well as some simple browser based conditions to test, but Smart Content by itself does not provide the data needed to support them. However, there are a couple of modules in its ecosystem that support 3rd party data providers, for e.g. Demandbase and FunnelEnvy. The idea here is if your site is already using one of these data providers to record data about anonymous users, that data can be used to deliver personalized content within Smart Content. Recently, I built a connector for Marketo RTP to Smart Content; I will update this blog with a link once it is a public module. For now, however, I believe detailing how I did it can help others connect Smart Content to any 3rd party marketing API.

The entry point is first understanding what a Response from the Marketing API looks like. For example, in FunnelEnvy, there are two fundamental options: matching the ID of an Audience, or matching the ID of a Variation. In DemandBase, there are myriad dimensions in the response. In Marketo RTP, we have 6 dimensions with a number of sub-dimensions. Either way, this Response needs to be understood so we can start representing it inside Smart Content. One way to look at this response is to query the marketing API in your browser console. For example, in inspector -> console, for RTP, I would type: rtp(‘get’, ‘visitor’, function(data){console.log(data)}); and observe the results. We’ll take a step back here to discuss setting up Smart Content before continuing.

The entry point of Smart Content (SC) is the Segment Set. Administering Segment Sets is found in the Structure -> Smart Content -> Manage Segment Sets menu (once SC is installed). A Segment Set represents some generalized way you want to segment anonymous users on your site. For example, you might title a Segment Set ‘Industry’ and then within the set, create Segments that correlate to industries like ‘banking’ or ‘manufacturing’. Once you’ve created the ‘Industry’ Segment Set, press the ‘edit’ button and you should be brought to a page where you can add Segments. This brings us to the next core piece of SC: the Condition.

You’ll notice that under a Segment you have the ability to create a list of conditions. You can select “If all” (AND) are true or “If any” (OR) are true, then the segment evaluates to true, otherwise, false. SC works by iterating through these segments and checking their conditions; once a segment’s condition(s) evaluate to true, a winner has been found and SC delivers a reaction (personalized content) based on the true segment. These conditions correlate exactly to the API Response data we discussed above. So, in code, we’ll need to create a condition that matches the API we’re using.

At this point I’ll assume that you’ve created a custom module in Drupal to represent your connector; in my case I’ve named it ‘smart_content_marketo_rtp.’ Within your module, create a ‘src’ folder and a ‘js’ folder. Inside the src folder create a ‘Plugin’ folder and inside that folder a ‘Derivative’ folder and a ‘smart_content’ folder. In the ‘smart_content’ folder we’ll have a ‘Condition’ folder, and inside that a ‘Group’ folder and an option ‘Type’ folder. The end result should look like this:

The first piece of code we’ll add to this is a new PHP Class in /smart_content/ to represent our new condition. In my case I titled it ‘MarketoCondition.php’. In addition to that, we also need to add the .libraries.yml file. Here we’ll configure the module’s JS files which interact with Smart Content’s backend. Since we’re creating a new condition, we’ll follow SC’s naming conventions. In the library.yml file add your version of this config:

condition.marketo_rtp:
header: true
version: 1.x
js:
js/condition.marketo_rtp.js: { }
dependencies:
- smart_content/storage
- smart_content/condition_type.standard

Note the filename you created under js:, you’ll need to also add a js file with this name under your js folder (e.g. condition.marketo_rtp.js). Okay, back to the MarketoCondition.php file.

Here is where we will define our new Condition. It will look like this:

namespace Drupal\smart_content_marketo_rtp\Plugin\smart_content\Condition;

use Drupal\smart_content\Condition\ConditionTypeConfigurableBase;

/**
* Provides a Marketo condition plugin.
*
* @SmartCondition(
* id = "marketo",
* label = @Translation("Marketo"),
* group = "marketo",
* deriver = "Drupal\smart_content_marketo_rtp\Plugin\Derivative\MarketoConditionDeriver"
* )
*/
class MarketoCondition extends ConditionTypeConfigurableBase {

/**
* {@inheritdoc}
*/
public function getLibraries() {
$libraries = array_unique(array_merge(
parent::getLibraries(),
[
'smart_content_marketo_rtp/condition.marketo_rtp'
]
)
);
return $libraries;
}

}</span>

Note the definitions in the comments. The syntax here must be preserved as Smart Content reads these entries and uses them internally. Note the filepath in the ‘deriver’ assignment; go ahead and create this PHP class in your ‘Derivative’ folder as well. Make sure to change the string in the getLibraries() method so it matches your module name and your JS file config definition in libraries.yml.

What this file is doing is defining a new Condition for Smart Content; the ‘id’ key defines how it is named when it’s passed in SC, the ‘label’ key defines how it will appear to an end user and the ‘deriver’ key points to a class that will define how SC should interpret all the dimensions in the API response we discussed earlier. In essence “what conditions should be available under the Marketo id?” Finally, overriding the getLibraries() function allows us to attach our custom JS file whenever our new Condition is used. That JS file will describe how to interact with the 3rd party API that powers our new Condition.

Next, let’s move to the deriver file defined in the comment. As shown, this file will be in src/Plugin/Derivative and must match the name you put in the comment exactly. This file can be largely isomorphic to other ConditionDeriver’s from Smart Content; a good example is the Demandbase Module. The one method we will care about in creating a custom connector is the getStaticFields() method. This is where the connector will map the marketing API’s dimensions to actual Smart Content types. If you’ve checked out the Demandbase link above, you’ll see that the three basic SC types are ‘boolean’, ‘number’ and ‘textfield.’ Hopefully the marketing API’s response you’re working with fits neatly into these. When I wrote the Marketo RTP connector, the response did not, and I had to write my own custom types. This explains where the types ‘arraytext’ and ‘arraynumber’ come from in the getStaticFields() method in this connector:

protected function getStaticFields() {
return [
'abm-code' => [
'label' => 'ABM Code',
'type' => 'arraynumber',
],
'abm-name' => [
'label' => 'ABM Name',
'type' => 'arraytext',
],
'category' => [
'label' => 'Category',
'type' => 'textfield',
],
'group' => [
'label' => 'Group',
'type' => 'textfield',
],
'industries' => [
'label' => 'Industry',
'type' => 'arraytext',
],
'isp' => [
'label' => 'Internet Service Provider',
'type' => 'boolean',
],
'location-country' => [
'label' => 'Country',
'type' => 'arraytext',
],
'location-city' => [
'label' => 'City',
'type' => 'arraytext',
],
'location-state' => [
'label' => 'State',
'type' => 'arraytext',
],
'matchedSegments-name' => [
'label' => 'Segment Name',
'type' => 'arraytext',
],
'matchedSegments-code' => [
'label' => 'Segment ID',
'type' => 'arraynumber',
],
'org' => [
'label' => 'Organization',
'type' => 'textfield',
]
];
}

This will look confusing at first, but if you followed the link above about RTP’s 6 dimensions you’ll see that the getStaticFields array members match exactly to these 6 dimensions. I’ve introduced a convention here for any dimension that contains a keyed array: I’ve used a ‘-’ to separate the dimension itself from the nested key. For example ‘abm-code’ and ‘abm-name.’ This dash will be necessary later when we parse out a nested key from a condition. Note that the ‘label’ key will designate the string that users of your connector see when they create a new Condition.

At this point, if you’ve created your new Condition and your Deriver files, we have one more file to add before seeing our new Condition inside Smart Content. Under the ‘Group’ folder create a new PHP class titled after your marketing API, for e.g. mine is simply named ‘Marketo.php.’ This file groups all of the new conditions you defined in your ConditionDeriver under one name. It is a very simple file:

namespace Drupal\smart_content_marketo_rtp\Plugin\smart_content\Condition\Group;

use Drupal\smart_content\Condition\Group\ConditionGroupBase;

/**
* Provides a condition group for Marketo conditions.
*
* @SmartConditionGroup(
* id = "marketo",
* label = @Translation("Marketo")
* )
*/
class Marketo extends ConditionGroupBase {

}</span>

The ‘id’ field in the comment links it to MarketoCondition.php and the ‘label’ field defines what users will see as a grouping when they create a new Condition.

At this point, if you did not create any new types in the getStaticFields() method, we can edit our Industry Segment Set, flush caches, and then press the ‘Select a condition’ drop down. You should now be able to scroll through this list and see the new grouping and the dimensions you defined in your Deriver file. If they do not appear, then one of the steps above was done incorrectly.

If you did not create any new Types, you can skip this next section. In the getStaticFields method I pasted above, you can see two new Types, ‘arraynumber’ and ‘arraytext.’ To define these new types, we’ll create two new PHP classes in the src/smart_content/Condition/Type folder: ‘ArrayNumber.php’ and ‘ArrayText.php.’ Since these two new types are depending on a more primitive type (textfield or number), I can simply extend these more primitive types. As such, my ArrayNumber.php file will look like:

namespace Drupal\smart_content_marketo_rtp\Plugin\smart_content\Condition\Type;

use Drupal\smart_content\Plugin\smart_content\Condition\Type\Number;

/**
* Provides a 'arraynumber' ConditionType.
*
* @SmartConditionType(
* id = "arraynumber",
* label = @Translation("ArrayNumber"),
* )
*/
class ArrayNumber extends Number {

/**
* {@inheritdoc}
*/
public function getLibraries() {
return ['smart_content_marketo_rtp/condition_type.array'];
}

}</span>

And

namespace Drupal\smart_content_marketo_rtp\Plugin\smart_content\Condition\Type;

use Drupal\smart_content\Plugin\smart_content\Condition\Type\Textfield;

/**
* Provides a 'arraytext' ConditionType.
*
* @SmartConditionType(
* id = "arraytext",
* label = @Translation("ArrayText"),
* )
*/
class ArrayText extends Textfield {

/**
* {@inheritdoc}
*/
public function getLibraries() {
return ['smart_content_marketo_rtp/condition_type.array'];
}

}</span>

As you can see, since these new types will use all the same operators as their primitive types, the only method we need to override is the getLibraries() method which will pass a custom JS file for evaluating the truth values of our new Types. Note that the ‘id’ field MUST match the type name you gave in your Deriver file. Make sure to add that JS file in libraries.yml and your /js/ folder. The libraries.yml definition will look like this:

condition_type.array:
header: true
version: 1.x
js:
js/condition_type.array.js: { }
dependencies:
- core/drupal

I will not go into too much detail on condition_type.array.js as its unlikely most readers are defining a new Type. The key for this file is to define new functions that Smart Content will call when it encounters an ‘arraytext’ or ‘arraynumber’. These functions follow specific naming conventions, for e.g.:

Drupal.smartContent.plugin.ConditionType['type:arraytext'] = function (condition, value) {...}Drupal.smartContent.plugin.ConditionType['type:arraynumber'] = function (condition, value) {...}

Where condition is the Smart Content Condition represented in JSON and the value is the value discovered on the page for a given visitor. These functions must return boolean values. You can check out /modules/contrib/smart_content/js/condition_type.standard.js to get a better sense of these functions. Also you can message ‘plamb’ on the Drupal Slack.

If you’ve gotten this far, then we have one more file to populate. Way back at the beginning we created the file js/condition.marketo_rtp, but left it empty. In this file we will tell Smart Content what to do when it comes across a condition of type ‘Marketo.’ We’ll open this file with the following:

(function (Drupal) {

Drupal.smartContent = Drupal.smartContent || {};
Drupal.smartContent.plugin = Drupal.smartContent.plugin || {};
Drupal.smartContent.plugin.Field = Drupal.smartContent.plugin.Field || {};

...

}(Drupal));</span>

The primary function in this file will follow Smart Content’s naming conventions:

Drupal.smartContent.plugin.Field['marketo'] = function (condition) {...}

Note that the text ‘marketo’ matches the id we’ve been passing around in many other files. When Smart Content evaluates a field of grouping ‘marketo’ it will execute this function. The first job this function must perform is making sure we can access the Marketing API and get a Response. It does that by constructing a Promise so it can be done asynchronously and then it returns a resolution of that Promise to the Smart Content backend that contains the relevant value. As such, our functions skeleton will look like this:

Drupal.smartContent.plugin.Field['marketo'] = function (condition) {
Drupal.smartContent.marketo = new Promise((resolve, reject) =>
{...}

});
return Promise.resolve(Drupal.smartContent.marketo).then( (value) => {
...
});
}</span>

Let’s first take a look at what’s happening inside the Promise. Here we will be checking that we can call the API and resolve the Promise with its Response:

let attempts = 0;
const interval = setInterval(() => {
if (attempts < 200) {
if (typeof rtp === "function") {
clearInterval(interval);
rtp('get', 'visitor', function(data){
if(data.results) {
Drupal.smartContent.storage.setValue('marketo', data.results);
resolve(data.results);
} else {
resolve({})
}
});
}
}
else {
clearInterval(interval);
resolve({});
}
attempts++;
}, 10);

All this code is doing is running an interval function through 200 attempts of trying to resolve a given variable as a function. This structure closely models the other Smart Content modules, the main difference is that in others the function is waiting for a JS library to be available on the page vs a function resolution; this is merely an artefact of how Marketo RTP works. Once the rtp variable is recognized as a function, the code can successfully call it the way Marketo intends. The Response data can then be passed into the resolve statement to then be dealt with when the function returns.

You might wonder how the Drupal.smartContent.storage.setValue(‘marketo’, data.results); line made it into this code block. I omitted some earlier code for clarity that deals with this. Most Smart Content connectors use a browser’s Local Storage to store the results of the Marketing APIs response. This is because the metadata associated with a user in the API rarely changes. When a user returns to the site, instead of executing the setTimeout function and waiting for the rtp function to return, we can simply grab the stored values out of their browser which will always be faster.. Putting the Local Storage code back in, the code block will look like this:

Drupal.smartContent.plugin.Field['marketo'] = function (condition) {
let key = condition.field.pluginId.split(':')[1];
if(!Drupal.smartContent.hasOwnProperty('marketo')) {
if(!Drupal.smartContent.storage.isExpired('marketo')) {
let values = Drupal.smartContent.storage.getValue('marketo');
Drupal.smartContent.marketo = values;
}
else {
Drupal.smartContent.marketo = new Promise((resolve, reject) => {
//run setTimeout code
});
}

}

}</span>

So only if the ‘marketo’ key isn’t found in Local Storage is the setTimeout code run. You can get a better sense of what these methods are doing here. But wait, what is that let key = condition.field.pluginId.split(‘:’)[1]; line about? Key is used in the return statement which we will discuss next.

When Smart Content passes around condition information, its key is always the group name, ‘marketo’ appended with the key from the Deriver class that was selected in the condition. Common keys for the RTP connector would be ‘marketo:matchedSegments-code’ or ‘marketo:industries.’ The line we mentioned above, let key = condition.field.pluginId.split(‘:’)[1];, is thereby grabbing the string that occurs after the ‘:’ and storing it for use in the return statement. This key will be used to parse the Response structure that comes back from the marketing API. Now we can look at the full return statement:

return Promise.resolve(Drupal.smartContent.marketo).then( (value) => {
//All single value members and arrays containing values
if(value.hasOwnProperty(key)){
return value[key];
}else {
//All arrays of arrays
var is_array_type = marketo_testForArray(condition.field.type);
if(is_array_type) {
var refined_key = marketo_getCorrectKey(key);
if(value.hasOwnProperty(refined_key)) {
return value[refined_key];
}
}
}
return null;
});

Promise.resolve().then() guarantees the code in the then block will run when the promise resolves. Since we earlier stored the key the Smart Content condition cares about, we first check if the Marketo API’s returned value simply contains that key. If so we return the value at that key back to Smart Content; simple enough. However, if the key in question is one of our custom array types, we need to detect that, split the main key from the nested key and return the correct value. The two functions called in the else block provide that functionality.

With these pieces in place, we can now test our new connector.

At this point, we would login to whichever marketing platform we’re connecting to and create a new test segment to test against our connector. In Marketo, this means going to Web Personalization -> Segments -> Create New Segment. Every Marketing platform is a bit different, but many of them have the option of creating a segment based on a url parameter which is easy for local testing. In Marketo we would add a ‘Behavioral -> Include Pages’ segment and define a URL for testing, e.g. /?test=yes. Under ‘domains’ we’d make sure our is selected and hit save. With Marketo, we can hover the segment we just created and get the ID. This is the value Smart Content will be matching against when our connector runs.

If we then open a fresh incognito window (remember the Local Storage discussion earlier?) and load /?test=yes we should be matched into the Marketo Segment. We can verify this by opening the console and running rtp(‘get’, ‘visitor’, function(data){console.log(data)}); again. Expanding the return structure should show our ID in ‘matchedSegments.’

To see this matching in a decision block, we would go back to Drupal and load Structure -> Smart Content -> Manage Segment Sets. If you created a Segment Set earlier, use that one or create a new one. For Marketo we will either create or re-use a condition and select Segment ID(Marketo) and paste the segment ID we just copied. Once saved and with caches flushed, we can add a decision block in Structure -> Block Layout -> Place Block, select our new segment set, choose blocks and save. In a new incognito window, we would load /?test=yes and see whatever block we chose.

If you have any questions you can find me @plamb on the Drupal slack chat.

May 15 2020
May 15
Pierce Lamb12 min read

May 15, 2020

This is Part 2 in a two part series where we detail how to create custom URLs for Drupal Views Exposed Filters. Part 1 covers how we create/update and delete these URLs. Part 2 covers how to load and process these URLs

If you have any questions you can find me @plamb on the Drupal slack chat.

Now that we have the original Exposed Filter paths correlated to custom paths in the path_alias table, let’s look at how to load them in the View and make sure they’re loading the right page when clicked on.

When I first worked on this problem, I was using a contrib module for Views called ‘Better Exposed Filters’ (BEF). I made this choice because I wanted to expose the filters as links (verses a

May 15 2020
May 15
Pierce Lamb8 min read

May 15, 2020

Part 1 covers when and how to generate custom URLs, Part 2 covers how to load and process these URLs.

If you have any questions you can find me @plamb on the Drupal slack chat.

A core feature of Drupal is the View. Views are pages that list content on a Drupal website. Despite that simple-sounding description, Views can be complex, especially for beginners. One feature we commonly want on a content listing page is the ability for an end user to filter the displayed content dynamically. For example, a car website might display all cars in a database on a listing page and an end-user might want to click an exposed filter called ‘Tesla’ to show only the Tesla models. Drupal provides this functionality out-of-the-box. Exposed filters in Drupal function by attaching querying parameters to the base URL of the View which the backend can use to appropriately filter the content. For example, if I have a View with the path /analyst-relations that displays content from large technology analysts, one exposed filter might be a link with the title Gartner. The path attached to the Gartner link will look like /analyst-relations?related_firm=5467. This query parameter, ?related_firm=5467, provides all the information Drupal needs to appropriately filter content. However, it is not a very nice-looking, descriptive URL. Ideally the link associated with the Gartner filter is something like /analyst-relations/firm/gartner.

I should note now that I am not a SEO expert and I don’t know for certain if custom exposed filter links will affect ranking in search engines. However, when I click a link like /analyst-relations/firm/gartner I have a much better idea of what information will be contained on that page than if I click /analyst-relations?related_firm=5467. Since serving these URLs does not have a high performance cost and they provide a more user-friendly experience, I believe that is reason enough to serve them.

Our goal is to replace all default exposed filter links with custom, descriptive URLs. The first question is, how do we create the custom URLs programmatically? Each URL will need to be unique and based on the content(s) it is related to. One option would be to do this dynamically as a page with exposed filter links is being loaded. Another option is to generate and store the custom URL whenever the relevant content is created/updated/deleted. I preferred the second option as it feels safer, more performant, and Drupal 8/9 comes with the path_alias module which I believe fits this task. I’ll note that this decision is definitely up for debate.

Okay, so we’re going to generate these custom URLs at CRUD time for relevant content(s). The quickest way to do that is, in a custom module, utilizing hook_entity_insert, hook_entity_update, and hook_entity_delete. From a technical debt perspective there may be a better way to do this, for e.g. by extending Entity classes, but these hooks will get you to a proof-of-concept the quickest. Every time any Entity is created, updated or deleted, these hooks are going to fire. If our custom module is called custom_urls, in our custom_urls.module file we would have:

/**
* Implements hook_entity_insert()
*/
function custom_urls_entity_insert(Drupal\Core\Entity\EntityInterface $entity){
_create_or_update_path_alias($entity);
}
/**
* Implements hook_entity_update()
*/
function custom_urls_entity_update(Drupal\Core\Entity\EntityInterface $entity){
_create_or_update_path_alias($entity);
}
/**
* Implements hook_entity_delete()
*/
function custom_urls_entity_delete(Drupal\Core\Entity\EntityInterface $entity){
_delete_path_alias($entity)
}

Inside of _create_or_update_path_alias and _delete_path_alias the first thing we’ll do is narrow down to only the entities we care about. That function will be called: _is_relevant_entity. Exposed Filters are often based on TaxonomyTerms or specific Entity bundles. For our example, inside _is_relevant_entity we will narrow to only the Terms and Entity Bundle we care about:

function _is_relevant_entity(Drupal\Core\Entity\EntityInterface $entity){
$entity_arr = [
'boolean' => FALSE,
'old_path' => '',
'new_path' => ''
];
$maybe_term = $entity instanceof Drupal\taxonomy\Entity\Term;
if($maybe_term){
...
} elseif ($entity->bundle() == 'product') {
....
}
return $entity_arr;
}

$entity_arr will be used to carry information about if the Entity is relevant, what the generated exposed filter path is and what the custom URL will be. If you follow the control structure you can see we’re going to use it to determine what the boolean value should be and for our example we care about Terms and Entities of type product. In our proof-of-concept, it would look something like this:

function _is_relevant_entity(Drupal\Core\Entity\EntityInterface $entity){
$entity_arr = [
'boolean' => FALSE,
'old_path' => '/analyst-relations',
'new_path' => ''
];
$maybe_term = $entity instanceof Drupal\taxonomy\Entity\Term;
if($maybe_term){
$relevant_taxonomies = [
'related_topics' => '/topic?related_topic=',
'related_companies' => '?related_firm='
];
$taxonomy_name = $entity->bundle();
$entity_arr['boolean'] = in_array($taxonomy_name, array_keys($relevant_taxonomies));
$entity_arr['old_path'] = $entity_arr['old_path'].$relevant_taxonomies[$taxonomy_name].$entity->id();
} elseif ($entity->bundle() == 'product') {
$entity_arr['boolean'] = TRUE;
$entity_arr['old_path'] = $entity_arr['old_path'].'/product?related_product='.$entity->id();
}
return $entity_arr;
}

As you can see, to get to a POC, I’ve done a lot of hardcoding here. In a fully general solution and safer solution, we’d load the View and get the old_path and the values in $relevant_taxonomies that way. However, via hardcoding I’ve generated the exact same paths that the View will create, for e.g. /analyst-relations?related_firm=5467. Note that if you don’t generalize this and the query keys or path in the View change (they are customizable) this will stop working.

Okay, so back to our _create_or_update_path_alias function. The beginning will look something like this:

function _create_or_update_path_alias($entity){
$raw_entity_arr = _is_relevant_entity($entity);
if($raw_entity_arr['boolean']){
//Update the path alias with the new URL
$clean_entity_arr = _build_custom_url($entity, $raw_entity_arr);

We use the boolean key to make sure we have an Entity we care about. Next we generate the custom url in _build_custom_url. That function will look like this:

function _get_url_from_regex($title){
$replace_whitespace = preg_replace('/\s+/', '-',$title);
$new_path_caboose = preg_replace('/[^a-zA-Z.-]/','',$replace_whitespace);
return $new_path_caboose;
}
function _build_custom_url($entity, $entity_arr){
$maybe_product = $entity->bundle() == 'product';
$raw_entity_url = $entity->url();
$entity_url_arr = explode('/',$raw_entity_url);
if($maybe_product) {
//It's a product Node
$new_path_train = '/analyst-relations/product/';
if(array_key_exists(2, $entity_url_arr)){
$new_path_caboose = $entity_url_arr[2];
} else {
$new_path_caboose = _get_url_from_regex($entity->label());
}
} else {
//It's a taxonomy term
$old_path = $entity_arr['old_path'];
$maybe_firm = strpos($old_path,'firm') !== FALSE;
if($maybe_firm){
//Firm filter
$new_path_train = '/analyst-relations/firm/';
} else {
//Topic filter
$new_path_train = '/analyst-relations/topic/';
}
if(count($entity_url_arr) > 1 && $entity_url_arr[1] !== 'taxonomy'){
$new_path_caboose = $entity_url_arr[1];
} else {
$new_path_caboose = _get_url_from_regex($entity->label());
}
}
$new_path = $new_path_train.strtolower($new_path_caboose);
$entity_arr['new_path'] = $new_path;
return $entity_arr;
}

In this function, we attempt to create the custom URL from the $entity->url() attached to Product’s and Taxonomy Terms. If we’re unable to, we pass the $entity-label() through some regexs. I’ve split the regexs into two inside _get_url_from_regex to make it easier to understand what is going on. We take the Entities label and replace any whitespace in it with a dash. We take this string and remove any non alphabetic character out of it. This produces strings that should work as the end (the caboose) of the new path and they replace the id number out of the old path. Then, whether we have a product or appropriate taxonomy term, we create the first part (the train) of the custom URL. Again this has been hardcoded for alacrity, but like above, in a general solution we’d load the View and create these. Like above if the View’s path changes this will stop working.

Okay so now we have an array that verifies we have a correct Entity, its old exposed filter path and the new custom path we want it to have. Now we are going to use the entityTypeManager() to query the path_alias table. Let’s view some more of the _create_or_update_path_alias function:

function _create_or_update_path_alias($entity){
$raw_entity_arr = _is_relevant_entity($entity);
if($raw_entity_arr['boolean']){
//Update the path alias with the new URL
$clean_entity_arr = _build_custom_url($entity, $raw_entity_arr);
$old_path = $clean_entity_arr['old_path'];
$new_path = $clean_entity_arr['new_path'];
$path_alias_conn = \Drupal::entityTypeManager()->getStorage('path_alias');
$new_path_already_exists = $path_alias_conn->loadByProperties(['alias' => $new_path]);
if(empty($new_path_already_exists)) {
$maybe_path_alias = $path_alias_conn->loadByProperties(['path' => $old_path]);
if (empty($maybe_path_alias)) {
//Create path alias
} else if (count($maybe_path_alias) == 1) {
//Update path alias
} else {
//We've somehow returned more than one result for the old path. Something is wrong
\Drupal::logger('custom_urls')->notice("The path: " . $old_path . ", is returning more than one result in path_alias");
}
} else {
\Drupal::logger('custom_urls')->notice("The generated path: " . $new_path . ", already exists in path_alias. An entity with an identical title was likely created");
}
}
}

So we get the connection to the path_alias table. First we test if the $new_path (the custom URL) already exists there. If it does we don’t do anything and send a message to the logger so we’re aware that the current Entity is trying to create a custom URL that already exists. Then we check if the $old_path (the generated exposed filter path) is already in the path_alias table (note that because they contain the entity’s id, they should only ever conflict on the rare chance, say, a Node and a Term on the same view have the same ID). If it does not we create the new path_alias entry using the $old_path and $new_path; else if it comes back with 1 result than we have an update and we update the $new_path; else we’ve somehow returned more than one result for the $old_path and we notify the logger. Here is the function completely filled out:

function _create_or_update_path_alias($entity){
$raw_entity_arr = _is_relevant_entity($entity);
if($raw_entity_arr['boolean']){
//Update the path alias with the new URL
$clean_entity_arr = _build_custom_url($entity, $raw_entity_arr);
$old_path = $clean_entity_arr['old_path'];
$new_path = $clean_entity_arr['new_path'];
$path_alias_conn = \Drupal::entityTypeManager()->getStorage('path_alias');
$new_path_already_exists = $path_alias_conn->loadByProperties(['alias' => $new_path]);
if(empty($new_path_already_exists)) {
$maybe_path_alias = $path_alias_conn->loadByProperties(['path' => $old_path]);
if (empty($maybe_path_alias)) {
//Create path alias
$new_path_ent = $path_alias_conn->create([
'path' => $old_path,
'alias' => $new_path,
'langcode' => \Drupal::languageManager()->getCurrentLanguage()->getId()
]);
$new_path_ent->save();
//Add new URL to cache
_cache_fancy_url($old_path,$new_path);
} else if (count($maybe_path_alias) == 1) {
//Update path alias
$path_alias_obj = reset($maybe_path_alias);
$path_alias_obj->set('alias', $new_path);
$path_alias_obj->save();
//Drop old URL from cache and add new one
_cache_fancy_url($old_path,$new_path);
} else {
//We've somehow returned more than one result for the old path. Something is wrong
\Drupal::logger('custom_urls')->notice("The path: " . $old_path . ", is returning more than one result in path_alias");
}
} else {
\Drupal::logger('custom_urls')->notice("The generated path: " . $new_path . ", already exists in path_alias. An entity with an identical title was likely created");
}
}
}

But wait, another function snuck in there: _cache_fancy_url($old_path,$new_path). In Part 2 of this series, we will look at how to load and process the custom urls; doing this from the cache is definitely the fastest way to do that, so we create/modify cache entries here. For clarity, I will show that function here:

function _cache_fancy_url($old_path, $new_path){
$default_cache = \Drupal::cache();
$old_path_result = $default_cache->get($old_path);
if($old_path_result !== FALSE) {
//Old path in cache, likely a Term or Product has been modified
//delete the old entry
$default_cache->delete($old_path);
}
//Add the new entry $default_cache->set($old_path,$new_path,Drupal\Core\Cache\CacheBackendInterface::CACHE_PERMANENT);
}

Caching here isn’t too important because when we load the custom urls, if they aren’t in the cache (perhaps, after a cache flush) we will set them there, but for what extra performance a 5 line function imparts it is worth it.

The delete hook is highly similar to the first two. I’ll paste it here and I imagine if you’ve read the above not much explanation is needed:

function custom_urls_entity_delete(Drupal\Core\Entity\EntityInterface $entity){
$raw_entity_arr = _is_relevant_entity($entity);
if($raw_entity_arr['boolean']){
//delete the associated path alias
$old_path = $raw_entity_arr['old_path'];
$path_alias_conn = \Drupal::entityTypeManager()->getStorage('path_alias');
$maybe_path_alias = $path_alias_conn->loadByProperties(['path' => $old_path]);
if(count($maybe_path_alias) == 1){
$path_alias_conn->delete($maybe_path_alias);
_delete_from_cache($maybe_path_alias);
} else {
\Drupal::logger('custom_urls')
->notice("The path: ".$old_path.", was set to delete from path_alias, but it returned ".count($maybe_path_alias)." results");
}
}
}

So now every time an Entity we care about in our View with exposed filters is created/updated/deleted we are also creating/updating/deleting and caching its associated custom URL. I prefer this way of creating the custom URLs versus creating them dynamically when the page loads as I feel that executing this extra code at entity CRUD time is more performant than at page load. While I know path_alias was intended for URLs like /node/1, I feel that this usage of the path_alias table matches its general intention: to provide nice aliases for non-nice paths.

We are one big step closer to custom URLs on a View with exposed filters, check out Part 2 to see how to load and process these custom URLs.

About Drupal Sun

Drupal Sun is an Evolving Web project. It allows you to:

  • Do full-text search on all the articles in Drupal Planet (thanks to Apache Solr)
  • Facet based on tags, author, or feed
  • Flip through articles quickly (with j/k or arrow keys) to find what you're interested in
  • View the entire article text inline, or in the context of the site where it was created

See the blog post at Evolving Web

Evolving Web