Upgrade Your Drupal Skills

We trained 1,000+ Drupal Developers over the last decade.

See Advanced Courses NAH, I know Enough

WSCCI Web Services Format Sprint Report

A small team met in Paris at the office of Commerce Guys from June 3-5th to discuss Drupal's web services serialization and syndication format needs. In short, "OK, we are going to have all of this fun new routing capability, now what do we do with it?" More specifically, how do we go about serializing Drupal data for consumption by remote programs (either other Drupal sites as in the case of content staging, other non-Drupal sites, client-side applications, or mobile apps), and what protocols and APIs do we make available to manipulate that data?

In attendance were:

The raw notes from the sprint are available as a Google Doc, but as usual are rather disjoint and likely won't make sense if you weren't there. A more complete report is below.

Executive summary

There are no clear and obvious winners in this space. All of the available options have different serious limitations, either in their format compatibility with Drupal, their access protocols (or lack thereof), or available mature toolchain.

Our recommendation at this time is to make JSON-LD our primary supported web services data format. It is quite flexible, and supports the self-discovery capabilities that we want to support. What it appears to lack is the set of tools and standards provided by the Atom and AtomPub specifications, which provide everything we want except for an actual data payload format. For use cases where the capabilities of Atom (such as Pubsubhubbub support) are necessary, wrapping JSON-LD strings in an Atom wrapper is ugly but technically possible. Alternatively, the JCR/PHPCR XML serialization format can serve as a forward-looking XML-based serialization when Atom functionality and true hypermedia are required.

This will require changes to the Entity system, most of which are already in progress. However, this provides new impetus to complete these changes in a timely manner. In short:

  • "Fields" get renamed to "Properties", and become the one and only form of data on an Entity. Any non-Property data on an Entity will not be supported in any way (except for IDs).
  • Properties become classed objects and include what is currently fields plus what is currently raw entity data (e.g., {node}.uid).
  • Entity Reference (or similar) gets moved into core.
  • All entity relationships are considered intrinsic on one side (the side with a reference field) and extrinsic on the other (the side referenced). That is, all relationships are mono-directional.
  • Every relationship (may) have a virtual Property assigned to the entity that is linked to, which stores no data but provides a mechanism to look up "all entities that reference to me". That is, back-references.
  • Content metadata (eg the sticky bit on nodes, og memberships, etc.) is implemented as a foreign entity with reference.
  • The responsibility for entity storage will be moved from the Field/Property level to the Entity level. That is, we eliminate per-field storage backends.

Background

There are two broad categories of web services to consider: s2s (Server to Server, Drupal or otherwise) and s2c (Server to Client, where client could be a mobile app, web app, client-side editor like Aloha or CreateJS, etc.). There is of course plenty of overlap. Both markets have different existing conventions, which frequently are not entirely compatible as they have different histories and priorities.

Entity API Revisions

In order to support generic handling of Entity->serialized translations, we need to standardize and normalize how entities are structured. Currently in Drupal 7 entities are largely free-form naked data structures. While Fielded data has a semi-regular form, its API is inadequate and much data is present on an entity via some other means. In order to handle serialization of entities, we need to either:

  1. Allow modules to implement per-property, per-serialization format bridge code. That would result in n*m elements that would need to get written by someone (whether hooks or objects or plugins or whatever).
  2. Provide a single standard interface by which all relevant data on an entity can be accessed, so that a generic implementation may be written to handle all Property types.

Given the extremely high burden the first option would place on module developers, we felt strongly that the second option would be preferable and result in better DX.

Ongoing work on the "Entity Property Metadata" in core effort has already begun this process. What we describe here is not a radical change, but more a tweaking of ongoing work.

The renaming of "Fields" to "Properties" is largely for DX. The word "field" means three different things in Drupal right now: A data fragment on an entity, a column in an SQL table, and a part of a record in Views. With Views likely moving into core for Drupal 8, eliminating one use of the term will help avoid confusion.

We therefore have a data model that looks as follows:

Entity [ Property [ PropertyItem [ primitive data values ] ] ]

Where "Property" was called in Drupal 7 "Field" and PropertyItem was called in Drupal 7 an "Item". This is largely just a rename.

That is, an Entity object is a glorified array of Property objects. A Property object is a glorified array of PropertyItem objects. A PropertyItem object contains some number of primitive values (strings and numbers), but no nested complex data structures. (An array or stdClass object may be PHP-serialized to a string as now, but the serialization system will treat that as an opaque string and not support any additional sub-value structure.

Additionally, each Entity class and Property class will be responsible for identifying its metadata on demand via a method. That is, much of the information currently captured in hook_entity_info() will move into a metadata() method of the Entity class; the information currently captured in hook_field_schema() and some of that captured in hook_field_info() will move into a metadata() method of the Property class. That allows the necessary information be available where it is needed, without having to pre-define giant lookup arrays. It also allows for that information to vary per-instance, such as field schema already does now.

Entity and Property classes will implement PHP magic methods for easier traversal. A preliminary, partial, demonstration-only implementation is as follows:

<?php
class Entity implements IteratorAggregate {// Keyed array of Property objects
protected $properties;public function getProperties() {
   return
$properties;
}public function
getIterator() {
   return new
ArrayIterator($this->properties);
}public function
__get($name) {
  
// returns the Property named $name
  
return $this->properties[$name];
}public function
__set($name, $value) {
  
// sets the Property named $name
}
}interface
PropertyInterface {
// Returns the stuff that was in hook_field_schema().
public function metadata();
}interface
ReferencePropertyItemInterface extends PropertyInterface {
public function
entity();
}class
Property implements PropertyInterface, ArrayAccess, IteratorAggregate {// Indexes array of PropertyItem objects
protected $items;public function offsetGet($offset) {
  
// Returns a PropertyItem object.
  
return $this->items[$offset];
}
// On Properties, as a convenience, the [0] is optional. If
// you just access a value name, you get the 0th item. That is
// useful for properties that you know for sure are single-value. However,
// because the [] version is always there this will never fatal out the way it
// would if the data structure itself actually changed.
public function __get($name) {
   return
$this->items[0][$name];
}public function
getIterator() {
   return new
ArrayIterator($this->properties);
}
}class
Node extends Entity {
// Convenience method.
public function author() {
  
// Could also micro-optimize and call offsetGet(0).
  
return $this->author[0]->entity;
}
}interface
PropertyItemInterface { }class PropertyItem implements PropertyItemInterface {// The internal primitive values
protected $primitives;public function __get($name) {
    return
$this->primitives[$name];
}public function
processed($name) {
  
// This is pseudo-code only; the real implementation here will not call any functions directly
   // but use something injected as appropriate.  We have not figured out that level of detail yet.
  
return filter_format($this->primitives[$name]);
}
}class
ReferencePropertyItem extends PropertyItem implements ReferencePropertyItemInterface {
public function
entity() {
  
// Look up the ID of the entity we're referencing to, load it, and return it.
}
}
// Individual properties can totally add their own useful methods as appropriate. This is encouraged.
class DateProperty extends PropertyItem {
public function
value() {
   return new
DateTime($this->properties['date_string'], new DateTimezone($this->properties['timezone']));
}
}class
ReferencedPropertyItem extends PropertyItem implements ReferencedPropertyItemInterface {
public function
getReferences() {
  
// Returns a list of all entities that referenced TO this entity via this property.
}
}
// For values that do not store anything, but calculate values on the fly
interface CalculatedPropertyInterface { /* ... */ }$entity = new Entity();
foreach (
$entity as $property) {
// $property is an instance of Property, always.
foreach ($property as $item) {
 
// $item is an instance of PropertyItemInterface, always.  if ($item instanceof ReferencePropertyItemInterface) {
  
$o = $value->entity();
  
// do something with $o.
 
}
 
// Do something with $item.
}
}
// Usage examples$node
// __get() returns a Property
 
->updated
   
// ArrayAccess returns a PropertyItem
   
[0]
     
// __get() returns the internal primitive string called timezone.
     
->timezone;$node
// __get() returns a Property
 
->updated
   
// ArrayAccess returns a PropertyItem
   
[0]
     
// __set() assigns the value of the internal timezone primitive.
     
->timezone = 'America/Chicago';$node
// __get() returns a Property
->author
 
// If you leave out [], it defaults to 0
  // The entity method returns the referenced user object
 
->entity()
   
// If you leave out the [], it defaults to 0
   
->name
     
// The actual string name value.
     
->value;// In practice, often use the utility methods.
$node->author()->label();
?>

By default, when you load an entity you will specify the language to use. That value will propagate down to all Properties and Items, so by default module developers will not need to think about language in each call. If a module developer does care about specific languages further down, additional non-magic equivalent methods will be provided that allow for specific languages to be specified. The details here will have to be worked out with Gabor and the rest of the i18n team.

When defining a new Entity Type, certain Properties may be defined as mandatory for the structure; the Title and Updated properties for nodes, for instance. These properties will be hard-coded into the definition of the Entity Type, and may be stored differently by the entity storage engine. However, to consuming code that is processing an entity there is no difference between a built-in Property and a user-added Property. An Entity Type is also free to define itself as not allowing user-added Properties (effectively mirroring non-fieldable entities today).

While objects are not as expensive in PHP as they once were back in the PHP 4 days, the amount of specialty method calls above MAY lead to performance concerns. We do not anticipate it being a large issue in practice. If so, more direct, less magical methods may be used in high-cost critical path areas (such as calling offsetGet() directly rather than using []) to minimize the overhead.

Extrinsic information

Currently there are a number of values on some entities that, in this model, do not "belong to" that entity. The best example here are the sticky and promote flags on nodes. This data is properly extrinsic to the node, but for legacy reasons are still there. That is information that often should not be syndicated. Organic Group membership is another example of extrinsic data.

We discussed the need to therefore represent extrinsic data separately from Properties. However, developing yet-another-api seemed like a dead-end. Instead, we decided that the way to resolve intrinsic vs. extrinsic data was as follows:

  • All Properties are intrinsic to the Entity the Property is on.
  • A ReferencedProperty (backlink) is not a part of the Entity itself. That is, the Entity knows about the existence of such linked data, but the data in question is extrinsic to it.
  • Extrinsic data on an Entity should be implemented as a separate Entity type, which references to the Entity it describes.
  • If data links two entities but is extrinsic to both, then an intermediary entity may have a reference to both entities.

For example, core will introduce a BinaryAttribute entity type (or something like that). It will contain only two values: its own ID, and a single-value ReferenceProperty to the entity it describes. There will be two bundles provided by core: Sticky (references to nodes) and Promoted (references to nodes). To mark a node as Sticky, create a BinaryAttribute entity, bundle Sticky. To mark it unsticky, delete that entity. Same for Promoted. (Note: Additional metdata fields, such as the date the sticky was created or the user that marked it sticky, may also be desired. Unmarking an entity may also be implemented not by deleting the flagging entity but having a boolean field that holds a yes or no. That is an implementation detail that we did not explore in full as it is out of scope for this document.)

After speaking with Workbench Moderation maintainer Steve Persch, we concluded that some such metadata (such as published) is relevant not to entities but to entity revisions. Fortunately that is easy enough to implement by providing an EntityReference Property and an EntityVersionReference Property, the latter of which references by version ID while the former references by entity ID. Which is appropriate in which case is left as an exercise to the implementer of each case.

Although not the intent, this effectively ports the functionality of Flag module into core, at least for global flags. Only a UI would be missing (which is out of scope for this document). It also suggests how per-user flags could be implemented: A UserBinaryAttribute entity type that references to both an entity and a user (specifically).

These changes would open up a number of interesting possibilities, such as much more robust content workflows, the ability to control access to the Sticky and Promoted values without "administer nodes" or even without node-edit capability, etc. We did not fully explore the implications of this change, other than to decide we liked the possibilities that it opened.

Implications for services

The primary relevant reason for all of this refactoring is to normalize the the data model sufficiently that we can automate the process of serializing entities to JSON or XML. As above, we want to avoid forcing nm (or worse, ij*k) necessary bridge components for each entity type, property type, and output format. It also neatly separates intrinsic and extrinsic data in a way that allows us to include it or not as the situation dictates. The other DX and data modeling benefits that it implies are very nice gravy, and exploring the implications of those changes and what additional benefits they offer is left as an exercise for the reader (and for later teams).

Syndication formats

With a generically mappable data model, we then turned to the question of what to do with it. We identified a number of needs and use cases that we needed to address:

  • Exposing entities in a machine-readable format
  • Exposing collections of entities in a machine-readable format
  • Exposing entities in both raw form suitable for round-tripping back to a node object and in a "processed" format that is safe for anonymous user consumption. (E.g., with public:// URLs converted to something useful, with text formats applied to textual data, etc.)
  • A way to resolve relationships between entities such that multiple related entities could be syndicated in a single string or a series of related strings (linked by some known mechanism). E.g., A node with its author object embedded or not, or with tags represented as links to tag entities or as inline term objects.
  • Every entity (even those that do not have an HTML URI) need to have a universally accessible canonical URI.
  • Semantically correct use of HTTP hypermedia information (GET,POST, DELETE, etc. PUT and PATCH are quirky and of questionable use.)
  • Data primitives we must support: String, int, float, date (not just as a string/int), URI (special case of string), duration.
  • Compound data types (Fields) are limited to being built on those data primitives; includes "string (contains html)".
  • Data structure inspection: Given "node of type X", what are its fields? Given "field of type Y", what are its primitives?
  • While we were not directly concerning ourselves with arbitrary non-entity data, a format that lent itself to other uses (such as Views that did not map directly to a single entity) is a strong benefit.

Given that set of requirements, we evaluated a number of existing specifications. All of them had serious deficiencies vis a vis the above list.

CMIS
CMIS is a big and robust specification. However, it consists mainly of optional feature sets, which would allow us to implement only a portion of CMIS and punt on the rest of it. CMIS' data model is very traditional: Documents are very simple creatures, and are organized into Directories to form a hierarchy.

CMIS also includes a number of different bindings for manipulation. The basic web bindings are designed to closely mimic HTML forms, right down to requiring a POST for all manipulation operations. They also required very specific value structures that we felt did not map to how Drupal entities are structured nor to how Drupal forms work, making it of little use.

CMIS also includes bindings for AtomPub, which is a much more hypermedia-friendly high-level API for communication. CMIS has no innate concept of internationalization, so that needs to be emulated in the data with separate data properties.

CMIS is based in XML, although a JSON variant is in draft form at this time.

Atom
Atom is an XML-based envelope format. That is, it does not define the format of a single item. Rather, it defines a mechanism for collecting a set of items together feed-like, for defining links to related content, for paging sets of content, etc. The structure of a single content item is undefined, and may be defined by the user. Atom also includes a number of useful extensions, in particular Pubsubhubbub and Tombstone, which allow for push-notifications and push-deletion. That is extremely useful for many content sharing and content syndication situations.

There are a couple of JSON-variants of Atom, including one from Google, but none seem to have any market traction.

AtomPub
AtomPub is a separate IETF spec from Atom the format, although the two are designed to complement each other. AtomPub defines the HTTP level usage of Atom, as well as the semantic meaning of various links to embed within an Atom document. (e.g., link rel="edit", which defines the link to use to POST an updated version of the document or collection.)
JSON-LD
JSON-LD is not quite a format as much as it is a meta-format. Rather, it's a way to represent RDF-like semantic information in a JSON document, without firmly specifying the structure of the JSON document itself. That makes it much more flexible than CMIS in terms of supporting an existing data specification (like Drupal's), but also means we need to spend the time to define which semantics we're actually using. That includes determining what vocabularies to use where, and which to custom-define for Drupal.

Our initial thought was to try to map entities as above to CMIS, so that we could leverage the AtomPub bindings that were already defined. We figured that would result in the least amount of "we have to invent our own stuff". However, we determined that would be infeasible. Documents in CMIS are too limited to represent a Drupal entity, even in the more rigid form described above. We would have to map individual Properties to CMIS Documents, and Entities and Language would have to be represented as Directories. However, that would make representing an Entity in a single XML string quite difficult, and/or require custom extensions to the CMIS format. At that point, there's little advantage to using CMIS in the first place.

While CMIS may work very well for low-complexity highly-organized data such as a Document Repository like Alfresco, it is less well suited to highly-complex but low-organization data such as Drupal.

Atom/AtomPub, while really nice and offering almost everything we want, are missing the one most important piece of the puzzle: They are by design mum on the question of the actual data format itself.

We then turned to JSON-LD. It took a while to wrap our heads around it, but once we understood what it was trying to do we determined that it was possible to implement a Drupal entity data model in JSON-LD. While not the most pristine, it is not too bad. We developed a few prototypes before speaking with Lin Clark and ending up with the following prototype implementation:

{
"@context": {
  "@language": "de",
  "ex": "http://example.org/schema/",
  "title": "ex:node/title",
  "body": "ex:node/body",
  "tags": "ex:node/tags"
},
"title": [
  {
    "@value": "Das Kapital"
  }
],
"body":
[
  {
    "@value": "Ich habe Durst."
  }
],
"tags":
[
  {
    "@id": "http://example.com/taxonomy/term/1",
    "@type": "ex:TaxonomyTerm/Tags",
    "title": "Wasser"
  }
]
}

This is still preliminary and will certainly evolve but should get the basic idea across.

Of specific note, JSON-LD has native support for language variation. It's imperfect, but should be adequate to represent Drupal's multi-lingual entities.

Defining what the semantic vocabularies in use will be is another question. Our conclusion there is that the schema information provided by a Property implementation should also include the vocabulary and particular semantics that field should use.

That is not actually as large a burden as it sounds. In most cases it will be reasonably obvious, once standards are developed. For instance, date fields should use ical. In cases where multiple possible vocabularies exist, a Property can make it variable in the same fashion as the field schema itself is currently variable, but only on Property creation (just as it is now). If no vocabulary is specified, it falls back to generic default "text" and "number" semantics.

As a nice side-effect, this bakes RDF-esque semantics into our data model at a basic level, which should keep all of the semantic-web fans happy. It also will ease integration with CreateJS, VIE, and similar rich client-side editors that can integrate with Aloha, which is already under consideration for Spark and potentially Drupal 8.

This does not, of course, provide the REST/hypermedia semantics we need. As far as we were aware there is no JSON-based hypermedia standard. There are a couple of suggested proposed standards, but none that is actually a standard standard.

Symfony Live Addendum

Following the Sprint, Larry attended Symfony Live Paris, the main developer conference for the Symfony project at which he was a speaker. There, Larry was able to do some additional research with domain experts in this area.

One of the keynote speakers was David Zuelke of Agave, and the topic was (surprise!) REST and Hypermedia APIs. The session video is not yet online, but the slides were 90% the same as this presentation. It is recommended viewing for everyone in this thread. In particular, note the hypermedia section that starts at slide 95. One of the key take-aways from the session (echoed in other articles that we've checked as a follow-up), is that we're not the only ones with trouble mapping JSON to Hypermedia. It just doesn't do it well. XML is simply a better underlying format for true-REST/HATEOAS functionality, and the speaker encouraged the audience to respond to knee-jerk JSON preference with "tough, it's the wrong tool."

After the session, David acknowledged that the situation is rather suboptimal right now (XML Is better for Document representation, JSON for Object representation; and we need to do both).

Larry also spoke with Henri Bergus, Midgard developer, author of CreateJS, and future DrupalCon Munich speaker. Henri pointed out that the JCR/PHPCR standard (Java Content Repository and its PHP-based port) does have its own XML serialization format independent of CMIS. After a brief look, that format appears much more viable than CMIS although additional research is needed. It is defined in the JCR specification, section 6.4.

Assuming JCR/PHPCR's XML serialization can stand up to further scrutiny, particularly around multi-lingual needs, it would be a much more viable option for true-HATEOAS behavior as we could easily wrap it in Atom/AtomPub for standardized linking, flow control, subscription, and all of the other things HATEOAS and Atom offer. While Atom would allow JSON-LD to be wrapped as the payload as well, wrapping JSON-LD in Atom would require both producers and consumers to implement both an Atom and a JSON-LD parser, regardless of their language. That would be possible, but sub-optimal.

At this time we are not firm on this conclusion, but given the varied needs of different use cases we are leaning toward recommending the use of PHPCR-in-Atom and JSON-LD as twin implementations. Attempts at implementing both options will likely highlight any flaws in either approach that cannot be determined at this time. Whether one or both ends up in core vs. contrib should be simply a matter of timing and resource availability, as being physically in core should not provide any magical architectural benefit. (If it does, then we did it wrong.) That said, the Entity API improvements discussed above are the same regardless of format, and offer a variety of additional benefits as well.

Acknowledgements

Thank you to everyone who attended the sprint, including those who just popped in briefly during the Tuesday biweekly WSCCI meeting. Thank you also to Steve Persch and Lin Clark for their impromptu help. Thanks to Acquia for sponsoring travel expenses for some attendees. And of course thank you to Commerce Guys for being such wonderful hosts, and to Sensio Labs for bringing Larry in to speak for Symfony Live as that's how we were able to have this sprint in the first place.

Onwards!

Author: 
Original Post: 

About Drupal Sun

Drupal Sun is an Evolving Web project. It allows you to:

  • Do full-text search on all the articles in Drupal Planet (thanks to Apache Solr)
  • Facet based on tags, author, or feed
  • Flip through articles quickly (with j/k or arrow keys) to find what you're interested in
  • View the entire article text inline, or in the context of the site where it was created

See the blog post at Evolving Web

Evolving Web