Apr 28 2009
Apr 28

The Field API is major new functionality in Drupal 7 to provide custom data types just like CCK did for previous versions of Drupal. Field API became part of Drupal 7 February, 2009, but there is still plenty of work to do. We need your help, and we’ll help you learn what you need to know to contribute.

The Field API is a great way to get yourself involved directly in Drupal core development. We realize, however, that Field API is a big, new, scary chunk of code that is not the easiest thing to jump into. The Field API team wants to help you get up to speed, so we’ve created the Field API mentoring process. Here’s our promise: If you assign a Field API issue to yourself and take responsibility for it, the Field API mentors will help you learn what you need by participating in the issue and answering questions on IRC.

We’ve created a Field API contributor page at drupal.org that describes (most of) the open issues and their degree of difficulty. If you want to grab one but have some questions about that part of Field API, ask about it over at Field API: What do you want to know? and we can add it to the documentation.

Apr 27 2009
Apr 27
Field API: What would you like to know? | Barry Jaspan

The Field API is major new functionality in Drupal 7 to provide custom data types just like CCK did for previous versions of Drupal. The Field API team has written fairly thorough API documentation but that is more useful as reference material than as a tutorial or HOWTO handbook.

So, here’s your chance to tell the Field API team what you want to know how to do. Comment on this post telling us what we should explain.

One option is to give us the title of HOWTO pages you’d like to read, such as:

  • What is the difference between a Field, Bundle, and Field Instance?
  • How do I add a Text field to a node type?
  • How do I add an Image field to users?
  • How do I get a list of all defined Fields and Field Instances?
  • etc.

Special thanks to chx for having the idea and offering to write some of the articles! (No good deed goes unpunished…)

Apr 25 2009
Apr 25

This is part five of my series, The DX Files: Improving Drupal Developer Experience. Today I am simply posting, verbatim, a message that a potential new Drupal developer sent me. The point here is just to understand how Drupal's DX influences the way it is perceived and the kind of people that want to work with it.

Hi Barry,

I am a new Drupal user/aspiring developer. I stumbled across your presentation "Developer Experience: Because Coders are People Too" on the DrupalCon 2008 Szeged site just last week. I watched the video and just nodded my head in agreement the whole time. Thank you! This validated my own opinions ­ there is something wrong (or at least unusual) about Drupal; it's not just me not getting it.

Anonymous arrays (the poor man's OOP) and un-typed arguments seem like a terrible idea to me. I almost react viscerally against them. I can understand why some have rejected Drupal. Your friend who will not work on Drupal again ­ I am almost there myself.

A little about my background: I have a Berkeley CS degree, worked for HP for 5+ years doing software development and now work at a small company that I cofounded a couple of years ago. I have lots of OOP experience. I have worked on a few 50,000 line+ projects in C++ and Java over the years. To do these medium to largish size projects, it has been my experience that one must do "defensive" programming. Part of this is knowing that the compiler is your friend. One should write code so that mistakes can be caught by the compiler. In Drupal, there are string literals everywhere and anonymous arrays ­ basically ticking time bombs waiting to blow off limbs. I find this frightening. I would never do this in C++ or Java.

I am just starting with Drupal. Part of why I liked your presentation so much is now I realize these are issues ­ something that I never saw acknowledged elsewhere ­ so I can hopefully move on and actually start making progress with Drupal without wondering why I wasn't getting it.

I am reminded of Winston Churchill's famous quotation: "It has been said that democracy is the worst form of government except all the others that have been tried." To me, Drupal is "The worst CMS except all the others that have been tried."

Contributed modules like Views, CCK, Voting API, etc and themes make Drupal more than a CMS. It is a terrific platform. Much easier/better than rolling your own. Java doesn't have a viable competitor, that I am aware of.

So basically I want to know how can I learn more and how can I help? I want to make Drupal sites, because of its huge wealth of contributed modules and the ability to extend the system. I am extremely frustrated by many parts of Drupal. I basically agreed with everything you put in your series of blog posts about DX.

What, as a noob, can I do?

I would like to contribute some how.

Thanks,
Nick

Nov 05 2008
Nov 05

This is part four of my series, The DX Files: Improving Drupal Developer Experience. I started this series with fairly simplistic suggestions. They proved not very popular and some of them I agree were of questionable benefit due to PHP’s nature. I was pleased to discover, however, that they nevertheless had quite an impact on raising the visibility of “Developer Experience” within the Drupal community. I am therefore ready to move on to some of the more complex DX issues in Drupal.

The most important DX change Drupal needs to make is switching from a form-driven model to an API-driven model. There are many parts to such a change. Today’s topic: static caching.

A simple example

Much static caching in Drupal core and contrib is based on the assumption of a “normal” page request in which only a fairly controlled set of operations can occur during a single execution. Consider this fairly typical pattern (from D7 HEAD as of today):

<?php
function taxonomy_get_term_data($tid, $reset = FALSE) {
  static
$terms = array();
  if (!isset(
$terms[$tid]) || $reset) {
   
$terms[$tid] = db_query('SELECT * FROM {term_data} WHERE tid = :tid', array(':tid' => $tid))->fetchObject();
  }
  return
$terms[$tid];
}
?>

This works great so long as no term ever changes during the course of execution. As soon as some code calls other taxonomy “API” functions, the interface can fail. Suppose we have a term whose tid is SOME_TID and whose name is “my name”. Consider:

<?php
 
// load SOME_TID and print its name
 
$term = taxonomy_get_term_data(SOME_TID);
  print
$term->name . "\n"; // change the name and save the term
 
$term->name = 'some new name';
 
taxonomy_save_term($term); // load the term again and print the new (?) name
 
$term = taxonomy_get_term_data(SOME_TID);
  print
$term->name . "\n";
?>

This program outputs “my name” twice, not “my name” and then “some new name.” Bottom line: The taxonomy_save_term() has no effect during a single page request!

It is horrible DX to have “API” functions that behave this way. Sadly, not just the taxonomy module has this problem; the pattern is widespread within Drupal.

The $reset non-solution

The typical solution within Drupal is an optional “clear the cache” argument like $reset for taxonomy_get_term_data(). When $reset is TRUE, the static cache is erased. This would be fine if it were always used by all the taxonomy_*() functions to maintain the API’s specification (in that case, the “taxonomy class” would simply be doing internal housekeeping on “private member variables”), but it isn’t. Callers using the taxonomy “API” need to know to pass the $reset argument themselves.

The problem gets even worse when you consider all the myriad contrib modules that use the same technique. It seems everybody has their own static cache and you need to call the right magic “clear the cache” function to get anything done. The API docs rarely spell this out and it is a PITA.

A real-world example of where this can bite badly: Try writing hook_update_N() functions that create or change CCK types, Views, taxonomy terms, etc. Disaster! Why should I have to figure out to call views_invalidate_cache(), content_clear_type_cache(), etc. etc.?

Static caching-induced bugs

The technique can also lead to bugs/security vulnerabilities when the static cache, conceived to operate in a normal page request, is used in a longer-running program.

Case in point: drupal_validate_form(). Here’s an excerpt:

<?php
function drupal_validate_form($form_id, $form, &$form_state) {
  static
$validated_forms = array();

  if (isset(

$validated_forms[$form_id])) {
    return;
  }
_form_validate($form, $form_id);
 
$validated_forms[$form_id] = TRUE;
}
?>

drupal_validate_form() caches the fact that a form successfully validated. This means that you cannot call the same form twice via drupal_execute() in one execution safely. The second and all successive calls will not call the form’s validate function. Instead, they will always invoke the submit handler.

Practice Statics

Sometimes the operation being cached is not even that expensive or called that often. It’s just being cached “for practice.” Here’s an excerpt from node.module:

<?php
function _node_revision_access($node, $op = 'view') {
  static
$access = array();
  if (!isset(
$access[$node->vid])) {
   
// compute and cache the access for $node and $op here
 
}
  return
$access[$node->vid];
}
?>

This is another example of static caching introducing a potential security hole. The $access static cache is based on $node->vid but not on $op. Suppose an attacker has permission to view a node but not delete it. If the attacker can cause _node_revision_access($node, 'view') to be invoked before _node_revision_access($node, 'delete), he can perform an unauthorized action.

This would be an important security hole in Drupal except it turns out that _node_revision_access() is never called more than once in a single page request. But… wait a minute! If the function is only ever called once per page request, why does it use a static cache at all? It shouldn’t.

This is just ridiculous. The code is unnecessarily complex in order to optimize a code path whose performance was obviously never measured and as a result introduces what would be a security hole but only if the static cache was ever actually used!

Solving the problem

So, what should we do? My suggestions:

  1. Eliminate unnecessary static caching completely. Unless and until you have empirical data proving that a particular static cache makes an important improvement in real-world usage scenarios, just don’t use one. If you already have a static cache without such data, remove it. The drupal_validate_form() and _node_revision_access() examples fall into this category.

  2. Maintain API integrity. If you must implement a static cache inside an API function, fine, but make sure that the API maintains its documented behavior within a page request. This means you have to make sure to update every “read” static cache in your code every time a “write” function is performed. Do NOT leave this to your callers. In the taxnomony example above, taxonomy_save_term() should invalidate or update the taxonomy_get_term_data() cache entry for the affected term.

http://drupal.org/node/254491 proposes a centralized static cache API for Drupal. I haven’t reviewed it recently, but it will probably help. However, even if it works really well, remember that not doing static caching at all will lead to simpler, less error-prone code. If you do not have a solid reason for using a static cache, just skip it until you do.

Aug 12 2008
Aug 12

This is part three of my series, The DX Files: Improving Drupal Developer Experience. This time, I’m suggesting changing some of Drupal’s most basic data structures and APIs by replacing anonymous arrays with well-defined data structures. I fully expect lots of disagreement.

Many of Drupal’s APIs (Form API, Schema API, etc.) use PHP arrays to represent complex structured data. For example, here is a Form API data structure:

<?php
$form
['author'] = array(
 
'#type' => 'fieldset',
 
'#access' => user_access('administer nodes'),
 
'#title' => t('Authoring information'),
 
'#collapsible' => TRUE,
 
'#collapsed' => TRUE,
 
'#weight' => 20,
);
$form['author']['name'] = array(
 
'#type' => 'textfield',
 
'#title' => t('Authored by'),
 
'#maxlength' => 60,
 
'#autocomplete_path' => 'user/autocomplete',
 
'#default_value' => $node->name ? $node->name : '',
 
'#weight' => -1,
 
'#description' => t('Leave blank for anonymous.');
);
?>

Some of the downsides to this representation include:

  • Developer IDEs cannot provide auto-completion or a similar form of assistance while the code is being written.
  • Invalid form properties (those that begin with “#”) cannot be identified at compile- or run-time.
  • It is awkward to associate default values or other automatic behaviors with array structures.
  • Functions that operate on specific kinds of form elements, such as textfield_validate(), are not assured they are being passed an “array of the right type.”

An alternative representation uses typed data structures, specifically a PHP class but without any methods (basically, what C calls a struct). For example:

<?php
$form
= new Form();$form->elements['author'] = $author = new FieldsetElement();
$author->access = user_access('administer nodes');
$author->title = t('Authoring information'),
$author->collapsible = TRUE;
$author->collapsed = TRUE;
$author->weight = 20;$author->elements['name'] = $name = new TextfieldElement();
$name->title = t('Authored by');
$name->maxlength = 60;
$name->autocomplete_path = 'user/autocomplete';
$name->default_value = $node->name ? $node->name : '';
$name->weight = -1;
$name->description = t('Leave blank for anonymous.');
?>

The second version of the code is almost identical to the first except for the change in representation; it is no harder to write and the conversion can even mostly be handled by search-and-replace. However, the object representation addresses (or can address) all the problems with the array representation and provide a variety of other benefits.

As an example, class representation solves the problem of functions not being sure about the kind of data they are passed. In the file that defines text field elements, we might have code like:

<?php
class TextfieldElement extends FormElement {
 
// Maximum length; NULL means no limit;
 
public $maxlength = NULL; // AJAX path for auto-completition; NULL means no auto-complete.
 
public $autocomplete_path = NULL;
}

function

textfield_validate(TextfieldElement $element) {
 
// $element is guaranteed to be a TextfieldElement
}
?>

Now, textfield_validate() is guaranteed to be passed a TextfieldElement.

Note: Yes, we could also change functions like textfield_validate() into methods of class TextfieldElement. Gotta start somewhere, though. Baby steps!

Jul 16 2008
Jul 16

With apologies to Gloria Jones and a variety of others...

Sometimes I feel there has to be a way
To improve securi-tay
To automatically prevent attacks
The bugs we fix seem not to help one bit
To make the exploit-tays
Not come back. They should stay away!
Oh! Tainted bugs!

As part of Acquia's security testing for Acquia Drupal, I've been experimenting with automated methods for detecting security vulnerabilities in Drupal and contributed modules. The time has come to report on my progress. If you want to learn more about this and are going to DrupalCon Hungary 2008, vote for my session proposal.

Data tainting is a kind of dynamic program analysis (it operates while the program is running) that can automatically detect one of the most frequent sources of security vulnerabilities: insufficiently validated user input. The idea behind data tainting is that when data is received from an untrusted source (such as GET request parameters or some parts of the database), it is marked as "tainted." Any data derived from tainted data (such as by string concatenation, passing function arguments, etc.) is also marked tainted. When tainted data is passed to a security-sensitive destination (such as the HTML response to a page request), a taint error is raised. Finally, when tainted data is validated in specific ways, the taint mark is removed so the data can be used freely.

What I am calling "Taint Drupal" is based on Taint PHP work by Wietse Venema along with some Drupal-specific customization particularly regarding the database. For more details, keep reading.

Tainting memory

Here is a simple example:

<?php
function message($arg) {
  return
'Your name is ' . $arg;
}
$name = $_GET['name'];
print
message($name);
?>

All request parameters in $_GET come directly from the user and can contain anything; therefore, they are "born tainted." $_GET['name'] is assigned to $name so, along with the actual value, $_GET['name']'s "taintedness" is also assigned to $name. When $name is passed to the function message(), the $arg argument also inherits the taint value. In the return statement, the string literal 'Your name is ' is not tainted because it comes from a trusted source (the program source code itself); however, when the non-tainted string literal is concatenated with the tainted $arg, the resulting string is tainted by $arg's, so the function result is also tainted. Finally, the print() statement receives this tainted value and, because it is programmed to, generates a taint error:

print(): Argument contains data that is not converted with htmlspecialchars() or htmlentities() in yourscript.php on line 6

As the error message indicates, the problem is that we did not validate $_GET['name'] with htmlspecialchars(). If we change one line of the script to

<?php
$name
= htmlspecialchars($_GET['name']);
?>

the taint flag is removed from $name and print() no longer generates a warning.

Tainting database reads

This works great for data in memory during a single page request, but what about the database? Consider a simplified version of Drupal's node table, the basis of all locally stored content:

CREATE TABLE node (
   id integer auto_increment,
   uid integer,
   title varchar(255),
   body text
);

When a user submits a node, Drupal stores the exact text provided for the title and body; it performs validation during output (there is a very good reason for this approach). Therefore, all data read from the title and body fields should be "born tainted" just like $_GET because the user has complete control over it. The the title or body are output before being run through htmlspecialchars() or some other similar validator, a taint error should result. The uid field, however, is the internally-generated user number for the author of the node. The user has no control over it. If the user is output in HTML (perhaps as part of a CSS style to allow per-user styling), no taint error should occur.

In Taint Drupal, I mark the title and body fields in the database schema as tainted (see, I told you it would come in handy! :-)). Whenever a SELECT query reads in these columns, my custom database driver marks them as tainted, just as if they had come directly from a user (which, ultimately, they did). Since the uid column is not marked tainted, the database driver leaves it alone.

Taint-checking database writes

Alert readers may have noticed I wrote "the user has no control over" the uid value. Well, what if a bug gives the user control? Taint Drupal prevents this by performing the reverse operation of SELECT-tainting: INSERT/UPDATE-taint-checking. Whenever it writes data to a column that is not marked tainted, it verifies that data is not tainted and logs an error if it is.

Demonstration

Here's a perfect example of Taint Drupal in action. Drupal 6.0 contained this code:

<?php
function node_page_edit($node) {
 
drupal_set_title($node->title);
  return
drupal_get_form($node->type . '_node_form', $node);
}
?>

In Drupal 6, the drupal_set_title() function required that the caller validate the title value, typically with check_plain(), so this code represented a XSS vulnerability. A user could set a node's title to something like "<script>alert('XSS!')</script>" and, when some other user or administrator visited the node's Edit page, the script could execute. (This bug was fixed in Drupal 6.1.)

With Taint Drupal, visiting the Edit page for *any* node while this bug exists generates warnings:

  • warning: print() [function.print]: Argument contains data that is not converted with htmlspecialchars() or htmlentities() in .../themes/garland/page.tpl.php on line 7.
  • warning: print() [function.print]: Argument contains data that is not converted with htmlspecialchars() or htmlentities() in .../themes/garland/page.tpl.php on line 70.

page.tpl.php lines 7 and 70 are the two places where the node title is displayed in HTML. So, instead of waiting for someone to discover the vulnerability, Taint Drupal would have pointed it out the moment it was created.

Current status

Taint Drupal is still very much experimental. It depends on patches to Taint PHP, a custom-built version of the PHP interpreter, minimal Drupal core patches, and a custom module. It is also far from complete. The Taint PHP patches are excellent but incomplete; for example, I discovered that the internal PHP function str_replace() was not taint-enabled and as a result all $_POST data processed by the Form API was being de-tainted automatically! (I fixed that.) Not all of Drupal's core tables are yet taint-marked correctly. I am handling SELECT and INSERT queries, but not UPDATE or DELETE queries. There is a lot left to do.

If you'd like more details about this work, vote for my session proposal at DrupalCon 2008.

Jul 02 2008
Jul 02

This is part two of my series, The DX Files: Improving Drupal Developer Experience.

Many Drupal APIs accept a boolean argument (TRUE or FALSE) to determine some behavior. I believe that practice should be banned in all but exceptional cases, instead using a defined constant with a descriptive name.

Here is a perfect example from Drupal core:

<?php
    $output
= node_view($node, FALSE, TRUE);
?>

Now, quick! Who can tell me what passing FALSE as the second argument and TRUE as the third argument means? Unless you are a Drupal guru, you almost certainly have no idea. Compare the previous line of code with this one:

<?php
   $output
= node_view($node, NODE_FULL_VIEW, NODE_IS_PAGE_CONTENT);
?>

Now it is much more clear what is going on. We're displaying the full view of a node and it is the page's primary content. Perhaps you do not know exactly what those constants mean ("A 'full view' as compared to what?", you might ask) but you are certainly better off than if you just see "TRUE" or "FALSE". If you later encountered

<?php
   $output
= node_view($node, NODE_TEASER_VIEW, NODE_IS_NOT_PAGE_CONTENT);
?>

everything would become pretty clear(*).

The end result of this change would be an API that is easier to learn and code that is easier to read, understand, and maintain.

* Disclaimer: I'll grant that it is far from obvious exactly what NODE_IS_PAGE_CONTENT actually causes to happen and in fact in a recent query on #drupal, several senior Drupal developers (myself included) couldn't remember, but it is still way better than "TRUE". This is an unrelated issue with node_view() that ought to be improved.

About Drupal Sun

Drupal Sun is an Evolving Web project. It allows you to:

  • Do full-text search on all the articles in Drupal Planet (thanks to Apache Solr)
  • Facet based on tags, author, or feed
  • Flip through articles quickly (with j/k or arrow keys) to find what you're interested in
  • View the entire article text inline, or in the context of the site where it was created

See the blog post at Evolving Web

Evolving Web