Software architecture evolution: Version Control API & Drupal's git migration

Parent Feed:

Version Control API is central to Drupal's migration from CVS to git. It's also the single thing that's taken up the most time in the work we've done to date, and there's still a fair bit left to do. But we're now at a point where we need to step back and take a high-level look at the direction it'll finally take, so I thought I'd use where we are as an opportunity to explain the goals and architecture of the module, both historically and looking to the future. Apologies in advance for any of the history I get wrong - I'm sure I'll do it, so please feel free to correct me.

In The Beginning

Version Control API was originally written as a 2007 Google Summer of Code project by Jakob Petsovits (aka jpetso). From the outset, VCAPI was intended to replace Project*'s tight coupling with CVS (via the cvslog module) so that Drupal could get off CVS and on to a different version control system. VCAPI tried to build a system & datastructure similar enough to cvslog that moving over wouldn't be too painful, but at the same time was VCS-agnostic. We could decide later which VCS would fill the gap. (Technically, it would even have been possible for different projects to use a different VCS - though we ultimately decided against that because of the added social and technical complexity.)

Given that VCAPI was intended from the beginning to replace cvslog, it's hardly surprising that they both do essentially the same thing: store representations of VCS repository data in Drupal's database, such that that data is readily accessible for direct use by Drupal. They also map Drupal's users to user data in repositories, thereby allowing for the management of repository ACLs directly in Drupal. (cvslog also integrates directly with Project*, while VCAPI opted to separate that into versioncontrol_project). They then provide output that any drupal.org user would be familiar with - the project maintainers block, the commit activity information in users' profiles, the commit stream, etc. Whereas cvslog was only concerned with integrating with CVS, VCAPI attempted to solve these problems (particularly storing repository data) in an abstracted fashion such that the data from any source control system could be adequately represented in a unified set of Drupal database tables. VCAPI would provide the datastructure, helper functions, hooks, etc., and then "backend" modules (such as the git backend) would implement that API in order to provide integration with a particular source control system.

A quick aside - any good engineer will see "storing representations of VCS repository data in Drupal's database" and trip a mental red flag. It's data duplication, which raises potentially knotty synchronization problems. So let me head that one off: extracting the data was especially necessary with CVS, as it was _far_ too slow and unscalable to make system calls directly against the repository in order to fulfill standard browser requests. And while git is MUCH faster than CVS, the data abstraction layer is still necessary. System calls are slow, and there's disk IO to think about; it's worth trying to avoid tripping those during normal web traffic. More importantly, generating an aggregate picture of versioncontrol-related activity within a given Drupal system, particularly one that has a lot of complex vcs/drupal user mapping and/or a lot of repositories, really requires a single, consistent datastore. Stitching together db- and repo-sourced data on the fly gets infeasible very quickly. Finally, putting the data into a database makes it possible for us to punt on caching, since Views/Drupalistas are accustomed to caching database queries/output.

Anyway, with all this in mind, jpetso made a herculean effort in writing the original 1.x branch of VCAPI. He came up with the original abstracted datastructures and general methodologies that allowed us to replicate the functionality of cvslog in an API that could be reimplemented by different VCSes. More about that history can be seen in g.d.o posts. And at its core, the system worked.

Unfortunately, there were also aspects of the system that were awkward and overengineered. Much of the original API was actually just a querybuilder; many of the abstracted concepts had become so abstract as to be unintuitive to new developers (e.g., there were no "branches" or "tags" in VCAPI - just the meta-concept of "labels"). The underlying problem, though, was an architectural predilection towards an 'API' that did backflips to abstract and accommodate all possible backend behaviors, then own all the UIs, rather than providing crucial shared functionality and readily overridable UIs that backends could extend as needed. You can't work with, let alone refactor, VCAPI without running into this last problem. The module was suffering from an identity crisis - is it an API for the backends? Or an API for third-party systems, like say Project*, which want to utilize the repository tracking features of VCAPI? The crisis was also evident in the querybuilder: the same system was used for building aggregate listings as for retrieving individual items, and optimized for neither.

Enter: OO

jpetso needed to start moving on to other things by 2008, and when he offered the project up for maintainership, I volunteered. After porting to Drupal 6, discussions began about how well-suited VCAPI & backends would be to object orientation. In particular, it could help to make the API less overbearing and release more control into the backends. And for GSoC 2009, marvil07 made exactly that his goal: porting VCAPI over to OO.

Note - there was other work going on throughout this time period by a variety of people, GSoC and otherwise. I do NOT mean to slight any of that work - it's just that those changes were less central to the evolution of the API itself, and therefore tangential to the focus here.

Prior to marvil07's work, VCAPI was an exemplary instance of Drupal's love for massive arrays. They were used to capture all the data being stored in the database, to send instructions to the querybuilders, as return values for all the various informational hooks implemented by backends...and just about everything else. marvil07's refactor revealed some of the real 'things' VCAPI deals with, in the form of discrete classes:

VersioncontrolRepository - Represents a 'repository' somewhere; at the bare minimum, this includes information like VCS backend, path to the repository root, and any additional information specified by the backend.
VersioncontrolItem - Represents a known versioned item - that is, a file or a directory - in a repository.
VersioncontrolBranch - Represents a known branch in a repository.
VersioncontrolTag - Represents a known tag in a repository.
VersioncontrolOperation - Represents, usually, a commit action in a repository. The 'operation' concept is one of the abstractions that can get confusing.

Each of these classes have two responsibilities - CUD (that's CRUD sans-R), and retrieving other related data (e.g., you could call VersioncontrolRepository::getItem() to retrieve a set of VersioncontrolItems, or VersioncontrolRepository::getLabels() to retrieve a set of VersioncontrolBranch or VersioncontrolTag). CUD was fairly well implemented on each of these classes by the time marvil07's original GSoC project was over. Related data retrieval was a bit more limited.

This set of classes also replaced awkward alters with inheritance as the new way for backends to interact with VCAPI: VersioncontrolGitRepository extending VersioncontrolRepository, VersioncontrolGitBranch extending VersioncontrolBranch, etc. Interfaces were also introduced to tell VCAPI that a particular backend's objects supported specific types of operations - generating repository URLs, for example. The crucial contribution of marvil07's GSoC project was developing this family of classes, which has remained largely unaltered. Unfortunately there wasn't really time to get to refactoring the logic, so much was simply cut from old 1.x procedural functions and moved into an analogous class method.

By the time we had reached the end of GSoC, I'd grown into the opinion that marvil07's work was an excellent first step. We still largely the same 1.x logic, just moved into an object-oriented environment. API<->backend interaction via inheritance had helped the identity crisis, but not resolved it entirely. There was some more flexibility for the backends to control logic that had once been the sole domain of the API, but we were still swimming upstream - too many disparate hooks, too much logic in VCAPI that the backends couldn't touch. A good foundation, but far from finished.

The Great Git Migration

When the big discussion about switching VCSes happened in February 2010, we were still gradually fleshing out the skeleton that had been introduced during GSoC 2009. During the discussion, the question was quite rightly raised whether we should even bother with VCAPI, or if we should just use something else (or start from scratch), especially given the wide agreement on wanting "deep integration". (On using VCAPI at all, this bit of the thread is particularly enlightening.) I ended up arguing that VCAPI, while by no means perfect, had already done a pretty good job of tackling the not-inconsiderable datastructure and CRUD questions. Those problems would have to be solved anyway, so starting from scratch would have been a waste. Folks ultimately found that to be a convincing argument, and that's been one of the major principles guiding the migration work thus far.

Another guiding principle also emerged from the initial discussions - if we're going to build our own system, it must be developer-friendly & maintainable. For years, the cruft and complexity of Project* has limited contributions to a very small circle of overworked developers; allowing the migration work to produce similarly impenetrable code would be horribly shortsighted. Consequently, the architectural decisions we've made have been as much motivated by the long-term benefits of architecting a tight, intuitive system as the short-term benefits of just finishing the damn migration already. Let's run through some of the big architecture shifts made thus far:

One of the biggest weaknesses in VCAPI 1.x was the querybuilder. It was an awkward custom job that introduced a few thousand lines of code and was quite difficult to extend. So we replaced the whole thing using the DBTNG backport.
In tandem with the conversion to DBTNG, we did a partial backport D7's entities. All of the classes from marvil07's original OO refactor (VersioncontrolRepository, VersioncontrolItem, etc.) are now instances of VersioncontrolEntity. Their loading is managed by a family of classes descended from VersioncontrolEntityController; all that can be seen in includes/controllers.inc. This is a great conceptual step forward - it makes a TON of sense to treat most of the objects VCAPI handles as entities.
We took another bite out of the identity crisis by definitively separating mass-loading for listings from targeted loading for data manipulation. Mass-listings are Views' responsibility, pure and simple. Only when you're actually _doing_ something with the API will objects get built from the complex Controller loaders.
We introduced a VersioncontrolBackend class, replacing the array returned from hook_versioncontrol_backend(). This class will increasingly replace procedural logic as a unified behavior object governing everything that VCAPI expects a backend to implement. To that end, the backend acts as a factory for turning data loaded by the VersioncontrolEntityController family into instantiated VersioncontrolEntity objects.

In short, we totally rebuilt VCAPI's plumbing, and with quite an eye towards the future - using DBTNG and Entities will make the D7 port very manageable. And now we're in the final phase of work with VCAPI - fleshing out entity methods, tweaking the datastructure, and dealing with the UI. All the stuff motivating me to write this article, as a way to force myself to think through it all properly.

Looking Forward

First, let's do a quick revisit of VCAPI & backends' purpose. These proceed roughly in order from plumbing -> API -> UI.

Maintain a list of repositories known by the system.
Maintain a mapping between Drupal users and the users known to the repositories.
Maintain ACLs pertaining to those users & repositories, and make the data readily accessible to the hook scripts that actually enforce the ACLs.
Track the contents/activity of a repository into an abstracted, cross-vcs format.
Link repository activity with users.
Provide sane default behaviors that can then be easily adapted to a specific VCS' requirements by the backend module.
Provide sane API to third-party (non-backend) client code for using or extending VCAPI's data.
Provide overridable & retool-able UIs for administrative functionality.
Provide portable, overridable & retool-able UI elements for listing & statistical information, like commit activity streams.

Now, let's run through that list to see how 1.x stacks up:

Maintain repository list - check, but CRUD is awkward.
User mapping - check, but CRUD is awkward.
ACLs - check.
Repository content tracking - check, but confusing & awkward through over-abstraction.
Repo content<->user link - check.
Sane defaults + backend overridability - nope. 1.x worked mostly by overstuffing logic into the API, and allowed backends to interact by flipping toggles. The rest was done with confusing hooks.
Third-party utility - nope. Third-party code just has the same set of confusing hooks, and not a lot of helpful API.
Admin UI - sorta. Static UI, even hard-coding some assumptions about data sources (e.g., repository "authorization methods"), but with some control afforded to the backends.
Portable UI elements - sorta. Blocks were used, but because there was no Views 2 when 1.x was written, there's just those hardcoded blocks. Moving to Views makes creating portable UI elements FAR easier.

Many of the problems in 1.x are helped, or even solved, by the architectural improvements I've been talking about throughout the article. Now let's break out our current work, the 2.x branch, into the same bullets. And forgive me, but I'm going to break narrative here and mention some details that I haven't previously explained. This IS supposed to be a list to help us actually finish up the work, after all :)

Maintain repository list - check. VersioncontrolRepository(Controller) has probably gotten more love than any other class. One major addition would be support for incorporating a backend-specific repo interaction classes, along the lines of svnlib or glip. That would make VCAPI into an excellent platform for doing repository interactions that are way outside the original scope; just load up the repository object from VCAPI, then go to town.
User mapping - unfinished - VersioncontrolAccount is one of the classes has barely been touched thus far.
ACLs - unchanged since 1.x, and in need of revisiting in light of all the other changes; best addressed at the same time we're revisiting VersioncontrolAccount.
Repository content tracking - almost there. We're going to undo a conflation made in 1.x; see these two issues. VersioncontrolOperation will go away in favor of VersioncontrolCommit, and we'll introduce a separate system for tracking activity (i.e., network operations) that is clearly separated from tracking repository contents.
Repo content<->user link - check. Despite the need for cleanup on VersioncontrolAccount, I believe this linkage is 100%.
Sane defaults + backend overridability - check, thanks to the move to good OO patterns.
Third-party utility - getting there. The advent of the OO API makes navigating VCAPI's internal datastructures much easier, but we still need to think about where & how we allow for alteration. Y'know, where we put our alter hooks.
Admin UI - not yet. We've backtracked from 1.x a bit, taking out some of the more hardcoded UI elements and are fixing to replace them with more flexible pieces. For the most part, that means building lots of Views, e.g., this issue. As with everything else in VCAPI, some of the difficulty comes in offering a dual-level API - one to the backends, the other to third parties.
Portable UI elements - zero. We're not going to provide a single block via hook_block() if we can at all avoid it. Views-driven all the way. Complicated, though, because the 'dual-level API' problems mentioned under Admin UI very much apply.

What's now emerging in 2.x is a layered, intelligible API that is thoroughly backend-manipulable, while still presenting third-party code with a consistent, usable interface. And with a repo interaction wrapper like I described above, VCAPI would be a launching point for the "deep integration" we all want. We're not there yet, but we're getting close. There's a general, central push to get a LOT more test coverage (especially testing sample data & standard use cases), without which we'll just never _really_ be sure how well the monstrosity works. There are still some crufty areas - "source item" tracking, "authorization method" for repository account creation - that we need to decide whether we discard, leave, or improve. And we need to come up with a consistent pattern for implementing dual-level Views: every backend needs to be able to generate a list of repository committers or an activity stream, for example, but each backend may be a bit different. So VCAPI provides a sane default, which can then be optionally replaced by a backend-'decorated' version.

I'm hoping this article helps put the VCAPI & family segment of Drupal's git migration in perspective. With any luck, it also gives enough of a sense of the problems we're grappling with that more folks might want to hop in and help us move everything along. Input on these plans are MORE than welcome.

Author:

Sam

RSS Tags:

Drupal

planet

Original Post:

http://blog.samboyer.org/blog/software-architecture-evolution-version-control-api-drupals-git-migration

About Drupal Sun

Drupal Sun is an Evolving Web project. It allows you to:

Do full-text search on all the articles in Drupal Planet (thanks to Apache Solr)
Facet based on tags, author, or feed
Flip through articles quickly (with j/k or arrow keys) to find what you're interested in
View the entire article text inline, or in the context of the site where it was created

See the blog post at Evolving Web