Solving Problems Through Collaboration

by Larry Garfield

For the last few weeks, we've been talking about various elements of the recent collaborative calendaring project that we've been working on with the Balboa Park Online Collaborative to help enable complex calendar and date handling in Drupal.  If you're interested in hearing more, come to the Museum Computer Network conference on October 29th to see the session our own Matt Farina will be presenting with the team from BPOC on The Collaborative Calendaring Project. Matt will be providing a full rundown on one of Palantir's most complex projects to date.

One important aspect of the project was data syndication. Although each museum within Balboa Park is able to create its own events independently, the main Balboa Park site needed to have a rollup of all (or most) events happening in the park. That allows visitors to see at-a-glance what is happening in the park when they will be there, across all museums. It also provides a central repository to then re-syndicate that information out to newspapers, tour groups, mobile apps, or anything else that one can connect a Drupal site to. (The list is not short.)

Syndication systems are not new, nor are they new to Drupal. However, none of the usual suspects could handle the setup we had.  This meant that we needed to come up with a new approach for pushing complete nodes in bulk between Drupal sites in a non-Drupal-specific manner on an inconsistent schedule. Fortunately, that happened to be right before DrupalCon San Francisco, and as often happens the community provided.

Existing options

Before beginning development, we looked at a number of existing Drupal options:

  • The ever-popular RSS format has solid Drupal support, but it doesn't readily support more than a "blob of text". We needed to maintain the full node structure with separate CCK fields, taxonomy, and so forth.
  • Drupal's Deploy module does full node replication, but it routes data through serialized PHP, drupal_execute(), and the Form API in order to handle modules that incorrectly assume that all node data is coming in through a form and then don't use the node API properly. (Please, don't let this happen to you.) It also is very strongly targeted toward site administrators, not content editors, and for one sender/one receiver use cases. We had to handle over a dozen senders to a single receiver. After discussion with Deploy maintainer Greg Dunlap, we decided it wasn't a good solution for this case.
  • New events, when posted, come in large spurts rather than a steady stream. That makes usual pull-based approaches inefficient at best, and likely to miss events at worst.
  • We needed an approach that could, in the future, be extensible beyond a Drupal site on the receiving end.

Collaboration is your friend

In San Francisco we had a number of conversations with other Drupalers who happened to be working on very similar problems. In particular, Kathleen Murtagh of GoingOn had a similar syndication problem for a Drupal application they were developing, although in their case they needed to go from a single sender to multiple recipients. Fortunately, Feeds maintainer Alex Barth had recently started working on Feeds integration with the PubSubHubub (PuSH) protocol and was presenting on the same topic.

PuSH, in short, allows receiving sites to subscribe to a sending site, optionally using an intermediary "Hub" server. Then when new content is available the sender notifies the Hub, which in turn sends the content to all registered subscribers. PuSH itself is an extention to the highly flexible Atom format, which is more flexible than RSS but also XML-based. Alex had just recently added PuSH subscriber support to the Feeds module, and written a tentative PuSH hub/publisher bridge module called push_hub that wasn't really released yet. That's problem one solved.

The other question then was how to represent the nodes we were PuSHing across the wire. It needed to be some XML-based format in order to fit into Atom, but there was no clear XML format for representing nodes. After some discussion with Kathleen, Alex, and RDF guru Stephane Corlosquet we settled on an RDF-wrapped custom format designed to handle a standard Drupal entity model. Although entirely a custom format for now, the idea was that we could, in time, replace portions of it with more standard RDF vocabularies. That made it more practical for our use, even though, for now, it didn't really leverage any RDF features.

The plan

The end result had a number of moving parts, but they all fit together beautifully. On the sending side is the new Views Atom module, developed mostly by Palantir. It consists of a Views style plugin that generates a PubSubhubub-compatible Atom feed and a views row plugin that generates a generic XML representation of entities, specifically nodes, including CCK fields, taxonomy, and basic properties. It is paired with the push_hub module, which handles subscriptions and pushing new content to subscribers using the Queue module, a backport of Drupal 7's new integrated queuing system. Solid Rules integration binds the two together so that when new subscription-worthy content is created a Rule fires that programmatically renders the View to generate the appropriate Atom feed and hands it off to push_hub, which will then handle sending those nodes to one or a thousand subscribers using the queue.

On the receiving side is the new Feeds Atom module, on which GoingOn took point, which processes an incoming Atom/RDFNode feed (via PuSH or otherwise) and maps it to a new or updated node exactly; fields present on the receiving node type get populated by the matching field on the sending node type, and fields we don't care about get ignored. As long as the fields match up, it Just Works(tm).

Hooks on both sides allow for other modules to add additional data and processing to the feed in a clean fashion. We were even able to build in support for syndicating attached files with duplicate detection so that the same file isn't syndicated multiple times, thanks to Alex's help. In fact, even though it wasn't his client Alex was an enormous help throughout the project, answering questions and offering suggestions on how to get the most out of the Atom and PuSH formats. We were also able to fix a number of bugs in untested parts of push_hub as well as a few other modules since we were pushing them past their normal comfort zones.

Community FTW

Both Palantir and GoingOn were able to complete their respective projects with more brainpower and less work than they would have been able to on their own. We have them configured in totally different ways but using the same code base, which is an indication that we did something right when architecting the solution. Alex got useful stress-testing and bug fixing for push_hub, which has since been released on Drupal.org as well.

And the community also got the beginnings of a generic, rich XML format for any Drupal entity. It's still in the early stages, but there are an enormous number of potential uses for a standardized round-trippable XML format for Drupal entities. This is just the beginning.

That is how Open Source is supposed to work.

The future

Of course, this is all first release design. While on the whole flexible, it certainly could go farther. Looking forward to Drupal 7, what we'd like to see happen is:

  • Factor the Entities-as-XML code and hooks out to a separate module that can handle both serializing to XML and turning the XML back into a node. That's a simple, single task that should be its own stand-alone system that any module can leverage, and more people can contribute to. Integration with the Entity Metadata module is also a possibility. That may also then include the Views row plugin.
  • Fold the Atom logic from both Views Atom and Feeds Atom into the Atom module. Right now that's a fairly simple module, but offers a lot of potential if we can expand it to be a central focal point of Atom-goodness (with PubSubHubub support).
  • Expand the Feeds module to have a separate plugin for processing each row in an incoming feed rather than just one plugin for the entire feed. That gives it more parallelism with Views, and would also allow for complete separation between the Atom and Entities-as-XML plugins that are currently integrated in Feeds Atom.

Three cheers for collaboration: now let's see how far we can take it!

Comments

That's certainly the hope. It's a first step only, but still a good step. Now we just need to build momentum around it to push it further. (Anyone want to fund the Drupal 7 version? :-) )

Did you look at the content distributor module? We're using it successfully to syndicate nodes. It uses XMLRPC instead, and the various Services modules abstract away the details, including pulling in files attached to file and image fields. Since its done over XMLRPC, it can also be secured by requiring an API Key and a valid user login on the host site.

Why is there no Project page on d.o? Are you planning on contributing it, or keeping it as proprietary? (which, would probably make more business sense)

Good luck guys, you picked a mega-tough nut to crack.

It's not a single module but a series of modules. The key ones are views_atom and feeds_atom, both of which are on Drupal.org already and linked from the article above. The push_hub module was released a few months ago and is also on Drupal.org.

As a practice, Palantir releases as much of our code as possible and when designing a site we try to architect it to minimize the amount of site-specific code needed. The "business sense" in open source is in establishing your skill and reputation by active involvement in the community, which we fully embrace.

Based on what is described here, it doesn't appear that this approach supports updating nodes. Is that correct and if so, are there plans to support updates?

Regardless, awesome work; very interesting stuff.

Thanks,
Kurt

Actually yes, updates are supported. The views_atom and feeds_atom modules both use Atom's id property to handle uniqueness. We just setup a rule to re-push a node when it's updated and let Feeds take it from there.

The blogapi module (core in D6, contrib in D7 I believe) already supports that use case. You don't need this approach for that.

That said, the XML format is documented in the views_atom module. There's a sample file.