Thoughts

Structured Content the Right Way

Author

Joe Kepley

Categories:

A set of stacked colored boxes representing order among chaos.

Blobs. Chunks. WYSIWYG vs. structured content. There are two ends to the spectrum, and we discuss how to land somewhere in the middle, providing a quality editing experience while still keeping the constructs of structured content front and center.

“Well maybe if you put it where it goes, you’d be able to find it.”

I have kids. And I may as well have this phrase tattooed on my forehead. My kids take the shortest route to cleaning up by throwing things in the nearest container. It’s easy, but as a result my house seems to have an endless series of bins, cabinets, and shelves that all contain random items. When there’s no plan for where anything goes, it becomes impossible to find anything.

Partly as a reaction to this, when it came time to organize the garage, I went all-out. I hung pegboard everywhere, and made sure each tool had it’s own hook. I got a set of containers with little drawers and sorted all the screws by type, size, and use.

Unfortunately, now we have the opposite problem: since each item only has one place, it takes more time to make sure everything gets back to the right spot. Worse, if a new item doesn’t fit one of the existing categories, it requires a whole new drawer. The system can’t be maintained because it’s inflexible.

This is the spectrum of structure. Not enough leads to chaos. Too much defies maintenance.

Order vs chaos in your CMS.

The same situation is true in a CMS. The easiest thing is to give editors free reign. Let them simply edit the content in-place on every page with no restrictions. This big blob of text is what we refer to as ‘unstructured content.’ But then the CMS has no structure to draw from, which makes it difficult to re-use the content.

On the opposite end of the spectrum is structuring every element of a page – every heading title, every paragraph, every image. While structurally solid and recognizable to the CMS, the editing environment becomes an endless series of form fields, most of which are irrelevant for the task at hand. And if something new pops up, we’re going to need to redesign the system.

The idea of structured content is that we’re providing the CMS with the right amount of information about our content to do its job. By modeling different attributes, choosing the right set of content tags, and building our content for re-use, we can break our content into the right chunks to make it easier to find and use while making the editor’s job easier at the same time.

Karen McGrane describes the relationship between structured and unstructured content very succinctly:

So I really believe, guys, that we are in a war of Blobs versus Chunks. We are in a war between giant, unstructured blobs of content, and clean, well-structured fields of content that have metadata attached. We are in a war of Blobs versus Chunks. You all are on Team Chunk. We cannot let the blobs win.

There’s a general movement against WYSIWYG systems among content strategists for this reason. (We’re using the term WYSIWYG here, which stands for ‘What You See Is What You Get’. This is a misnomer on the modern web, since the same page can appear a thousand different ways on a thousand different devices. What we’re really talking about is Microsoft Word-like rich-text editing of the sort enabled by TinyMCE or CKEditor.) By treating the editing interface like Microsoft Word and allowing editors to post anything, we open up an avenue for wasted document detail. We write brilliant copy that the CMS can’t read.

DailyPlanetArticle.final.v7.FINAL.docx

Consider this example: suppose we sit a reporter down in front of a blank unstructured document and have them start writing an article.

Martians attack Metropolis, Residents in Panic
by Clark Kent, Daily Planet

METROPOLIS – Residents fled in terror today as a strange craft crashed in the financial district just after 2:00PM

That’s all well and good, and easy to do, but now how can we programmatically read who the author is? Or the dateline? Can we always assume the title is the first line? Or is it always bolded? We can’t use any of this because it’s up to the editor to always enter things the same way, and they’ll probably forget or vary things a bit. In short, by making things just like a word processing program, they’re going to use it just like a word processing program.

So instead of smart, CMS-readable content – a story that breaks out important fields like title, author, dateline, and location – we get a big, unstructured blob of content.

The most logical step seems to be to put everything in its own field. But now you’re into one-bolt-per-drawer territory. There’s an art to content modeling, and the core to that art is finding the right balance.

What fields you choose to use and how you represent, name, and arrange those fields is one of the hardest parts of any project, and has the biggest impact on how the system will be used in the future.

If you break out a separate field for every potential item an editor might need, you can create a system so dense and complex that no one can use it. On the other hand, if you simply punt and dump everything into a single WYSIWYG field, you’ve made things easy to understand, but you’ve essentially given up most of the advantages of your CMS.

What’s more, most of the time, we need some flexibility to vary our content. Maybe we want to insert an image, or some special markup. Having a series of special tags to memorize and insert manually, or a field for each paragraph doesn’t provide a good user experience for the editor. So there are situations where WYSIWYG makes sense as an editing interface.

Striking the right balance.

Is this a limitation of WYSIWYGs? It’s certainly true that WYSIWYG interfaces allow for more flexible content, and that this can lead to unstructured content. But the main issue is how most CMS systems implement WYSIWYG.

In fact, we could argue that our problem isn’t with WYSIWYG as an editing interface, but with the data structure that’s being outputted. It’s not the WYSIWYG, but the HTML.

By far the simplest way to implement a WYSIWYG in a browser is to use something called the ‘contenteditable’ user interaction standard. This was something started way back in the day in Netscape, and is now supported in all major browsers. Basically, if you flag an area as ‘contenteditable’, the browser goes from being an HTML renderer to an HTML editor, and we can read back the entered HTML.

This is convenient, but where the problem arises is that nearly all CMS systems simply store this HTML directly, creating the ‘blob’ of content. Different browsers will produce slightly different HTML, so this starts producing some inconsistency. Worse, users can paste in HTML they’ve copied from somewhere else (Word is a particularly bad offender as a source), and it will get saved along with everything else. On the surface, everything will look the same, but if you try to parse the markup, you’re probably heading for trouble.

Chunky blobs: WYSIWYG as structured content.

So what we really need to do this well is a WYSIWYG that will produce structured data instead of an HTML tag soup. If we had that, we’d have a hybrid – a ‘chunky blob’. If you have an enforced schema on your WYSIWYG data, you could still re-use and index data inside of a WYSIWYG.

The mind-boggling thing is that nearly every CMS currently on the market gets this wrong. Part of the problem is that it’s technically challenging to create, and it isn’t something that’s readily apparent in a sales demo. The ‘good’ WYSIWYG and the ‘bad’ WYSIWYG can have the same editing UI but handle data completely differently on the back end.

In fact, a ‘good’ editing UI may actively reject things that are pasted in if they don’t comply to the proper schema. The WYSIWYG interface seems seem ‘pickier’, which is a good thing in the long run. But it’s not something that will pop up until long after a CMS contract has been signed, so it’s not something that naturally works its way into a product roadmap.

There’s a pretty short list of enterprise-scale CMS systems that handle WYSIWYG content as schema-enforced, structured data – eZ Publish comes to mind, and as a bonus they also provides the ability to expand the tag set – so it’s important to look for this functionality before it becomes a hidden iceberg, finally striking months down the line when you’re looking to re-use content.

You can’t click a link in your print version.

So let’s look at a situation where this might matter. Suppose our example article is properly modeled. (A real news article will have many more fields, but these will do for now.):

Title
Byline
Dateline
Body

We’ve published it on the web, and it includes the main story, a couple of embedded images, and links to the two related stories, as well as background information on other sites.

Now, since we’re a modern newspaper, we have an app. In the app, it would be better if the links to my content were restructured to use the app’s resource locator (myapp:// instead of http://), and it would be better to serve the app a smaller version of the images. We’ve also made the editorial decision to present sidebar content as it’s own panel in the app, rather than inline with the content.

We’re also going to send a version for print. In the print version, all of the links should be suppressed, perhaps with the URLs added to a sidebar, and the images should use the largest resolution we have available.

All of these transformations can be accomplished in roundabout ways with unstructured WYSIWYG data, but you’re going to wind up putting in more work to get less consistent results vs. using structured data in the first place. Create inline tags that structure the content without sacrificing editorial flexibility.

Additionally, using a schema for your WYSIWYG content enables you to extend the field to add new capabilities in the future. Your editors want to embed Youtube video in their articles? Create a tag that correctly builds an embedded Youtube player for the web output, and ignores it for print. If Youtube changes their player format, and you already have it in 1,000 articles? No problem, just change the template that renders your player.

Proper tools, proper planning, proper experience.

Editors, administrators, authors, and contributors are key stakeholders for every web project. And even the most structured, well-modeled content will fail if a new site requires a major change in an editor’s existing workflow.

Which is where the balance ultimately comes in – by providing editors with an experience that fits their workflow, we can ensure that they are able to contribute high-quality, well-modeled content. Instead of throwing out the existing WYSIWYG tools, we can instead add structure and schema to WYSIWYG content – providing them with an easy way to create rich and interesting content while still preserving a proper data structure.

In the end, the balance between order and chaos can be used to enable editors to be a part of your content management best practices, rather than the dark corner that the best practices are structured around.