As part of our work on open content, and how to design systems that support authoring and translation that are both useful and usable, we have been thinking about the role of metadata and, by extension, search. This post contains some incomplete thoughts - a line in the sand, more than anything - and, six months from now, will provide something for all of us to laugh at. Possibly, we will all be able to laugh at this sooner than that. Time can be cruel.
In other words, I am firmly reserving the right to recant any or all of what I'm saying here. I'd love to hear different viewpoints on this.
Keep Data Simple
This sounds - and is - pretty basic, right up until it's time to implement an actual system. However, as soon as it's time to build a system, people "just need this one field."
In building data systems, additional fields are the equivalent of scope creep.
Humans Should Only Enter Metadata In Precisely Defined Circumstances
We'll get to this in more detail later in this post, but whenever possible, metadata should de derived from the data.
In some cases, this is simple: the author of a piece of content is easy to derive. Ditto for the date a piece was created.
A good example of metadata that should be entered by a human is a license.
But, in the case of a person remixing data that uses different licenses, the pool of possible licenses for the remix should be derived.
Your Picture Is My Image Is Her Binary
A system-defined metadata can be useful, but it will be most useful for the people who designed and built the system, as they are the ones who define the system-specific meanings of the metadata terms.
In other words, your metadata will be useful to you, but it might not be useful to your users. For better or worse, metadata is rooted in language, and words carry baggage and connotations that, among a large group of individuals, make a universal meaning elusive at best.
With this in mind, the "best" metadata is often good search.
But Community Tagging Is Awesome
No, it isn't. Community tagging creates the appearance of structure and organization when what you really have is a chunky stew of chaos.
If you can get enough people contributing tags, then - maybe - you will be able to pull some signal from the noise, but that also assumes a large number of people and a robust search technology.
Faceted Search: Blech or Ugh?
In designing search systems for sites, faceted search can be useful at providing structure when sifting through content. However, is faceted search something that we actually appreciate,or something that we have grown accustomed to?
On Google, how often do you use faceted search, or go beyond the options that you can access via the advanced search UI?
If faceted search went away, or was replaced with facets generated from metadata that could be derived from the core dataset, what would be lost? Anything?
Look at your search habits. Identify if or when faceted search saved you time. In situations when you use faceted search, was faceted search essential, or could it have been replicated by full text search?
Search Has Its Limitations
But with all that said, search has its limitations.
Understanding how stemming works (or doesn't work) is essential to interpreting the results we get.
And this is more complex when we work with translated content in multiple languages.
"Just In Time" Metadata
There are times and places where good, structured metadata is essential. By separating out the metadata requirements from the actual dataset (and keeping the core data as simple as possible) you help ensure that the quality of your underlying data remains high.
Implementing a metadata structure around data is firmly in the domain of a context-specific application.
In terms of open educational resources, this allows for easier reuse of the data. If a piece of content was written in the US, a school looking to resuse that content in the UK won't care about the Common Core alignment of the resource.
To put this another way, inflicting a metadata standard on your data (as opposed to applying metadata within an application that uses the data) makes your data both less portable and less useful.
In listening to people who are writing and using open content, a key barrier we hear about repeatedly is portability (there are others as well, and these other issues will get their own posts).
A barrier to portability - and really, to the usability of authoring and translation platforms that support open content - is the premature and often unnecessary application of metadata into the underlying data. If we keep the data as clean as possible - which means resisting the urge to apply metadata without a compelling need - we can simplify both portability and usability. Metadata should be applied as part of an application that uses the data, when there is a clearly defined need to catagorize the data. And then, the categorization should be done by people who know what they are doing.
It doesn't matter how good your categorization system is if it is applied to your data inconsistently, and/or if no one uses your data.
Image Credit: "faceted" taken by jenny downing, published under an Attribution license.