On Tagging, Part 1

In my current job, I manage content – a lot of content.  And a part of that job is trying to figure out ways to help folks find and navigate the articles that we publish.  At a high level, there are 2 primary ways that people currently go about these tasks – taxonomy and search.  From the user experience viewpoint, search is pretty self-explanatory, so I don’t think there’s much need to go into it here.   However, it is worth pointing out that one of the primary values of search is that it’s inherently dynamic.  You can attach a degree of confidence to search (some search engines more than others) as a way for getting to content because the engines are constantly being updated with new information.

Taxonomy, on the other hand, tends to be relatively static in nature.  This has value in the sense that it is relatively predictable and can thus be used in constructing UI metaphors such as menus and tag clouds.  However, it’s static nature also makes it a royal pain to maintain. (not to mention that I’m just really bored of UI metaphors like menus and tag clouds.)  As the taxonomy evolves, the amount of energy required to apply changes across the content set grows exponentially.  Additionally, because this type of content organization generally relies on a person to associate a piece of content with a set of taxonomy elements, it tends to only work well when the content’s author also controls the taxonomy. Now, in some cases, this isn’t a problem because the information being organized is inherently static – in my case, an example is the magazine issue that an article is associated with.  In other cases, however, the static nature of the taxonomy can be much more problematic – for example, topic (or tag).

In managing tags for all MSDN Magazine articles, I have 2 major problems to solve.

  1. I have many authors who write articles – and none of them own the set of tags that are ultimately associated with their articles.
  2. I want to support an effort to apply a standard taxonomy across the entire MSDN network (yes, I realize that I technically said “network” twice).  This brings with it 3 sub-problems.
    1. A standardized, network-scoped taxonomy is much larger than a typical tag cloud
    2. The taxonomy is likely to be pretty fluid as it spreads out across the network, as general trends change, or as new technologies launch.
    3. When I apply the taxonomy (or any changes to the taxonomy over time), I need a way to review and possibly update content that has already been published.

When initially approaching this problem, the thinking was that we would simply hire a vendor to go back through the thousands of articles that we’ve published over the years and manually tag everything.  It took all of 3 seconds to realize how ridiculous that idea was.  So then my friend Greg said, “frankly, I don’t see why we’re spending our time worrying about tagging at all if the goal is about helping people find articles on certain topics – search does a great job at that already.”  And that got me thinking about what really matters in managing a taxonomy (tag set).  In my case, I believe that it’s important to diligently maintain and evolve the taxonomy – and to present it in interesting ways (much more interesting ways than we currently do) as a navigation metaphor.  I also think it goes without saying that it’s important to create the content.  However, what if I could get out of the way with respect to the associations between content and tags, and instead let search fill that role?

So, just to give the idea a test, I got a copy of the work-in-progress standard taxonomy, iterated through the tags, and for each tag automated Live search to give me all of the MSDN Magazine articles that it believed were relevant.

private static IEnumerable<string> GetArticleIDsForTag(
   string magazineName, string tag) {
   var shortIDs = new List<string>();

   var s = new MSNSearchPortTypeClient();
   s.InnerChannel.OperationTimeout = new TimeSpan(0, 0, 2, 0);
   var searchRequest = new SearchRequest {
        Query =
           string.Format(SEARCH_Q,
                         magazineName,
                         tag),
        AppID = SEARCH_APP_ID,
        CultureInfo = "en-US"
     };

   var t = 0;
   while (t < 1000) {
      var sr = new SourceRequest[] {
          new SourceRequest {
                               Source =
                                  SourceType.Web,
                               Count = 50,
                               Offset = t
                            }
      };

      searchRequest.Requests = sr;

      var searchResponse = s.Search(searchRequest);
      var response = searchResponse.Responses[0];

      foreach (var result in response.Results) {
         var shortID = ExtractShortID(result.Url);

         if (!shortIDs.Contains(shortID) && !string.IsNullOrEmpty(shortID))
            shortIDs.Add(shortID);
      }

      t += response.Results.Count();

      if (response.Results.Count() == 0)
         break;
   }

   return shortIDs;
}

A couple of things to point out here.  First, while it’s hidden away in a const, the query sent to the search engine looks like following:

private const string SEARCH_Q =
 "site:http://{0}.microsoft.com/en-us/magazine/ meta:Search.Magazine.PageType(article) \"{1}\"";

In addition to scoping the search to the magazine site’s iroot, I’ve also inserted limited the search just to magazine articles by inserting a meta tag into each article page that looks like the following:

<meta name="Search.Magazine.PageType" content="article" />

In addition to the query url syntax, I needed to deal with a couple realities about how search processes and returns results.  Specifically, that it caps the number of returned results at 1000 and it returns a max of 50 results at a time by default. (If you try setting this cap greater than 50, you’ll get 10 per page.)  Therefore, I explicitly configured my search to return 50 items per page, and iterated until I either ran out of results or ran up against the 1000 item limit.  For the purposes of auto-tagging, I hypothesized that any results greater than 1000 are likely noise anyway – and in fact, as I analyze the results, that number may actually be far lower.

The final step in this block is to simply process the results.  In my case, I simply need to capture what’s known as the short ID – this is the cryptic file name that’s generated for all pages published on MSDN.  For that, I simply ran the following regular expression and extracted the capture.

private static readonly Regex URLAnalyzer =
 new Regex(
    @"(?:\/?(?<languageServed>\w{2}-\w{2}))?\/magazine\/(?:\w+\/)*(?<shortid>\w+)(?:\((?<languageRequested>\w{2}-\w{2})?[,\s]*(?<version>\w{2}\.\w{2})?\))?\.aspx",
    RegexOptions.IgnoreCase | RegexOptions.CultureInvariant |
    RegexOptions.IgnorePatternWhitespace |
    RegexOptions.Compiled);

In the end, I get a map of tags to short IDs.  The next step is to take that map, load it into my data warehouse where I can correlate it to all sorts of other statistics, and then start building some really interesting visualization on top of it.  I’ll continue into those topics later, but the point of this post is to simply discuss some alternate ways to think about tagging and some strategies for testing those ideas.

I’m still experimenting with how effective this technique works in practice, so don’t think that I’ve landed on this as some kind of panacea.  Specifically, I’m still working through 2 main issues – 1) helping search to reduce the signal to noise ratio and 2) ensuring that search indexes my pages in a reasonable timeframe so that an article can be auto-tagged soon after published.

About Howard Dierking

I like technology...a lot...
This entry was posted in auto tagging. Bookmark the permalink. Follow any comments here with the RSS feed for this post.
  • George C

    More important than keywords and tags, especially for MSDN, is “currency” (is that the right word?). i.e. recognising if the information is relevant TODAY and hasn’t been superceded by new technology or best practice. I suspect this is a much harder problem.

  • http://codebetter.com/blogs/howard.dierking hdierking

    @Tom – very cool stuff. I had not seen this before. It looks as though using this tool, I would take more of a bottom up approach, defining the taxonomy from the content (I had done some similar experiments using the AdWords API at one point) as opposed to applying a taxonomy to a set of content.

    In terms of actively managing the taxonomy however, something like this would be a great input for continuously vetting the taxonomy. Thanks for showing it to me.

  • http://www.reuters.com/developer Tom Ablewhite