Generating the Collection Data

As I mentioned previously, getting your data design right is critical in creating a highly functional and aesthetically pleasing collection.  In that same post, I also mentioned that it’s good to test your data design using the pivot viewer before creating any of your trade card images.  This will help to validate that you’ve chosen the right facet categories and that your sample sizes look like about what you would expect across your various pivots – particularly in histogram mode.

Pivot (both the rich client and Silverlight control) expects XML data written to a defined XML schema.  Additionally, your XML should be saved as a file with a *.cxml extension (more about additional requirements for hosting on the Web or from Azure later).  The best resource for understanding more about this format (as well as download the actual XSD files) is the pivot Web site – more specifically, the Collection XML Schema page.  For the MSDN Magazine collection, the collection definition XML looked like the following:

<?xml version="1.0" encoding="utf-8"?>
<Collection xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema"
    xmlns:p=http://schemas.microsoft.com/livelabs/pivot/...
    Name="MSDN Magazine Articles"
    p:Icon="msdnmagazine_icon.ico"
    SchemaVersion="1.0"
    xmlns="http://schemas.microsoft.com/collection/metadata/2009">
    <FacetCategories>
        <FacetCategory Name="Popularity" Type="String" p:IsFilterVisible="true"
            p:IsMetaDataVisible="false"
            p:IsWordWheelVisible="false">
            <Extension>
                <p:SortOrder Name="Popularity Group">
                    <p:SortValue Value="Viewed Often" />
                    <p:SortValue Value="Viewed Occasionally" />
                    <p:SortValue Value="Viewed Infrequently" />
                </p:SortOrder>
            </Extension>
        </FacetCategory>
        <FacetCategory Name="Issue Date" Type="DateTime" Format="MMMM yyyy"
            p:IsFilterVisible="true"
            p:IsMetaDataVisible="false"
            p:IsWordWheelVisible="false" />
        <FacetCategory Name="Topic" Type="String"
            p:IsFilterVisible="true"
            p:IsMetaDataVisible="true"
            p:IsWordWheelVisible="true" />
        <FacetCategory Name="Authors" Type="String"
            p:IsFilterVisible="true"
            p:IsMetaDataVisible="true"
            p:IsWordWheelVisible="true" />
        <FacetCategory Name="Column / Feature" Type="String"
            p:IsFilterVisible="true"
            p:IsMetaDataVisible="true"
            p:IsWordWheelVisible="false" />
        <FacetCategory Name="Column Name" Type="String"
            p:IsFilterVisible="true"
            p:IsMetaDataVisible="true"
            p:IsWordWheelVisible="false" />
        <FacetCategory Name="Has Code" Type="String"
            p:IsFilterVisible="true"
            p:IsMetaDataVisible="true"
            p:IsWordWheelVisible="false" />
        <FacetCategory Name="Issue" Type="String"
            p:IsFilterVisible="false"
            p:IsMetaDataVisible="false"
            p:IsWordWheelVisible="false" />
        <FacetCategory Name="Code Link" Type="String"
            p:IsFilterVisible="false"
            p:IsMetaDataVisible="false"
            p:IsWordWheelVisible="false" />
        <FacetCategory Name="Title" Type="String"
            p:IsFilterVisible="false"
            p:IsMetaDataVisible="false"
            p:IsWordWheelVisible="false" />
        <FacetCategory Name="Subhead" Type="String"
            p:IsFilterVisible="false"
            p:IsMetaDataVisible="false"
            p:IsWordWheelVisible="false" />
        <FacetCategory Name="Page Views" Type="String"
            p:IsFilterVisible="false"
            p:IsMetaDataVisible="false"
            p:IsWordWheelVisible="false" />
    </FacetCategories>
    <Items ImgBase="msdnmagazine_deepzoom\msdnmagazine.dzc">
        <Item Id="1"
            Name="Thread Pools: Scalable Multithreaded ..."
            Img="#0" Href="http://msdn.microsoft.com/...">
            <Description>
                <!-- long description here -->
            </Description>
            <Facets>
                <Facet Name="Authors">
                    <String Value="Ron Fosner" />
                </Facet>
                <Facet Name="Code Link">
                    <String Value="http://code.msdn...." />
                </Facet>
                <Facet Name="Column / Feature">
                    <String Value="feature" />
                </Facet>
                <Facet Name="Column Name">
                    <String Value="Feature Article or Unknown " />
                </Facet>
                <Facet Name="Has Code">
                    <String Value="Has Code" />
                </Facet>
                <Facet Name="Issue">
                    <String Value="October 2010" />
                </Facet>
                <Facet Name="Issue Date">
                    <DateTime Value="2010-10-01T00:00:00" />
                </Facet>
                <Facet Name="Page Views">
                    <String Value="Unknown" />
                </Facet>
                <Facet Name="Popularity">
                    <String Value="Viewed Infrequently" />
                </Facet>
                <Facet Name="Subhead">
                    <String Value="Thread Pools" />
                </Facet>
                <Facet Name="Title">
                    <String Value="Scalable Multithreaded ..." />
                </Facet>
                <Facet Name="Topic">
                    <String Value="C++" />
                    <String Value=".NET" />
                    <String Value="Concurrency" />
                </Facet>
            </Facets>
        </Item>

        <!-- Many, many, many more items -->
    </Items>
</Collection>

As you can see, most of the configuration information that tells the Pivot viewer how to setup its various windows (filters, the details pane, etc.) is defined up top in the FacetCategories section.  I’ll quickly point out a couple features of the collection schema that I’ve taken advantage of for the magazine collection, but won’t spend too much time here because it’s all really well documented on the Pivot Web site.

  • Depending on the level of control that you desire, you may end up needing to use elements from 2 different Pivot-related schemas.  All of the basic elements are defined in the http://schemas.microsoft.com/collection/metadata/2009 namespace, which is my default namespace.  More detailed settings, such as determining whether a facet category is visible in the filter pane, are controllable through elements and attributes defined in the http://schemas.microsoft.com/livelabs/pivot/collection/2009 namespace (prefixed “p” in my document).
  • I’ve defined a custom sort order for the popularity group.  As you can see, this is really simple to do – just put the various text strings in the order that they should show up in.  The key here is that you need to know what your possible values are – so not a good idea to try this when you could have a lot of different values or when you don’t know all of the possible values up front.  Then again, if you have a large set of facets in a category, this should in itself be a “smell” – you want to keep your set of facets relatively small within a facet category so that your user is not having to scroll through an exhaustive list just to filter.  For the magazine collection, I broke this rule a bit for the authors facet category – but I did this intentionally for 2 reasons:
    • Looking at previous traffic data showed that people actually did care to filter by specific authors
    • Within the filter pane, the Pivot viewer enables sorting by the number of articles an author has written and not just by the alphabetical list of authors – and I thought this was a nice way to bring the ‘regular’ writers (and based on our data, our more searched out writers) up to the front.
  • If you compare the above XML with the actual Pivot viewer, you’ll see that I’ve defined more facet categories than I’m actually displaying (for example, “Subhead”).  This has to do with my production workflow (more on that later) – these are not actually needed by the Pivot viewer.
  • The “Topic” facet has multiple string values in it.  This is great for enabling an article to show up across multiple topics, but it has the downside of double-counting the article in histogram mode.  Because of my goals, this was an acceptable level of imprecision – however, you may have different requirements, so be aware.

And now the part about actually generating the data…

There are 4 different data sources that feed into the process that ultimately yields the above XML:

  • The MSDN Magazine metadata management application (SharePoint lists)
  • A legacy MSDN content item with the mappings of topic friendly names to short names (emphasis on legacy)
  • The MSDN back end database for getting ‘short ids’ from the longer guid identifiers
  • The previous month’s page traffic report, courtesy of Omniture

The workflow is managed using a SSIS package and looks like the following:

pivotWorkflow

There’s (unfortunately) some pre-processing work that needs to happen on the Omniture data due to the fact that apparently, “csv” to the Omniture developers translates into “CSV that we assume someone will be reading using only Excel”.  As such, the file contains a bunch of header data (again, stuff that would look great if you were using Excel to view the file, but stuff that only gets in the way when you’re trying to programmatically load the data).  Even more insidious, and the part that required the true ‘cleanup’ by me (SSIS can be configured to simply ignore header rows, so that’s more about me complaining than anything else), was in the fact that the – again, CSV – file outputted formatted numbers, and guess what character was used for the thousands separator?  You guessed it – the comma!  As a result, my script task reads in the text and runs the following regular expression code to replace the formatted numbers with unformatted numbers.

Regex regex = new Regex("\"(\\d+),(\\d+)\"",
    RegexOptions.CultureInvariant | RegexOptions.Compiled);
string regexReplace ="$1$2";
string result = regex.Replace(fileText, regexReplace);

The Omniture cleanup then simply removes extra data columns from the report and then saves the cleaned up file over the original one in the working directory.  While all this has been happening, we’ve also copied the XML representation of all article content from the magazine metadata management application.  This XML looks like the following:

<?xml version="1.0" encoding="utf-8"?>
<Index>
    <Mag Name="MSDN Magazine"
        homepageguid="ebc3411f-9739-4103-a534-76783c4aa573"
        issuesguid="320c2adc-3dc5-4fbb-b197-e2dfbc766d35"
        issuestocguid="f3f75834-f3c2-4973-8d63-375fe8231b5e"
        rootguid="bc353eb1-58d9-4e7f-b70a-2900729b6b18">
        <Yr Name="2010" Dir="10"
            guid="7aaba6cc-42b6-4432-85ac-592e48b79e30"
            tocguid="cdb28296-4fe9-473b-841a-d34a17643c88" codeguid="">
            <Iss Note="" Year="2010" Dir="10" Name="October"
                Date="10/1/2010" guid="7a430b75-f398-4b6f-9d70-41888e7b95b2"
                tocguid="bdc12809-ffaa-4ff7-86e1-0c4ea451e8eb" Id="25;#">
                <Art Note="" Sub="Thread Pools" type="feature"
                    Ti="Scalable Multithreaded Programming with Thread Pools"
                    Au="Ron Fosner" Sum="Getting the most out of ..."
                    guid="26c5f253-ff52-4e94-bbc5-1a8f8d6ab031"
                    shortid="gg232758" codeguid="" Name=""
                    coderef="http://code.msdn.microsoft.com/mag201010Thread"
                    codename="http://code.msdn.microsoft.com/mag201010Thread"
                    WebTitle="Scalable Multithreaded Programming ..."
                    Id="345;#" IssuePublishDate="2010-10-01T00:00:00">
                    <Tag>C++</Tag>
                    <Tag>.NET</Tag>
                    <Tag>Concurrency</Tag>
                </Art>
                <!-- Many, many, many more articles -->
            </Iss>
        </Yr>
    </Mag>
</Index>

Once I’ve got that copied over, I perform several post-processing steps on it using the other data sources I’ve mentioned above.  I’ll walk through them in order.

1. Resolve Short IDs

If you look at the above XML, you can see that the example article has an attribute ‘shortid’ defined.  As you probably figured, this is what shows up on the URL – and as such, what shows up in the Omniture traffic reports.  Most of the more recently added articles have this short id defined – however, when I started the project, I discovered that nearly all articles which were more than 2 years old had only a surrogate guid defined, and as such, I created this command to find the guid in the backend content database and extract the short id.

2. Get the Friendly Tag Name

This is just a product of history – I would explain it to you if I even knew the original background.  But for whatever reason, the value of the Tag elements does not match to the “Friendly Name” that gets displayed in the UI.  It doesn’t look too bad in the example above, but there are some really cryptic ones in there – promise.  Anyway, the mapping data is stored as a content item which can be publically accessed through the MTPS content Web services.  Here’s the code to match everything up.

class UpdateTagDescriptions
{
    private const string MAGAZINE_TOPICS_ASSET = "ff754346";
    private const string MAGAZINE_TOPICS_LOCALE = "en-us";
    private const string MAGAZINE_TOPICS_VERSION = "MSDN.10";
    private const string XMLNS_XHTML = "http://www.w3.org/1999/xhtml";

    private static XmlNamespaceManager _namespaceMgr;

    public UpdateTagDescriptions() {
        _namespaceMgr = new XmlNamespaceManager(new NameTable());
        _namespaceMgr.AddNamespace("x", XMLNS_XHTML);
    }

    public void Run(XDocument masterXmlDoc) {
        //get the topics from the MTPS content Web service
        var proxy = new ContentServicePortTypeClient();
        var request = new getContentRequest
        {
            contentIdentifier = MAGAZINE_TOPICS_ASSET,
            locale = MAGAZINE_TOPICS_LOCALE,
            version = MAGAZINE_TOPICS_VERSION
        };

        // Specify that we want the XHTML primary document
        // contents and code stream
        var documents = new requestedDocument[1];
        documents[0] = new requestedDocument {
            type = documentTypes.primary,
            selector = "Mtps.Code" };

        request.requestedDocuments = documents;

        var response = proxy.GetContent(null, request);
        var r = response.primaryDocuments[3].Any;

        Console.WriteLine("Building tag map...");

        var tagMap = new Dictionary<string, string>();
        var topicElements = r.SelectNodes("//x:topic", _namespaceMgr);
        if(topicElements == null)
            throw new Exception("No topic elements found.");

        foreach (XmlElement topicElement in topicElements)
        {
            var name = topicElement.GetAttribute("name");
            var tag = topicElement.GetAttribute("tag");
            tagMap.Add(tag, name);
        }

        Console.WriteLine("Total number of elements in the tag map: "
            + tagMap.Count);

        foreach (var mappingItem in tagMap) {
            var tagElements = masterXmlDoc.XPathSelectElements(
                "//Tag[.='" + mappingItem.Key + "']", _namespaceMgr);

            Console.WriteLine("Replacing {0} {1} tags with {2}",
                tagElements.Count(), mappingItem.Key, mappingItem.Value);

            foreach (var tagElement in tagElements)
                tagElement.Value = mappingItem.Value;
        }
    }
}

3. Embellish with Omniture Metrics

Here is where the Omniture data actually gets added to the working copy of the XML file pulled from the metadata management application.  Basically, the code does 3 things:

  1. create a popularity group map based on the total number of ‘art’ elements in the metadata management app’s XML
  2. run through the cleaned up Omniture report CSV from top to bottom (the data comes out of Omniture sorted from top ranked (most viewed) to bottom)
  3. Add XML data noting the popularity group as well as some of the raw statistics (e.g. page views)

The code looks like the following:

class EmbelishWithPageMetrics
{
    public static Regex ShortIdRegEx = new Regex("/(\\w+).aspx",
        RegexOptions.CultureInvariant | RegexOptions.Compiled);

    public void Run(XDocument masterXmlDoc, string omnitureFile) {
        Console.WriteLine("Embelishing with Omniture page metrics...");

        // get total number of articles and setup the popularity group
        var articleElements = masterXmlDoc.XPathSelectElements("//Art");
        var popularityGroupMap = new PopularityGroupMap(
            articleElements.Count());

        // update each article with popularity group and page views data\
        using (var textStream = new FileStream(omnitureFile, FileMode.Open)) {
            using (var sr = new StreamReader(textStream)) {
                var recordNumber = 1;
                while (!sr.EndOfStream) {
                    var line = sr.ReadLine();
                    var cells = line.Split(new[] {','});

                    var shortId = ExtractShortIdFromUrl(cells[1]);

                    recordNumber += UpdateOmnitureFacetsForArticle(shortId,
                        recordNumber, int.Parse(cells[2]),
                        popularityGroupMap, masterXmlDoc);
                }
            }
        }
    }

    private static string ExtractShortIdFromUrl(string url)
    {
        var shortIdMatch = ShortIdRegEx.Match(url);
        return shortIdMatch.Groups[1].Value;
    }

    private static int UpdateOmnitureFacetsForArticle(string shortId,
        int sortRank, int pageViews, PopularityGroupMap popularityGroupMap,
        XDocument masterXmlDoc)
    {
        var matchingArticle = masterXmlDoc.XPathSelectElement(
            "//Art[@shortid='"+ shortId +"']");
        if (matchingArticle == null) {
            return 0;
        }    

        matchingArticle.SetAttributeValue("OmniturePopularityGroup",
            popularityGroupMap.GetGroup(sortRank));
        matchingArticle.SetAttributeValue("OmniturePageViews", pageViews);

        return 1;
    }
}

and the popularity group map looks like this:

class PopularityGroupMap
{
    private int[] _groupUpperBounds;

    public PopularityGroupMap(int totalItems)
    {
        //sets up distribution of 1:10%, 2:60%, 3:30%
        _groupUpperBounds = new[] {
            (int)(totalItems * .1),
            (int)(totalItems * .7) };
    }

    public int GetGroup(int rank)
    {
        if (rank <= _groupUpperBounds[0])
            return 1;
        return rank <= _groupUpperBounds[1] ? 2 : 3;
    }
}

After the post processing adds this data to the XML, getting to the final CXML (collection schema) is the matter of a simple XSLT.  I was planning to post the XSLT here, but it’s around 300 lines and I’m sure I would have to make it twice that long to keep lines from running off the page, so instead, I’ll reference it in bitbucket as soon as I get the solution posted.

Anyways, the final step is to run the pauthor.exe tool to create the DeepZoom images.  Pauthor is a command line tool provided by Live Labs which, among other things, let’s you create a DHTML template and bind it to items in your CXML to create your image pyramids.  I’ll talk about this at more length in my next post.

About Howard Dierking

I like technology...a lot...
This entry was posted in MSDN Magazine, Pivot. Bookmark the permalink. Follow any comments here with the RSS feed for this post.
  • Anonymous

    For the version of Pivot that I used to build the magazine collection, no, there was not a way to have hierarchical facets.  I’m not sure if this feature was added in the Silverlight 5 version – take a look at the docs and let me know. 

  • Nidhi Singh Mnit

    Sir , i want to Show Sub categories falling under a category in different way , so that my subcategories shud be clearly visible .But what can i see the approach that i follow by that the subcategry come at the same level to that of Main Category . Is there a way to do any thing using Pivot Viewer

  • discount paintings

    superb posts. Mind sharing how you get the info for blog posts