Monday, November 9, 2009

SQL PASS 09 - Semi-Structured Data in SQL Server

A couple coworkers and I just had the great opportunity to present a post-conference session at SQL PASS 2009. The session topic was "Managing Unstructured and Semi-Structured Data with SQL Server"; I own the semi-structured features in SQL Server, so I presented that part of the session.

For simplicity, think of "semi-structured data" as structured data that's not relational. To manage semi-structured data, SQL Server 2005 introduced XML, which is powerful and general (as someone once said, tongue in cheek, "XML is like violence: if it doesn't solve your problem, you're not using enough of it"). However, XML isn't ideal for certain situations, and doesn't integrate perfectly with relational databases.

Enter SQL Server 2008, which introduces some really useful features for managing semi-structured data more relationally. Briefly, if you find yourself managing the following kinds of data, take a look at these features:

  • Trees or hierarchies (e.g., product categories, org charts, conversation threads): check out hierarchyid, a new type representing nodes in a tree.
  • Properties that apply to only some items (e.g., product descriptions, custom document properties, tagging): check out the SPARSE column option, which gives you NULLs that take no space, constant-time addition of new columns, and up to 30,000 columns in a table.
  • Tables where only some rows need indexing (e.g., columns with many NULLs, "hot" rows, heterogeneous row groups): check out filtered indexes, where you specify which rows are indexed. This makes the index much smaller, yields faster look-ups, and avoid unnecessary index maintenance. Filtered indexes are cool enough that I'll make a post about just them sometime soon.

For those interested, here's a slide deck and demo T-SQL code on these features, focusing on how you can use them to solve real-world problems. If you have any feedback or suggestions — about the deck, the demo code, or the features themselves — please post a comment. For example, here are some of the feedback and takeaways I got from attendees at SQL PASS:

  • One attendee wondered how hierarchyid compares to left-right value tables (a.k.a. nested set model, or modified preorder traversal) for representing trees. Briefly, hierarchyid has the same sub-tree properties, so is equally good at descendant queries. Furthermore, it has three advantages: more compact (one column averaging 10 bytes per value can represent ~100M nodes); constant-time node insertion anywhere in the tree; and easier, more efficient level queries (e.g., (grand)children of a node, all nodes with level 3).
  • People seem to like column sets, especially how they automatically shred XML inserts and updates into the proper sparse columns.
  • People also seem to really like filtered indexes. One "wow" example was choosing which rows were worth indexing based on your workload. To illustrate, suppose table SoldAt tracks which products are sold at which stores. If only some stores have enough traffic to warrant indexing, you can create an index for just those stores with CREATE INDEX SoldAt_Key_Stores ON SoldAt(...) WHERE StoreId IN (1, 4, ...).
  • Some people mentioned that they often need to manually update filtered statistics (such as are generated automatically for filtered indexes) to ensure good plans. This is something we can look into.

I'd love to hear more, so please comment away!

Read more...