Monday, March 22, 2010

Tracking Multiple Estimates with Project

MS Project is a powerful tool. However, it has certain expectations about tasks, and deviating from them can be difficult. For instance, Project presumes tasks have a single work amount and duration.

I prefer multiple estimates. I typically ask developers for a realistic and a pessimistic estimate (i.e., "should be done by, could take as long as"). Then, I also track a weighted estimate somewhere between these, based on my confidence in the given estimates. Finally, I want a projected duration based on how long the work has taken so far (e.g., if a task is half done in a week, the projection should be two weeks).

Since Project expects only one work/duration value, tracking progress and making predictions against multiple estimates is hard. You can create extra columns for them, but these don't integrate well with many features, such as Gantt charts. And, as far as I can tell, the projected duration isn't a built-in option.

To work around these issues, I created a Project 2010 template that lets me track work items with multiple estimates. The key advantages are:

  • Easy pessimistic/optimistic/etc estimates, Gantt charts, and end dates.
  • Projected estimate based on how long tasks have taken so far.
  • Simple steps to track progress, whether reported as percent done or time left.

Here are links to the template and to instructions. I think it's fairly easy to use, but I'm very interested in any feedback. If you give it a try, let me know!

Edit: A commenter requested a Project 2007 version of the template, so I've put one here. However, not having actually used it myself in Project 2007, I can't guarantee how well it'll work.


Tuesday, January 26, 2010

Economics of Extraction

As I stated in my last post, managing unstructured data is increasingly crucial. A common estimate bandied about is that upwards of 80% of enterprise data is unstructured, including office documents, email, etc. The need to manage the information represented by these bits isn't just theoretical. It's painfully real. Real enough that people pay lots of money for third party solutions to help them do this.

To give an idea of exactly how much money, here are prices for some of the top players in this roughly $2.5 billion market:

  • Autonomy IDOL Server: $220K bundled
  • SAS Enterprise Miner: $100-$400K in 2001
  • Open Text Enterprise 2.0: $600K for 1,000 users
(Thanks to Naveen Garg, a fellow PM, for these data).

Consider these numbers in the context of my last post. Not only is there a good conceptual argument for extraction in databases, there's also a clear customer need. If there wasn't pain, vendors couldn't charge hundreds of thousands of dollars for a solution. And that's what customers are willing to pay for a solution that's not fully integrated into the database: a separate system to buy and maintain and support.

Imagine what they'd think of true unstructured data management as a first-class database feature.


Wednesday, January 6, 2010

Databases Need Extraction

Databases are traditionally awful at managing unstructured data, such as office documents, media files, or large blocks of text. Yet users still want to store documents in them. The reason is that documents often have associated structured data already in the database. For instance, an MP3 has an embedded artist, album, and title, which are typically mirrored in a database so they can be queried. Likewise, a resume may be a Word document, but the applicant's name and contact information were probably typed into a form and stored in a table.

Managing documents and their associated structured data separately is painful. Just consider the common approach of keeping the file in the file system and storing its path in the database. What if the artist embedded in the MP3 changes? Or the file is moved, or deleted? Issues like consistency control, synchronizing backups, queries over both structured and unstructured data, and even supporting multiple systems can become a nightmare. So people start putting documents in the database.

This is a call to arms: as database people, if there's a compelling reason for users to store data in a database, the database should help manage it. So what's involved in managing unstructured data?

At a high level, there are two types of management tasks over unstructured data:
  1. Managing the bits: let users efficiently insert, update, and delete documents, as well as seek within them, stream them, etc. This includes other typical data services, such as backup and restore.
  2. Managing the information represented by the bits: an MP3 has an artist, beats-per-minute, and lyrics, while a resume mentions schools, companies, and skills. To fully manage such documents, users need to query this information.

Recently, databases have gotten better at managing unstructured bits. For example, SQL Server has FILESTREAM, which improves streaming performance, and Remote Blob Storage, which stores unstructured data in a dedicated file server.

However, managing the information represented by the bits requires extraction. Specifically, it requires cracking open the document, extracting or inferring interesting content, then exposing it as queryable structured data. This means extraction is not a side task; it is an integral part of managing unstructured data.

Full-text search in databases is a step in this direction, but there's a lot more work to do. When we put MP3s in a table, we should be able to query them by artist, title, or even lyrics. When we store a resume, we should be able to find related job descriptions, or join the schools it mentions to our Employee table to find old classmates that work here.

It's estimated that about 80% of the data out there is unstructured, and thus probably not in databases. If we're going to take a serious shot at managing it, then databases need extraction.