Wednesday, January 6, 2010

Databases Need Extraction

Databases are traditionally awful at managing unstructured data, such as office documents, media files, or large blocks of text. Yet users still want to store documents in them. The reason is that documents often have associated structured data already in the database. For instance, an MP3 has an embedded artist, album, and title, which are typically mirrored in a database so they can be queried. Likewise, a resume may be a Word document, but the applicant's name and contact information were probably typed into a form and stored in a table.

Managing documents and their associated structured data separately is painful. Just consider the common approach of keeping the file in the file system and storing its path in the database. What if the artist embedded in the MP3 changes? Or the file is moved, or deleted? Issues like consistency control, synchronizing backups, queries over both structured and unstructured data, and even supporting multiple systems can become a nightmare. So people start putting documents in the database.

This is a call to arms: as database people, if there's a compelling reason for users to store data in a database, the database should help manage it. So what's involved in managing unstructured data?

At a high level, there are two types of management tasks over unstructured data:
  1. Managing the bits: let users efficiently insert, update, and delete documents, as well as seek within them, stream them, etc. This includes other typical data services, such as backup and restore.
  2. Managing the information represented by the bits: an MP3 has an artist, beats-per-minute, and lyrics, while a resume mentions schools, companies, and skills. To fully manage such documents, users need to query this information.

Recently, databases have gotten better at managing unstructured bits. For example, SQL Server has FILESTREAM, which improves streaming performance, and Remote Blob Storage, which stores unstructured data in a dedicated file server.

However, managing the information represented by the bits requires extraction. Specifically, it requires cracking open the document, extracting or inferring interesting content, then exposing it as queryable structured data. This means extraction is not a side task; it is an integral part of managing unstructured data.

Full-text search in databases is a step in this direction, but there's a lot more work to do. When we put MP3s in a table, we should be able to query them by artist, title, or even lyrics. When we store a resume, we should be able to find related job descriptions, or join the schools it mentions to our Employee table to find old classmates that work here.

It's estimated that about 80% of the data out there is unstructured, and thus probably not in databases. If we're going to take a serious shot at managing it, then databases need extraction.

No comments: