Kevin Hegg: Are we going to see major changes in data management?

Row-oriented Relational Database Management Systems (R-RDBMS) have grown in popularity since the early 1980's to the point where the overwhelming majority of data management by the early 2000's was handled by RDBMS's. There have always been alternatives to R-RDBMS's, but until recently none of the alternatives have provided sustainable technical advantages or gained significant market share. So, what's different now and is it going to impact the life of a software developer?

Early in my career I had the opportunity to program IBM's SQL/DS and shortly after that Oracle (2.0 or 3.0, I forget). Relational database development was simple for me. In late 1986 Sybase released SQL Server 1.0. I was assigned to a project that was going to use it, but since the project had large-scale data and significant performance requirements I needed special training. I spent a couple of weeks at Sybase learning SQL Server internals and there was no going back. Since then I have kept up with Sybase, Microsoft SQL Server, and Oracle database internals and have spent a lot of time designing and tuning databases. At this point in my career I feel that I can squeeze every drop of performance out of any R-RDBMS. Now that we have established the I am an experienced database guy, let's resume the discussion. :-)

As R-RDBMS's matured vendors began throwing everything into a single product. Michael Stonebraker discusses the "one size fits all" approach and the problems with the approach here and here. He lays the foundation for when and why R-RDBMS's start to fall apart. From my personal experience building applications to process sensor data and financial data feeds it took a lot of expertise, effort, and cost to tune commercial R-RDBMS's to meet the performance requirements.

With the rapidly decreasing cost and increasing capacity of CPU, RAM, and disk storage the alternatives to R-RDBMS start to become much more attractive. This increased the appetite to process and store substantially more data and this is now testing the limits for R-RDBMS technology. As solutions scale into the petabyte range many of the traditional data modeling techniques like Ralph Kimball's dimensional modeling are less successful. The same can be said for indexing (bit-map) and physical partitioning schemes. What worked for 10 GB - 10 TB doesn't work near as well for 10 TB - 10+ PB. Also, the problems become more severe as your processing requirements approach real-time.

Over the last couple of years the number of columnar storage solutions has increased noticeably, led by Google's BigTable, Sybase IQ, Stonebraker's research with C-Store, etc. Much of the benefit from columnar storage comes from the ability to compress data and to substantially reduce the I/O's needed to complete a query. Google decided not to use SQL for a query language while other columnar solutions stuck with SQL, but what this showed is that SQL is also reaching its limits.

Next, Stonebraker published his research on H-Store. He proposes that pure OLTP applications can see dramatic improvements in performance from massive simplification of the database engine and performing in-memory, distributed processing of data. Also, he proposes to do away with SQL as the query language. Werner Vogels, who is well-respected in the distributed systems community, expresses some scepticism. While he likes Stonebraker's challenge to provide 50x improvements, he is worried that Stonebraker is only solving the scale-up problem when instead he should be focused on the scale-out problem, similar to what Google and Amazon do. That was my concern initially, but the more that I think about it I don't think that H-Store is necessarily unable to scale-out. I think it is just a matter of time. Now that there is a working implementation of H-Store, they can now focus on scaling-out.

This led into a (contentious, if you read the blog comments) debate between Curt Monash and Stonebraker about how many different types of databases should be supported. It doesn't matter so much on the exact categorizations. What does matter is that the R-RDBMS world is "hitting the wall" with increasingly regularity with the "one size fits all" solution and that is driving the database market to come up with a variety of alternate solutions for each category of data management. I don't believe that R-RDBMS's offer a good solution for managing XML, semi-structured, and unstructured data, especially as the amount of data increases and the processing requirements approach real-time. Also, I don't believe that SQL is the correct language for processing this data.

I disagree with a some of Stonebraker's comments though. He thinks the low-end OLTP market will go almost entirely to open-source databases. I don't believe it is that simple. First, brand loyalty should not be underestimated. The cost difference between low-end commercial and open-source R-RDBMS's isn't significant enough to drive people in one direction or another. Second, the cost and complexity of swapping out R-RDBMS's for legacy systems far outweighs the license cost savings. Third, I think that solutions like Amazon's SimpleDB and Microsoft's SQL Server Data Services will be a more attractive option for the low-end than open-source. Not only do they eliminate the software license fees, but they eliminate the hardware and system/database administration costs. Head-count reduction is far more important to many organizations than software license reduction.

I agree with Stonebraker that the current R-RDBMS vendors are at risk of getting caught in the middle as we undergo change in the data management market. Also, I agree that the R-RDBMS's are getting too complex and this complexity is unnecessary. Finally, there is one conclusion that I would like to add. If solutions like H-Store are able to eliminate transactions, concurrency management, and other complexities then the benefits will be so great that they will quickly permeate into the mid-range solutions.

How is the developer's life going to change? First, they will potentially have to unlearn a lot of database relational design and programming habits. For an H-Store type of solution, this could result in a substantial reduction in the amount and complexity of the code. It will bring us a lot closer to being able to automatically generate the data access layer from a data model since much of the programmer intervention is due to transaction and concurrency management. Second, if we find a suitable replacement for SQL then we can potentially eliminate another big pain, Object/Relational Mapping. Am I the only who thinks that every O/RM tool completely sucks? I know what problem they are trying to solve. I just don't think they are solving the problem. Yes, I can build a working application with them, but I feel so dirty afterwards. I feel like you are just trading one problem for another. Third, in the area of columnar storage and non-relational data solutions I think we will have to be prepared for more developer effort in the short-term. The Google-imitators are just now learning that BigTable and MapReduce type of solutions are no free ride. The lack of tool support and best practices is something that will be fixed over time, but in the short-term it will be an issue.

When are the major changes going to happen? They already are happening. When is it going to impact the average customer or developer? I am not smart enough to accurately predict this, but I think it is close enough that I am paying attention. If and when changes start to happen, I want to be ready.

1 Comments:

At 9:18 PM, Anonymous said...: Great post, Kev. I'm not really up on the latest in the database world but feel like I learned a few things by reading your post. Looks like I've got some catching up to do, so thanks.

And yeah, I, too, feel dirty after using an ORM tool. I've played around with a number of them but never used one in a production app because none of them feel right to me. I feel like Goldilocks at the dinner table but none of the porridges hit my sweet spot.

<< Home

Kevin Hegg

Friday, March 14, 2008

Are we going to see major changes in data management?

1 Comments:

About Me

Previous Posts