The next wave in BI

Topic: TPC-H fun with Greenplum

Date: 07/06/2010

By: Peter Guarino

Subject: Not sure what you are trying to prove

Are you evaluating the performance of free databases for hosting large databases or just databases that run only on a single node? Wouldn't it be more instructive to compare Greenplum SNE versus other free databases? Otherwise, shouldn't you be comparing Greenplum's full version against similarly priced implementations from other vendors, single node or not. The real advantage of Greenplum, and other databases like it, is its superior cost performance ratio and ability to scale inexpensively. OK, so you might have to run Greenplum on two or three servers to get the same performance as Sybase IQ on a single node, if its cheaper so what?

A database that largely fits into main memory and resides completely on ssd storage is really just a toy example and of no particular value. Worse yet, to make general inferences from a single data point would lead to flawed conclusions. Better conclusions could be drawn by varying the data set sizes to factor out accidental advantages that one engine might have over another at a particular instance.
* include large data sets that greatly exceed main memory, 10-100x
* examples that stress io, cpu and memory in various ways

It is important to have enough knowledge about the databases involved to avoid badly tuned engines that would skew results. In particular the "out of memory" errors you encountered most likely pertain to Postgres' work memory, not main memory or swap, a tuning parameter that will lead to particularly bad performance if improperly set. Bad query plans indicate that you did not run analyze on the tables after loading them - a must for Greenplum as it does not run autovacuum as the Postgres 8.3 and 8.4 engines do.