Dave's Code Blog: January 2005

Generating Code

In conjunction with tedia2sql, I also wrote some Python code to take those custom files that our hacked version of tedia2sql generates, and use them to generate Java data objects. In this case, the data objects have getters and setters, and 'getFooAsString()' getters, and are derived from a base class that has insert, update and delete methods to call our internal database layer (that sits on top of PostgreSQL). That's a mouthful, but the upshot is that I enter the stuff into Dia, save the diagram, run one script, and can start writing database client code.

Automating away tedious stuff is awesome. (Or put another way: it's great to be lazy.)

tedia2sql

We're up to designing our database. We're porting an existing database from Filepro (?) to PostgreSQL, and wanted to have a diagram of our ERD. After a brief bit of poking around, we found tedia2sql, a Perl program that takes a diagram out of Dia and generates DDL for you. It's Open Source, of course, so we were able to have one of our crack Perl coders (i.e. not me) modify it to generate some files used by an internal program. Just what the doctor ordered.

It's interesting, too, in that it uses Dia's class diagram widgets, (which are more functional than its database widgets). But the docs are quite clear, and it was very easy to put into Dia the objects that tedia2sql needed, and the PostgreSQL DDL worked just fine. (tedia2sql also supports MySQL, Sybase, Oracle, DB2, and InnoDB.) Another nice thing tedia2sql does is automatically generate join tables, if you specify a n:n relationship between two tables. Very helpful.

PostgreSQL Docs

One other PostgreSQL tidbit: I'm pretty impressed with the docs. They're fairly thorough, fairly readable, and aimed at someone without lots of pg experience. (In fact, they're aimed at someone without much db experience.) Good job, folks.

PostgreSQL Performance

Today I started poking around in PostgreSQL, doing some performance testing. I loaded it up with a million rows in one table, and a half-million in another, and started doing some selects to see how fast it would run. And played around with 'explain' to see what pg thinks it should do.

The first interesting thing I saw is that at some point, it'll switch from doing an index scan to do a sequential scan (i.e. a full table scan). In my case, I'm simulating 5 years worth of data, and doing a query across a date range, and seeing results come back in several seconds. 'explain' shows that if the query is over more than a week or so, it'll do a full table scan rather than use the index.

So I look at my data, and realize that I'm generating the dates randomly. In reality, the dates will be in physical order on the disk in roughly date order (i.e. written as they're created). So I redid my data, reran the explain, and now even a full year range uses the index. Neat stuff.

Digging deeper, I find that 'analyze' generates statistics on each field in each table, and put them in the pg_stats table. One of the columns is called 'correlation', which indicates if like data can be expected to be found close together or not. Changing the load order on the table changed the correlation value from almost 0 to almost 1, meaning the pg figured if it needed rows with similar values from that table, it could expect to find them close together on disk, meaning that an index scan would deal with data that in most cases would already be in memory.

So when my dates were random, correlation was 0, meaning that if (for example) the table had 15000 pages on disk, and I wanted 15000 rows (which pg could also estimate), then it would be reading all of the pages anyway, and would choose a sequential scan. When correlation is 1, 15000 rows might live in 500 pages, meaning reading the index and those 500 pages would be much quicker than a table scan, so the index scan would be chosen. (Hope this makes sense.)

Google Likes Me

As part of my khubd thing here, I googled for khubd, found 19,800 pages, with my recent blog entry at the top! I wondered if Google put it at the top knowing that it was mine, but I searched from a coworker's machine, and got me at the top as well. Now, this blog has very few readers at present (at least, judging by what I see on bloglines.com), so I'm wondering if I'm getting some kind of karma from being on blogspot, or something. Anyway, I find this quite curious.

khubd, Round 2

Plugged in MP3 player, khubd hung again. Googled for a while, and found nothing that jumped out at me to try. A couple of people reported having this problem, and confirmed its association with plugging in a USB device.

Dave's Code Blog

Wednesday, January 26, 2005

Generating Code

tedia2sql

Friday, January 07, 2005

PostgreSQL Docs

PostgreSQL Performance

Wednesday, January 05, 2005

Google Likes Me

khubd, Round 2

About Me

Previous Posts

Archives