Today I started poking around in
PostgreSQL, doing some performance testing. I loaded it up with a million rows in one table, and a half-million in another, and started doing some selects to see how fast it would run. And played around with 'explain' to see what pg thinks it should do.
The first interesting thing I saw is that at some point, it'll switch from doing an index scan to do a sequential scan (i.e. a full table scan). In my case, I'm simulating 5 years worth of data, and doing a query across a date range, and seeing results come back in several seconds. 'explain' shows that if the query is over more than a week or so, it'll do a full table scan rather than use the index.
So I look at my data, and realize that I'm generating the dates randomly. In reality, the dates will be in physical order on the disk in roughly date order (i.e. written as they're created). So I redid my data, reran the explain, and now even a full year range uses the index. Neat stuff.
Digging deeper, I find that 'analyze' generates statistics on each field in each table, and put them in the pg_stats table. One of the columns is called 'correlation', which indicates if like data can be expected to be found close together or not. Changing the load order on the table changed the correlation value from almost 0 to almost 1, meaning the pg figured if it needed rows with similar values from that table, it could expect to find them close together on disk, meaning that an index scan would deal with data that in most cases would already be in memory.
So when my dates were random, correlation was 0, meaning that if (for example) the table had 15000 pages on disk, and I wanted 15000 rows (which pg could also estimate), then it would be reading all of the pages anyway, and would choose a sequential scan. When correlation is 1, 15000 rows might live in 500 pages, meaning reading the index and those 500 pages would be much quicker than a table scan, so the index scan would be chosen. (Hope this makes sense.)