06 June 2007

Memorable SQL statistics

I was casually reading the PostgreSQL FAQ, looking at neater ways of extracting DATABASE & TABLE definitions than multiply JOINed SELECTs across pg_* system tables, when I ran across this little snippet:
To uniquely number rows in user tables, it is best to use SERIAL rather than OIDs because SERIAL sequences are unique only within a single table. and are therefore less likely to overflow. SERIAL8 is available for storing eight-byte sequence values.
OK, so here I am hypothetically wanting to stick more than 4 billion objects into my databases.

Firstly, let’s think about such a table, with lines of text packed into it, an INTEGER (well, SERIAL) & a VARCHAR for each row — averaging, say, 60 characters per row. That’s 4 billion times 69 bytes, or roughly 280GB in one TABLE. A ‘SELECT * FROM’ would take a fair while.

Well... that’s OK, I can buy a disk that large, & there’s those SERIAL8s to play with, to give me 16 billion billion items in one TABLE.

Or to put it another way, an example TABLE as above resting in a mere million terabytes or so. So if we add 1,000 records a second to this TABLE, we’re looking at many, many lifetimes (at a mere 86 million records a week, 31 billion a year, so roughly 8 million years) just waiting for it to load.

Hmmm. “Sorry, do you have disks with a 16 megayear MTBF?” Ducking, running very fast.

Good thing PostgreSQL limits itself to 32TB, no? (-:

...er, & that I won't be hitting the 400GB-a-row limit, either. The concept of two rows consuming the largest hard disk I can buy is a bit staggering. Backups will not be fun.

I should mention that MySQL looks a lot easier to get these basic stats out of; SHOW DATABASES, SHOW [FULL] TABLES, SHOW INDEX FROM etc. Must try that next. (-:

No comments: