Ross Boylan wrote:
> On Sun, 2008-03-30 at 20:27 +0200, Joris Dobbelsteen wrote:
>
>> Ross Boylan wrote:
>>
>>> I have a postgres server for which du reports
>>> 1188072 /var/lib/postgresql/8.2/main
>>> on Linux system.
>>> The server has only one real database, which is for bacula. When I dump
>>> the database, it's 73Mg.
>>>
>>> This is immediately after I did a full vacuum and restarted the server.
>>>
>>> Also,
>>> bacula=> SELECT relname, relpages FROM pg_class ORDER BY relpages DESC
>>> limit 15;
>>> relname | relpages
>>> ---------------------------------+----------
>>> file_jpfid_idx | 27122
>>> file_pathid_idx | 17969
>>> file_jobid_idx | 17948
>>> file_pkey | 14580
>>> file_fp_idx | 12714
>>> file | 11558
>>> file_filenameid_idx | 9806
>>> filename | 3958
>>> filename_name_idx | 2510
>>> filename_pkey | 1367
>>> path | 966
>>> path_name_idx | 950
>>> path_pkey | 151
>>> pg_attribute_relid_attnam_index | 46
>>> pg_proc | 45
>>>
>>> It seems very strange to me that there is such a difference in size
>>> between the dump and the database: the data store is almost 15 time
>>> larger than the dump.
>>>
>>> Is this to be expected (e.g., from the indices taking up disk space)?
>>> Is there anything I can do to reclaim some disk space
>>>
>> There are a few factors you need to take into account:
>>
>> * Data storage in the database is packed into blocks and contains
>> header data. Since data needs to be put into blocks there is a
>> potential for waisting space. If you are unlucky it can become
>> nearly a single row in the worst case.
>> * You need to vacuum often, to ensure obsolete rows are removed and
>> space can be reused.
>> * Tables are not reduced in size and only grown. I thinks cluster
>> and vacuum full will reduce the size of your table.
>> * Indexes are not in the backup, they are derived from the table
>> data on a restore.
>> If you remove the indexes you are left with 150~200 MB of data (I
>> guessed).
>> Doing reindex will rebuild the index and get rid of all the bloat
>> it has been collected during use. _I recommend you try this_, as
>> your indexes on the file table look quite huge.
>>
> reindexing had a huge effect. After reindex the top tables (file,
> filename and path) I now see
> SELECT relname, relpages FROM pg_class ORDER BY relpages DESC limit 15;
> relname | relpages
> ---------------------------------+----------
> file | 11558
> filename | 3958
> filename_name_idx | 2383
> file_jpfid_idx | 2145
> file_fp_idx | 1787
> file_jobid_idx | 1427
> file_pathid_idx | 1427
> file_pkey | 1427
> file_filenameid_idx | 1427
> filename_pkey | 1367
> path | 966
> path_name_idx | 871
> path_pkey | 151
> pg_attribute_relid_attnam_index | 46
> pg_proc | 45
> and du now reports 451M. That still seems a bit large, given the size
> of the sql dump, but it's almost 2/3 lower than it was before.
>
> Thanks so much!
>
> I guess I need to figure out how to reindex automatically.
>
Take a threshold, e.g. look which indexes are towards the table size, or
something. The bloat is mostly causes by continues updates to the
indexes on every insert, update and delete command. The index needs to
split pages that might be merged back some time later. Doing frequent
vacuums might, or might not, prevent this. Even in theory you will see
that algorithms allow trees to grow quite large up to a certain constant
factor. This is in order to have a good limit on the amount of work that
must be done on a operation on the index.
>> But the most important factor for you will be the following:
>>
>> * Backups are compressed. Since you store filenames and paths these
>> will have a very high amount of regularity and therefore are very
>> good targets for compression. This can save a huge amount of data.
>> If you take a compression factor of 50%~70% you will reach your 70 MB.
>>
> I don't see how this is relevant, since my dump file was plain text
> (sql).
>
From the top contenders, about half are indexes, so you are stuck with
~200 MB of data in the tables.
Postgresql has some wasted space due to placement of the tuples in a
block and overhead for each block and row. I don't know those values,
but they are in the range of 24 bytes per tuple, I believe. Secondly a
block is 8 KB by default and tuples cannot be stored into multiple
blocks (thats what toast to work around).
All in all: Lookup tuple sizes, if they are small than the overhead from
postgresql can be a big factor. If you are huge you loose on portions of
unoccupied space in blocks. I believe pg_statistics will provide this
information.
Another factor is representation in the SQL dump might be more efficient
than in the database, but this highly depends on your data set. For
example, a int8 takes 8 bytes in a table, while it takes between 1 and
~20 in a SQL dump.
How the plain SQL dump becomes this small I cannot explain without much
much more details.
>> Hope this helps...
>>
>
> It was a huge help.
>
Glad it was,
- Joris