finding directory sizes

I want to archive directories to CD. I have many of them in
various places, I don't care if one from /data/web is on the
same CD as one from /home as long as the specified directory is
not split any further.

The important point is that there are things I need to exclude, such as
log files.

I'm currently getting the size by using du in an open, and munging
the result, is there a better way?

I did think of using stat to add up every file, but if I'm talking
a few hundred per directory, is that wise? And how would I exclude
files, considering that each main directory set has more than one file
pattern to exclude? (this has 2, others have 3 or 4)

Advertisements

On Mon, 23 Aug 2004 06:48:43 GMT, Zebee Johnstone said:
>...
>open (DU,"find $snapshot -type d -maxdepth 1 -exec du -sk --exclude=access_log* --exclude=error_log* {} \\;|") || die "can't do find for $snapshot $!\n";
>
>I did think of using stat to add up every file, but if I'm talking
>a few hundred per directory, is that wise? And how would I exclude
>files, considering that each main directory set has more than one file
>pattern to exclude? (this has 2, others have 3 or 4)

I'd suggest using File::Find with an appropriate callback sub. It's in
the standard distribution, and the docs have a few recipes.

Advertisements

In comp.lang.perl.misc on 23 Aug 2004 07:07:30 GMT
Damian James <> wrote:
> On Mon, 23 Aug 2004 06:48:43 GMT, Zebee Johnstone said:
>>...
>>open (DU,"find $snapshot -type d -maxdepth 1 -exec du -sk --exclude=access_log* --exclude=error_log* {} \\;|") || die "can't do find for $snapshot $!\n";
>>
>>I did think of using stat to add up every file, but if I'm talking
>>a few hundred per directory, is that wise? And how would I exclude
>>files, considering that each main directory set has more than one file
>>pattern to exclude? (this has 2, others have 3 or 4)
>
> I'd suggest using File::Find with an appropriate callback sub. It's in
> the standard distribution, and the docs have a few recipes.

I'm not sure what you mean by 'appropriate callback sub".

Do you mean use File::Find recursively to run stat on every file?

As far as I know, if you do that, you can't pass parameters to
the sub that's processing the files, so suddenly everything's global?

and as I say, is running stat on every file in dirs that have hundreds
of files the right way to go? and how to exclude ones you don't want?
I know the patterns I want to exclude, how do I pass those to the
File::Find subroutine?

Zebee Johnstone wrote:
> In comp.lang.perl.misc on 23 Aug 2004 07:07:30 GMT
> Damian James <> wrote:
>
>>On Mon, 23 Aug 2004 06:48:43 GMT, Zebee Johnstone said:
>>
>>>...
>>>open (DU,"find $snapshot -type d -maxdepth 1 -exec du -sk --exclude=access_log* --exclude=error_log* {} \\;|") || die "can't do find for $snapshot $!\n";
>>>
>>>I did think of using stat to add up every file, but if I'm talking
>>>a few hundred per directory, is that wise? And how would I exclude
>>>files, considering that each main directory set has more than one file
>>>pattern to exclude? (this has 2, others have 3 or 4)
>>
>>I'd suggest using File::Find with an appropriate callback sub. It's in
>>the standard distribution, and the docs have a few recipes.
>
>
> I'm not sure what you mean by 'appropriate callback sub".
>
> Do you mean use File::Find recursively to run stat on every file?
>
> As far as I know, if you do that, you can't pass parameters to
> the sub that's processing the files, so suddenly everything's global?

Do not have an irrational fear of using package variables an local().
(Have only a rational fear). Some time ago someone motivated by
irrational fear actually modified File::Find itself not to use package
variables for it's global variables but instead to use file-scoped
lexicals (still global in the programming sense). Because local()
doesn't work on lexicals this person just unthinkingly removed all the
local()s. In so doing they, of course, broke the re-entrancy of File::Find.

However, that said, you only need to use global variables (meaning
file-socped lexicals or package scoped variables) if you want the
callback to be a named subroutine. If you use an anonymous subroutine
then it acts as a closure meaning it can see lexically scoped variables
that were in scope where the anonymous sub was defined.

sub do_find {
my $foo = 'somthing';
my $wanted = sub {
# do stuff with foo
};
find($wanted, '/foo', '/bar');
}
> and as I say, is running stat on every file in dirs that have hundreds
> of files the right way to go?

Well obviously you have to do this in some way - but on Win32 IIRC the
implementation of stat() is (was?) pathological. If speed is of the
essence on Win32 then spawn a native windows recursive directory lister
and parse the output.
> and how to exclude ones you don't want?

return if ....

Or to exclude whole dirtectories

$File::Find:rune = 1 if ...
> I know the patterns I want to exclude, how do I pass those to the
> File::Find subroutine?

Zebee Johnstone <> wrote in message news:<>...
> I want to archive directories to CD. I have many of them in
> various places, I don't care if one from /data/web is on the
> same CD as one from /home as long as the specified directory is
> not split any further.
>
> The important point is that there are things I need to exclude, such as
> log files.
>
> I'm currently getting the size by using du in an open, and munging
> the result, is there a better way?
>

Unless you can demonstrate through benchmarking that this is a faster
approach than another such as using 'stat', I don't see why you need
to open a filehandle connection to read a file when you are simply
interested in the file's name and size.

Zebee Johnstone wrote:
> Damian James <> wrote:
>> I'd suggest using File::Find with an appropriate callback sub. It's
>> in the standard distribution, and the docs have a few recipes.
>
> I'm not sure what you mean by 'appropriate callback sub".

The "wanted()" function, that _you_ need to provide sucht hat File::Find
knows what to do with each file.
> Do you mean use File::Find recursively

No need for. That is the beauty of File::Find that it will recurse
automatically without _you_ doing all the leg work.
> to run stat on every file?

Try "-s" instead.
> and as I say, is running stat on every file in dirs that have
> hundreds of files the right way to go? and how to exclude ones you
> don't want? I know the patterns I want to exclude, how do I pass
> those to the File::Find subroutine?

Did you look at the documentation and examples that come with File::Find?

It matches the option by the same name in /usr/bin/find. See the
man page for 'find'. (A bit of history: The perl script find2perl
accepts the same command line arguments as /usr/bin/find, and
outputs a perl script to impliment that command.)

File::Find calls the 'wanted' function for everything it comes across.
After your wanted() function returns, if the thing being looked at
is a directory, File::Find will process that directory recursively
unless $prune is set. Setting $prune while looking at a plain file
does nothing. Setting $prune while looking at a directory says to
pretend that the directory is empty.
-Joe

In comp.lang.perl.misc on 23 Aug 2004 04:46:12 -0700
Jim Keenan <> wrote:
>
> Unless you can demonstrate through benchmarking that this is a faster
> approach than another such as using 'stat', I don't see why you need
> to open a filehandle connection to read a file when you are simply
> interested in the file's name and size.

I'm not opening Du for each file, but for directories.

If I use File::Find, I have to go through every single file, then later
work out how to decide which directories to keep together and which to
split.

Using Du on directories I can go:
start at root.
Check all directories one level below root, get their size.
If one of them is too big to fit on a CD, then go down one level,
do it again. recurse if necessary, though it isn't usually.

This gives me the smallest number of directories to then fit on CD.
It won't be the most efficient use of CD space, but then the efficient
use of human time to find things and get them back is more important
than a few meg here and there.

If I use File::Find to look at every single file, then I have to do some
kind of later munging to work out that directory split so as to have as
much of possible of the directories below root kept together.

So root might have /web /home /other and /web might have 15 sites all
smaller than a CD (some quite small, some quite large), plus there's
/web/web2 which has a similar mix of sites below it, but /web/web2
itself is lasrger than a CD. Measnwhile /home has at least one
directory below it that is too big to fit on a CD, so it has to be
split, and the directories below do too.

But I dn't know in advance which will have to be, and which won't. If
/web/web2/website1 is big enough to take up a CD on its own, I don't
want to split it.

yes, if I have to recurse, then I have to re-do du on that directory,
so if there's a reasonable way to record the info for each file only
once and then do the splitting that might be better.

Zebee Johnstone <> wrote:
> yes, if I have to recurse, then I have to re-do du on that directory,
> so if there's a reasonable way to record the info for each file only
> once and then do the splitting that might be better.

When you use 'du -s', you're doing all the work, but throwing away the
intermediate information. You could just keep the entire output of
'du', then parse the directories you need.

Alternatively, you could create a hash in memory that reflected the
filesystem, along with the sizes of every subdirectory. Just form the
subtree as you're parsing.

Then you can check the size of the top-level. If it's too big, just
look for all the sub-components and check their size...

(Using straight filenames instead of a hierarchical hash would have been
easier in some ways, but if the directory was oversized, it would have
been much harder to find all the subdirectories.)

#!/usr/bin/perl -w
use strict;
use File::Find;
use File::Spec;

my $size_key = "SIZE%:/_-SIZE"; # just a string that is unlikely
# to match the name of a subdirectory

A method like this is a big win whenever the cost of the treewalk and
the stats are large. If it takes 2 minutes to do the top-level 'du',
and you might have to repeat it a few times to get what you want, the
extra overhead in the program to do it once is worth it.

Share This Page

Welcome to The Coding Forums!

Welcome to the Coding Forums, the place to chat about anything related to programming and coding languages.

Please join our friendly community by clicking the button below - it only takes a few seconds and is totally free. You'll be able to ask questions about coding or chat with the community and help others.
Sign up now!