For bulk file work, avg about 8 Mb which is not that big. For backup work, tar/rsync/scp etc up to 600Gb per file. Biggest "transactions are however database work. Up to 80_000_000 record traversals per process several times a day.

Linux (ext3 and ext4), HP-UX (jfs) and AIX (jfs) and several NFS based processes (small files).

Data files are mostly plain CSV or CSV file(s) inside a ZIP. I'm not doing a lot of XML. Binary, HTML and other formats occasionally. CSV is not-fixed length, but some binary files are (though we are pushing the organizations that give us those to move to CSV/UTF-8).

Various "printer report" files (mainframe printer files, each line with a prefix for carriage control & such.

I have a small collection of utilities I use to crank through 'em. For example, for some printer report files, I have a program that accepts an excel spreadsheet and it creates a C program to parse it and reformat it to a fixed format for importing into excel or a database. I also have a few programs that analyze files to help determine their contents and format.

My scripts routinely process files sized 1GB (gzip compressed). The content is either 28 column CSV (ca. 250 bytes per line) or 500 column fixed width (ca. 2k bytes per record) transaction data. Both types get converted to tab separated output plus two administrative columns and then bulk loaded into database tables, as the bulk loader does not like to talk to a fifo or pipe, unfortunately.

The content of the files is ASCII text, all packed decimals for the fixed width files have already been decoded to numbers.

1. The UniProt (=SwissProt+Trembl) monthly updated protein info database. We put these datafiles into a database. Uniprot.org also makes available this data in XML form (same URL as below) but I find those too large to download/handle/process. The (smaller) .dat files are regular text files:

A few years back I wrote a script to scan a broken filesystem for video files by combing through the raw disk device. The RAID was 13 TB, and the individual files saved went from a couple GB to 50 or 60 GB.

Else I wrote a few disk benchmark scripts that work fine indeed... Perl moves data around more than fast enough to saturate the speediest storage.

On the basis of the replies so far (many thanks to all respondents), a file handling utility that catered for files up to 256 Terabytes and individual lines and records up to 64k, would likely cater for most peoples every day requirements?

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

sounds like it would be massive overkill for "most people's everyday requirements".

Maybe, but once you go bigger than 4GB, you have to start dealing with 64-bit integers, which at 16 million TB is really overkill :)

So, since I also need to keep track of the length of each record/line, I figured that using the lower 48 bits for offsets (256TBmax) and the upper 16-bits for the length (64k), means that I can manipulate 'record descriptors' which are 64-bits each.

Not only are these easily manipulated as 'integers', they are also a cache friendly size which might also yield some performance benefits.

In an ideal world, the split point would be a runtime option which might allow (say) dealing with genomic stuff where individual sequences can be substantially bigger than 64k; but overall file sizes tend to be much smaller. But I cannot see an easy way to make that decision at runtime.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

CSV and formatted Excel spreadsheet files of up to 5 MB. All containing insurance claims data and each insurance company uses its own format. The files get parsed by some Perl-scripts into a standard format which goes to the database.

Otherwise a variety of small Excel spreadsheets (a few hundred rows at the most) too small and too much prone to change to warrant a proper database to be made for it: the data in these spreadsheets is used to produce insurance certificates, extracts of cover, ... thanks to Template::Toolkit and LaTeX.

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

I work for a firm that does marketing for 90+ car dealerships around the US. A big part of what we do to prep for data mining is standardizing and importing data from assorted dealership systems, some archaic and some not (ADP, R&R, Advent, Arkona, Quorum, Scorekeeper, etc.), and this sometimes requires processing service files of up to 200-300 MB with hundreds of thousands of records. Theoretical maximum could be even larger. Input format might be CSV or more of a vertical text format (key value), depending on how we're acquiring the data, but it's always text and never fixed-length. We use custom Perl scripts / mySQL for the most part, and we recently upgraded to a pretty fast server with 4 GB RAM (Cari.net, their pricing and service is pretty good and we also had our previous server there). OS is of course some popular Unix variant that I forget.

EDIT: We also import sales, leases, and a variety of other stuff, but the service file is just the largest part of that. I imagine the databases in uncompressed form could run upwards of 500 MB to a GB each over time.

Currently it's mostly little bits of nothing, <10 MB of XML or rarely more than 50 MB of log files. In my previous job I did a lot of log file analysis for an major ISP/web hoster where mail and FTP server logs measured 500-800 MB from just a couple of hours on one box, of which they had a couple of hundred.