Here's a question I've been mulling over. Are pack files worth using in a game? I've looked into using PhysicsFS, but I'm not liking its global state so much. I don't need any compression in my pack files, as just about everything will already be compressed (png for images, vorbis for audio, vp8 for video, protobuf binary objects for units/objects/maps, etc), and I'm more concerned about read/write times. If I used a pack file, I'd just need a format with random access (like zip). Here are the pros and cons as I see it of having an uncompressed pack file:

Pros

(Slightly) harder for users to muck with

It's only one file (it's kinda nice having things grouped in one file)

Cons

(Slightly) harder for me to work with

Increased save times when modifying the file (which the game won't do, but I will in my editor, so it's a con for me, though users won't experience this)

???

Faster read times? (I've heard it can help not thrash the hard drive so much, but is this really much of a concern today on modern operating systems, and does it really help a significant amount?)

Does anyone have much experience with the pros/cons of using pack files? Are there any significant pros to using pack files, and are there any significant cons to just using the normal file system?

(Slightly) harder for me to work withIncreased save times when modifying the file (which the game won't do, but I will in my editor, so it's a con for me, though users won't experience this)

In my engine, I only use packs/archives for retail/shipping builds. In development builds, each asset is stored in a separate file in the data directory. When making a shipping build, the data directory is "zipped" into an archive, and the code is compiled to use a different asset-loading class.This lets me have the benefits of archives on the end-user's machine, while having the benefit of easy content iteration during development.

IMO, I find working with "files" much harder than working with assets. Yes, in development my assets are stored as files, so when I say Load("foo.texture"), that does turn into CreateFile("data/foo.texture", GENERIC_READ, FILE_SHARE_READ, 0, OPEN_EXISTING, FILE_FLAG_OVERLAPPED, 0)...

However, because I never think of these assets as "files", I'm free to change the behind the scenes behavior. Maybe I'll look in the OS's file system first, and then in a patch archive, and then in the shipping archive. Maybe I'll pre-load a chunk of an archive that contains several assets at once, then when the user asks for any of those assets, I've already got the data sitting there, etc...

Also, because I don't think of them as files, I don't author them as files. I never copy files into the data directory, and I never use file->save as to create any of the data files.Instead, we have a content directory, which does contain files, and a data-compiler, which scans the content directory for modifications, compiles any modified files, and writes them into the data directory. This means that for example, if I want to change my textures to use a different DXT compression algorithm, or I decide that materials should be embedded into level files, then I change the data-compiler's rules, and the data directory can be recompiled.

The association between an asset name (e.g. "foo.texture") and the content file path (e.g. "d:\myProject\content\foo.png") aren't hard-coded; our compilation routines are written in C#, the build steps described in Lua, and regular expressions are used to find a suitable file to use as input when building an asset.e.g. The following Lua script tells the asset-compiler that:* If the asset "foo.geo" is required, then use the GeometryBuilder plugin (a C# class) and load "temp/foo.daebin" as input.* If "temp/foo.daebin" is required, then use the DaeParser plugin and search the content directory recursively for "foo.dae" (a COLLADA model).

These kinds of data compilers can also be used to ensure that only the data that's actually used by the game ends up in the data directory. Instead of compiling every file inside the content directory, we only compile the required files. We start off with the hard-coded asset-names used in the game's source code (ideally this number is quite small), then we find the linked assets (e.g. a material links a texture and a shader), and repeat until we've got a full dependency tree.Another neat feature you can add to a system like this is asset-refreshing -- the data-compiler is already scanning for file modifications to rebuild new data, so when it re-builds a file it can check if the game is currently running, and if so, send a message to the game instructing it to reload the modified asset.

In the industry, every company I've worked for the past 6 years has used some kind of automated asset pipeline like this, and I just can't imagine going back to manually placing files in the game's data directory -- to me, it seems like a lot more of a hassle

Faster read times? (I've heard it can help not thrash the hard drive so much, but is this really much of a concern today on modern operating systems, and does it really help a significant amount?)

Assuming a non-SSD drive, it can give a significant reduction in loading times. With 1000 seperate files, the OS can keep each one defragmented, so that each individual file load can be done without wasteful seek periods, however, the OS doesn't know the order in which you want to load all of those files, so you'll pay a seek penalty in-between each file and likely won't benefit from automatic pre-caching.If you pack all the files end-to-end, in the order in which you want to load them, you can spend more time reading and less time seeking.

As for modern OS's helping out, either way, make sure you're using the OS's native file system API (e.g. on windows, CreateFile/ReadFileEx instead of fopen/fread). By using these API's you can take advantage of modern features like threadless background loading (DMA), file (pre)caching or memory mapping.

Wow, thanks a ton for the great insights Hodgman! I'm definitely looking at implementing a data compiler like that. The auto-asset-refresh sounds *really* nice. That, and it lets the artists maintain their normal workflow when it comes to updating assets. And I think I'll do what you do: use pack files in the release builds and the filesystem for development builds. Abstracting the data storage and using a swappable loading class would be nice.

Properly packed, you can reduce load times. That is by far the biggest compelling reason. Ideally packed you have a small pointer table up front followed by all the data that gets memory-mapped and copied into place as fast as the OS streams it in. However, do that wrong and it will be SLOWER than a traditional load. Profile and proceed with careful measurements.

Making it harder for end users to reverse engineer is perhaps the most invalid reason. If that is your motivation then stop.

Properly packed you can have independent resource bundles that can be worked and replaced as individual components. A great example of this is The Sims where you can download tiny packs of clothes, people, home lots, and more. People generate custom content all the time and upload their hair models, body models, clothing models, the associated textures and whatnot, all in their own little bundle.

Many comprehensive systems will use dual-load systems, first checking the packaged resources and then checking the file system for updated resources. That enables you to make changes without rebuilding all the packages. Even better systems will watch the file system and automatically update when changes are detected. This is extremely useful when there are external tools, such as string editors, tuning editors, and various resource editors so you can see your changes immediately in game.

I'm very interested by this.
I initially started by referring to resources (possibly the same thing as "asset names") however I had a few collisions here and there and I later switched to using file names directly. I didn't like this and I don't like it now, I want to go back to asset names in the future however I am still unsure on how to deal with naming collisions and in general provide a fine degree of flexibility.
Perhaps it would be just better to give better naming conventions?
Suggestions on rules about resource->file mappings?

@Madhed:
In respect of the generally very interesting paper by Jan Wassenberg, one should note that it contains a lot of very useful information for some cases, and a lot of consideration in general. If one develops for a console or considers streaming data from CD, the paper hits the spot 100%. Some of the techniques described (e.g. duplicating blocks) are big win when you read from a medium where seeking is the end of the world (such as a DVD), or when you can't afford clobbering some RAM.
On the other hand, if one targets a typical Windows desktop PC with "normal" present time hardware, almost all of the claims and assumptions are debatable or wrong (that was already the case in 2006 when the paper was written).

What is indisputably right is that it's generally a good idea to have one (or few) big files rather than a thousand small ones.
Other than that, one needs to be very careful about which assumptions are true for the platform one develops on.

On a typical dektop machine which typically has half a gigabyte or a gigabyte of unused memory (often rather 2-4 GiB nowadays, or more), you absolutely do not want to bypass the file cache. If speed (and latency, and worst case behaviour) is of any concern, you also absolutely do not want to use overlapped IO.

Overlapped IO rivals memory mapping in raw disk throughput if the file cache is disabled and if no pages are in cache. This is cool if you want to stream in data that you've never seen and that you don't expect to use again. It totally sucks otherwise, because the data is gone forever once you don't use it any more. With memory mapping, you pull the pages from the cache the next time you use the data. Even with some seeks in between (if only part of a large file is in the cache), pulling the data from the cache is no slower and usually faster (much to my surprise -- this is counterintuitive, but I've spend some considerable time on benchmarking that).

Ironically, overlapped IO runs at about 50% of the speed of synchronous IO, if it is allowed to use the cache (this is, other than under e.g. Linux, actually possible under Windows). Pulling data from the cache into the working set synchronously peaks at around 2 GiB/s on my system (this is surprisingly slow for "doing nothing", a memcpy at worst, but it beats anything else by an order of magnitude).

Asynchronous IO will silently, undetectably, unreliably, and differently between operating systems and versions, and depending on user configuration, revert to synchronous operation. Also, if anything "unexpected" happens, queueing an overlapped request can suddenly block for 20 or 40 milliseconds or more (so much for threadless IO, which means your render thread stalls during that time). This is not singular to Windows, Linux has the exact same problem. If the command queue is full or some other obscure limit (that you don't know about and that you cannot query!) is hit, your io_submit blocks. Surprise, you're dead.

What you ideally want is to memory map the entire data file and prefault as much of it as you can linearly at application start (from a worker thread).

If you, like me, own a "normal, inexpensive" 3-4 year old harddisk, you can observe that this will suck a 200 MiB data file into RAM in 2 seconds, with few or no seeks at all. If you, like me, also have a SSD, you can verify that the same thing will happen in well under a second. Either way, it's fast and straightforward. If your users, like pretty much everyone, have half a gigabyte of unused memory, the actual read later will be "zero time" without ever accessing the disk.
This is admittedly the best case, not the worst case. But the good news is that the worst case is no worse than otherwise. The best (and average) case, on the other hand, is much better.

@samoth
fair point. I just wanted to point out the paper since that was the first thing that sprung to my mind when reading the thread title. I haven't acually implemented or verified the results but found the paper interesting enough to share.

I use PhysFS myself and it works great I think. It allows you to not use an archive, but instead mount an actual folder.

This means in development you can still be using PhysFS and be working with the resources on disk directly, and then create an archive and switch to using the archive by mounting the .pak file or whatever.

PhysFS has really nice FileIO functions too.

I find it's also pretty easy to write a little batch script or shell script that creates the archive from a folder in one click if you add something and want to see changed results.

I initially started by referring to resources (possibly the same thing as "asset names") however I had a few collisions here and there and I later switched to using file names directly. I didn't like this and I don't like it now, I want to go back to asset names in the future however I am still unsure on how to deal with naming collisions and in general provide a fine degree of flexibility.Perhaps it would be just better to give better naming conventions?Suggestions on rules about resource->file mappings?

My build tool scans the entire content directory and builds a map of filenames to paths. If the same filename appears at multiple paths (e.g. content/lvl1/foo.png, content/lvl2/foo.png), then a boolean is set in the map, indicating that this name->path mapping is conflicted.

When evaluating build rules, this table is used to locate input files on disk. If an entry from the table is used that has it's conflict flag set, then the tool spits out an error (describing the two paths) and refuses to build your data. This is similar to bad code spitting out assertion failures and refusing to run.

Because my particular data-compiler is designed to always be running, it listens to changes to the content directory, and if you create a duplicate file, I can pop up one of those annoying bubbles from the system tray, letting you know you've just made a potential conflict before you even try to build your data.

Regarding naming conventions, I can somewhat enforce these by specifying them in my build rules.For example, if I wanted to disallow "foo.texture" and enforce "foo_type.texture", where "type" is some kind of abbreviation, I can set only rules that contain "_type". Let's say one of my "types" is "colour+alpha", and that I want the artists want to author colour and alpha seperately.

Whereas, if someone sets up a material to link to something like "foo.texture", which doesn't follow the convention, the data compiler follows this sequence:Build: data/foo.texture*Error: No rule matches "data/foo.texture"

Thank you very much, I think I understand the basic principles. I am currently doing something similar to the example you describe. It now appears reasonable not allowing conflicts to happen is better than fixing them.

I use a system similar to java classpath/jars (or PhysFS ?). In fact I have a virtual filesystem with multiple layers of archives or directories which could be mounted. Important is the fact, that there're layers of archives. I.e.

When I mount the archives/directories in the order data->patch->directory, I got the final virtual filesystem:/data/texture/tex1.png (Version 1.2)/data/texture/tex2.png (Version 1.0)/data/scripts/script1.lua (Version 1.1)

This comes in really handy when delivering patches or exchanging single files for debugging purpose (atleast for a hobby dev ).

The main motivations are quicker load times and more convenience of distribution.

Suppose you have 10,000 files. This is a fairly low number of individual objects, even for a game without much content.

On a desktop OS, your on-access AV program must scan every file. This is usually very time consuming.

You'll probably distribute your game as an archive (e.g. zip) anyway. So it doesn't make any difference. Your packer may be less efficient than zip. or more, but it doesn't really matter.

The overhead of having large numbers of files in the OS filesystem is quite significant, particularly when you remember that EVERYONE has on-access AV scanners!

Development convenience can be provided by having dev-builds search the filesystem first (and the resources.zip second) for files.

---

There is no security / reverse engineering benefit, because it is just as easy for a cracker to modify your big zip file as it would be if they were individual files. If you want to discourage casual reverse-engineering (or graphics ripping etc), then rename your .zip file to .zpi or something

An other point is, that many third party resources (textures,sounds,models etc.) are coming with a license which forces you to deliver the resources in a protected way. A resource file(!=simple zip) is atleast a basic protection.

An other point is, that many third party resources (textures,sounds,models etc.) are coming with a license which forces you to deliver the resources in a protected way. A resource file(!=simple zip) is atleast a basic protection.

Ha, no it's not. Any game that gains popularity is not protected (I mean really... Spore, anyone?), and any game that doesn't gain popularity isn't worth overcomplicating in the name of unnecessary and ineffective protection. And I highly doubt media licenses would seriously force me to deliver them in a "protected" pack file (I could be wrong, but I'd be surprised)... The reason I listed "(Slightly) harder for users to muck with" as a pro is not so much because it would stop Average Joe from replacing a texture (because if Pro-Hacker Henry hacks the file anyway, all he needs to do is release a program and Average Joe can now do everything Pro-Hacker Henry can do), but because I think some modders get a kick out of reverse engineering things and in a way it helps develop a modding community for the game, which (if done correctly) I think can be a good thing. [edit: Hodgman has pointed out that this has come across as quite arrogant; please read my post below, as that was not my intention (and realize I am not talking about the legal issues of fulfilling a contract here; I'm talking more about what frob said above)]

Anyway, sorry, I'm not trying to start a holy war here. You've all brought up some great points. I hadn't thought about anti-virus programs, and I didn't realize Windows struggled with lots of files in one folder (I'd probably categorize them in subfolders anyway, but it's good to know).

Sorry to whomever voted down Cornstalks but I had to undo your vote.
I am the author of MHS (Memory Hacking Software) and I have a very informed view on this topic.

Hackers can’t really be prevented. Instead for what we anti-cheat specialists aim is to simply minimize the spread of cheats, and Cornstalks was basically trying to illustrate this.

I have models that I use as test material for my own engine which I got from a site, but in the back of my mind I know for a fact that they ripped that content out of a Final Fantasy game illegally. The model is a raw hack of data.

Fine. I am still going to use that data as test material so I can gauge the progress of my engine. Some of my data I know to be illegally ripped from Halo as well but I was not the one who ripped it.

But this is peanuts compared to some of the things I myself have ripped from games, which include entire levels, not just individual models.
If you were to proclaim that you had any form of unhackable resource in any product you made I would simply laugh. The digital age + protection? Give me a break.

No we can’t stop hackers. But you really don’t realize how effective the small stuff is.
As mentioned by Cornstalks, the “slightly harder for users to muck with” situation is actually extremely effective when combating hackers. I have first-hand experience talking with hackers of all levels and I can personally confirm that hackers without much skill give up very easily.
In my own engine I have a custom compression system that acts as a deterrent for hackers. Why?
Only a little has changed from the standard libraries, but in order to handle that change you still have to rewrite the entire decompression system.
Even if some people realize that, very few of them are willing to actually do it. “Meh, I will just hack something else,” is how most will reply.

They end up creating more basic-level hacks and then keeping them for themselves. Why? Because there is no prestige in releasing a hack that everyone else can make.

The benefits in deterring the basic and obvious cheats are actually quite huge and very very frequently underestimated.

I downvoted it because of this arrogant dismissal of the reality that lawyers don't understand technology. I can paraphrase it as "Ashaman73, I've no experience with such legal requirements so I'll declare that they don't exist".

An other point is, that many third party resources (textures,sounds,models etc.) are coming with a license which forces you to deliver the resources in a protected way. A resource file(!=simple zip) is atleast a basic protection.

Ha, no it's not. Any game that gains popularity is not protected (I mean really... Spore, anyone?), and any game that doesn't gain popularity isn't worth overcomplicating in the name of unnecessary and ineffective protection. And I highly doubt media licenses would seriously force me to deliver them in a "protected" pack file (I could beam wrong, but I'd be surprised)...

The point was that if you're legally obliged to use 'protected' files then you're forced to jump through this hoop. Yes, your 'protected' files can easily be opened, but that doesn't change the fact that if there's a legal requirement to use 'protected' files, then you may have to. And yes, such legal hoops do exist and are an important detail in the real world.

For example, anyone can rip a copy-protected DVD easily, but the fact that the DVD has weak anti-copying measures means that you've crossed a particular legal line in the sand, which makes the lawyers job much easier when trying to prosecute pirates. Even though this copy-protection is useless in stopping copies from being made, publishers use it anyway as it becomes a legal weapon (the data was "protected" and you "broke" that protection).

I downvoted it because of this arrogant dismissal of the reality that lawyers don't understand technology.

An other point is, that many third party resources (textures,sounds,models etc.) are coming with a license which forces you to deliver the resources in a protected way. A resource file(!=simple zip) is atleast a basic protection.

Ha, no it's not. Any game that gains popularity is not protected (I mean really... Spore, anyone?), and any game that doesn't gain popularity isn't worth overcomplicating in the name of unnecessary and ineffective protection. And I highly doubt media licenses would seriously force me to deliver them in a "protected" pack file (I could be wrong, but I'd be surprised)...

The point was that if you're legally obliged to use 'protected' files then you're forced to jump through this hoop. Yes, your 'protected' files can easily be opened, but that doesn't change the fact that if there's a legal requirement to use 'protected' files, then you may have to. And yes, such legal hoops do exist and are an important detail in the real world.

For example, anyone can rip a copy-protected DVD easily, but the fact that the DVD has weak anti-copying measures means that you've crossed a particular legal line in the sand, which makes the lawyers job much easier when trying to prosecute pirates. Even though this copy-protection is useless in stopping copies from being made, publishers use it anyway as it becomes a legal weapon (the data was "protected" and you "broke" that protection).

I will apologize for the arrogance, as I honestly didn't intend for it to come across as arrogant as I suppose it has (when I read "protection" I immediately thought of Spore, which has always been a funny example of epic failure to me, hence the "ha" part). I am indeed sorry for that.

I will say I am surprised that it sounds like (from Hodgman and Ashaman) it's not uncommon for contracts to require things to be packed and protected... I understand that of course if a contract states that, that is what needs to be done. I was not disagreeing with that. My point was more in line with frob's point that "Making it harder for end users to reverse engineer is perhaps the most invalid reason." Sure, it can prevent newbs from messing around with things, but I'd use a simple encryption/obfuscation scheme for that if that was my goal (which could be implemented on top of the pack file or raw files on the file system) (I see "packing things into a file" and "encrypting/obfuscating them" as two different problems with two different goals, though they can be used in combination).