Think about what happens with the bits on the disk when you want to 'insert' something at byte 128 of a 2 gigabyte file.
–
MichaelTApr 29 '14 at 18:00

You mean with no operating system and no file-system in between? Then it will not work. With the other two in place I have no idea why it can not work.
–
UserApr 29 '14 at 18:01

9

Take 500 dominos and lay them end to end in a line. Now try to insert one into that line without moving the others.
–
GrandmasterBApr 29 '14 at 18:14

2

@MichaelT In my dream world, you should only have to insert another block into the string of blocks that make up the file and distribute the contents of the current first block onto the first two blocks. Granted, this would require the file system implementers to handle odd-sized blocks - but in the situations where you do need this operation, it would improve efficiency so much it's not even funny.
–
Kilian FothApr 29 '14 at 18:16

1

@User the questions of filesystem fragmentation and how Ext4 work moves firmly into the realm of SuperUser. Please remember to fully specify your problem or they'll be asking about bytes again. You are asking about blocks and file systems and logical volume managers and the like.
–
MichaelTApr 29 '14 at 19:40

6 Answers
6

Given that most file systems store the contents of files in individual blocks that are not necessarily contiguous on the physical disk, but linked via pointer structures, it seems that such a mode - "inserting" rather than "appending" or "overwriting" - ought to be possible, and could certainly be made more efficient than what we have to do now: read the entire content, edit the stream of bytes and re-write the entire content.

However, for better or worse, the UNIX semantics of file systems were designed along the "rough and simple" paradigm in the 1970s: it allows you to do everything, but not necessarily in the most efficient way possible. Nowadays it is almost unthinkable to introduce a new file opening mode into the Virtual File System layer and have any hope of the major file systems adopting support for it. This is a pet peeve of mine, but unfortunately one unlikely to be resolved any time soon.

Block level storage complicates the question a step further. Sticking with the OP's original example, the two versions of the string ought to fit within a single block. The bytes need to be written out sequentially and that's what necessitates shifting the tail of the string down by whatever the amount inserted.
–
GlenH7Apr 29 '14 at 18:22

It would only be efficient if you have to insert exactly the amount of data that can be stored in one block, in exactly the border between two existing blocks.
–
Idan AryeApr 29 '14 at 19:20

Kilian Forth seams to be right. I asked a professor about this and he told me about the same: The "rough and simple" design allows portability and thus is more widely used. Not many file systems allow insertion and even less operating systems expose it, to apply to a portable interface. @GlenH7 Two people who edited my question made it look like I would ask about bytes and reverted my clarification. The real question is about the interface we use.
–
UserApr 30 '14 at 10:53

Yes, the blocks are linked via pointers and thus the file contents don't have to be stored contiguously but when they are stored contiguously then the hardware can read block after block without having to slow down at all. If it had to follow pointer by pointer then the read head would constantly be moving. That's why defragmentation helps speed up your computer. It puts the block pointers for files in contiguous blocks. Then the command is not read block 1, read block 3, read block 9, read block n...it becomes read blocks 1 through n. The hardware can do that much, much more efficiently.
–
DunkApr 30 '14 at 19:31

First you need to read everything after the insert point, then write it back down by as much space as you're going to insert. Then you can write your "insert" data into the correct place. Extremely poor performance operation, hence not natively supported.

SSD still reads and writes a page at a time. So to write your 1 byte that you want to insert you'd have to write an entire page of data along with updating all the corresponding file system tables/pointers. I would not be surprised if the initial file systems did have an insert-like operation but they realized that it added far more overhead than it saved.
–
DunkApr 30 '14 at 19:17

Theoretically, you could implement a file that would allow this sort of thing. For maximum flexibility, though, you'd need to store a pointer to the next byte along with every byte in the file. Assuming a 64-bit pointer, that would mean that 8 of every 9 bytes of your file would be composed of internal pointers. So it would take 9000 bytes of space to store 1000 bytes of actual data. Reading the file would also be slow since you'd need to read each byte, read the pointer, follow the pointer to read the next byte, etc. rather than reading large, contiguous blocks of data from disk.

Obviously, this sort of approach isn't practical. You could, though, split the file up into, say 32 kb blocks. That would make it relatively easy to add 32 kb of data at any 32 kb boundary in the file. It wouldn't make it any easier to add a single byte as the 5th byte of the file. If you reserve some free space in every block, though, you could allow small additions of data to take place that would only affect the data in that single block. You'd have a penalty in terms of file size, of course, but potentially a reasonable one. Figuring out how much space to reserve and how to split blocks, though, tends to be much easier for a particular application than for a general-purpose system-- what works in one context may be very bad in another depending on the file access and modification characteristics.

In fact, many systems that spend a lot of time interacting with files implement something like what I've described above when they implement their particular file abstraction. Databases, for example, will generally implement some concept of a "block" as the smallest unit of I/O that they can work with and will generally reserve some amount of space for future growth so that updating a row in a table only affects the one block on which that data is stored rather than rewriting the entire file. Different databases, of course, have different implementations with different trade-offs.

I'd also mention the challenge of "seek to the block that is at 1 gigabyte of a 2 gigabyte file" could take a little bit of time with the linked list of bytes implementation.
–
MichaelTApr 29 '14 at 18:35

The issue of what happens during insertions is one that causes a lot of consternation among people designing de-duplication for storage systems.
–
BlrflApr 29 '14 at 19:40

Thanks for understanding that I did not mean to talk about bytes but about the bigger picture.
–
UserApr 30 '14 at 8:38

And you want to insert C at position 0x04. That requires shifting bytes 4 - 6 down one byte so you can insert the new value. If you don't, you're going to over-write the value that's currently at 0x04 which is not what you want.

So the reason why you have to re-write the tail of the file after you insert a new value is because there isn't any space within the file to accept the inserted value. Otherwise you would over-write what was there.

Addendum 1: If you wanted to replace the value of b with C then you do not need to re-write the tail of the string. Replacing a value with a like sized value doesn't require a rewrite.

Addendum 2: If you wanted to replace the string ab with C then you would need to re-write the rest of the file as you've created a gap in the file.

Addendum 3: Block level constructs were created to make handling large files easier to deal with. Instead of having to find 1M worth of contiguous space for your file, you now only need to find 1M worth of available blocks to write to instead.

In theory, you could construct a filesystem that did byte-by-byte linking similar to what blocks provide. Then you could insert a new byte by updating the to | from pointers at the appropriate point. I would hazard a guess that the performance on that would be pretty poor.

You can't insert another domino within the line of dominoes without causing everything to tumble over. You have to create the space for the new domino by moving the others down the line. Moving dominoes down the line is the equivalent of re-writing the tail of the file after the insertion point.

Suppose a b C and d are not characters but gigabytes of characters. Could you address this in your answer? I like the picture but I also think people would aproach inserting 1000 dominos into 2000 dominos differently that 1 domino into 6 dominos.
–
UserApr 29 '14 at 19:22

@User - GB of characters instead of bytes fundamentally changes the nature of your question and now blocks for storage must be considered. At a simple level, the answer is the same. You can't insert something within a contiguous series of "whatevers" without creating space.
–
GlenH7Apr 29 '14 at 19:33

When you do direct access to a file you are using a low level that can be used to build more sophisticated structures. Consider building a database with your data that allows the types of access you need, including insertion.

It would be less expensive if you only need to iterate through the file not do random accesses to a specified offset. If you need random access by offset in file you will need to update the index for all bytes beyond the insertion point.

In general, you will pay in indexing data structures, memory to store the index, and extra disk accesses to update it.

Insertion into a file is not implemented in most file systems because it is considered an "expensive" (time-eating and space eating) operation with potentially long-term "expensive" repercussions and additional failure modes.

A file system with insertion semantics would probably either use shift&insert (potentially very expensive when you insert at the front of a large file, but no/few long-term side-effects) or some sort of generalized heap allocation with variable length allocation sizes (very ill-behaved performance in some cases [imagine the interactive users' faces if they try to save a file during a stop-the-world GC!]).

If you want to experiment, you can easily build a file I/O abstraction in Java or Python that implements insertion. If you succeed and it has well-behaved performance characteristics, you have the basis for an excellent research paper. Good luck.

this does not seem to offer anything substantial over prior 6 answers
–
gnatApr 29 '14 at 19:22

You can write all the software you want but it won't change the way the hardware works. Hardware works by reading/writing in blocks/pages. In a HDD, if that data is not contiguous then the read head has to move which slows file access time dramatically. Any insert operation would "by the very fact of it being an insert" have to be stored elsewhere and not contiguously. So sure, insert will possibly be faster (for very large files) but reading will be much slower.
–
DunkApr 30 '14 at 19:24