I started work on a disk driver, but now I have some doubts regarding how to transfer data from the disk driver to the relevant process, in a reasonably efficient way. I imagine a read request will proceed as follows.

Process X requests to read some data from a disk. This request is routed via the VFS to the disk driver, which adds the request to its queue of pending operations.

Process X should now block until the data has been read from the disk.

At some point the disk driver starts reading the data from the disk. The question is now, how do I transfer the data back to the relevant process?

In relation to the last part, I have the following things in mind.

For DMA transfers, the data buffers must be below 4 GB - i.e. in a 64-bit system I cannot necessarily read data directly to its final destination.

I imagine reading data into a buffer, after which the data must be copied to its final destination. The question is now: how large should these buffers be?. For ATA DMA I can in principle read 512 MB of data for each DMA transfer. This amount of data will be rare, and I am not going to allocate such a large buffer. In this case I would split the request in a sequence of smaller transfers.

Let us now imagine that the data has been read into a buffer, and it needs to be copied to its final destination. I imagine the ATA driver being a thread in kernel space, which means that it does not necessarily have the same virtual address space as the process that requested the data. How do I handle this situation? I could of course temporarily map the necessary frames into the kernel address space, but it seems a little bit too complicated.

I am generally confused about how to properly implement this entire approach, and there are probably issues I haven't even considered yet.

Why not just shift the "problem" to userspace? i.e. Allow the userspace process to specify that the physical address should be <4GB when requesting additional memory (i.e. as a flag to your "mmap" or equivalent). That means that any copying is handled by userspace and not a concern of the kernel.

Alternatively, re-map the userspace memory to <4GB when the read is performed. As long as the read is page-aligned, you don't even have to copy anything; even if the read fails, the contents of the buffer passed to "read" (or equivalent) beyond the number of bytes successfully read are undefined. A suitably advanced memory manager should be able to identify <4GB pages that are not/no longer needed to be accessible to DMA and move then out to "long" memory if/when available <4GB memory is running low.

Additionally, most Intel CPUs from the "Sandy Bridge" generation onwards (and a smattering of earlier chips) and AMD chips going back even further have IOMMU support (Intel VT-D, AMD-Vi) which will allow you to remap >4GB RAM into the range that 32-bit DMA devices can use.

You should probably also maintain a read-ahead cache. If access patterns indicate linear access, it might be prudent to read ahead from the current requests, so you have the data handy when requested (e.g. you see requests for sectors 2, 3, and 4, so why not load sector 5 into the buffer?).

I see no need for zero-copy disk access. Typically, disk access is way slower than RAM access (which is way slower than cache access), so just load the data into a kernel-internal buffer and copy the requested data back to the user. This also makes it much more convenient to use: You can size and align the internal buffer to your needs (e.g. page size and alignment for AHCI, if I recall correctly) and just cut out the data you need for whatever driver requested it. For example, if you want to open a file on an ext2 partition, you will have to read the inode. Which will be a request for a random block of 128 (?) bytes. So the disk access will not return to userspace at all, the inode stays in kernel (for now, until fstat() is called).

The way I do it is to assign a semaphore and a set of buffers to each drive and use them to move data from and to the disks. Just as nullplan said, you don't really need to optimize disk accesses that much since they are 'slow as hell' compared to what CPU can process, even modern SSDs.

Thanks for all the comments so far. Maybe I am complicating something that is relatively straightforward.

I will allocate some buffers in the disk driver and use these for transferring the data. However, I am still unsure how to copy the data to its final destination, because the final destination potentially is a virtual address in another address space. As suggested, I could shift the problem to user space and let the process copy the data from the buffers to its final destination, but I see two potential problems with this.

(1) I would need a sufficiently large buffer to store all the data at once, unless I want to wake the thread every time the buffer is full.(2) I would be forced to wake the thread as soon as the buffer is full, such that it can be emptied and used for the next operation in the disk drivers queue. If the process has low scheduling priority, this might not be ideal.

Another option is of course to let the disk driver run in whatever address space currently relevant for the operation in progress. This might not be too difficult to handle, but I need to think about potential issues with this.

The IOMMU suggestion is also interesting, this is definitely something I will look into.

The read-ahead I already have in the back of my mind, but this will be part of the VFS (which I also haven't written yet) and not the disk driver itself.

You have to proxy the returned data though your VFS process (for one it has to adjust the current file offset, which may not be multiple of sector size). I suggest to set up a DMA buffer area (<4G in RAM) and map that buffer in your VFS task. That way when the disk driver finishes, your VFS will get the sector "instantly". You'll have to copy the data from there to the blocked process' address space though for two reasons:

1. it's very unlikely that the process reads from a sector aligned address and multiple of sector size and it's buffer also happens to be on a page boundary. Thus you'll need to copy the sector's data anyway.

2. your driver may read another data from the disk into the same buffer position, overwriting the previous sector data, therefore mapping DMA area directly into process' buffer is not a good idea.

If you want to get better perfomance, you can do what Linux does: use all free memory as disk cache. When the sector is ready in the DMA buffer, copy (and not map) it into the VFS' memory along with deviceid and LBA information. Next time you want to read a sector, consult the cache first. If you're lucky, you don't have to call the disk driver at all. Another advantage, that (if alignment requirements are met) you can map your VFS cache into the process' read buffer directly, because it's not in the DMA area, therefore it won't be modified. Implementing read-ahead then simply becames a matter of reading more sectors into the cache (you don't have to worry about caller's buffer is full or not).

So the steps are:1. process calls read(buf,size)2. VFS calculates LBA using file offset and knowing the FS' internals in question3. VFS calls disk driver (and if async, goes on with other requests)4. disk driver reads sector into RAM (with DMA using physical address) when ready, it notifies VFS5. VFS copies the sector in DMA buffer (which is mapped in its address space) into the disk cache6. VFS copies the data from cache into 'buf' (or changes caller's mapping so that its 'buf' points to the same physical page as the disk cache)7. process is awaken, and can read the data in 'buf'.

Oh, and almost forgot: consider the case when you read 8 bytes from a file at seek offset 508 for example. Then the first 4 bytes will be at the end of sector X, and the second 4 bytes will be at the beginning of sector Y, yet you have to provide the contigous 8 bytes in the buffer for the reader. If the file could be fragmented, so you can't make the assumption Y=X+1. To make it more interesting, imagine that your disk driver loads sector X into the DMA buffer at offset 0-511, then the scheduler picks another driver (while it's blocked waiting for the sector Y seek to complete), which new driver reads into DMA buffer at 512-2047, then your original disk driver reads sector Y into DMA buffer at 2048. In that case you'll have to copy bytes 507-511 and 2048-2055 from the DMA buffer into the reader's 'buf'...

It is not clear from your question if the OP talks about a microkernel or a monolithic one. At least some answers seems to assume a microkernel. If that is the case, having each read go through a separate VFS process is a horrible idea.

It is not clear from your question if the OP talks about a microkernel or a monolithic one.

True, but that doesn't really matter. VFS can be a part of a monolithic kernel, or a separate process in a microkernel, either way it has to receive the data from the driver (a) and pass it to the proccess (b). There's a difference between (a) and (b), which may not be obvious: disk driver returns sectors, but application reads arbitrary number of bytes from an arbitrary file offset. Somebody has to cut out the correct amount of data from the sector data, and that's the VFS' job.

Code:

+---+---+---+---+ Sectors in the disk cache (not necessarily in proper order) ####### The requested data, not sector aligned and may overlap multiple sectors

The question is more like whether the OS implements paging or not. If not, then there's simply no other way than copying. If there's paging, then the kernel can juggle with mapping disk cache into the buffer if alignment is ok. For example Minix is a microkernel, but it does not implement paging. Therefore when the data is ready it copies that to the destination buffer right away. Linux on the other hand is monolithic, but has paging. It copies the data into the globally mapped disk cache first, switches to destination buffer's address space and moves the data from there to the dest buffer. One may think Minix way is faster, but actually it's not. More precisely that's true for strickly one read, but having a disk cache eliminates subsequent sector reads therefore it's much faster on the long run.

Quote:

At least some answers seems to assume a microkernel. If that is the case, having each read go through a separate VFS process is a horrible idea.

I don't understand. In a microkernel VFS *is* a separate userspace process, couldn't be otherwise. This really does have a task switch penalty, but that's normal. It's well known that security and stability (separating systems and pushing them into userspace) has a toll that all microkernels have to pay.

zity wrote:

When the data resides in the VFS cache, it can easily be copied to its final destination

Yes, and you can assume data is always there (if not, you tell the disk driver to read it, and only continue when the driver is done). In other words, you create two abstraction layers:1. disk driver only cares about reading sectors into the disk cache, and flushing sectors from cache to disk2. file read/write functions only deal with reading and writing the disk cache.

There's a difference between (a) and (b), which may not be obvious: disk driver returns sectors, but application reads arbitrary number of bytes from an arbitrary file offset. Somebody has to cut out the correct amount of data from the sector data, and that's the VFS' job.

Couldn't it be the application's job instead? If you limit applications to reading/writing aligned 4kB blocks, you can optimize for the extremely common case where the filesystem uses blocks in some multiple of 4kB and cut out a lot of unnecessary work in the VFS.

Actually, it's pretty common to do this in language standard libraries, so most applications would see no difference at all.

At least some answers seems to assume a microkernel. If that is the case, having each read go through a separate VFS process is a horrible idea.

I don't understand. In a microkernel VFS *is* a separate userspace process, couldn't be otherwise. This really does have a task switch penalty, but that's normal. It's well known that security and stability (separating systems and pushing them into userspace) has a toll that all microkernels have to pay.

Sure, the VFS probably resides in a different process but that does not mean that reads need to go through the VFS. For example, managarm has the concept of "passthrough" operations that directly go to the FS/disk driver. Operations like open() or mmap() always go through the VFS, while read()/write() (almost) never do. The FS (and not the VFS) can easily handle caching and maintain the file offset.

Could be, but does that worth it? Seems very inconvenient. Last time I saw something like that was with OS/390 and VMS, both _optionally_ supported fixed length records in files (but variable per file, so could be different than the underlying sector size). Also let's admit those were designed more than 40 years ago. Amiga, DOS, BeOS, MacOS (mach), Win, Linux, SCO UNIX, all BSDs allow any byte position and any buffer size, so it seems to me like a must have feature these days.

Quote:

If you limit applications to reading/writing aligned 4kB blocks, you can optimize for the extremely common case where the filesystem uses blocks in some multiple of 4kB and cut out a lot of unnecessary work in the VFS.

True, but 1) see above on limiting the app, 2) unfortunately you can't guarantee all file systems to be 4k aligned. Imagine for example if cluster size is not multiple of 4096 on a FAT partition, or if root directory starts on a sector which is not multiple of 8. I think the best we can do is using a "shortcut" solution when the fs consist of 4k blocks and the buffer is also aligned, and a slower, but general code if not. That way we can have the performance gain you mentioned without limiting the OS to certain specially formatted file systems.

Quote:

Actually, it's pretty common to do this in language standard libraries, so most applications would see no difference at all.

Yes that's a possibility. You can do that in the standard library if you send all fs and storage characteristics to the process. I haven't examined this solution, because my standard library does not store file offsets, knows nothing about filesystems let alone storage devices (therefore it does not know the sector size).Has anybody implemented this? Would you mind sharing your experience with us, pros and cons of this solution? Sounds like an interesting idea, but I have questions about locking (see below).

Korona wrote:

The FS (and not the VFS) can easily handle caching and maintain the file offset.

Okay, but where exactly would that FS be in a micro-kernel? I think in the same process as the VFS. Other than that, you are right read and write do not have to strictly go through the VFS, they only have to access the same memory as open/close (like file offset which could be in the process' address space too). But I think positioning/reading over file end could be problematic if you omit FS/VFS for read/writes (but not impossible to solve).

I'm not sure how to implement locking if file offset is handled by the standard library in the process' memory though (for example when one process is writing the first 64k of the file, and several others are reading the same file, and only the readers with file offset + read size < 64k should be blocked, I mean F_SETLK fcntl command).

This is how I've implemented this: I have an FS process, which includes VFS functions and disk cache too. It has a VFS node for every open file and directory (vnode, containing device reference, partition position etc.), and a table for every opened file (openfiles, with file offset field and referencing the file's vnode and the opener's pid). On the standard library side I only have a simple integer table, which basically contains global openfiles indeces, nothing more. When a process calls read() with fd 3 for example, then the standard library translates fd 3 with that table (let's say 123) and uses that for the syscall. That way my VFS can use the openfiles index 123 received from the syscall, and it does not know that to the process that's fd 3. Hope this makes sense.

At least some answers seems to assume a microkernel. If that is the case, having each read go through a separate VFS process is a horrible idea.

I don't understand. In a microkernel VFS *is* a separate userspace process, couldn't be otherwise. This really does have a task switch penalty, but that's normal. It's well known that security and stability (separating systems and pushing them into userspace) has a toll that all microkernels have to pay.

Sure, the VFS probably resides in a different process but that does not mean that reads need to go through the VFS. For example, managarm has the concept of "passthrough" operations that directly go to the FS/disk driver. Operations like open() or mmap() always go through the VFS, while read()/write() (almost) never do. The FS (and not the VFS) can easily handle caching and maintain the file offset.

No, there's no reason why read() and write() have to go through the VFS at all on a microkernel. Forcing them to go through the VFS just adds unnecessary overhead with no benefit to security or stability (on a kernel with capability-based message passing at least; on a kernel with entirely connectionless message passing it's a lot trickier).

Under UX/RT, read(), write(), seek(), and the like will be considered as transport layer calls, which will call kernel IPC to communicate directly with the other end, whereas open(), close(), and the like will be VFS layer calls, which will call the VFS using read() and write() over a permanently open file descriptor. The only time that read() and write() will call the VFS component is on directories (which will be built in memory by the VFS based on those of the underlying filesystem). All message passing will be done over the FD-oriented transport layer, with no concept of low-level non-file-based message passing exposed by the standard library at all, although lower-level read()/write()-like calls that expose the underlying message registers and long message buffer will be provided (the traditional, register-oriented, and buffer-oriented APIs will all be fully interoperable with messages being copied between formats as necessary).

Octocontrabass wrote:

bzt wrote:

There's a difference between (a) and (b), which may not be obvious: disk driver returns sectors, but application reads arbitrary number of bytes from an arbitrary file offset. Somebody has to cut out the correct amount of data from the sector data, and that's the VFS' job.

Couldn't it be the application's job instead? If you limit applications to reading/writing aligned 4kB blocks, you can optimize for the extremely common case where the filesystem uses blocks in some multiple of 4kB and cut out a lot of unnecessary work in the VFS.

Actually, it's pretty common to do this in language standard libraries, so most applications would see no difference at all.

Under UX/RT, a read()-family call in a client will just request a specified number of bytes, and the server will then use a write()-family call to reply with the data, and a write()-family call in a client will send the raw data to be written, with the server obtaining it through a read()-family call. It will be up to the server to handle the data appropriately. There's no good reason to overcomplicate the transport or VFS layers with concepts like disks and devices, and UX/RT will leave handling of disks and the like up to individual servers (libraries will be provided to make writing servers easier).

bzt wrote:

Korona wrote:

The FS (and not the VFS) can easily handle caching and maintain the file offset.

Okay, but where exactly would that FS be in a micro-kernel? I think in the same process as the VFS. Other than that, you are right read and write do not have to strictly go through the VFS, they only have to access the same memory as open/close (like file offset which could be in the process' address space too). But I think positioning/reading over file end could be problematic if you omit FS/VFS for read/writes (but not impossible to solve).

Requiring all filesystems to be in the same process as the VFS sounds like a pretty severe limitation to me. It kind of defeats many of the security and stability advantages of a microkernel.

bzt wrote:

I'm not sure how to implement locking if file offset is handled by the standard library in the process' memory though (for example when one process is writing the first 64k of the file, and several others are reading the same file, and only the readers with file offset + read size < 64k should be blocked, I mean F_SETLK fcntl command).

AFAIK Unix advisory locking normally just affects the locking APIs and doesn't cause read() or write() to block, so that shouldn't cause any problem with bypassing the VFS in read()/write(). Under UX/RT seek() will just send the requested offset directly to the server, and the offset will be tracked by the standard library (there will be an ftell()-like function to obtain the current offset).

There are lot of things here which seems not to be well defined. We need to clarify these before we continue. Let me suggest a simple exercise:Draw a picture with boxes: requester, fs server, disk driver.Now place the following expressions in one of them: open() handler, read()/write() handler, seek() handler, common filesystem code, filesystem specific code, offset checks, locking checks, LBA calculation, buffer/sector splitting, disk cache, vnode table.

What's not clear to me:

"VFS" - what do you mean by Virtual FileSystem? For me that's the layer between the file abstraction (stdlib file functions) and the disk driver (sector read/write functions). Filesystem specific code is part of that (regardless of the implementation details), because they have to know and manipulate VFS structures intimately (by that I mean they use a much-much more specific and detailed interface than the user processes to interact with vnodes, regardless how that is actually implemented (direct memory access, messages, whatever)). This layer does not necessarily mean one process, could be more.

"read()/write() going through the VFS" - for me that means to pass the message to the _same_ process (server, service, task etc.) which have been used to open the file and which in turn sends another message further to another task. What do you mean exactly by saying going through the VFS?

"communicate directly with the other end" - what does this mean? Disk driver sending messages directly to the requesting process? Is "going through the VFS" an opposite of "communicate directly"?

"It will be up to the server to handle the data appropriately" - what server that is if not the one responsible for the VFS? How does this relate to the previous statement? That phrase "communicate directly" suggests the request does not go through any server, so who is going to handle the data?

"Requiring all filesystems to be in the same process as the VFS sounds like a pretty severe limitation to me." - why? What limitation does it pose? What else would you suggest? If the filesystems are handled in separate processes, which is the one that you call "server"? One common VFS process and one for each specific FS would double the overhead of message passing. And if not, you suggest to repeat all common "appropriate data handling" code in each and every FS specific process?

"Under UX/RT seek() will just send the requested offset directly to the server, and the offset will be tracked by the standard library" - if offset is tracked on the standard library's side (in the requester process' address space) then why do you need seek() to send the offset to the server in the first place? And if you send the offset to the server anyway, why don't you just handle it there, on the server side? What is the benefit of creating a redundant copy of it in stdlib?

"doesn't cause read() or write() to block" - imho should if file is not opened with NONBLOCKING flag, but okay, let's say they just return EBUSY or EAGAIN errno (or any other method to inform the requester that read()/write() failed because of locking), it doesn't really matter. The point is, how do you detect the requested file block is locked condition if open file structures are not stored in a central place, but in stdlibs, each in it's own address space?

Please don't think this is an attack, it's not. I just see many contradictions (and a design must be straightforward and contradiction-less), that's why I'm asking.

Who is online

Users browsing this forum: No registered users and 10 guests

You cannot post new topics in this forumYou cannot reply to topics in this forumYou cannot edit your posts in this forumYou cannot delete your posts in this forumYou cannot post attachments in this forum