Implementation Details

Choice of the Emulation Methods

The intent of the project was to get reliable read-write access to
NTFS partition. There are several possible
ways to achieve that:

Virtualmachine Running the Original W32 Subsystem

Creating virtual-hardware PC and running the original W32 binaries
including their boot-loader etc. Disk device access would be passed as
virtual IDE disk (=hard disk drive). File access API would be implemented
either by special escaping by some trapped instruction out of the
virtualmachine while using W32 file access API or using the standard W32
SMB (Server Message Block) network access through some virtual network
card. The latter network access solution is almost the currently available
possibility of running full-blown disk-sharing real
Microsoft Windows NT inside virtual
machine emulator such as VMware.

"ntoskrnl.exe" Inside Virtual Address Space

This solution was chosen by the project. Binary filesystem driver and
also ntoskrnl.exe binary file are required.
Unfortunately ntoskrnl.exe expects a native
PC virtual-hardware missing during regular UNIX user space process
emulation, therefore such instructions must be trapped and emulated/ignored
from case to case.

Also the initialization code of ntoskrnl.exe is not executed by this project since
it expects to get full PC hardware access privileges and thus some
datastructures do not get initialized by it (need to be trapped later at
runtime stage). Some of the missing initializations are solved by
API functions wrapping.

Filesystem Driver Inside Virtual Address Space

Unlike previous method here we do not use
even ntoskrnl.exe as the complete kernel part of
W32 is emulated from the project source
files. cdfs.sys driver was successfuly ran
in this manner in the former versions of this project but the possibility
to run without ntoskrnl.exe was dropped since it
had no licensing gains (you need the original
Microsoft Windows NT files at least for
the filesystem driver itself) and the emulation of undocumented parts
reusable from ntoskrnl.exe binary was
a pain.

Sandboxing of W32 Filesystem

The emulated W32 environment running the original W32 filesystem driver
is separated from the rest of UNIX OS. It achieves the following goals:

Restartable: W32 driver can be restartde in clean state if it crashed

Secure: Malicious W32 code cannot affect the security of UNIX OS

Stable: Buggy W32 cannot crash any part of UNIX OS

Sandboxing is provided with the following attributes:

standalone UNIX process with separate memory space

chroot(2) in empty directory to prevent any UNIX OS filesystem access

setuid(2) to own user/group to prevent interaction with UNIX processes

setrlimit(2) to limit system resources available for W32 environment

the only connection with the UNIX OS by CORBA/ORBit RPC

Project Components Architecture

This security is almost the same as provided by
emulated virtual machines such as
VMware.

Sandboxing Scheme

Project can be also used in non-sandboxed mode by
--no-sandbox option as it is easier to debug
without CORBA/ORBit RPC. In this case the
DirectorySlave/FileSlave
options are used directly instead of their
DirectoryParent/FileParent
peers.

"patched" vs. "unpatched" Libraries

Library is called patched if we require
loading its original binary code file. Project needs to patch it to be able
to trap all the function entry points. The only currently
patched library of this project is
ntoskrnl.exe.

Library is called unpatched if no original
binary code is needed since all of its functions are completely emulated by
the native implementations of this project.
The typical unpatched representative is
hal.dll as it specializes on the hardware
dependent code and therefore it must be completely replaced by this project
running in the GNU/Linux operating system environment. Early versions of
this project had also full unpatchednative implementation of
ntoskrnl.exe but it no longer applies.

Memory Management

Original Microsoft Windows NT
architecture uses two address space areas – user space and kernel space.
User space is mapped in the range 0x00000000
to 0x7FFFFFFF, kernel space is mapped in the
range 0x80000000
(KERNEL_BASE in ReactOS sources) to
0xFFFFFFFF. All these virtual memory ranges
represent addresses after their MMU (Memory Management Unit) mapping, of
course. More discussion can be found in the
description
by Microsoft.

This project runs in the virtual address space used both for the UNIX
user space process part and for the W32 kernel space. Therefore this
project defines that W32 kernel runs in the whole range
0x00000000 to
0xFFFFFFFF since there are no special mapping
assumptions about the UNIX user space process mapping. No W32 user space
exists in this project. Such approach also nullifies any special memory
moving operations between W32 kernel space and W32 user space memory areas
(such as MmSafeCopyToUser()).

Supported Binary Formats

The native W32 binary format is identified as
PE-32 (Portable Executable 32-bit), such
files have all the usual extensions such as
.sys, .exe,
.dll etc. PE-32
loading support was already implemented by ReactOS, its memory mapping
specifics just had to be ported to GNU/Linux environment by this project.
This loading support does not (yet) cover importing of debug symbols from
W32 .PDB (Program DataBase) files in GNU/Linux
ABI (Application Binary Interface) compatible way.

This project also supports transparent loading of UNIX
.so (Shared Object file) binary format. If you
have W32 source files for some W32 library you can try to compile it by GCC
to get the shared library with GNU/Linux ABI compatible debug information
(GCC option -ggdb3 recommended). Beware of
possible compilation problems as Microsoft
C code expects exception handling to be
supported by the compiler (definitely not the case of the plain C compiler
of GCC) — all the exception catching code should be discarded as any
generated exceptions are always fatal when
such driver is running in the scope of this project. You can use the
following script of this project to compile W32 filesystem source files as
UNIX .so:
src/w32-mod/ext2fsd.so-build.sh

Be aware of some differences if you use
PE-32 binary format file vs.
.so format file.
PE-32 use the appropriate W32 specific
cdecl/stdcall/fastcall call types,
.so must be completely compiled in the standard
UNIX cdecl call type semantics.
Native function implementations do not need
to be explicitely exported by captivesym as they
are resolved automatically by the UNIX dynamic system linker. It may be
surprising you will have to fix all such missing symbol exports if you
advance during the development from the debugging
.so file for the production version of the
original PE-32 binary file.

At Most One Mounted Filesystem

The project technically supports only one (exactly one...) mounted
filesystem device and only one filesystem driver. There is nothing
complicated to support multiple disks and multiple loaded filesystem
modules but as they would share the address space it would only bring
a possible complications during bug reports and the bug solving
itself. It was considered as a more sane way to support multiple W32
mounted disks by completely separately running project instances in
a different UNIX processes communicating from their sandboxes via
CORBA sandbox interface. This sandboxing
feature is not yet deployed although its code is already prepared.

The project also does not support any state cleanup to be able to load
filesystem A,
cleanup A and load a different
filesystem B in the same process address
space. It complies with the preventions of the possible debugging
complications as noted above. Despite this you still must call the function
captive_shutdown() to flush all the pending
filesystem buffers to the disk. After calling
captive_shutdown() the process address space is
no longer usable for any further project operations and the process is
expected to be terminated in the manner compatible with its driving
CORBA sandbox interface control master.

Each sandbox executing the untrusted W32 binary filesystem driver code
is connected through its
CORBA sandbox interface at the point of upper
layer libcaptive-specific filesystem API, at
the point of the bottom layer of GIOChannel
device access and also for transfers of GLib logging
messages/warnings/errors out of the sandbox to the user.

Multithreading and Multiple Processors

W32 platform stands on its thorough architecture parallelism. It
must lock all its objects to maintain coherence in presence of
multithreading and multiple processors. Since the author of this project
considers any parallel execution a serious obstacle for debugging the whole
project architecture was designed to prevent any undeterministic behaviour.
Therefore this projects always emulates uniprocessor
Microsoft Windows NT kernel
(KeNumberProcessors symbol is always 1),
everything runs in the single initial thread/process and all the filesystem
operations are performed as synchronous
("synchronous" by flags
FILE_SYNCHRONOUS_IO_ALERT,
FO_SYNCHRONOUS_IO,
IRP_SYNCHRONOUS_API,
IRP_SYNCHRONOUS_PAGING_IO,
forced TRUE result of
IoIsOperationSynchronous()
etc.).
For several cases needed only by ntfs.sys there
had to be supported asynchronous access
(STATUS_PENDING return code) – parallel
execution is emulated by GLib
g_idle_add_full() with
g_main_context_iteration() called during
KeWaitForSingleObject().

Since there is a possibility a real W32 parallel threading would
be yet needed in the future all the code that would be hit by W32
multithreading capability is marked by
TODO:thread comment.

Multiple processors (SMP) support will never need to be implemented
since uniprocessor W32 kernels apparently run the filesystem driver modules
fine. As this project implements only the uniprocessor W32 kernel all the
processor locking functions and structures such as
KSPIN_LOCK etc. can be safely implemented as
no-operations.

Asynchronous callbacks registered for
IO_WORKITEMs are passed as GLib idle
functions by g_idle_add_full(). Although they
will probably never be executed during non-interactive project's batch
executions it is the responsibility of W32 driver implementation to
complete all the pending tasks before its W32 shutdown. Such W32 shutdown
is done during cleanup of the project's execution by
captive_shutdown().

Paranoia Checks

A general approach of software projects development is to implement
many internal sanity checks during the development stage but to produce the
most optimized final release product without those debugging checks.

Facilities for these practices can be seen in the standard
C include files for example as function
assert() which gets disabled by the
NDEBUG symbol used during the final optimized
executable compilation. This project uses Gnome GLib messaging subsystem
offering sanity checks discarded by symbols
G_DISABLE_ASSERT and
G_DISABLE_CHECKS.
Microsoft also produces two versions of
its products – regular customers use the "free build" (also
called "retail") while the programmers should develop their code
on the "checked build" product releases.

As this project will always run unknown binary code of proprietary W32
filesystem drivers, the code can never be trusted. Such code even runs in
the same unprotected address space as its controlling UNIX code. Since
there is not enough documentation for the W32 components of the system and
also such documentation is usually misleading it can never be considered as
100% emulation. Even in the final releases all the sanity checks
implemented in this project should remain active as all the project's code
always interacts with unknown and untrusted W32 binaries.

Microsoft Windows NT code is written in
a foolproof style as it accepts even invalid input values, and which
it usually corrects. This makes long-term debugging a pain as it hides
sources of problems. "Checked build" releases were probably
designed to fix this flaw by strict consistency checks but it did not reach
its goals as such checks are usually missing in the code.

This project has strict consistency checks across all the code to make
the debugging phase easy enough. Failed sanity check is not always
a bug – sometimes it just means the real W32 binary code is more
benevolent than it could be expected according to the documentation and
such sanity check gets removed for the next version build. In other cases
the failed sanity checks mean the execution path for some unexpected
arguments combination was not yet implemented by this project. I may also
mean a bug, of course...

Last but not least – never miss a possible sanity check as its
later removal is in an order of magnitude cheaper than an uncaught
invalid assumption. Failed assertion is not always a bug although it
has to be fixed, of course.

STATUS_LOG_FILE_FULL

After writing approx. 1MB of data on NTFS test partition NTFS driver
returns for any further write requests
STATUS_LOG_FILE_FULL error code.
Apparently it is caused by the fact this project is
single-threaded and it ignores the spawn
of parallel journalling thread during ntfs.sys
initialization.

Fortunately ntfs.sys will clear its
journalling log file during filesystem unmount. This project will therefore
remount the volume if STATUS_LOG_FILE_FULL
is detected to workaround missing journalling thread.

Similiar behaviour can be seen during write of compressed files —
the file gets written uncompressed and its compression will proceed only
during the final filesystem unmount.

ParentConnector volume remounter

The sandbox master component of this project has control of restarting
its sandbox slaves containing the W32 filesystem. Target goal of
ParentConnector component is to transparently
provide persistent view of files and directories over the sandboxed slaves
being restarted.

In the case of read-only operations it would be simple as we could only
save our state of currently opened filesystem objects with their read
file/directory offset. Write operations can be handled as the read-only
ones as long as all the operations are successful. In the case of W32
filesystem crash we loose all the past write operations. If we would redo
all the write operations we could very easily invoke the same crash.
Therefore we write:

Filesystem crash broke dirty object: FILE/PATH/NAME

message to syslog and refuse any further operations with this
object.

Parent Connector

HANDLE represents W32 object open in
existing W32 filesystem.HANDLE is created
on-demand according to the saved state of the object (such as its
pathname). Even the whole VFS sandbox slave
is spawn on-demand if some object operation requests it.

W32 filesystem crash can obviously occur at any moment - it generates
GObjectsignalabort. Successful filesystem unmount
(even as the part of remount operation) must be first preceded by
detach signal to close all existing
W32 HANDLEs. After their close the filesystem
gets the unmount requests. Only in the case all the close operations
succeeded including the final filesystem unmount the signal
cease can be activated to notify all the
dirty (written) objects they are now clean. During this
cease signal the project will also
flush the sandbox commit buffer to its
underlying media.

Objects never written remain in clean
state and they can be transparently reopened even if W32 filesystem crash
occurs.