> The algorithm I had in mind was :> Read N bytes (or the whole files, I'm not sure about it yet)> if this byte is part of a prefix instruction, parse it else continue> to opcode, and so on and so on .

Imagine how a physical CPU will process instructions. There is no room
for guessing, the decoding of an instruction is a straight forward process.

> Though, I ran into several "problems" in mind:>> 1. Which data structure should store the values I read ? A hash table> or a Tree ? Or a combination of both ? (trie) Should the tree be> balanced ? If not, will it cost in efficiency or whether balancing it> will cost in efficiency ?

Whenever you start decoding an instruction, create a data structure for
it. Put instruction prefixes into that structure, for later use.

> 2. What about invalid instructions ? Should I strip them the moment I> detect they're invalid or should they be stored FFU ?

The separation of instructions from data and garbage is one of the most
annoying tasks in an disassembler. In real life x86 code you also should
be prepared for overlaid instructions, where a jump into another
instruction will result in the execution of a different instruction.
There may exist different interpretations of the same bytes, depending
on the entry points taken.

Even if an instruction decodes fine, for itself, it may be invalidated
later by contextual constraints, like for unreachable instructions or
invalid instructions in the following bytes.

> 3. Which data structure should hold the final result of the> disassembled instruction ?

For x86 code I used a fixed structure, containing only the essential
information, useful to bypass the decoding of the instruction in later
passes.

More important is the storage of the (possibly) valid instruction
sequences (address ranges), which form the Basic Blocks (BB) in control
flow analysis. You'll have to start from known entry points to code,
adding further entry points from references in immediate or indirect
instruction arguments. Every jump, call or interrupt terminates an BB.
You cannot be sure whether a call or interrupt returns at all, and that
it will always return to the point where the call was taken. Sometimes
the bytes after an call contain subroutine arguments, which are
processed and skipped by the callee, or are used to modify the return
address in some other way. The Microsoft coders have tons of dirty
tricks in their pockets, which can (or shall?) confuse any intruder into
their code. I'd suggest that you start with 32 (or 64?) bit protected
mode code (flat model), which may contain not so many tricky constructs,
as I found in 16 bit and real mode code. Otherwise the Intel segmentitis
will bite you...

After an first pass over all initial and added entry points, you'll
typically end up with many white spots in the segments, which will have
to be explored in subsequent passes, by heuristics or guided by the
user. I strongly suggest that you have a look at IDA, the best
Interactive DisAssembler I've ever seen :-)

Later on you'll have to deal with the data inside an executable. Much
time will be spent in reasoning about the data type of non-code bytes,
which may be [arrays of data structures containing] pointers, and
tracking the data flow between registers, stack and other writeable data
areas. But that's too much for an first approach to the universe of
disassemblers ;-)

> 5. Should the disassembler itself be multi threaded or one program> which does everything step-by-step and if it will be multi threaded -> how can I handle or parse different instructions ? or handle> synchronization ?

I cannot see much use for multiple threads in an disassembler, except
for a separation of a GUI from the disassembler itself. Everything else
will lead to killing dead end threads from other threads, over and over
again, and you'll loose control over the analyzed areas of definitely
known content (code, data, junk). Not to mention self modifying code...

You better prepare to stream the results of your analysis into an file,
allowing to resume the analysis later.