This file is intended for those interested in writing
64-bit programs for the AMD64 and EM64T processors running on x64 (64-bit Windows),
using GoAsm (assembler), GoRC (resource compiler) and GoLink (linker).
It may also be of interest to those writing 64-bit assembler programs for
Windows using other tools.

Introduction to 64-bit programming

Despite the differences between the 64-bit processors and their 32-bit counterparts, and between
the x64 (Win64) operating system and Win32, using GoAsm to write 64-bit Windows programs
is just as easy as it was in Win32.

In fact, you can readily use the same source code to create executables for both platforms if
you follow a set of rules.

You can also convert existing 32-bit source code to 64-bits
and some of the work required to do this can be done automatically using
AdaptAsm.

Although 32-bit and 64-bit executables are based on the same PE (Portable Executable) format,
in fact there are a number of major differences. The extent of those differences means that
32-bit code will only run on Win64 using the Windows on Windows (WOW64) subsystem. This works
by intercepting API calls from the executable and converting the parameters to suit Win64.
64-bit code will not work at all on 32-bit platforms.

The executable contains a flag which tells the system at load-time whether it is 32-bit or
64-bit. If the x64 loader sees a 32-bit executable, WOW64 kicks-in automatically. This means
that 32-bit and 64-bit code cannot be mixed within the same executable.

The significance of the above is that the programmer has to choose between:-

Making one version of the application (Win32). This will work on both platforms.

Making two versions of the application (one for Win32 and one for Win64).

For those who are interested in PE file internals, here is a summary of the main differences
between 32-bit and 64-bit executables:-

The PE file format for Win64 files is called "PE+".

The size of optional header field in the COFF header is 0F0h in a PE+ file and 0E0h in a
PE file.

The "machine type" in the COFF header is not 14Ch (as it is for x86 processors), but is
8664h (for the AMD64 processor).

The "magic number" at the beginning of the optional header is 20Bh instead of 10Bh.

The "majorsubsystemversion" in a PE+ file is 5 instead of 4 in a PE file.

The executable "image" (the code/data as loaded in memory) of a Win64
file is limited in size to 2GB. This is because the AMD64/EM64T processors use relative addressing
for most instructions, and the relative address is kept in a dword. A dword is only capable of
holding a relative value of ±2GB.

The import address table (where the loader overwrites the addresses of external calls such
as the addresses of APIs in system Dlls) is enlarged to 64-bits, as is the import look-up table.
This is because the address of external calls could be anywhere in memory.

The preferred image base, SizeofStackReserve, SizeofStackCommit, SizeofHeapReserve
and SizeofHeapCommit fields in the optional header are enlarged from 4 to 8 bytes.

The default base address in Win64 is 400000h as in Win32 files.

64-bit executables which provide properly for full Win64 exception handling contain a .pdata
section holding the tables required for this.

Here are the main differences between Win32 and Win64 of relevance to the assembler or
Windows programmer:-

Calling convention. Win32 uses the STDCALL convention whereas Win64 uses the FASTCALL
convention. In STDCALL all parameters which are sent to an API are PUSHed on the stack.
In Win32 the stack pointer (ESP) is reduced by 4 bytes for each PUSH. In STDCALL it is the
responsibility of the API to restore the stack to equilibrium.
In FASTCALL, the first four parameters are sent to the API in registers (in this order: RCX,RDX,R8 and
R9), but the fifth and subsequent parameters are PUSHed on the stack.
In Win64, the stack pointer (RSP) is reduced by 8 bytes for each PUSH. Unlike STDCALL, it is not
the responsibility of the API to clear up the stack. Instead this must be done by the caller
to the API. The caller must also ensure that there is space on the stack for the API to store
the parameters which are passed in registers. In practice this is achieved by reducing the stack
pointer by 32 bytes just before the call.Note than in GoAsm all the work required by the FASTCALL calling convention is done
automatically if you use INVOKE or ARG followed by INVOKE. See
coding to comply with FASTCALL calling convention.
The use of ARG and INVOKE is described in the relevant part of the
GoAsm manual.
Note that GoAsm does not yet do this for parameters which need
to be sent in the XMM registers (ie. in floating point instructions).

Windows uses the FASTCALL convention to call the window procedures and other callback
procedures in your application. This means that your window procedures will pick up the parameters
in a different way under Win64. Also the window procedures no longer have to
restore the stack to equilibrium.
Note that GoAsm will implement these things automatically if you use FRAME...ENDF.
The use of FRAME...ENDF is described in the relevant part of the
GoAsm manual.

All functions using a stack frame (including window procedures) need to follow certain
rules if they wish to make use of exception handling. The tools need also to add exception
frame records to the executable. This will also be handled automatically by the "Go" tools.
Note this is not yet available

Register volatility. In Win32, window procedures and other callback procedures have to restore the values in the
EBP,EBX,EDI and ESI registers before returning to the caller (if the value in those registers
are changed). This is something that is also done by the Windows APIs (these registers
will not change when you call an API). These are called the "non-volatile" registers.
In Win64, this list of registers is extended to RBP,RBX,RDI,RSI,R12 to R15 and
XMM6 to XMM15.
The "volatile" registers are those which may be changed by APIs, and which you do
not need to save and restore in your window procedures and other callback procedures.
In Win32 the general purpose volatile registers were EAX,ECX and EDX. These have now
been extended to RAX,RCX,RDX, and R8 to R11.

You might not have expected this, but in 64-bit assembly for the AMD64, pointers to
code and data whose addresses are within the executable are still only 32-bits.
This ties in with the fact that RIP-relative addressing limits the size of the
executable to 2GB. Pointers to external addresses, such as functions in Dlls, are 64-bit
wide so that the function can be anywhere in memory see call address sizes.

The main differences are the expanded register range, some changes to instructions, and the use of
RIP-relative addressing. The notes below refer to the AMD64 in 64-bit mode. In this mode the
AMD64 can also run 32-bit executables naturally.

The AMD64 adds several new registers to those available in the 86 series of processors, and
also adds new ways to address the existing registers.

The EAX,EBX,ECX,EDX,ESI,EDI,EBP and ESP "general purpose" registers are all enlarged to 64-bits.
The enlarged registers are accessed using RAX,RBX,RCX,RDX,RSI,RDI,RBP and RSP

You can still access the low dword of these registers (ie. the least significant 32 bits) by using
the existing names EAX,EBX,ECX,EDX,ESI,EDI,EBP and ESP.

You can still access the lowest word of these registers (ie. the least significant 16 bits) by using
the existing names AX,BX,CX,DX,SI,DI,BP and SP.

You can still access the first byte of RAX,RBX,RCX and RDX (ie. the least significant 8 bits)
by using the existing names AL,BL,CL,DL as in the 86 processor. But you can now also address
the first byte of the "index" registers by using SIL,DIL,BPL and SPL. So for example SIL is
the least significant 8 bits of the index register RSI.

You can still access the second byte of RAX,RBX,RCX and RDX (bits 8 to 15) by using
the existing names AH,BH,CH,DH as in the 86 processor. However, the opcodes for this have been
altered in the AMD64 processor. They now clash with the opcodes required to address the
byte versions of the extended registers R8 to R15. So you cannot use AH,BH,CH,DH and
R8B to R15B in the same instruction.

There are eight new 64-bit registers (the "extended registers") named R8 to R15.

The low dword of these registers (ie. the least significant 32 bits) can be addressed
using the R8D to R15D forms.

The low word of these registers (ie. the least significant 16 bits) can be addressed
using the R8W to R15W forms.

The first byte of these registers (ie. the least significant 8 bits) can be addressed
using the R8B to R15B forms.

There are 8 new XMM (128-bit) registers named XMM8 to XMM15.

The 64-bit MMX registers (MM0 to MM7) are still available. As in the 86 processor they are
also used as floating point registers (ST0 to ST7) for the x87 floating point instructions.

There are some instructions which are not available in the AMD64. The opcodes are now
used for other purposes. The full list is contained in the AMD64 manuals, but includes
AAA, AAD, AAM, AAS, DAA and PUSH and POP operations using CS,DS,ES and SS.

Instructions are enlarged to allow for the new registers and register forms of address,
for example:-

MOV RAX,immediate ;move a 64-bit number into the 64-bit register
JRCXZ >L1 ;if RCX is zero jump forward to L1

The string instructions are now enlarged to allow for 64-bit addressing for, example:-

The repeat prefixes REP, REPZ and REPNZ use RCX rather than ECX.
The loop instructions LOOP, LOOPZ and LOOPNZ use RCX rather than ECX.
The table look-up instruction XLATB uses RBX rather than EBX.

Apart from the above, the only new instruction of any note usable by programmers is MOVSXD
which can move 32-bits of data from a register or from memory into a 64-bit register, sign extending
bit 31 into all higher bits. There are also a handful of new system instructions.

In the AMD64, each PUSH and POP instruction moves the stack pointer by 8 bytes instead
of 4 bytes as in the 86 processor. This means that PUSH 32-bit register is no longer
a recognised instruction on the AMD64. To help with compatibility of source code, GoAsm treats (for example) PUSH EAX
as equivalent to PUSH RAX. In /x86 mode, GoAsm treats PUSH RAX as equivalent to PUSH EAX.
So it does not really matter which you use.

PUSH immediate on the AMD64 takes a 32-bit immediate (number) value and sign extends bit 31
into all higher bits. There is no single instruction capable of taking a 64-bit immediate value and
PUSHing that onto the stack. For this reason PUSH ADDR THING is not a recognised instruction
on the AMD64 (the offset value is treated as an immediate). The problem here is that the actual
immediate value of any particular offset is unknown until link-time, and at assemble-time it is
impossible for the assembler to know whether the offset is above 7FFFFFFFh and so would
be affected by the sign extension.

Therefore in GoAsm, PUSH ADDR THING makes use of the R11 register
and takes advantage of the shorter RIP-relative addressing of LEA with the following coding:-

LEA R11,[THING]
PUSH R11

The 3DNow! instructions are still available in the AMD64. It's not clear whether
these instructions are now available on processors supporting Intel EM64T technology.

Some instructions in the AMD64 processor which address data or code, use RIP-Relative addressing
to do so. The relative address is contained in a dword which is part of the instruction. When
using this type of addressing, the processor adds three values: (a) the contents of the dword
containing the relative address (b) the length of the instruction and (c) the value of RIP (the
current instruction pointer) at the beginning of the instruction. The resulting value is then
regarded as the absolute address of the data and code to be addressed by the instruction. Since
the relative address can be a negative value, it is possible to address data or code earlier
in the image from RIP as well as later. The range is roughly ±2GB, depending on the
instruction size. Since relative addressing cannot address outside this range, this is the
practical size limit of 64-bit images.

RIP-relative addressing happens "behind the back" of the user. The processor uses it if the
opcodes contain certain values (in the ModRM byte, the Mod field equals 00 binary, and the r/m
field equals 101 binary). You cannot control this except by changing the type of
instructions you use. Generally here are the rules which govern whether or not an instruction
uses RIP-relative addressing:-

Addresses in data cannot use RIP-relative addressing since the value of RIP cannot be
known at the time when those addresses are set. Instead, an absolute address for insertion
is calculated at link-time. So for example the following instructions do not use
RIP-relative addressing but instead use absolute addresses:-

Note that in practice, the absolute address is contained in a dword and not in a qword. This is why
in the above examples data and code addresses can be contained within a dword data declaration.
This restriction is feasible because the practical image size is limited to 2GB anyway because
of the restrictions imposed by RIP-relative addressing.

Offsets converted to immediate values either at assemble-time or at link-time use
absolute addressing rather than relative addressing. For example the following instructions
do not use RIP-relative addressing but instead use absolute addresses:-

Note in the case of an external call, the relative address points to the Import
Address Table. Since the table is now enlarged to 64-bits, it is possible to call a code label
anywhere in memory.

LEA uses RIP-relative addressing, for example:-

LEA RBX,MyDataLabel3 ;load into RBX address of data label

RIP-relative addressing is not used where the data or code label is supplemented by
an index register. Although this may seem odd, the reason appears to be that adding
information about the register to the opcodes means that the processor can no longer
recognise the instruction as one which uses RIP-relative addressing (in the ModRM byte,
the Mod field no longer equals 00 binary, and the r/m field no longer equals 101 binary).
This means that the following instructions use absolute addresses rather than RIP-relative
ones:-

Because RIP-relative addressing is not being used here, for these types of instructions to work properly,
the Image Base should be well below 7FFFFFFFh.
These types of instructions would need to be adjusted if using a larger Image Base or when linking with the /LARGEADDRESSAWARE option.

Bearing in mind that the image size is limited to 2GB by the above arrangements, it might be
thought that the advantages of RIP-relative addressing are somewhat limited. This seems to
be the case. It appears that the only advantage is that it lessens the number of relocations
which would need to be carried out by the loader if a DLL is loaded at an address which is
unexpected. The loader then would need to adjust all absolute addresses to suit the actual
image base, but relative addresses would not have to be altered since they refer to other
parts of the virtual image of the executable itself. However, it is good practice for the
programmer to choose a suitable image base at link-time to avoid the need for relocations in
a DLL in the first place. A good example of this is the system DLLs themselves. They all
have a different image base which effectively avoids any prospective clashes of the image
in memory which would require relocation at load-time.

will be coded as an E8 RIP-relative call, using a dword to provide the offset from RIP.
The destination of this call might be an internal code label (ie. a procedure or function
within the executable itself). Or it might be to an external code label, such as
an API in a system Dll or to a code label exported by another exe or Dll. The first
destination of a call to an external code label is to the Import Address Table which
is part of the executable itself. This table is written over by the loader when the
executable starts. Therefore during run-time the table contains the absolute addresses
in virtual memory of the eventual destination of the call. In a 64-bit executable,
the table contains 64-bit values, so the E8 RIP-relative call is capable of calling a procedure
or function anywhere in memory.

Calls to memory addresses either held in a label, or in registers, or in
memory pointed to by registers, however, are dealt with in a different way. They are
not channelled through the Import Address Table. These calls must also permit the
destination of the call to be anywhere in memory. In order to achieve this they must
themselves use 64-bit absolute addresses. Examples of these types of calls are:-

Using the switched type indicator

The above change of a data type may require a corresponding change to a type indicator. The letter P is reserved as a type indicator in all situations when
GoAsm might expect to find one. So you can have this switch:-

#if x64
P = 8
#else
P = 4
#endif

P can be switched to the equivalent of any of the pre-defined type
indicators that is B, W, D, Q or T. In this case it is switched either
to Q (value 8) or to D (value 4). Therefore you can control the size
of the instruction with it, for example:-

MOV P[RDI],0 ;zero a qword at RDI if 64-bit, dword at EDI if 32-bit
LOCAL POINTERS[10]:P ;make 80 byte local pointer buffer if 64-bit, 40 byte if 32-bit

Alignment requirements

The requirements of the system in Win64 for correct alignment of the stack pointer,
data, and structure members are much stricter than in Win32. Wrong alignment
can cause as best a loss of performance and at worst, an exception or program exit.

Stack alignment

The stack pointer (RSP) must be 16-byte aligned when making a
call to an API. However, this is organised automatically by GoAsm if you use
INVOKE see automatic stack alignment.

Data alignment

All data must be aligned on a "natural boundary". So a byte can be byte-aligned, a word
should be 2-byte aligned, a dword should be 4-byte aligned, and a qword should be 8-byte
aligned. A tword should also be qword aligned. GoAsm deals with alignment automatically
for you when you declare local data (within a FRAME or USEDATA area). But you will need
to organise your own data declarations to ensure that the data is properly aligned. The
easiest way to do this is to declare all qwords first, then all dwords, then all words
and finally all bytes. Twords (being 10 bytes) would put out the alignment for later
declarations, so you could declare all those first and then put the data back into
alignment ready for the qwords by using ALIGN 8.

As for strings, in accordance with the above rules, Unicode strings must be
2-byte aligned, whereas ANSI strings can be byte aligned.

When structures are used they need to be aligned on the natural boundary of the
largest member. All structure members must also be aligned properly, and the structure
itself needs to be padded to end on a natural boundary (the system can write in
this area). Because of the importance of this, from Version 0.56 (beta), GoAsm aligns structures
automatically for you. See automatic alignment and padding of
structures and structure members for more.

Windows often uses structures to send and receive information using the APIs. In 64-bits
these structures are likely to be significantly different from their 32-bit counterparts
because of the enlargement of many data types to 64-bits.
See changes to Windows data types.
Take for example the WNDCLASS structure which is used when you want to register a window class:-

A number of the members are now qwords, whereas previously they were dwords as you can
see from the 32-bit version below. The class style at offset +0h remains a dword, but then
in the 64-bit version, padding of four bytes is required because the next member is a
qword. This complies with the requirement that structure members are aligned on their natural
boundary. A qword is used to provide space for the pointers firstly to the window procedure
itself at +8h, to menu name at +38h and to the window class name at +40h. This is despite
the fact that 64-programming as implemented by Win64 for the AMD64 processor only uses 32-bit
pointers where those pointers give the addresses of internal data. Presumably the reason
for this is that the same structures as being used here as are used for the IA64 family of
processors (which use 64-bit pointers to internal data). Handles in the structure are also
enlarged to 64-bits.

It is also a requirement that the structure is enlarged so that it ends on
the natural boundary of its largest member. This is achieved by adding the
necessary padding at the end of the structure. So PAINTSTRUCT becomes:-

In practice it was found that the system wrote to the area of padding at +44h when
using PAINTSTRUCT in certain circumstances. This shows the importance of complying
with these rules (otherwise you could find that data after the structure could be
written over).

Note that the beginning of structures must be aligned on the natural boundary
of the largest member as well. All the above rules ensure, therefore, that qwords
in the structure are always qword aligned.

Automatic alignment and padding of structures and structure members

As we have seen correct alignment of structures and structure members is crucial
for proper operation of 64-bit code. Unfortunately the Windows header files
containing the structure definitions do not necessarily contain the necessary
padding to achieve such alignment.

So from Version 0.56 (beta), GoAsm does this work automatically for you as follows:-

GoAsm always pads if necessary to ensure that structure members are on their
natural boundary. So in the MSG structure example below, the padding at +0Ch
could be left out. It would be inserted automatically.

GoAsm always adds padding at the end of a structure so that the structure
ends on a natural boundary. So in the example below the padding at +2Ch could be
left out. It would be inserted automatically.

The symbols created when using a structure are automatically adjusted to suit
the alignment and padding which is applied.

You can see what alignment and padding GoAsm has added to your source code if you
specify /l in GoAsm's command line. This will create a list file. Also you can
view the effect in a debugger.

Structures - the overall picture

If you are writing source code for both 32 and 64-bit versions of your program, this
will be made much easier if you use conditional assembly to switch the correct structures
at assemble-time, and then instead of filling the structures using the offset values, you
fill them using the member names. Using this method, GoAsm finds the correct offset for
you automatically. This technique has been used in the demonstration file Hello64World 3.

You can use conditional assembly to switch whole banks of structures in one go. These
can be contained in include files containing 32-bit structures and 64-bit structures
respectively.

Since GoAsm aligns and pads the structures automatically for you, you can use
the 64-bit structure definitions already available in include files, or you can make
your own from the Windows header files using Wayne J Radburn's
xlatHinc utility.

One main thing to remember is that all Windows handles are 64-bits so the APIs will provide them
in RAX rather than in EAX.

The same goes for Windows pointers. For example you may ask Windows for some memory. The address
of the memory will be returned in RAX and not in EAX.
So this means that:-

ARG 4h,3000h,EDX,0
INVOKE VirtualAlloc ;reserve and commit edx bytes of read/write memory
MOV [EAX],66666666h ;insert a number at the beginning of that memory

is bad 64-bit coding, whereas

ARG 4h,3000h,EDX,0
INVOKE VirtualAlloc ;reserve and commit edx bytes of read/write memory
MOV [RAX],66666666h ;insert a number at the beginning of that memory

is good.

Since all pointers to internal data and code labels are 32-bits, in theory it is possible
to use the 32-bit versions of the general purpose registers (EAX to ESP) for all such pointers
so for example, you could use MOV [ESI],AL instead of MOV [RSI],AL.

However, I do advise against this for the following five reasons:-

It means you have to keep track of which pointers are internal ones and which are
external ones. You must allow for the external ones being 64-bits.

You may need two sets of procedures which are oft-used in your program, one using
32-bit register pointers and one using 64-bit register pointers.

The string instructions such as LODSB, MOVSW, STOSD, CMPSQ and SCASB use RSI and RDI
in a 64-bit program rather than ESI and EDI. And the repeat prefixes REP, REPZ and REPNZ
use RCX instead of ECX.

Using the 32-bit versions of these instructions in 64-bit program codes one opcode
larger than the 64-bit version. This is because in a 64-bit program, MOV [RSI],AL is
the default and to convert this to MOV [ESI],AL requires an 67h override byte.

You can still use the same source code to make both 32-bit and 64-bit programs provided
you only use the general purpose registers, RAX to RSP. This is because when you use the /x86
switch with GoAsm these registers are automatically regarded as EAX to ESP instead.

You can automate the required changes to existing 32-bit code using AdaptAsm.

If you need to use the R8 to R15 registers, remember that R8 to R11 are volatile (they will
not be maintained by the APIs). If you use the non-volatile R12 to R15 registers within window
procedures and callback procedures then you must ensure that they are restored after use. This
can be done by using PUSH at the beginning and POP at the end of the procedure which uses them, or
by using the USES statement.

When passing parameters to an API using INVOKE, you may need to take into account that
in the FASTCALL calling convention the parameters have to be sent to the API in the RCX,RDX,R8 and
R9 registers. Therefore you would not wish to pass parameters in registers which will be overwritten
by GoAsm (you will get an error message if you try to do this).

For example this is bad and will show an error:-

INVOKE MessageBoxW,RDX,R8,R9,R10

It's bad because if it were allowed, it would translate to:-

MOV R9,R10
MOV R8,R9
MOV RDX,R8
MOV RCX,RDX

so it can be seen that the contents of the registers are being overwritten before they
are being used to establish the parameters.

Better would be:-

INVOKE MessageBoxW,R10,R9,R8,RDX

Which translates to:-

MOV R9,RDX
MOV RDX,R9
MOV RCX,R10

Note that GoAsm does not bother to code MOV R8,R8
Even better would be:-

INVOKE MessageBoxW,RCX,RDX,R8,R9

which requires no further code to pass the parameters since they are already in the correct
registers. So this is very efficient code.

Take care when mixing the 64-bit registers and their 32-bit counterparts because the processor
can change the contents of the whole 64-bit register when this is
not obvious. This is because when writing results to a 32-bit register the processor will
zero-extend the result into the whole 64-bits of the register. So, for example:-

MOV RAX,-1 ;fill RAX with 0FFFFFFFF FFFFFFFFh
AND EAX,0F0F0F0Fh ;(apparently) work only on EAX

but the processor will zero extend the result into RAX, in other words it will zero
the whole of the high dword of RAX. The result in RAX is 00000000 0F0F0F0Fh not 0FFFFFFFF 0F0F0F0Fh as
expected. This happens irrespective of the value of bit 31 of RAX (this is not the same as sign-extension).

A similar thing happens when using other instructions. Here is an example with XOR:-

You can take advantage of zero-extension in various ways. Some examples are given in
some tips to reduce the size of your code. Take also this example, where
the structure RECT (which is four dwords) contains values which must be passed to the API MoveWindow
as qwords:-

Here only 32-bit registers are used to extract the information from the RECT structure, but
we know that the high part of the 64-bit versions of those registers are set to zero.

It is possible that there is a performance loss in relying on zero-extension. Some of the
documentation suggests that the processor has to carry out an additional operation to zero
the high bits of the register.

You may wonder about the difference between the following instructions:-

MOV D[THING],12345678h
MOV Q[THING],12345678h

These code differently and do different things. The dword version places the value 12345678h
into the dword at the label THING as you would expect. The qword version does the same, but
also zeroes the dword at THING+4. This is because it sign-extends the result into
the qword at the label THING. So if the high bit is set, the qword version will fill THING+4
with 0FFFFFFFFh. In other words, the 32-bit value in these instructions are regarded as
signed numbers, and written to memory accordingly.

The stack pointer (RSP) must be 16-byte aligned when making a call to an API. With some
APIs this does not matter, but with other APIs wrong stack alignment will cause an exception.
Some APIs will handle the exception themselves and align the stack as required
(this will, however, cause performance to suffer). Other APIs (at least on early builds
of x64) cannot handle the exception and unless you are running the application under debug
control, it will exit.

Because of this requirement, the Win64 documentation states that you can only call an API
within a stack frame. This is because it is assumed that only within a stack frame can the
stack be guaranteed to be aligned properly. A call out of the stack frame will misalign the
stack by 8 bytes.

This requirement is very restrictive to assembler programmers, and causes compilers a big
headache. GoAsm's solution to this problem is to insert special coding before and after each
API call (when INVOKE is used) to ensure that the stack is always properly aligned at the time
of the call. This liberates the assembler programmer, and means that:-

Calls to APIs (using INVOKE) can be made anywhere in your code. They can be made from
procedures called by other procedures without worrying about the stack pointer.

PUSHes and POPs can be used in the usual way to save and restore registers, memory addresses
and contents of memory without having to worry that this puts the stack out of alignment.

You can use the same source code both for 32-bit and 64-bit versions of your application
(there is no requirement for stack alignment in 32-bits).

The overhead for aligning the stack at the time of each API call is an additional nine bytes per
API, which seems a small price to pay for the advantages gained. To keep down the size of the code as
much as possible, GoAsm takes a number of opportunities to optimise the code particularly
when inserting the parameters. See some optimisation done by GoAsm for
details. See also coding to achieve automatic stack alignment.

Bringing together all those considerations and also those set out above, it is perfectly possible
to use the same source code to create executables for both 32-bit and 64-bit platforms.

To recap, here are the rules which must be followed to do this:-

When calling APIs use INVOKE in your code instead of CALL.

When passing parameters to APIs use ARG in your code instead of PUSH, alternatively
give the parameters after INVOKE.

Use FRAME .. ENDF in your code when using LOCAL data or picking up parameters sent to a window
procedure (or other similar callback procedure).

If you want to use the new registers R8-R15, XMM8-XMM15, or the new 8, 16 and 32-byte addressed
registers, make sure they are used only within switched 64-bit source code using conditional
assembly.

Use the 64-bit form of the general purpose registers (RAX,RBP,RBX,RCX,RDX,RDI,RSI, and RSP)
for pointers. When GoAsm assembles for 32-bit, it will automatically reduce these
registers to their 32-bit counterparts.

If you have used PUSHFD and POPFD to save and restore the flags, change this to
PUSHF and POPF or PUSH FLAGS and POP FLAGS.

Ensure that structures, data sizes, and type indicators are correct for 32/64-bit use, if necessary
by using conditional assembly.

Use /x64 in the command line to create a 64-bit executable, and /x86 in the command line
to create a 32-bit executable.

The "Go" tools will do the rest of the work.

Note that x86 should not be used in the command line for Win32 source code (use it only for
32/64-bit switchable source code).

See the file Hello64World3 for example source code which can make
either a simple Win32 "Hello World" Window program or a Win64 one.

Bringing together all the above considerations, this is what you need to do to convert existing
32-bit source code to 64-bit source.

Change all CALLs to APIs to INVOKE. Do not change any CALLs to non-APIs.

If you have used PUSH to send parameters to an API in your 32-bit source, change this to
ARG. Do not use ARG for any other PUSHes.

Change all the 32-bit general purpose registers used as pointers (that is, within
square brackets) to their 64-bit counterparts (RAX,RBP,RBX,RCX,RDX,RDI,RSI, and RSP). This
will keep your code shorter, and ensure that pointers to external data work properly.
Remember also to use only RSI, RDI and RCX with your string instructions and repeat prefixes.
See choice of registers.

Ensure that registers which contain system handles and other values provided by the system
are changed to their 64-bit counterparts (RAX,RBP,RBX,RCX,RDX,RDI,RSI, and RSP).

Adjust all other registers use as required. Generally for other use, the existing
registers will work perfectly well, but do not mix the use of 32-bit and 64-bit registers
because of zero-extension of results. There is no need to change
PUSHes and POPs of registers. These changes are done automatically by GoAsm because the opcodes
are the same (for example PUSH EAX is regarded the same as PUSH RAX and vice versa).

Ensure that structures, data sizes, and type indicators are correct for 64-bit use.

Check that your JECXZ instructions are changed to JRCXZ if appropriate.

Since 64-bit tends to be a little larger than 32-bit code, when you
re-assemble your code using the /x64 switch, you may find that some
short jumps have to be re-organised.

AdaptAsm comes packaged with GoAsm and I originally wrote it to help to convert
source code used for other assemblers to GoAsm syntax. I have now extended it to
help towards the conversion of 32-bit source code to 64-bit source code. This works
both on GoAsm source code and also source code for other assemblers.

What AdaptAsm does when helping to adapt a file to 64-bits using the /x64 switch

CALLs to APIs are changed to INVOKE (CALLs to non-APIs are not affected).
AdaptAsm does this by looking at lists of APIs in ".h.txt" files in the same
folder as AdaptAsm.exe. See the ".h.txt" files for
more information about these files.
This works with all types of calls even if enclosed in square brackets and
even if dependent on a define (equate) or a switch, for example:-

Changing PUSH to ARG for the parameters sent to the API. AdaptAsm does this by
counting the correct number of parameters back from the CALL and comparing this with
the correct number of parameters in the lists of APIs in ".h.txt" files in the same
folder as AdaptAsm.exe. See the ".h.txt" files for
more information about these files.
Here are some simple examples:-

These files are text files containing lists of APIs and the number of parameters
required by each API. AdaptAsm looks inside its own folder for such h.txt files.
The "h.txt" files are created from Microsoft header
files using a clever javascript file ApiParamCount.js, written by Leland M George of
West Virginia, who has kindly donated it to the public domain. This js file is shipped
with AdaptAsm together with some ready-made h.txt files containing the most commonly
used APIs. If your program uses APIs declared in other header files you can make your
own "h.txt" files using the js file. There are two ways to use the js file:-

Either drag and drop the header file onto the js file (an h.txt file will
be made in the same folder)

From the command line using the following command (for example):-
cscript ApiParamCount.js WinNT.h
or
wscript ApiParamCount.js WinNT.h
which commands start the Windows Scripting Host which handles JavaScript files
outside Web page environments.

As well as switching to 64-bit or 32-bit assembly, specifying /x64 or /x86 in GoAsm's command line
also permits these words to be tested in conditional assembly. So, for example, you can switch
two different generalised window procedures in this way:-

In 32-bits this is good coding because there is a dword at [SYSTEM_INFO+4h] (the dword here holds
the systems memory page size (these assumes the structure was filled in using a call to the
GetSystemInfo API).
In 64-bits this is bad because the value at +4h is still a dword, but you are now sending a
qword to VirtualFree and not just a dword. This should be coded as follows instead:-

Here the call puts a 32-bit value into the dword SIZEOF_WORKAREA which is correct. However
assembling and running the same code in a 64-bit system would overwrite the next dword in
memory as well (a qword is sent not a dword). So you need to enlarge SIZEOF_WORKAREA to
a qword.

Forgetting that all calls are now to 64-bit values.This can easily be forgotten when using tables to control movement of execution around
your code. Take the case of a simple table of labels for example:-

DATA
Table DD CODELABEL,2h
CODE
CALL [Table]

or

DATA
Table DD CODELABEL,2h
CODE
MOV RSI,ADDR Table
CALL [RSI]

This will call an 64-bit address with CODELABEL's address in the low dword and 2 in the high
dword. This will produce an error at run-time. The solution for internal calls is to code as
follows:-

Forgetting that all POPs are now to qwords.Your existing 32-bit source code may POP into dwords in memory. For example:-

DRAW_RECTANGLE:
PUSH [RECT],[RECT+4] ;save left and top of rectangle
; code to adjust rectangle
; and then draw it
POP [RECT+4],[RECT] ;restore top and left of rectangle for future use
RET

In 64-bits a RECT structure is still 4 dwords just as it was in 32-bits. However
the second POP in the above code would rub out the second dword in the structure
because the POP is in fact 64-bits, not 32-bits.

Correct coding for 64-bits would be:-

DRAW_RECTANGLE:
PUSH [RECT],[RECT+4] ;save left and top of rectangle
; code to adjust rectangle
; and then draw it
POP RAX ;restore top of rectangle for future use
MOV [RECT+4],EAX ;insert dword only
POP RAX ;restore left of rectangle for future use
MOV [RECT],EAX ;insert dword only
RET

where filename is the name of your asm file written either as a 64-bit source file or
a 32/64 switchable source file. Use /x86 instead of /x64 when assembling a 32/64 switchable
source file to make a 32-bit version.
The object file created by GoAsm can be sent to GoLink or another linker in the usual way.
GoLink automatically senses whether the object file is 32 or 64-bit and creates the
correct type of executable to suit.
You cannot mix 32-bit and 64-bit object files. GoLink will show an error if you try to
do this.
You do not necessarily need to make 64-bit executables on a 64-bit machine. This is because the
DLL names given to GoLink simply tell the linker that the DLL contains the APIs used by
the application and these tend to be the same between the two platforms. If your application
calls APIs specific to the 64 bit system however, this does not work.

GoAsm always aims to produce the tightest possible code from your source. In the case of x64,
GoAsm has not yet taken up all opportunities to optimise the code. This is because there are still
some unknowns, such as effects on performance of optimised code on x64.

The optimisations and refinements are listed here to help you when you look at the code produced
by GoAsm in the debugger.

GoAsm optimisations and refinements in all code

None of these affect the flags or adversely affect performance.

MOV 64-bit register,ADDR label changed to LEA 64-bit register,label. This saves
5 opcodes. One important difference between the two instructions is that the MOV version uses
an absolute relocation (hence in theory it needs to leave space for a 64-bit value to be inserted by
the linker). The LEA instruction uses RIP-relative addressing and so it can
do the same job but requires only a 32-bit space for the relative address.

PUSH or ARG ADDR Non_Local_Label also uses LEA as well as the R11 register as follows:-

LEA R11,ADDR Non_Local_Label
PUSH R11

See explanation for this. Note that this will also take place with INVOKE when pushing arguments with ADDR,
which also includes use of pointers to a string or raw data (ex. 'Hello' or <'H','i',0>).

This affects the flags.

PUSH or ARG ADDR Local_Label is coded as follows:-

PUSH RBP
ADD D[RSP],+/-Displacement

Additional optimisations and refinements only when INVOKE is used

These may affect the flags which does not matter when calling an API. Those that rely on
zero-extension may require another operation from the processor, but it
is assumed that this does not matter when calling an API. It is more important to keep the
code size down.

A register parameter containing zero is optimised using XOR 32-bit register. This is a
saving of between 7 and 8 bytes over the MOV equivalent.

A register parameter containing a number (an "immediate") which can fit into 32-bits is
changed to use a 32-bit register, saving between 1 and 5 bytes depending on the register and
the number.

A register parameter containing -1 is achieved by using OR 64-bit register,-1 saving
6 bytes.

If the parameter is already in the correct register no further code is emitted
because it is not required.

PUSH RSP ;save current RSP position on the stack
PUSH [RSP] ;keep another copy of that on the stack
AND SPL,0F0h ;adjust RSP to align the stack if not already there
;
; parameters dealt with here
;
SUB RSP,20h ;adjust RSP to provide placeholders
CALL TheAPI
LEA RSP,[RSP+xxh] ;get RSP back to correct place for next
POP RSP ;restore RSP to its original value

or

PUSH RSP ;save current RSP position on the stack
PUSH [RSP] ;keep another copy of that on the stack
OR SPL,8h ;adjust RSP to align the stack if not already there
;
; parameters dealt with here
;
SUB RSP,20h ;adjust RSP to provide placeholders
CALL TheAPI
LEA RSP,[RSP+xxh] ;get RSP back to correct place for next
POP RSP ;restore RSP to its original value

Note it is possible some of these optimisations may adversely affect performance..

Using the 64-bit registers (RAX to RSP) as pointers to memory (for example MOV [RSI],AL)
saves a byte over using the 32-bit versions (for example MOV [ESI],AL). This is because in such
instructions a 67h override byte is needed for the 32-bit version.

The opposite is the case when you use registers to hold immediates (numbers). In those
cases using the enlarged registers (RAX to RSP) and the extended registers (R8 to R15) or any
of the new register addressing methods, adds at least a byte to each instruction. For
example, MOV RAX,23456h is 2 bytes larger than MOV EAX,23456h. The contrast is even greater
using larger numbers which are above 7FFFFFFFh because these have to be coded as full 64-bit
numbers if you use a 64-bit register. So for example MOV RAX,80234560h codes 5 bytes larger
than MOV EAX,80234560h. If the number you wish to move will fit into a byte, then even greater
savings can be achieved, for example MOV AL,88h codes as 2 bytes, but MOV RAX,88h is 10 bytes.

DEC and INC (with a register) now use two opcodes, whereas in 86 processors they were very
frugal, using only one opcode. But there is still an advantage in using this over SUB register,1
or ADD register,1 which is one byte longer. SUB or ADD can still be used if you need to test
the carry flag after the instruction.

In 64-bit programming LEA register,Label is 5 opcodes shorter than
MOV register,ADDR Label yet they achieve the same result. In GoAsm source code however, you
can use either since GoAsm automatically uses the shortest form.

PUSH ADDR THING codes as 9 bytes, whereas if you use LEA RAX,THING followed by PUSH RAX instead,
this is 8 bytes. However, it changes the content of the RAX register.

A good way to fill a register with -1, is to use OR register,-1 which in the case
of a 64-bit register is 4 bytes, a saving of 6 bytes over MOV register,-1. However, OR
affects the flags, but MOV does not.

Compares in the range -80h to +7Fh code as 4 bytes (eg. CMP RDX,-80h to RDX,7Fh) but outside
that range they code as 7 bytes (so eg. CMP RDX,80h is 7 bytes).

You can still use LEA to do intra-register arithmetic for example LEA RAX,[RAX+RAX*2] which
multiplies RAX by three. This codes as 4 bytes.