Even better, apparently the 68k is a common platform for teaching assembly instructions. Thank you all for making this community available.

Since I know nothing about assembly (and because I don't want to purchase IDA in order to disassemble my controller's memory), I thought that I would learn assembly by writing a disassembler for the 68k code (in the form of an s record) that runs on the controller of interest.

First, I found a list of all of the assembly code instructions and how they are encoded in binary:

I then went through this document and extracted all of the formats for what I perceived as relevant instructions. I had to understand how effective addressing was encoded too. Similarly, some instructions do not allow certain effective addressing modes, so I tried to account for this.

I wrote a c++ script which can interpret the s record and compare it to possible binary instructions, taking the first match that it finds and outputting it to another text file. If anyone is interested, I'll find a way to get this script onto git (just note that I wrote it for myself so it is not really easily applicable to a wide range of uses).

I compared the output of my program when given the s-records created by EASy68k from the example 1, 2, and 4 to the original instructions and they matched well (excepting the recovery of stored variables).

So far, I have learned quite a lot about how assembly code works in order to create this.

However, my original interest is in understanding the instructions recovered from about 1 million memory addresses. My code fails to match certain lines in the instruction:example:0x084020: 540704: no match 0000000001001110 0000000000000000 1110101001100000 0000000000010000 0000001110110000 0000000010001111 0000001001110001 0000000110011100

May I please capture anyone else's interest in this project (or disassembly in general)? I believe that I would benefit from conversation about assembly encoding and disassembly.

Last edited by Obeisance on Wed Sep 28, 2016 12:40 am, edited 1 time in total.

So far, my code operates on a few assumptions and I can identify a few faults.

Assumptions:1) All instructions are subsequent in memory2) All instructions must start on even addresses

Thus, when I find a match for an instruction I move forward by the number of bytes that the instruction is composed of before checking for the next instruction. I do not think that I explicitly force the number of bytes in an instruction to be even, though. If I incorrectly identify an instruction then the code from that point forward will be incorrectly interpreted.

If I do not find a match for a set of bytes, I step forward by 2 bytes to the next even memory address.

Faults that I am aware of:1) If an instruction is incorrectly identified, all subsequent disassembly is potentially incorrect2) I have found places where a branch instruction leads to an odd numbered memory address-> even properly accounting for the two's compliment displacement, this is inappropriate. I believe that I should not have to force the number to be even, thus this is likely due to an incorrect starting address for interpretation (off by two bytes, maybe?)3) Some branch instructions lead to the middle of another instruction -> also likely due to incorrect interpretation prior to that address4) Some lines which are in areas where I expect that instructions should be listed are not matched to an instruction5) Stored data or variables are not distinguished from instructions in binary-> my script will attempt to fit them to code even if that was not their original purpose

I believe that I can fix problem 3 by listing every branch landing address in a separate file during disassembly and then comparing each subsequent disassembly attempt to that file to make sure that I don't try to create an instruction which has a landing point in its middle. A more advanced version of the disassembler could follow the process flow of the code and only attempt to disassemble code which fits along the possible branch paths (but this strategy may be a tad too complicated for me to execute).

I could also, as a starting point, create a file which flags instructions which branch to odd locations (or which are on odd memory locations) as erroneous in order to comprehend the magnitude of each issue.

Other than that, I'm not sure how to tackle the other problems. Especially the one where an instruction cannot be identified - perhaps there are some operation codes which can be user defined in the end application. Or maybe there is an error in the listing of allowable effective addressing modes for certain instructions. I'm not even sure how I could easily guess what potential instructions could go in these places... maybe I'll let the script create a text file which lists potential candidates for a given non-fitted memory address and manually compare the binary.

As I said in my last post, shorter programs seem to be disassembled successfully by my current code. Somehow handling the longer codes will require more clever algorithm control.

Any thoughts about how I could fix the issues? Are my assumptions reasonable?

Many disassemblers follow the instructions to decide what is code and what is data. Begin at the starting address and disassemble the code. When a branch instruction is found follow both possible paths. Keep a table of memory accessed by the code. These memory areas are probably variables.

Many disassemblers follow the instructions to decide what is code and what is data. Begin at the starting address and disassemble the code. When a branch instruction is found follow both possible paths. Keep a table of memory accessed by the code. These memory areas are probably variables.

Thank you for your advice. That is quite a challenging proposition for me.

I believe that I would have to effectively interpret the intent of the code (nearly a full simulation, as far as I can tell) in order to properly follow all of the branch instructions.

For instance, in the disassembly that I have so far I have seen two ways of accessing a particular subroutine:

Code:

LEA ($532).L,A2 ;load address into A2....JSR (A2) ;jump to the address pointed to by A2

In order to properly follow the first version, I would need to keep track of what is in each address register as I walk through the code. I would need to prepare for the case where someone used some other command to populate the register used for the JSR command.

As a first step towards fixing my code's shortcoming, I left the disassembler to step through the code without regard to process flow, but I tried to account for jump/branch operations and possible data storage points. I made a change to my code where it tracks jump command landing points which are of higher address than the current instruction of interest which is being decoded (erasing those lower because the disassembler is still stepping forward through the code and will not visit lower addresses again). When an instruction would be bisected by a landing point, the disassembler rejects the interpretation. This increased my code's run time from about 4 minutes to about 10 minutes (lots of time eaten up by reading and writing a text file to store the landing points for jump routines). It still cannot capture the indirect JSR or JMP effective addressing modes, though, because I'm not tracking what is in the registers.

I did a similar accounting for non-jump commands which reference effective addresses containing a ($XXXX) type mode. These may be potential data addresses, thus the disassembler rejects instructions which occupy these addresses. Unfortunately, I had to ignore the LEA operation because I have found at least one instance where it is used to set up a jump, rather than loading the pointer to a data address. Another issue- this increased the run time of the disassembler to about an hour and a half.

The changes I made are rather daft, and I think that I may have to try to implement the more complicated version which tracks along the code's operation path. In doing so, I fear that the disassembly will miss some code which may not normally be accessed- for instance the exception handling code. I have also seen infinite loop code written in places where the PC should never have access to (I assume this was placed here as a fail-safe).

On the note of the exception handling code, the beginning of the disassembly points to 0x000000 as the beginning of the first vector table (made up of 256 long words, the VBR is sometimes set to another location in memory in my disassembly) which indicate what addresses contain code that each exception type should go to. It is unfortunately uninteresting- all addresses in the vector table (apart from the SP and PC reset values, and the Trace exception) point to the same location. This means that the code which I have failed to interpret in the disassembly (even with operation codes 1111 or 1010) is not likely some programmer defined special function. I may be in over my head.

2) I find that the disassembly, to truly follow the flow of the code, would need to have a full simulation of the function of the target microcontroller. For example, I find that there is a function in the code which is called and copies a large block of memory from one location to another location in memory and then uses a JMP command to push the program counter to that location. Within the copied/relocated code, there are positional references (ex. VBR command) which are relative to the moved location (i.e. they are not valid references to code in the locations as it is read from the hex file). This means that in order for a disassembler to follow the flow of the code and all of the branching it entails, it would need to interpret the intent of the copying algorithm as well as performing the copy. Only then could it read the instructions meant for those memory addresses and interpret them correctly.

Who is online

Users browsing this forum: No registered users and 2 guests

You cannot post new topics in this forumYou cannot reply to topics in this forumYou cannot edit your posts in this forumYou cannot delete your posts in this forumYou cannot post attachments in this forum