Slashdot videos: Now with more Slashdot!

View

Discuss

Share

We've improved Slashdot's video section; now you can view our video interviews, product close-ups and site visits with all the usual Slashdot options to comment, share, etc. No more walled garden! It's a work in progress -- we hope you'll check it out (Learn more about the recent updates).

stalebread queries: "Me and a team of other students have been tasked with reverse engineering a massive C/C++ (mostly C) computer game of about half a million lines. We have most of the source, but no clue of how to approach a task of this magnitude. Anyone have suggestions of programs, or techniques we could use to understand the structure of the game?"

I would presume that the code came from a liquidation/auction/takeover and the human capital the produced it is no longer available. First, I would try to hire one of the original sw architects to do some consulting. Who knows? They might have some email files that could be considered "part of the software".

"Human capital" is a rather common economics term to refer to those skills and knowledge that enable an employee to produce the desired works. Use the wiki, Luke. In this case, it is the experience and serenity which makes the Tao Master of programming worth several novice salaries:)

Why, yes! It is legal. In fact, the right to reverse engineer a piece of software or hardware for interoperbility is protected in the US, IIRC. Hence Intel clones, PC clones, Samba, etc.

But the article poster has access to the source code, something not usually associated with 'reverse engineering'. Products are still protected by patents, copyright and trademarks, and writing Samba (for example) after seeing Microsofts code would open one up to legal woes.

C-Scope [sourceforge.net] is a cool, free, class browsing tool that can make vi feel like a full-featured IDE. If you're an OS kind of person, take a look at this before you jump into the commercial tools.

I presume you mean reverse engineering in the program understanding sense. In which case the way to go about it is to sit down and read the source code, taking notes as you go. You should then set yourself some maintenance tasks - modifying the source code is the best way to find out if you understand it or not.

...a maintenance task, not a coding task. S/he is probably looking for a UML model, as I implied elsewhere in the thread. IBM Rational, Gentleware, Borland and some FOSS projects have software just for this sort of thing: Modeling all of the classes, structs, member variables and functions along with displayable relationships (using arrows, lines, and nesting).Whats more, some of these tools can be used to modify programs within the model, and then update the source code (forward-engineering). They can also

Yep, lots of luck finding a single one of these tools that works on C code. Although making pretty pictures can certainly be a good way to get an overview of the software, and maybe students need that kind of assistance. Personally I think something like C-Scope is more than enough.

In theory, C should only be a problem if it was coded without regard for OOP. And even then, structs will likely abound... you can pull those into the model with the built-in reverse engineering and use that as the nucleous for modeling the rest of the program either by hand or with the help of scripts. For instance, in Rose you could write a script to represent.c files and functions as stereotyped components and classes...and maybe even show what sort of data gets passed between functions.UML assumes OOP,

Don't know the exact quote, but Sun Tzu has some wisdom that applies.(paraphrased)"commanding many is the same as commanding few, it is generally a matter of organization".

So I would read some general stuff on how to do this (Practical C Programming has a short chapter for example, but you probably want a book all about it). I would then do what they do with their few to few thousand line sample meticulously to the whole thing as the parent post suggests.

If he already has the source, then this problem may be easy enough to make asking Slashdot unnecessary. However, there are instances in which asking Slashdot is necessary. If they didn't have most of the source, for example. Or, for example, in this article [slashdot.org], where an IT guy was asked to make an infrastructure for over one million email accounts that must scale perfectly and have 99.9% uptime. Show me a university that trains students for that.

It sounds like you are wanting to refactor the code, or port it to another platform. If you are missing some of the code, then you'll have to reverse engineer that portion of it.

As for how to approach it - I think it depends on the size of your team, and what goals you set for the effort. Are you just wanting to learn? Or do you want to improve performance? Or make it work on another platform? What are the goals for this project?

Once you know those details, they might give you an idea where to begin.

Yeah, the "most of the source code" part is a bit scary. If they really are talking about reverse engineering from executables they are in for a hell of a time. The state of the art is a project I work on now and then, Boomerang [sourceforge.net], and it isn't for the faint of heart. I've been hearing for years about people who are working on decompilation tools that are integrated into IDA Pro [datarescue.com] but I've yet to see it. The time where you can enter a binary, press a button and get back compilable, maintainable source code is still a long long way off. But that's good, friends of mine do commercial decompilation work.

To understand how games are made in the first place. What kind of a game is it? Is it a single player game, or multiplayer game? If it's multiplayer you'll have to watch out for code designed to keep the game logic at a fixed rate; all other code will be built on top of that. Singly player games on the other hand don't have to worry about all the intricacies of keeping the various game clients in sync.

So it really depends on the kind of game it is. Since I'm assuming you know this, I would suggest trying to first think how you would write the game yourself, and then see if you find any similarities between your ideas for the engine structure and the games.

Sounds like MUD/MUSH code to me. I can't think of many games with 500K lines of source code written exclusively in C/C++ that aren't MUDs.
If that's the case, itistoday is right in that MUD code is built around a game timer (for fixed rate logic, MUDs use 'ticks', generally firing off around once every 30 seconds), and it's also built around clever use of sockets to allow multiple realtime connections, which is generally one of the harder things to accomplish (read, was).
If I wanted to get to the nitty

...it's improved mightily since I last used it. Granted, it was reverse-engineering some Java code, but it wouldn't do squat unless it could compile the whole thing (I assume it created a symbol table/parse tree and based its analysis on that). Which made it useless for documenting portions of a product, or one that was in flux and not in a cleanly-compilable state. Sure, you could stub out everything, but if you're talking an entire package that isn't available, then it's more work than it's worth.Perso

I used it quite nicely under Windows (Cygwin, I believe) to hack some AIX code (a big QT/Motif app). The AIX version was so old that Source Nav wouldn't compile there, but it ran on cygwin and I used to go to the file-shares where the AIX source code was and it ran quite nicely and got me out of a lot of trouble.

First run the code under a profiler. This will give you some idea of where it spends it's time. Running under a first-class debugger (SoftICE?) will also help because you can haul off stack-traces and see what's been called from where.

I agree it's very useful. Callgrind && kcachegrind usually give a lot of insight into the code. I wonder if callgrind can be hacked to produce nice seq. UML diagrams (it knows when objects are created and what messages are sent -- should be a very nice addition to the analysis).

Just because you have the code doesn't mean you know how the system is assembled and how all the components work together. "Reverse Engineering" is a pretty loosely defined, but if you take it literally, it's just that.. reversing the engineering process. From the description of the question, the poster is looking to take the finished product (the source for this game..) and move back up the high level design phase. This means analyzing the module interconnections, class hierarchy, and that sort of stuff. It doesn't necessarily mean they want to "port" or "compile" it.

Reminded me of a porting project I had a few years back. We had pretty much all the source code from the game, but only binaries for the proprietary libraries it relied heavily on. In fact, it was only when we started pulling the thing apart that we found out just how much stuff was hidden away in the libraries. For example, a lot of the code seemed very disjointed - most sections didn't seem to be called from anywhere. We found that all entities (player, enemies, effects, menus, sounds, and so on) were set

>We have most of the source, but no clue of how to approach a task of this magnitude.

Reverse engineering is generally thought of as a "cleanroom" technique that involves having the binary and/or specification but not the source. If you have the source, then you're just reading/rewriting it (or perhaps just copying it and doing s/Old Name/Our Cool New Game That's Nothing Like Old Name/).

>Anyone have suggestions of programs, or techniques we could use to understand the structure of the game?

... We have most of the source, but no clue of how to approach a task of this magnitude.

Reverse engineering is generally thought of as a "cleanroom" technique that involves having the binary and/or specification but not the source. If you have the source, then you're just reading/rewriting it

If you have the source but not the spec, and you're working on recreating the spec, then you're reverse engineering.

You can't call it "reading" and "writing." Too many computer people think that "reading" and "writing" are fast, straightforward, linear techniques that result in a disorganized mess - which for them is probably the truth. They would never understand that putting serious mental effort into creating a coherent, useful work could be called "writing," or that a systematic, intelligent effort to understand something could be called "reading."

Graphviz [graphviz.org] and GNU GLOBAL [gnu.org] used in combination give a graphical and web-accessible view of a large, unknown software system. This will give your team a high-level view of the modules and how they interact. This will make it easier to discern the system design.

Whats the goal of your project? To figure out how they do one particular thing? To figure out how the code works in a general way? To understand 1 subsystem?For the first, I'd try and find the functions called around when it occurs, and use a debugger to step through what happens.

For the second, I'd study the interface files and use cscope. Figure out what is calling what, and see how its interlinked.

For the third, you need to do the same as above on a local level- between files of the module. Then d

if you have access to (ir)rational rose, runing your code through that will probably speed up a lot of this process. otherwise, a combination of cccc and doxygen with the appropriate config files will give you about the best start you can hope for. hopefully, the code has reasonable documentation. if not, you're basically screwed --- you'll have to work out the use cases and reconstruct your software from there.

C/C++ more correctly but rarely known as C++/C is C++ written in the style of C, and is a wicked waste of Bjarne's time.

The guy behind xapian.org/xapian.com, Olly Bets knows how to write C++ with proper and repeated use of the base classes, iterators and templates and to be frank, his C++ looks almost like perl, and it is a delight to read.

When sombody says C/C++ it can easily be something less evil, like C++ that uses some libraries written in C, or C++ with a C API binding. You know, kindof like when somebody says their app is written in Java/C. Many projects do use multiple languages, and C and C++ are both very popular.

If C++ was statically deterministic during initialization and the templated versions did not grow exponentionally, then it might still be a viable language even though it is an academic version of IBM's PL1.
c was and still is a great generic universal (somewhat portable) assembler language which is why it is at the core of so many OSs.
Why didn't they just generate c from Smalltalk instead of writing C++? Gee I bet that is/was already being done!
And I don't even code in Smalltalk, but am stuck in the J

It sounds like you are unable to build the complete system and run it, since you're missing functionality. This removes the possibility of using runtime tracing tools.

The first thing I would do is run something like Doxygen [doxygen.org] over it to generate a cross-referenced description of the structures. It won't give you a global view of things, but it will give you a decent browsable view of the code itself. Another response mentioned GNU GLOBAL [gnu.org] which may work better for you. Yet another possibility is LXR [linux.no], though it may not work as well in C++. Regardless, a nice thing about Doxygen is that, when used with GraphViz, you can get useful diagrams generated showing class containment and file inclusion graphs.

After you have that, get out your paper and pencil, and start drawing and manually tracing things. That's how I go about coming up to speed on new code I can't execute and step through. Eventually transfer that knowledge into a text file (or, nowadays, a wiki) so that others can benefit from it.

Doxygen is a great help in understanding someone's code. If you have Graphviz installed, and all the options turned on, it will generate call graphs, header dependencies, and even UML diagrams of your classes and structures, along with an html-ized view of the code. And best of all it's free.

I would second this suggestion. I would also suggest you turn on the options the fully cross reference and include the source code, as well a generate the graphiz diagrams (as suggested by another reply th the parent post).

From the description it sounds like you are trying to understand how the program works (as you have the source code), rather than 'reverse engineering' which the usual meaning assumes you don't.

So my suggestion is start by getting it compiled, up and running;) You can then use the debugger to breakpoint the code and follow it through. You say you have most of the source code. Is the rest available as libararies to link to? Otherwise you could create 'fake' libraries just to get it compiling and running.

Another thing that might help is asking yourself what are the important events in the game and finding where they are handled (where is the score updated, where is damage worked out, spawn/death events handled). Sometimes games use global event handlers/listeners so different parts of the code can hook into the important stuff. Find where these events are handled or callbacks registered and it should give you a quick insight into the more interesting parts of the code;)

I've been through this sort of exercise several times in my career so far. 500k LOC is too much for a small team to get a handle on in any reasonable amount of time, so don't feel too helpless... You're professor is throwing you guys to the wolves and seeing what you are able to accomplish.

As for the actual suggestions, read on:

First, you'll need a tool to generate some form of cross reference for the entire codebase... I'd recommend Doxygen (hack the config file to generate the inheritance and call

My assumption is that you're to reverse engineer the software, but have been given fragments of the source as a guide, yet still have to show your methodologies so as to prove that you didn't just re-write the source.

I'd start buy actually reading the source - building it if you can. Run profilers on it and try to get some kind of visual representation of the underlying code tree. If you have source, try using something like DOXYGEN [doxygen.org] to autogen some documentation (and structure) out of it. Someone menti

We are evaluating some tools along these lines. The ones we liked most are RedHat's Source-Navigator [sourceforge.net] (GPL) and Scitool's
Understand for C++ [scitools.com] ($$$).

Sorce-Navigator seems to be slow compared to Understand C++, I'm sure this has to do with the way they index the DB.
On the other hand, the Linux version of Understand C++ needs some polishing IMHO (too many crashes on Debian/serge).

As for report-generating tools that just index and cross-reference the whole project, Gonzui [sourceforge.net] is a pretty good one.

First of all this is not a massive code base for a commercial computer game, it's about average. Many games get into the 1-2 million lines of code. Having said that most games also have teams that are probably much larger than your group of students.I'm not exactly sure what you're trying to do here. As many ppl have said reverse engineering something that you already have the source for is not really reverse engineering at all. However if I make the (somewhat suspect) assumption that your objective is to e

How about forward-engineering it? Try and add a feature to the game that doesn't already exist (and doesn't make use of any built-in scripting system/tools, etc.) That's the only way you'll really learn what's going on in there. Reading code in and of itself doesn't mean you underestand what it does. But if you have to change the code, there's no doubt that you'll figure out how (at least a part of) it works!

It's pretty simple, just time consuming. I've seen a few reverse engineering books floating around: "Reversing," "Exploiting Software." Since it's mostly stdC, it shouldn't be nearely as difficult to reverse engineer. Other languages can make things more complicated (Multiple calling mechanisms, more dynamic memory allocation, etc..).

Tools:

OllyDbg - Awesome usermode debugger, probably better suited than softice for this particular task. You can add assembly wherever you want, and it will create patches for the exe that can be automagically applied. It's also FREE.

TSearch - Can search memory at runtime, set breakpoints, disassemble code on the stack, and dynamically insert new assembly at runtime. Nice for understanding the flow of the software as it runs, and identifying interesting variables and structures.

REC Decompiler - Awesome decompiler that produces a high level representation of the code. Not a replacement for your brain, but can save a lot of time tracing over assembly code to understand the purpose of a function.

WinPCap & Ethereal - For reversing game protocols, and understanding client-server interaction. Sometimes it's nicer to just figure out where the host name/IP string is located in the binary and replace it with 127.0.0.1, then write a little proxy program to sit in between the client and the server.

HVIEW: Hex editor with the ability to disassemble.

(Use Cygwin or mingw for the following)
strace: Traces signals, system calls, and spits them out to the screen.

nm: Dump binary symbol table and names.

I've definitely forgotten a plethora of other useful tools (especially the binutils ones), but the above consist of some of my favorites.

For a game, you'll probably be dealing mostly with OllyDbg, HVIEW, REC, and winpcap/proxy. I'd recommend using nm to get a list of all of the symbols in the program, and then maybe split up and assign each student some number of symbols to understand and rewrite in C. Then they can use HVIEW or OllyDbg to navigate to those symbols, and try translating them. If they have a difficult time, have them use REC to get a higher level representation they can cheat off of.

So what are you trying to do? Do you want to modify this source todo something new? Do you want to document it, or represent it in some other way? You're still not telling us what you mean by 'reverse engineer'. What is your goal here?

So what are you trying to do? Do you want to modify this source todo something new? Do you want to document it, or represent it in some other way? You're still not telling us what you mean by 'reverse engineer'. What is your goal here?

We want to do both. Right now we're at the point where we're trying to document and understand the code. Eventually our goal will be to modify the source to add some features.

I see. Well my advice from the previous post still stands. Don't try to address the codebase as a whole, but confine yourselves to the functional (rather than architectural) areas that you're interested in.Presumably you have some specific types of modification in mind so start by creating a list of each functional area that you think each modification will impact. You probably don't even need to look at the code for this part. For example, if you want to add a 'boost' feature to a racing game, then you'll

I was thinking about that tool for Perl that uses Devel::DProf and a diagramming program (GraphViz I think) to draw flow charts of the program as it runs, showing which routines are called, that would be great.See graphviz.org's resources section for some links to profilers

I wonder if something like that is avaiable for C++. Found ROCASE [ubbcluj.ro] which looks like a CASE tool that can "reverse-engineer" (analyze) C++ files and automatically format diagrams for you to help understand the code structure. Post back he

Usually you can find a tool that will dump out all the system calls with arguments. See what the program is doing, and maybe write some scripts to analyze the log files. It gives you a genuinely useful perspective that you'll never get from reading the code.

For programs that primarily do file processing, you can get a similar understanding by analyzing the input files and the output files.

For database programs you often can get the DBMS to log the transactions or the SQL.

Primary Questions
What is your target environment, language?
What is your objective?
What is your time line?
What are your available resources and what is their available time to apply to this effort?
Additional questions?
Do you have a working version of the original installed Game?
Can you get the missing source or if not possibly "Reverse Engineer" it?

It becomes so very apperent the people who read about the scene, and the people who do it, and/or at least know enough to ask for more detials. After reading through various posts, it was somewhat easy to pick through the people who read about reverse engineering. Mostly the choice of tools, and attitudes they had reflected the general material about reverse engineering. Also I found the "I need mores" which I describe as people who don't wont to guess. "Tell me exactly what you are doing, and I will tel