It worked before, and all I did was add in hotboot. I think it has something to do with the 'control' value in comm.c, as that's the only call accept_new makes on any variable. I seen in a modified SWR that control was closed like this:

My other problem is that hotboot is saving items inside of corpses. Someone looted a corpse, a hotboot was done, and the items appeared back inside the corpse and on the player. I added the fixes to fread_obj so that it shouldn't save corpses, but it apprently still is. Any help?

You should use gdb to figure out where it's crashing. If you're not getting core dumps, you should figure out how to turn them on. See e.g. Nick Gammon's gdb giude for how to enable large core dumps.

with the 'control' value in comm.c, as that's the only call accept_new makes on any variable

What does this mean, more precisely?

Well, anyhow, debugging this without gdb is kind of like looking for a needle in a haystack... I've already said what is happening: a descriptor is being closed that shouldn't be, and is being added to the list of descriptors to check at network update. Why or where that happens will be very hard to tell without gdb...

You should compare your implementation of the hotboot snippet with the actual snippet to make sure that you did everything exactly as you were supposed to. Try doing it with a stock version of your codebase and see if you can get it to work there.

Set a breakpoint at the line where it exits, wait for it to crash, and then examine the stack. You should be able to find it by searching for the string "poll" in comm.c -- you're looking for a call to perror, I believe.

control is assigned about 13 lines up. It's probably the listening socket for the mud, which will be checked using either select() or poll() for incoming connections.

If that socket has already been closed, and you try to close it again, the universe will be unhappy. Likewise, if you try to close it and it hasn't yet been opened, life will be bad. C is annoying that way, the paranoid among us tend to check almost everything possible before accessing a pointer.

Banner said:

My other problem is that hotboot is saving items inside of corpses. Someone looted a corpse, a hotboot was done, and the items appeared back inside the corpse and on the player. I added the fixes to fread_obj so that it shouldn't save corpses, but it apprently still is. Any help?

That's a classic exploit which happens when you don't have atomic code. You go to move an object from the corpse to your inventiry. The code copies the object to you, then deletes it from the corpse. If it crashes between the two operations (and the resulting hotboot code saves both inventories and reboots), you end up with two copies.

Some of your choices are:
1. Live with item duplication bugs and try to prevent things that cause crashes.
2. Reverse the order so it deletes first and then copies... that would result in item loss, rather than duplication.
3. Devise a persistant scheme for doing item transfers. If you used a database, you'd wrap both operations with a transaction, which would rollback on a crash. Not using a database, you could open a file that describes the transaction in progress so if you crash before it completes, the recovery code on the other end of the hotboot could look for such files and pick up where it left off.

That's one of the reasons I prefer to let the driver crash. If you engineer it so things bounce themselves, then you don't have as much incentive to actually FIX the bugs. If your players have been sending you email for 6 hours because their game is sitting in gdb, frozen on a seg-fault, I (at least) am more likely to sit down and fix the cause of the problem, rather than slapping a band-aid on and letting it run again.

That's a classic exploit which happens when you don't have atomic code. You go to move an object from the corpse to your inventiry. The code copies the object to you, then deletes it from the corpse. If it crashes between the two operations (and the resulting hotboot code saves both inventories and reboots), you end up with two copies.

No no, someone looted the corpse, I did a hotboot, and the items respawned in both places. I didn't say anything crashed or an emergency hotboot happened..

Quixadhal said:

control is assigned about 13 lines up. It's probably the listening socket for the mud, which will be checked using either select() or poll() for incoming connections.

If that socket has already been closed, and you try to close it again, the universe will be unhappy. Likewise, if you try to close it and it hasn't yet been opened, life will be bad. C is annoying that way, the paranoid among us tend to check almost everything possible before accessing a pointer.

Heh. I'd tell you my solution but you explicitly said you didn't want to hear it already

Welcome to the hell of running dangerous code.

What exactly does that even mean? I never said anywhere I didn't want to hear anything. And if your comment is referring to emergency copyover, that's not the case. It works fine since I fixed it for hotboot, and it worked fine on copyover. The error I'm receiving has nothing to do with it, and if it does, then tell me so I can fix it instead of giving me cryptic messages. I don't see how switching to hotboot suddenly creates a new descriptor error that involves emergency copyover.

I dunno, maybe I'm pointing out the fact that you have been given the answer many times already and for some reason you seem unwilling to go forward with it or ask for clarifications on how to set breakpoints etc. Instead, you repeat your question over and over again, starting new threads too.

I guess that I'm not sure how to help you anymore. You don't seem willing to do what you need to do to solve your problem.

how switching to hotboot suddenly creates a new descriptor error that involves emergency copyover.

And that is indeed the question.

As much as we like to think that snippets of code are nice plug-and-play blocks you can put in and pull out at will, the DikuMUD codebase was never designed to be that modular, and the C programming language does nothing to make it any simpler to be modular. If you don't understand what the code does, it's twice as hard to figure out where it's going wrong.

My suggestion is to read through all the code that's involved in accepting connections, in closing them, and in performing the actual copyover, both setting it up before the exec, and recovering afterwards. You should also know exactly how signal handlers interact with each of these cases. Man pages are your friends.

If you understand the flow of the code, you'll know exactly where to set breakpoints and what values you are interested in observing when those breakpoints are hit. Otherwise, you're just throwing darts at a dartboard with the lights turned off.

The thing is that it would be extremely easy to get a start here if only we knew which descriptors were being put into the FD_SETs ... and the best way to do that is to have breakpoints at the point where the error message comes up, and examine which descriptors are being polled ... Otherwise, this is all just an exercise in futility. Needle and haystack, darts and darkness, whichever metaphor you like they all apply.

Sorry, DavidHaley. I looked at the guide on GDB that was linked to and it tells how to set breakpoints, so I didn't need to ask. I was given the answer once, not "many times". And I was clarifiying by asking for a confirmation. I haven't gotten around to trying it yet between school and work, but when I do, I'll be sure to call you up and we can calm these waters. I'm unsure as to why you're so hostile anyway.

Quixadhl, thanks for the nice-mannered explanation. I'll try what's been suggested and report back. Much thanks.

Actually, I did suggest that you use gdb several times, and you were not very responsive. I suggested it one last time above, and you said nothing at all, instead asking another person, yet again, what you should do to solve your problem. I don't understand why you ask questions, are given suggestions, and then you neither follow up on those suggestions nor explain why they don't work for you, just to come back and ask questions again. If you didn't have time to try it, you could have just said so, but completely ignoring it was rather rude on your part given that I am spending time to help you.

Anyhow, it is pointless to continue until we have that extra information, so I guess all we can do now is wait for that.

EDIT: while I agree that I am frustrated, I don't see any hostility on my part. If you felt that repeating myself was such a hostile act, I apologize, but then again to me it looked like you were completely ignoring the suggestion, or perhaps simply missed it.

Heh. I'd tell you my solution but you explicitly said you didn't want to hear it already

Welcome to the hell of running dangerous code.

What exactly does that even mean? I never said anywhere I didn't want to hear anything. And if your comment is referring to emergency copyover, that's not the case. It works fine since I fixed it for hotboot, and it worked fine on copyover. The error I'm receiving has nothing to do with it, and if it does, then tell me so I can fix it instead of giving me cryptic messages. I don't see how switching to hotboot suddenly creates a new descriptor error that involves emergency copyover.

Well you said it with the emergency copyover thing. I suppose I just saw this thread and thought it was still the same topic. Personally I can't see how you'd hold that in isolation from switching "copyover" to "hotboot" ( the same thing! ) since that emergency handler is still going to invoke exactly the same process. You already know where I stand on such hackery in code and that I've been very active in discouraging people from using such things. But you said you didn't want that type of answer on the subject. Thing is, what you're running into could well be related, and, well, I can't help the feeling that the dangers involved have been ignored.

Just kinda figured I would go ahead and add my 1/2 cent about this thing. We used to use the emergency copyover code in our mud, and yes for most things it worked out okay, and just copied over through the segmentation violation, or seg fault. But in the instance where you are trying to save characters/objects/rooms or any file for that matter, and then load from them, its possible that the saving gets interrupted then that causes a continous crash/emergency copyover basically resulting in an infinite loop. The other problem was when the error was in a file thats loaded on startup, since not everything was loaded but then we tried saving things, we ran into many, many issues. These things happened to us at least a dozen times in the year and a half that we used the emergency copyover code. If this doesn't deter you, then I suggest you put in a few "failsafes" one being a check to make sure everything has been loaded before bothering to "resave" everything before the emergency copyover is called.

Well, basically, if you are crashing because objects get corrupted, then saving those and thinking you can safely and fully restore from them is essentially suicide as far as stability is concerned. At best, you will lose just the portion that got corrupted; at worse you can lose huge chunks of data if you start writing to files and introduce more and more inconsistencies due to the corrupted data.

Another way to help get around this is to never write to the file until you have the entire string ready to be written. This requires writing to an intermediate destination (i.e. a string in memory, or a temp. file on disk) and then overwriting the actual file only at the very end.

But yes, emergency copyover is very dangerous. But like Samson pointed out it has already been said that Banner is totally ok with that and doesn't want to reconsider it, because for his purposes it is better than the alternative.