@jflippen PHP is a blocking methodology by its very nature. So it couldn’t continue moving forward if one part is “stuck”. (Unless of course it’s reaching maximum execution timeout, which shouldn’t happen but I don’t know everyone’s environment.)

@wayne-workman Thanks Wayne. Is it possible that the block is timing out then if the file is taking too long on the remote server to create the md5sum and that is why it keeps trying to replace the same file each time there is a replication cycle? Is there a place I can change the time it takes to timeout to try as a troubleshooting technique (or I might go with George’s technique and try and echo out the two variables to a log during that pass to check and see if it is truly grabbing the $hashRem

@jflippen Just as an idea (first let me say I’m not a programmer), if you look about in the code where you can find an example of the replication agent writing to a log file. Clone that and place it in the correct location in the code to write both md5 hash codes into the log. Once the fog server has restarted then it should log that information into the replicator log file. I’ve had to do somethings similar in the past to reverse engineer some of the magic Tom does with his code.

@jgallo@Tom-Elliott
In another thread I was looking into the code a bit and Tom verified that the following code compares the files with one being hashed on one server and the other being hashed on the other.

Tom, is it possible that the function is checking for the $hashRem variable before the other server finishes it’s md5hash and therefore comes up with a mismatch? Still doesn’t explain why it won’t delete the file and replace though…

I got a undefined variable error in replication log at the bottom. As you can see, replication is working but still states that d1p2.img file doesn’t match. Also, the file for the storage node with the error was perfectly fine on an earlier check. I didn’t update any images today.

@jflippen I to updated and noticed it is no longer in a replication loop. Interesting as I observed the replication log go through all my storage nodes is that first storage node has same file that does not match and one other storage node. So i’m waiting for next round of replication to occur and continue to observe. Update seem to fix the processes that @Tom-Elliott was talking about earlier and seems to check other storage nodes as replication goes down the line of storage nodes.

@hanz@Tom-Elliott I updated this morning and can confirm that my server will now check other nodes after the first one in line has files to replace. However, I believe it is still having the issue of the files saying they are different even though the md5sum is the same (and not deleting said file when it says it will and transfer to log)

@tom-elliott You are the man sir…The replication does indeed appear to be working as expected…I removed files from one node to confirm complete transfer. The only question I have concerns the odd output in Image Replicator log showing the following:

The files are there already and not actually being deleted and resent as the log indicates…I only attached the logs from 2 nodes, but they all behave the same and have the same output when image is already present. This does seem to only affect the largest partition.

Mind updating again to latest working, I didn’t push a version change yet, I just pushed a quick fix to what I hope will help out.

Essentially, I’m setting the variable for checking running processes. However, the variable was accidentally unset within its running scope, so I re-added that variable so it should be used. Maybe this is why the weirdness was happening? It was always checking the running process of the first item in the list, so while it did the checks on the other nodes, only the first node was being checked for a running process (hence why disabling that first node would allow the next, and so forth, to start replicating properly.)

It’s a long shot, and I don’t have a means to test myself. Please just run a git pull and see if that helps out at all.

@tom-elliott I can verify through multiple attempts that the images are being replicated to the first storage node, but then the replication stops completely. Every other attempt, it seems to want to replicate the biggest (main) partitions as you can see below in the log. d1p3 for UEFI or d1p2 for MBR images.

Furthermore…after disabling the Ashford node it does in fact go to the second “Ramage” node, but stops there and doesn’t try to replicate to the other nodes whatsoever…(I have 9 nodes alltogether, including the main server)

It doesn’t appear to be a permissions issue. I’m not seeing an entire image as the problem, usually just the largest file, but it’s inconsistent as to which image or storage node will find that the “Files do not match on server”.

@tom-elliott No worries, I was just throwing some observations out there. My thought process was that since we are dealing with replication and storage groups with various master servers maybe somewhere down the line permissions could be not properly set since if the replication service doesn’t see the file consistency. As I am as lost as any other person with this issue, lol, i’m hoping that maybe some observation input could at least steer you in some direction to troubleshoot.

I just uploaded an image to a location that has a storage group defined and has two storage nodes within that group. So tailing the replication log it eventually shows that files need to replicate. Replication occurs but the log just sits there as if replication occurred to all the nodes. Eventually once replication service restarts from the sleeptime settings in FOG, replication starts fresh and eventually goes back to a loop to same files and same storage node. Hope that helps.

@jgallo 755 for fog:root should be fine permissions to allow delete. This is because 7 is the “user” permissions (which is read, write, execute). The first 5 is the group permissions which is read and execute. So the “root” permissions. The second 5 is the “other” which is also read and execute.

I’m not saying you didn’t know this already, just pointing out that the permissions being 777 or 755 should not matter, especially as capture doesn’t have a problem moving the files in place.

Of course any information might help lead to a more suitable solution, but this shouldn’t be a problem at all. It’s worrying as I don’t know where to begin troubleshooting this issue. Especially since the first replication process seems to work perfectly fine.

@ablohowiak Do you think this could be a permissions issue? I think someone mentioned that in a different post but here is my observation from original upload of an image to uploading an updated version of the original image:

permissions of folder on original image - 777 with user:group - fog:fog

I’m currently manually setting permissions back to original settings as if it was a fresh upload then going to upload an updated version and re-check the permissions and see if changes occurred. What i’m hoping for: if changing the permissions of image folders to their original state, then I hope a replication loop doesn’t occur. If it does not, then i will go ahead and upload and continue to observe.

I’m not replicating the snapins just images. Things worked okay when the storage group was small, and I wasn’t replicating images across storage groups.

Now as storage nodes are added the images sometimes get there, but usually things get hung up on a node not deleting and copying an image. The process loops before making it through all the storage nodes.

I have noticed that it has not been deleting the files when it says it will in the log. I can manually delete them via ftp and the fog username / password. Also, the snapins will update on the nodes just fine when I re-upload a snapin, so it just seems to be an issue with the images. I think what they are referring to is that since they have multiple nodes, as do I, the image never replicates to the other nodes because it keeps finding a fault and never updating the one of the first one in line.

I’ve been going through this thread and seems very identical to what we are experiencing. I think the only difference is that person only had one node as to maybe @ablohowiak and I having multiple storage nodes with various storage groups. rsync worked like a champ so rsync’ing the master nodes and replication to storage groups works fine also.