[ https://issues.apache.org/jira/browse/HDFS-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840412#action_12840412
]
Konstantin Shvachko commented on HDFS-955:
------------------------------------------
h3. The Problem
Our recovery logic for IMAGE_NEW file was originally intended for the checkpoint recovery,
and it works in this case. But it does not work for recovery from a saveFSImage() failure.
The storage directory may contain four files: IMAGE, EDITS, EDITS_NEW, and IMAGE_NEW.
Here are the steps we perform during checkpoint:
0. Initially storage directory has IMAGE and EDITS files only.
1. Start checkpoint. NN creates EDITS_NEW, and starts streaming edits into it.
2. Upload IMAGE_NEW from SNN to NN storage directory.
3. When upload is done, rename EDITS_NEW -> EDITS.
4. Rename IMAGE_NEW -> IMAGE. Back to the initial state.
Here is the time-line of which combination of files represent the _current_ state of the file
system relative to the events above.
IMAGE + EDITS --- (1) --- IMAGE + EDITS + EDITS_NEW --- (2) --- IMAGE + EDITS + EDITS_NEW
--- (3) --- IMAGE_NEW + EDITS --- (4) --- IMAGE + EDITS
The recovery procedure:
- If EDITS_NEW.exists, then we know NN failed after 1 or 2, but before 3, and our recovery
strategy is to discard IMAGE_NEW.
- If ! EDITS_NEW.exists && IMAGE_NEW.exists, then NN failed after 3, but before 4,
and we recover by upgrading IMAGE_NEW to IMAGE.
Now lets see what happens when we save image during startup or saveNamespace.
Here are the steps we perform when we call saveFSImage():
0. Initially the storage directory has IMAGE, EDITS, and potentially EDITS_NEW, which have
been all loaded and digested in NN RAM.
1. Create EDITS_NEW if missing.
2. Save IMAGE_NEW.
3. Empty EDITS and EDITS_NEW.
4. Rename EDITS_NEW -> EDITS.
5. Rename IMAGE_NEW -> IMAGE.
We use the same recovery procedure here as in checkpointing, which leads to a data loss in
the following failure scenario.
If we fail after 3 but before 4, then we will discard IMAGE_NEW, because EDITS_NEW.exists.
But the latest updates in EDITS and/or EDITS_NEW has already been wiped out and we loose these
edits forever.
The main reason the checkpointing logic does not work for saving is that IMAGE_NEW has different
semantics in these two cases.
- In checkpoint IMAGE_NEW = IMAGE + EDITS
- In saveFSImage IMAGE_NEW = IMAGE + EDITS + EDITS_NEW
h3. The Solution
Different images should be represented by separate files and treated differently. I'll denote
them
- IMAGE_CKPT = IMAGE + EDITS the checkpoint image
- IMAGE_LAST = IMAGE + EDITS + EDITS_NEW the last saved image
So the checkpoint process will create IMAGE_CKPT and will work with is exactly as before,
no changes here.
saveFSImage will save NN's memory state into IMAGE_LAST, and should consist of the following
steps:
0. Initially the storage directory has IMAGE, EDITS, and potentially EDITS_NEW and IMAGE_CKPT.
1. Save image into IMAGE_LAST.
2. Remove EDITS, IMAGE_CKPT, and EDITS_NEW - in the order listed.
3. Rename IMAGE_LAST -> IMAGE.
4. Create empty EDITS.
It is important to note that checkpoint cannot start once saveFSImage started, because NN
is in safe mode, and because it holds the NN lock. If the upload of IMAGE_CKPT has started
(stage c-2) it will proceed concurrently with the save. But rollEdits() (stage c-3) will fail
if called during saveFSImage.
Here is the time-line of which combination of files represent the _current_ state of the file
system relative to the events above.
IMAGE + EDITS + EDITS_NEW --- (1) --- IMAGE + EDITS + EDITS_NEW --- (2) --- IMAGE_LAST ---
(3) --- IMAGE --- (4) --- IMAGE + EDITS
The recovery procedure for saving image is:
- If EDITS.exists && IMAGE_LAST.exists, then we know NN failed after 1 but before
2, and we recover by discarding IMAGE_LAST.
- If ! EDITS.exists && IMAGE_LAST.exists, then NN failed during or after 2, and we
recover by applying 2, 3, and 4.
- If ! EDITS.exists && ! IMAGE_LAST.exists, then NN failed after 3, and we recover
by applying 4.
There is a slight complication for WinFS. It will not let us remove IMAGE_CKPT on stage 3
if the checkpointer is still writing into it. In this case we will ignore the failure, and
quit the procedure, delaing the rest of the steps for the future. The correct state (rename
IMAGE_LAST to IMAGE) will be restored either when checkpoint finishes or if the NN restarts
due to a failure.
> FSImage.saveFSImage can lose edits
> ----------------------------------
>
> Key: HDFS-955
> URL: https://issues.apache.org/jira/browse/HDFS-955
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 0.20.1, 0.21.0, 0.22.0
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
> Priority: Blocker
> Attachments: hdfs-955-moretests.txt, hdfs-955-unittest.txt, PurgeEditsBeforeImageSave.patch
>
>
> This is a continuation of a discussion from HDFS-909. The FSImage.saveFSImage function
(implementing dfsadmin -saveNamespace) can corrupt the NN storage such that all current edits
are lost.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.