[HBase-user] Major compaction cannot remove deleted rows until the region is split. Strange!

We saw a very strange case in one of our production cluster. A coupleregions cannot get their deleted rows or delete marker removed even aftermajor compaction. However when the region triggered split (we set 100G forauto split), the deletion worked. The 100G region becomes two 10G daughterregions, and all the delete marker are gone.

Also, the same region in the slave cluster (through replication) havenormal size at about 20+G.

BTW, the delete marker in the regions are mostly deleteFamily if itmatters.

Search Discussions

Frank LuoWhat if you manually trigger major-compact on that particular region? Does it run and the delete markers removed? -----Original Message----- From: Tianying Chang Sent: Friday, May 27, 2016 4:33 PM To: [email protected] Subject: Major compaction cannot remove deleted rows until the region is split. Strange! Hi, We saw a very strange case in one of our production cluster. A couple regions cannot get their deleted rows or delete marker removed even after major compaction. However when the region

We saw a very strange case in one of our production cluster. A couple regions cannot get their deleted rows or delete marker removed even after major compaction. However when the region triggered split (we set 100G for auto split), the deletion worked. The 100G region becomes two 10G daughter regions, and all the delete marker are gone.

Also, the same region in the slave cluster (through replication) have normal size at about 20+G.

BTW, the delete marker in the regions are mostly deleteFamily if it matters.

Download the latest installment of our annual Marketing Imperatives, “Winning with People-Based Marketing”<http://www2.merkleinc.com/l/47252/2016-04-26/3lbfd1>This email and any attachments transmitted with it are intended for use by the intended recipient(s) only. If you have received this email in error, please notify the sender immediately and then delete it. If you are not the intended recipient, you must not keep, use, disclose, copy or distribute this email without the author’s prior permission. We take precautions to minimize the risk of transmitting software viruses, but we advise you to perform your own virus checks on any attachment to this message. We cannot accept liability for any loss or damage caused by software viruses. The information contained in this communication may be confidential and may be subject to the attorney-client privilege.

Tianying ChangYes, I tried manually trigger major compact on the problematic region, and it is the same that the region did not get any deleted rows removed, therefore the size is kept at 90+G. And I verified that if do a split, then the major compaction will work as expected, i.e. into two 10+G regions. Thanks Tian-Ying

Yes, I tried manually trigger major compact on the problematic region, andit is the same that the region did not get any deleted rows removed,therefore the size is kept at 90+G. And I verified that if do a split, thenthe major compaction will work as expected, i.e. into two 10+G regions.

ThanksTian-Ying

On Fri, May 27, 2016 at 2:54 PM, Frank Luo wrote:

What if you manually trigger major-compact on that particular region? Doesit run and the delete markers removed?

We saw a very strange case in one of our production cluster. A coupleregions cannot get their deleted rows or delete marker removed even aftermajor compaction. However when the region triggered split (we set 100G forauto split), the deletion worked. The 100G region becomes two 10G daughterregions, and all the delete marker are gone.

Also, the same region in the slave cluster (through replication) havenormal size at about 20+G.

BTW, the delete marker in the regions are mostly deleteFamily if itmatters.

Download the latest installment of our annual Marketing Imperatives,“Winning with People-Based Marketing”<http://www2.merkleinc.com/l/47252/2016-04-26/3lbfd1>This email and any attachments transmitted with it are intended for use bythe intended recipient(s) only. If you have received this email in error,please notify the sender immediately and then delete it. If you are not theintended recipient, you must not keep, use, disclose, copy or distributethis email without the author’s prior permission. We take precautions tominimize the risk of transmitting software viruses, but we advise you toperform your own virus checks on any attachment to this message. We cannotaccept liability for any loss or damage caused by software viruses. Theinformation contained in this communication may be confidential and may besubject to the attorney-client privilege.

We saw a very strange case in one of our production cluster. A coupleregions cannot get their deleted rows or delete marker removed even aftermajor compaction. However when the region triggered split (we set 100G forauto split), the deletion worked. The 100G region becomes two 10G daughterregions, and all the delete marker are gone.

Also, the same region in the slave cluster (through replication) havenormal size at about 20+G.

BTW, the delete marker in the regions are mostly deleteFamily if itmatters.

This is really weird. Anyone has any clue for this strange behavior?

ThanksTian-Ying

These 0.94 Tian-Ying?

It looks like the DeleteFamily is retained only; do you see incidence wherethere may have been versions older than the DeleteFamily that are alsoretained post-major-compaction?

Tianying ChangYes, it is 94.26. By a quick glance, I didn't see any put that is older than the delete marker's TS, which could go as far as about couple weeks ago since major compaction on it for long time seems. Also it is really strange that if the region is split, then seems everything is working as expected. Also we noticed, the same region replicated at the slave side is totally normal, i.e. at 20+G....

Yes, it is 94.26. By a quick glance, I didn't see any put that is olderthan the delete marker's TS, which could go as far as about couple weeksago since major compaction on it for long time seems.

Also it is really strange that if the region is split, then seemseverything is working as expected. Also we noticed, the same regionreplicated at the slave side is totally normal, i.e. at 20+G....

On Fri, May 27, 2016 at 3:13 PM, Stack wrote:

On Fri, May 27, 2016 at 2:32 PM, Tianying Chang wrote:

Hi,

We saw a very strange case in one of our production cluster. A coupleregions cannot get their deleted rows or delete marker removed even aftermajor compaction. However when the region triggered split (we set 100G forauto split), the deletion worked. The 100G region becomes two 10G daughterregions, and all the delete marker are gone.

Also, the same region in the slave cluster (through replication) havenormal size at about 20+G.

BTW, the delete marker in the regions are mostly deleteFamily if itmatters.

This is really weird. Anyone has any clue for this strange behavior?

ThanksTian-Ying

These 0.94 Tian-Ying?

It looks like the DeleteFamily is retained only; do you see incidence wherethere may have been versions older than the DeleteFamily that are alsoretained post-major-compaction?

Michael StackIf you move the region to another server, does that work? Looking in 0.94 codebase, I see this in Compactor#compact // For major compactions calculate the earliest put timestamp // of all involved storefiles. This is used to remove // family delete marker during the compaction. if (majorCompaction) { tmp = fileInfo.get(StoreFile.EARLIEST_PUT_TS); if (tmp == null) { // There's a file with no information, must be an old one // assume we have very old puts earliestPutTs =

Yes, it is 94.26. By a quick glance, I didn't see any put that is olderthan the delete marker's TS, which could go as far as about couple weeksago since major compaction on it for long time seems.Also it is really strange that if the region is split, then seemseverything is working as expected. Also we noticed, the same regionreplicated at the slave side is totally normal, i.e. at 20+G....

If you move the region to another server, does that work?

Looking in 0.94 codebase, I see this in Compactor#compact

// For major compactions calculate the earliest put timestamp

// of all involved storefiles. This is used to remove

// family delete marker during the compaction.

if (majorCompaction) {

tmp = fileInfo.get(StoreFile.EARLIEST_PUT_TS);

if (tmp == null) {

// There's a file with no information, must be an old one

// assume we have very old puts

earliestPutTs = HConstants.OLDEST_TIMESTAMP;

} else {

earliestPutTs = Math.min(earliestPutTs, Bytes.toLong(tmp));

}

}

The above is followed by this log line:

if (LOG.isDebugEnabled()) {

LOG.debug("Compacting " + file +

", keycount=" + keyCount +

", bloomtype=" + r.getBloomFilterType().toString() +

", size=" + StringUtils.humanReadableInt(r.length()) +

", encoding=" + r.getHFileReader().getEncodingOnDisk() +

(majorCompaction? ", earliestPutTs=" + earliestPutTs: ""));

}

This prints out earliestPutTs. You see that in the logs? You running withDEBUG? Does the earliest put ts preclude our dropping delete family?

Tianying ChangHi, Stack Based on the log, the major compaction was run, and it took 5+ hours. And I also manually run major_compact from hbase shell explicitly to verify. I just moved the region to a different RS and issued a major_compact on that region again, let me see if the major compaction can succeed and will report back. Thanks Tian-Ying

Based on the log, the major compaction was run, and it took 5+ hours. AndI also manually run major_compact from hbase shell explicitly to verify.

I just moved the region to a different RS and issued a major_compact onthat region again, let me see if the major compaction can succeed and willreport back.

ThanksTian-Ying

On Sun, May 29, 2016 at 4:35 PM, Stack wrote:

On Fri, May 27, 2016 at 3:17 PM, Tianying Chang wrote:

Yes, it is 94.26. By a quick glance, I didn't see any put that is olderthan the delete marker's TS, which could go as far as about couple weeksago since major compaction on it for long time seems.Also it is really strange that if the region is split, then seemseverything is working as expected. Also we noticed, the same regionreplicated at the slave side is totally normal, i.e. at 20+G....

If you move the region to another server, does that work?

Looking in 0.94 codebase, I see this in Compactor#compact

// For major compactions calculate the earliest put timestamp

// of all involved storefiles. This is used to remove

// family delete marker during the compaction.

if (majorCompaction) {

tmp = fileInfo.get(StoreFile.EARLIEST_PUT_TS);

if (tmp == null) {

// There's a file with no information, must be an old one

// assume we have very old puts

earliestPutTs = HConstants.OLDEST_TIMESTAMP;

} else {

earliestPutTs = Math.min(earliestPutTs, Bytes.toLong(tmp));

}

}

The above is followed by this log line:

if (LOG.isDebugEnabled()) {

LOG.debug("Compacting " + file +

", keycount=" + keyCount +

", bloomtype=" + r.getBloomFilterType().toString() +

", size=" + StringUtils.humanReadableInt(r.length()) +

", encoding=" + r.getHFileReader().getEncodingOnDisk() +

(majorCompaction? ", earliestPutTs=" + earliestPutTs: ""));

}

This prints out earliestPutTs. You see that in the logs? You running withDEBUG? Does the earliest put ts preclude our dropping delete family?

Tianying ChangHi, Stack After moving the region and issue a major compact on that region, its size shrink from 99G down to 24G. So it looks like the region is in a bad state that cannot recover, close/open it fixed the issue. And from the region size metric graph, we can see major compaction stop working since March 31, so some bug that caused region enter into bad state... Unfortunately, we don't have DEBUG enabled and that is the last region that has the issue, it is hard to figure out what is the bug that

After moving the region and issue a major compact on that region, its sizeshrink from 99G down to 24G. So it looks like the region is in a bad statethat cannot recover, close/open it fixed the issue. And from the regionsize metric graph, we can see major compaction stop working since March31, so some bug that caused region enter into bad state... Unfortunately,we don't have DEBUG enabled and that is the last region that has the issue,it is hard to figure out what is the bug that caused the bad state...

ThanksTian-Ying

On Tue, May 31, 2016 at 3:43 PM, Tianying Chang wrote:

Hi, Stack

Based on the log, the major compaction was run, and it took 5+ hours. AndI also manually run major_compact from hbase shell explicitly to verify.

I just moved the region to a different RS and issued a major_compact onthat region again, let me see if the major compaction can succeed and willreport back.

Yes, it is 94.26. By a quick glance, I didn't see any put that is olderthan the delete marker's TS, which could go as far as about couple weeksago since major compaction on it for long time seems.Also it is really strange that if the region is split, then seemseverything is working as expected. Also we noticed, the same regionreplicated at the slave side is totally normal, i.e. at 20+G....

If you move the region to another server, does that work?

Looking in 0.94 codebase, I see this in Compactor#compact

// For major compactions calculate the earliest put timestamp

// of all involved storefiles. This is used to remove

// family delete marker during the compaction.

if (majorCompaction) {

tmp = fileInfo.get(StoreFile.EARLIEST_PUT_TS);

if (tmp == null) {

// There's a file with no information, must be an old one

// assume we have very old puts

earliestPutTs = HConstants.OLDEST_TIMESTAMP;

} else {

earliestPutTs = Math.min(earliestPutTs, Bytes.toLong(tmp));

}

}

The above is followed by this log line:

if (LOG.isDebugEnabled()) {

LOG.debug("Compacting " + file +

", keycount=" + keyCount +

", bloomtype=" + r.getBloomFilterType().toString() +

", size=" + StringUtils.humanReadableInt(r.length()) +

", encoding=" + r.getHFileReader().getEncodingOnDisk() +

(majorCompaction? ", earliestPutTs=" + earliestPutTs: ""));

}

This prints out earliestPutTs. You see that in the logs? You running withDEBUG? Does the earliest put ts preclude our dropping delete family?

Michael StackInteresting. So moving it to another RS make it major-compactable? That would seem to indicate some state kept in the RS memory is preventing the major compaction running. Is moving the region a workaround for you until we figure what it is Tian-Ying? St.

After moving the region and issue a major compact on that region, its sizeshrink from 99G down to 24G. So it looks like the region is in a bad statethat cannot recover, close/open it fixed the issue. And from the regionsize metric graph, we can see major compaction stop working since March31, so some bug that caused region enter into bad state... Unfortunately,we don't have DEBUG enabled and that is the last region that has the issue,it is hard to figure out what is the bug that caused the bad state...

Interesting. So moving it to another RS make it major-compactable? Thatwould seem to indicate some state kept in the RS memory is preventing themajor compaction running. Is moving the region a workaround for you untilwe figure what it is Tian-Ying?

St.

ThanksTian-Ying

On Tue, May 31, 2016 at 3:43 PM, Tianying Chang wrote:

Hi, Stack

Based on the log, the major compaction was run, and it took 5+ hours. AndI also manually run major_compact from hbase shell explicitly to verify.

I just moved the region to a different RS and issued a major_compact onthat region again, let me see if the major compaction can succeed and willreport back.

Yes, it is 94.26. By a quick glance, I didn't see any put that is olderthan the delete marker's TS, which could go as far as about couple

weeks

ago since major compaction on it for long time seems.Also it is really strange that if the region is split, then seemseverything is working as expected. Also we noticed, the same regionreplicated at the slave side is totally normal, i.e. at 20+G....

Hao zhangHi, Stack We found this in our production cluster, too. I take a look about the code and found one case which will make the major compact not work. 1. put some rows 2. scan 3. delete 4. scanner didn't close rightly, it will keep the the read point in region's scannerReadPoints. 5. major compact. It can't work because we have a scanner which have small read point than the delete. But if move the region to new rs, the scannerReadPoints will update to the biggest memstoreTs which form sotrefile

We found this in our production cluster, too. I take a look about thecode and found one case which will make the major compact not work.

1. put some rows 2. scan 3. delete 4. scanner didn't close rightly, it will keep the the read point inregion's scannerReadPoints. 5. major compact. It can't work because we have a scanner which havesmall read point than the delete.

But if move the region to new rs, the scannerReadPoints will update tothe biggest memstoreTs which form sotrefile. So major compact will work. I thought the try...catch module in Scanner.leaseExpired() method needs afinally module to close the region scanner rightly.

After moving the region and issue a major compact on that region, its sizeshrink from 99G down to 24G. So it looks like the region is in a bad statethat cannot recover, close/open it fixed the issue. And from the regionsize metric graph, we can see major compaction stop working since March31, so some bug that caused region enter into bad state... Unfortunately,we don't have DEBUG enabled and that is the last region that has the issue,it is hard to figure out what is the bug that caused the bad state...

Interesting. So moving it to another RS make it major-compactable? Thatwould seem to indicate some state kept in the RS memory is preventing themajor compaction running. Is moving the region a workaround for you untilwe figure what it is Tian-Ying?

St.

ThanksTian-Ying

On Tue, May 31, 2016 at 3:43 PM, Tianying Chang wrote:

Hi, Stack

Based on the log, the major compaction was run, and it took 5+ hours. AndI also manually run major_compact from hbase shell explicitly to

verify.

I just moved the region to a different RS and issued a major_compact onthat region again, let me see if the major compaction can succeed and willreport back.

Yes, it is 94.26. By a quick glance, I didn't see any put that is olderthan the delete marker's TS, which could go as far as about couple

weeks

ago since major compaction on it for long time seems.Also it is really strange that if the region is split, then seemseverything is working as expected. Also we noticed, the same regionreplicated at the slave side is totally normal, i.e. at 20+G....

Anoop JohnSo u feel the call s.close() would have created some Exception and the removal of entry from scannerReadPoints would not have happened. hmm.. sounds possible public synchronized void close() { if (storeHeap != null) { storeHeap.close(); storeHeap = null; } if (joinedHeap != null) { joinedHeap.close(); joinedHeap = null; } // no need to synchronize here. scannerReadPoints.remove(this); this.filterClosed = true; } We may need to change this close impl and make sure the removal from

We may need to change this close impl and make sure the removal fromscannerReadPoints happen in a finally block.

-Anoop-

On Fri, Jun 3, 2016 at 7:33 AM, hao zhang wrote:Hi, Stack

We found this in our production cluster, too. I take a look about thecode and found one case which will make the major compact not work.

1. put some rows2. scan3. delete4. scanner didn't close rightly, it will keep the the read point inregion's scannerReadPoints.5. major compact. It can't work because we have a scanner which havesmall read point than the delete.

But if move the region to new rs, the scannerReadPoints will update tothe biggest memstoreTs which form sotrefile. So major compact will work.I thought the try...catch module in Scanner.leaseExpired() method needs afinally module to close the region scanner rightly.

After moving the region and issue a major compact on that region, its sizeshrink from 99G down to 24G. So it looks like the region is in a bad statethat cannot recover, close/open it fixed the issue. And from the regionsize metric graph, we can see major compaction stop working since March31, so some bug that caused region enter into bad state... Unfortunately,we don't have DEBUG enabled and that is the last region that has the issue,it is hard to figure out what is the bug that caused the bad state...

Interesting. So moving it to another RS make it major-compactable? Thatwould seem to indicate some state kept in the RS memory is preventing themajor compaction running. Is moving the region a workaround for you untilwe figure what it is Tian-Ying?

St.

ThanksTian-Ying

On Tue, May 31, 2016 at 3:43 PM, Tianying Chang <tychang@gmail.com>

wrote:

Hi, Stack

Based on the log, the major compaction was run, and it took 5+ hours. AndI also manually run major_compact from hbase shell explicitly to

verify.

I just moved the region to a different RS and issued a major_compact onthat region again, let me see if the major compaction can succeed and willreport back.

Yes, it is 94.26. By a quick glance, I didn't see any put that is olderthan the delete marker's TS, which could go as far as about couple

weeks

ago since major compaction on it for long time seems.Also it is really strange that if the region is split, then seemseverything is working as expected. Also we noticed, the same regionreplicated at the slave side is totally normal, i.e. at 20+G....

We may need to change this close impl and make sure the removal fromscannerReadPoints happen in a finally block.

-Anoop-

On Fri, Jun 3, 2016 at 7:33 AM, hao zhang wrote:Hi, Stack

We found this in our production cluster, too. I take a look about thecode and found one case which will make the major compact not work.

1. put some rows2. scan3. delete4. scanner didn't close rightly, it will keep the the read point inregion's scannerReadPoints.5. major compact. It can't work because we have a scanner which havesmall read point than the delete.

But if move the region to new rs, the scannerReadPoints will update tothe biggest memstoreTs which form sotrefile. So major compact will work.I thought the try...catch module in Scanner.leaseExpired() method needs afinally module to close the region scanner rightly.

After moving the region and issue a major compact on that region, its sizeshrink from 99G down to 24G. So it looks like the region is in a bad statethat cannot recover, close/open it fixed the issue. And from the

region

size metric graph, we can see major compaction stop working since

March

31, so some bug that caused region enter into bad state...

Unfortunately,

we don't have DEBUG enabled and that is the last region that has the issue,it is hard to figure out what is the bug that caused the bad state...

Interesting. So moving it to another RS make it major-compactable? Thatwould seem to indicate some state kept in the RS memory is preventing

the

major compaction running. Is moving the region a workaround for you

until

we figure what it is Tian-Ying?

St.

ThanksTian-Ying

On Tue, May 31, 2016 at 3:43 PM, Tianying Chang <tychang@gmail.com>

wrote:

Hi, Stack

Based on the log, the major compaction was run, and it took 5+

hours.

And

I also manually run major_compact from hbase shell explicitly to

verify.

I just moved the region to a different RS and issued a

major_compact on

that region again, let me see if the major compaction can succeed

and

will

report back.

ThanksTian-Ying

On Sun, May 29, 2016 at 4:35 PM, Stack wrote:

On Fri, May 27, 2016 at 3:17 PM, Tianying Chang <tychang@gmail.com

wrote:

Yes, it is 94.26. By a quick glance, I didn't see any put that

is

older

than the delete marker's TS, which could go as far as about

couple

weeks

ago since major compaction on it for long time seems.Also it is really strange that if the region is split, then seemseverything is working as expected. Also we noticed, the same

region

replicated at the slave side is totally normal, i.e. at 20+G....

If you move the region to another server, does that work?

Looking in 0.94 codebase, I see this in Compactor#compact

// For major compactions calculate the earliest put timestamp

// of all involved storefiles. This is used to remove

// family delete marker during the compaction.

if (majorCompaction) {

tmp = fileInfo.get(StoreFile.EARLIEST_PUT_TS);

if (tmp == null) {

// There's a file with no information, must be an old one

// assume we have very old puts

earliestPutTs = HConstants.OLDEST_TIMESTAMP;

} else {

earliestPutTs = Math.min(earliestPutTs,

Bytes.toLong(tmp));

}

}

The above is followed by this log line:

if (LOG.isDebugEnabled()) {

LOG.debug("Compacting " + file +

", keycount=" + keyCount +

", bloomtype=" + r.getBloomFilterType().toString() +

", size=" + StringUtils.humanReadableInt(r.length()) +

", encoding=" + r.getHFileReader().getEncodingOnDisk() +

(majorCompaction? ", earliestPutTs=" + earliestPutTs:

""));

}

This prints out earliestPutTs. You see that in the logs? You

running

with

DEBUG? Does the earliest put ts preclude our dropping delete

family?

Looking more in code, we retain deletes in following circumstances:

this.retainDeletesInOutput = scanType ==

ScanType.MINOR_COMPACT ||

scan.isRaw();

So, for sure we are running major compaction?

Otherwise, have to dig in a bit more here.. This stuff is a littleinvolved.St.Ack

We found this in our production cluster, too. I take a look about thecode and found one case which will make the major compact not work.

1. put some rows2. scan3. delete4. scanner didn't close rightly, it will keep the the read point inregion's scannerReadPoints.5. major compact. It can't work because we have a scanner which havesmall read point than the delete.

But if move the region to new rs, the scannerReadPoints will update tothe biggest memstoreTs which form sotrefile. So major compact will work.I thought the try...catch module in Scanner.leaseExpired() method needs afinally module to close the region scanner rightly.

After moving the region and issue a major compact on that region, its sizeshrink from 99G down to 24G. So it looks like the region is in a bad statethat cannot recover, close/open it fixed the issue. And from the regionsize metric graph, we can see major compaction stop working since

March

31, so some bug that caused region enter into bad state...

Unfortunately,

we don't have DEBUG enabled and that is the last region that has the issue,it is hard to figure out what is the bug that caused the bad state...

Interesting. So moving it to another RS make it major-compactable? Thatwould seem to indicate some state kept in the RS memory is preventing themajor compaction running. Is moving the region a workaround for you untilwe figure what it is Tian-Ying?

St.

ThanksTian-Ying

On Tue, May 31, 2016 at 3:43 PM, Tianying Chang <tychang@gmail.com>

wrote:

Hi, Stack

Based on the log, the major compaction was run, and it took 5+ hours. AndI also manually run major_compact from hbase shell explicitly to

verify.

I just moved the region to a different RS and issued a major_compact

on

that region again, let me see if the major compaction can succeed and willreport back.

We add some log in our production cluster.2016-06-06,21:37:28,443 INFO org.apache.hadoop.hbase.regionserver.HRegion:scannerReadPoints size is 0 and smallest read point is 40379952016-06-06,21:37:51,429 ERRORorg.apache.hadoop.hbase.regionserver.HRegionServer: FailedopenScannerjava.io.IOException: Could not seek StoreFileScanner2016-06-06,21:38:02,103 INFO org.apache.hadoop.hbase.regionserver.HRegion:scannerReadPoints size is 1 and smallest read point is 40380372016-06-13,17:17:10,516 INFO org.apache.hadoop.hbase.regionserver.HRegion:scannerReadPoints size is 1 and smallest read point is 4038037

When new RegionScanner, it will add a scanner read point inscannerReadPoints. But if we got a exception after add read point, the readpoint will keep in regions server and the delete after this mvcc numberwill never be compacted.Create a issue about this. https://issues.apache.org/jira/browse/HBASE-16012

2016-06-10 3:08 GMT+08:00 Stack <stack@duboce.net>:

Tian-ying:

Can you try the suggestion below to see if it helps?

Do you see the lease expired logs?

St.Ack

On Thu, Jun 2, 2016 at 7:03 PM, hao zhang wrote:

Hi, Stack

We found this in our production cluster, too. I take a look about thecode and found one case which will make the major compact not work.

1. put some rows2. scan3. delete4. scanner didn't close rightly, it will keep the the read point inregion's scannerReadPoints.5. major compact. It can't work because we have a scanner which havesmall read point than the delete.

But if move the region to new rs, the scannerReadPoints will update tothe biggest memstoreTs which form sotrefile. So major compact will work.I thought the try...catch module in Scanner.leaseExpired() method needs afinally module to close the region scanner rightly.

After moving the region and issue a major compact on that region, its sizeshrink from 99G down to 24G. So it looks like the region is in a bad statethat cannot recover, close/open it fixed the issue. And from the

region

size metric graph, we can see major compaction stop working since

March

31, so some bug that caused region enter into bad state...

Unfortunately,

we don't have DEBUG enabled and that is the last region that has the issue,it is hard to figure out what is the bug that caused the bad state...

Interesting. So moving it to another RS make it major-compactable? Thatwould seem to indicate some state kept in the RS memory is preventing

the

major compaction running. Is moving the region a workaround for you

until

we figure what it is Tian-Ying?

St.

ThanksTian-Ying

On Tue, May 31, 2016 at 3:43 PM, Tianying Chang <tychang@gmail.com>

wrote:

Hi, Stack

Based on the log, the major compaction was run, and it took 5+

hours.

And

I also manually run major_compact from hbase shell explicitly to

verify.

I just moved the region to a different RS and issued a

major_compact

on

that region again, let me see if the major compaction can succeed

and

will

report back.

ThanksTian-Ying

On Sun, May 29, 2016 at 4:35 PM, Stack wrote:

On Fri, May 27, 2016 at 3:17 PM, Tianying Chang <

tychang@gmail.com>

wrote:

Yes, it is 94.26. By a quick glance, I didn't see any put that

is

older

than the delete marker's TS, which could go as far as about

couple

weeks

ago since major compaction on it for long time seems.Also it is really strange that if the region is split, then seemseverything is working as expected. Also we noticed, the same

region

replicated at the slave side is totally normal, i.e. at 20+G....

If you move the region to another server, does that work?

Looking in 0.94 codebase, I see this in Compactor#compact

// For major compactions calculate the earliest put

timestamp

// of all involved storefiles. This is used to remove

// family delete marker during the compaction.

if (majorCompaction) {

tmp = fileInfo.get(StoreFile.EARLIEST_PUT_TS);

if (tmp == null) {

// There's a file with no information, must be an old

one

// assume we have very old puts

earliestPutTs = HConstants.OLDEST_TIMESTAMP;

} else {

earliestPutTs = Math.min(earliestPutTs,

Bytes.toLong(tmp));

}

}

The above is followed by this log line:

if (LOG.isDebugEnabled()) {

LOG.debug("Compacting " + file +

", keycount=" + keyCount +

", bloomtype=" + r.getBloomFilterType().toString() +

", size=" + StringUtils.humanReadableInt(r.length()) +

", encoding=" + r.getHFileReader().getEncodingOnDisk() +

(majorCompaction? ", earliestPutTs=" + earliestPutTs:

""));

}

This prints out earliestPutTs. You see that in the logs? You

running

with

DEBUG? Does the earliest put ts preclude our dropping delete

family?

Looking more in code, we retain deletes in following

circumstances:

this.retainDeletesInOutput = scanType ==

ScanType.MINOR_COMPACT

scan.isRaw();

So, for sure we are running major compaction?

Otherwise, have to dig in a bit more here.. This stuff is a littleinvolved.St.Ack