Re: machine won't start

From:

Matthew Dillon <dillon@xxxxxxxxxxxxxxxxxxxx>

Date:

Wed, 4 Jul 2012 11:34:09 -0700 (PDT)

:There was interesting debate before couple od days/weeks on OpenBSD
:about support for disks larger than 2TB. It turned out that they can
:be used just fine without GPT, but multiboot capability is mostly lost
:as job is done in disklable (their fdisk can't do that)
:http://marc.info/?l=openbsd-misc&m=133857397722515&w=2
:
What we do in our 'fdisk -IB' formatting sequence is cap the LBA
slice values at all-1's and CHS values at 1023/255/63 (I think that
is all-1's too). We do not wrap the CHS or LBA values as that
creates massive edge cases when the size of the disk sometimes
just barely wraps and makes the BIOS think the disk is really tiny
instead (this has happened to me!).
The DragonFly OS then detects that a slice is using capped values
and silently uses the HD-reported values instead. Or, more to the
point, the DragonFly disklabel code detects the situation and
properly allows the disklabel to be sized to the actual media size
instead of restricting it to the capped LBA values for the slice.
But, as you can see, mixed results. Even though capping the LBA
values instead of wrapping it is the officially-supported methodology,
some BIOS's can't handle it. Fortunately nearly all BIOS's that would
otherwise barf on the situation do allow you to go in via the BIOS
setup and manually set the access mode to LBA or LARGE yourself.
--
BIOS issues are also the reason why most fdisk's use such weird
CHS values for the bootable slice.
sysid 165,(DragonFly/FreeBSD/NetBSD/386BSD)
start 63, size 78156225 (38162 Meg), flag 80 (active)
beg: cyl 0/ head 1/ sector 1;
end: cyl 1023/ head 255/ sector 63
(NOTE the 'start 63', sector numbers start at '1', not '0' for
fdisk reporting... blame Intel).
fdisk sector numbers start at 1, so the slice start of '63' winds
up being only 512-byte aligned. The reason the start is weird
like this is because the maximum sectors/track is 63, and many
BIOS's (again their old decrepid CHS probing) blow up if the
slice is not on a cylinder boundary. A lot of BIOS's also blow
up if sectors/track is not set to 63, so we lose no matter what we
do.
--
Newer advance-format drives with 4K sector sizes are instantly
inefficient when the resulting filesystems, even if they are
aligned relative to the slice, wind up not being aligned relative
to the physical media. This forces the HD itself to issue
read-before-write when handling pure media writes from the
filesystem, resulting in very poor performance.
Some disk manufacturers (aka Seagate) apparently tried detecting
filesystem alignment on the fly but it just created an immense
mess (including w/GPT compatibility slices). I think most disk
drive manufacturers are finally settling into requiring media
accesses to be physically aligned if the consumer wishes the
accesses to be efficient.
Again, applies only to advanced-format drives but once the kinks
are worked out by BIOS manufacturers I expect a lot of HD vendors
will move most of their lines to advanced-format (4K physical
sectors), because 4K physical sectors allow them to put a
inter-sector gap back in and because it boosts linear transfer
rates by 30-50%.
In anycase, DragonFly solves the alignment issue in its disklabel64
partition format (which has been our default for a few years now),
by detecting that the slice table is mis-aligned and correcting
for it in the disklabel. Plus disklabel64 uses a very large
initial alignment... not just 4K. It's more like ~1MB.
# data space: 39077083 blocks # 38161.21 MB (40014933504 bytes)
#
# NOTE: If the partition data base looks odd it may be
# physically aligned instead of slice-aligned
#
diskid: 7f45e4eb-9af2-11e1-a2f9-01012e2fd933
label:
boot2 data base: 0x000000001000
partitions data base: 0x000000100200
partitions data stop: 0x000951237000
backup label: 0x000951237000
total size: 0x000951238200 # 38162.22 MB
alignment: 4096
display block size: 1024 # for partition display only
16 partitions:
# size offset fstype fsuuid
a: 1048576 0 4.2BSD # 1024.000MB
b: 16777216 1048576 swap # 16384.000MB
d: 21251288 17825792 HAMMER # 20753.211MB
a-stor_uuid: a5cff4d1-9af2-11e1-a2f9-01012e2fd933
b-stor_uuid: a5cff4e0-9af2-11e1-a2f9-01012e2fd933
d-stor_uuid: ac45d623-9af2-11e1-a2f9-01012e2fd933
The 'partition data base' in the DragonFly disklabel64 format
is using an offset of 0x100200, which is 1MB+512 bytes. The
extra 512 bytes is correcting for the unaligned fdisk slice
the partition is sitting in (which the partition code probes
dynamically). The 1MB is geared towards LVM partitioning for
future soft-RAID setups. LVM tends to want very large alignments
which it then cuts down as you set up soft-RAID configurations
within in.
So hard drives care about reasonable alignment ~32K is usually
plenty good enough, and lvm/dm cares about larger partitioning
alignments and ~1MB is usually plenty good enough for that
purpose.
The better alignments probably also help SSDs though it will
depend on the firmware. SSDs tend to be more dynamic in the
way they handle write-combining but I think the 32K base media
alignment (i.e. the first slice is offset by 63 sectors and we
add one more sector, the 1MB being irrelevant)... so a physical
device sees a ~32K base alignment for most I/O operations. In
anycase, the SSDs have better write combining algorithms anyway
but might still react better to a ~32K base alignment than to
a ~32K-512 bytes base alignment.
-Matt
Matthew Dillon
<dillon@backplane.com>