Author
Topic: Problems booting ArcaOS (Read 1853 times)

This is a problem I have been trying to sort for a few weeks but without success.

When the system has been powered off for a while, eg overnight, ArcaOS often does not boot fully at first attempt - can take 3 or 4 attempts. If "Loading OS2DASD.DMD" is displayed for longer than the usual flicker the system then stops booting with "Loading OS2LVM.DMD" displayed. This behaviour started around 6 weeks ago with the occasional failure on first cold boot and has gradually got to the stage where booting ArcaOS can now take upto 4 attempts on most cold boots. Sometimes, just to add to my confusion, ArcaOS boots on the first attempt with no problem.

I also have eCS2.1 and 2.2beta2 installed on this system. There is no problem cold booting either of those installations, both boot at first attempt.

The config.sys files for ecs2.1, ecs2.2b2 and ArcaOS are identical - except for drive letter involved - with regard to DEVICE and BASEDEV drivers and the driver files used are the same on all 3 systems.

If the problem affected all 3 installations then I would suspect either hardware or a flaky driver but the problem only seems to affect ArcaOS - it has not shown up when cold booting either of the eCS installations.

Could your hard drive be developing a bad spot where ArcaOS is installed?

I am very much afraid this look indeed as Dave is suggesting, a bad sector on your hard disc...Boot from ArcaOS DVD and run a CHKDSK on the boot volume to see if that does any good.But if I where you I would make a backup of any data.

Download the hard disk manufacturers disk test utilities and run them against the HD. If there is a HD problem this should indicate what and where it is - at least the various disk tools that I have do.

If it is a sector going bad you should be able to move the data on that sector to a spare sector and mark the sector as bad. I just wish we had a SMART monitor that worked with ArcaOS because the SMART output would also indicate problems - example, number of relocated sectors.

Edit to add. Since you have eCS on the same disk the SmartMon from that should work assuming you have SMART enabled for the disk.

I doubt it has Something to do with broken Harddisk. I am seeing the same Problem Here but it Happens only occasionally. I suspect IT IS yet another SMP issue as I never See this Problem with only one core (which you could try as a Test).

Download the hard disk manufacturers disk test utilities and run them against the HD. If there is a HD problem this should indicate what and where it is - at least the various disk tools that I have do.

If it is a sector going bad you should be able to move the data on that sector to a spare sector and mark the sector as bad. I just wish we had a SMART monitor that worked with ArcaOS because the SMART output would also indicate problems - example, number of relocated sectors.

Edit to add. Since you have eCS on the same disk the SmartMon from that should work assuming you have SMART enabled for the disk.

Not sure what the above line "can't monitor Offline_Uncorrectable count - no Attribute 198" is telling me...

Being a bit "naughty" I have installed the ArcaOS kernel (14.201) and acpi (3.23.07) files in use to the eCS installations for testing to see if these ArcaOS only components have any bearing on the problem. This will take a few weeks...

I also see that I am still using the eCS version of os2ahci.add on my eCS installations rather than the ArcaOS version. However, they are both v1.32...

the Attribute 198 is the Uncorrectable Sector Count (which on a spinning rust disk is saying there are no more spare sectors to recover data to).

198 Uncorrectable Sector Count

The total number of uncorrectable errors when reading/writing a sector. A rise in the value of this attribute indicates defects of the disk surface and/or problems in the mechanical subsystem. (or Off-Line Scan Uncorrectable Sector Count - Fujitsu)

I think you have hit the problem of SSDs on our platform - no way of controlling the wear levelling that is necessary to prevent over use of some cells.

For some reason I thought that current/recent SSDs handled wear levelling themselves rather than requiring outside intervention but I guess I misread something, somewhere.

If that is not the case then it could well be the problem I am seeing. It would explain why it affects ArcaOS which is my usual bootup option and not the eCS installations which rarely get booted these days.

Yet another reason to change to an operating system that has the necessary tools. Trouble is I have not yet found an alternative OS that I like as much as OS/2 based systems...

I think that all the SSDs have the ability to control wear levelling built into the firmware. The very expensive ones have a triggering mechanism also built in while the budget ones rely on the driver to supply the trigger information - at least that is what a friend said was necessary on his Linux box with the SSD. He had to add a switch to the ahci driver during the mount operation.

I have started to standardise on Seagate FireCuda drives for my OS/2 boxes. It helps boot times with the built in SSD and the firmware is intelegent enough to make sure everything works as it should. Not as fast as a pure SSD but fast enough - works well with my RYZEN processor based motherboards but we need USB 3.0 and 3.1 drivers to make modern boards useful - I have had to turn one RYZEN box over to Zorin OS because of lack of USB drivers, only USB mouse and keyboard ports.

You are refering to TRIM support... With the quality of today's SSD I do not know if you REALLY need TRIM support. The German computer magazine C!T did a test a few years ago with SSD's being written to for months without a single faillure...

You could also be looking at a defective SSD.... How old is the SSD ?So far in the last half year I have known two people who only run Windows on SSD disc and it just stopped working from one moment to the other... And it was Windows 7 that has TRIM support...

Does make every SSD by itself since ages. Even CF-cards does it since the beginning. Only SD-Cards and maybe USB-Sticks don't do it. So no problem with any SSD.

Quote

trim

We do not have such utility. But as said before, not really necessary since years. If you really like you can create a new partition and copy over data every 5 years or so if you feel better, but....

Quote

flushing the delay write cache

AFAIK the command is implemented in the driver since many years. Probably since the beginning of os2ahci. There are some comments about in xwp shutdown code from some experts which let me believe that such issues were addressed since many years. 2009 or so.

Quote

In the end you end up with some cells that are over worked and don't retain their status as long as they should.

Hard to believe as the whole wear leveling stuff is the job of the controller on the drive. It's not influenced by any OS or OS driver. Of course there are differences in quality, reliability and durability as with all other electronic components. Pick a trusty manufacturer and you hardly can damage it with OS/2 usage patterns. Which does not mean there will never be a failure. If you are the one in millions ....

AFAIK the command is implemented in the driver since many years. Probably since the beginning of os2ahci. There are some comments about in xwp shutdown code from some experts which let me believe that such issues were addressed since many years. 2009 or so.

Tell me that when os2ahci actually sees and works with the ASMedia ahci chips. At the moment I have several boards that have 6 or more sata sockets but only 2 of them are usable the other 4 are connected to ASMedia chips.