So my rig crashed again, it was up for about 19 hours with the current settings. The previous it crashed I had not been able to see it crash, I just knew it because the screen was blank and the fans on the gpus went up to 100 percent. This time I was siting in front of it doing something else when the screen went blank and the fans kicked up to 100 percent. My question is if there is some kind of log that could be looked at to see what caused the crash or can one be enabled that only keeps the last one hour of activity?

So my rig crashed again, it was up for about 19 hours with the current settings. The previous it crashed I had not been able to see it crash, I just knew it because the screen was blank and the fans on the gpus went up to 100 percent. This time I was siting in front of it doing something else when the screen went blank and the fans kicked up to 100 percent. My question is if there is some kind of log that could be looked at to see what caused the crash or can one be enabled that only keeps the last one hour of activity?

ssh in and look at the tail end of /var/log/dmesg. I have some crappy PCIe extenders here that would interrupt the connection between the GPU and the computer as soon as mining software fired up. The errors show up toward the end of /var/log/dmesg.

There's also /var/log/messages, but that tends to be less useful for hardware errors.

I have a keyboard and monitor connected to the rig for now, I found a file named kern.log that is 1.7 GB in size and kern.log.1 that is about 650 MB. these are the messages

So my rig crashed again, it was up for about 19 hours with the current settings. The previous it crashed I had not been able to see it crash, I just knew it because the screen was blank and the fans on the gpus went up to 100 percent. This time I was siting in front of it doing something else when the screen went blank and the fans kicked up to 100 percent. My question is if there is some kind of log that could be looked at to see what caused the crash or can one be enabled that only keeps the last one hour of activity?

thanks in advance

look at the syslog:

go to ubuntu button top left and enter:

sy

click on system log

when I do that it gives me a stream of those messages in my previous post

So my rig crashed again, it was up for about 19 hours with the current settings. The previous it crashed I had not been able to see it crash, I just knew it because the screen was blank and the fans on the gpus went up to 100 percent. This time I was siting in front of it doing something else when the screen went blank and the fans kicked up to 100 percent. My question is if there is some kind of log that could be looked at to see what caused the crash or can one be enabled that only keeps the last one hour of activity?

ssh in and look at the tail end of /var/log/dmesg. I have some crappy PCIe extenders here that would interrupt the connection between the GPU and the computer as soon as mining software fired up. The errors show up toward the end of /var/log/dmesg.

There's also /var/log/messages, but that tends to be less useful for hardware errors.

I have a keyboard and monitor connected to the rig for now, I found a file named kern.log that is 1.7 GB in size and kern.log.1 that is about 650 MB. these are the messages

First of all big thank you to fullzero and everyone contributing to this distro!

I've been struggling with the Genoil crash issue and lack of watchdog implementation for the past few days and I have a bandaid solution that seems to be actually working quite well, perhaps it can help others in the community:

Essentially you need to split the Genoil output to a file, grep it (we only care about 'error' instances only ; and then this output as input for a monitoring script that kills and restarts the misbehaving process.

So we have 2 scripts launched in screen as daemons "ltail" script and "ett" script

finally I also send output of ltail to timestamp.log to track how many times Genoil fails per hour - with roughly aiming at 1 crash per hour this gives me about 130MHs out of 5xGTX1060 which is a good 20+ MHs higher then Claymore... most importantly it gives stable hashing despite the OC introduced errors. The recovery is literally seconds.Oh yeah and I also run $tail -f ~/eth/Genoil-U/timestamp.log in a screen as well as watch -n 5 'sensors |grep Core' in another screen to fine tune the OC vs crash per hour vs tempHope this helps, and I hope the message is not too chaotic.Cheers!

Also, I couldn't find how I can see the current mining process. I did see the screen -r commands, but that implies killing the current process and restarting it. I'd like to be able to see, from SSH, the current mining process without killing it. Is this possible?

If you want to monitor the mining process via screen you're going to have to kill the initial gnome-terminal. There's no way around that, as screen can only reconnect to an existing screen session.

This shouldn't be a big deal if you have a stable rig. You only need to do it once per reboot. My process is:

1. From my desktop where I monitor my rigs I initiate a constant ping:

Code:

ping -t 10.20.30.40 # substitute your rig's IP, find it in your router, or by running nmap on your LAN subnet, or by running ifconfig from a guake terminal on the rig if you have a monitor connected

2. Boot the rig3. Wait until I begin to get ping responses from the rig, thus indicating Ubuntu has booted and rig has network connectivity4. SSH into the rig (user: m1 password: miner1)5. Initiate a screen session:

Code:

screen -s [name for your rig, make one up or call it "rig"]

6. Start nvidia-smi dmon to watch for mining process to begin (by waiting until this happens you know OC settings, fan speed settings, etc have been applied. Running those commands from within screen isn't 100% consistent IME as I always see error messages when I tried it that way. It's best to let those settings commands run from gnome-terminal as Ubuntu first boots IMO).

Code:

nvidia-smi dmon

7. Wait until you see wattage go up and GPU utilization go up to 100% (which indicates that the oneBash script concluded and opened the mining process). Exit nvidia-smi with CTRL + c8. Find the PID for gnome-terminal.

Code:

ps aux | grep gnome-terminal

9. Kill it:

Code:

kill [PID from step 8]

10. Restart mining:

Code:

bash '/media/m1/1263-A96E/oneBash'

It might seem like a lot of steps, but it takes all of 120 seconds and you shouldn't need to do it very often once your rig is dialed in. You're losing maybe 1 minute's worth of hashes on avg of every week? Pretty negligible considering the convenience of monitoring from another workstation, and you're not using up system resources by using Teamviewer. This also lets you go completely headless if you buy a dummy HDMI plug. I just updated from 16 to 17 and didn't need to haul my extra monitor upstairs to do it. Easy peasy.

So my rig crashed again, it was up for about 19 hours with the current settings. The previous it crashed I had not been able to see it crash, I just knew it because the screen was blank and the fans on the gpus went up to 100 percent. This time I was siting in front of it doing something else when the screen went blank and the fans kicked up to 100 percent. My question is if there is some kind of log that could be looked at to see what caused the crash or can one be enabled that only keeps the last one hour of activity?

ssh in and look at the tail end of /var/log/dmesg. I have some crappy PCIe extenders here that would interrupt the connection between the GPU and the computer as soon as mining software fired up. The errors show up toward the end of /var/log/dmesg.

There's also /var/log/messages, but that tends to be less useful for hardware errors.

I have a keyboard and monitor connected to the rig for now, I found a file named kern.log that is 1.7 GB in size and kern.log.1 that is about 650 MB. these are the messages

just checked if they are seated correctly on the motherboard and on the cards and they are, I did an lspci command and it looks like id a2eb is the first gpu on the rig, it has it's own power cord to the power supply on the card and on the riser. the card does work but it has these errors

So my rig crashed again, it was up for about 19 hours with the current settings. The previous it crashed I had not been able to see it crash, I just knew it because the screen was blank and the fans on the gpus went up to 100 percent. This time I was siting in front of it doing something else when the screen went blank and the fans kicked up to 100 percent. My question is if there is some kind of log that could be looked at to see what caused the crash or can one be enabled that only keeps the last one hour of activity?

ssh in and look at the tail end of /var/log/dmesg. I have some crappy PCIe extenders here that would interrupt the connection between the GPU and the computer as soon as mining software fired up. The errors show up toward the end of /var/log/dmesg.

There's also /var/log/messages, but that tends to be less useful for hardware errors.

I have a keyboard and monitor connected to the rig for now, I found a file named kern.log that is 1.7 GB in size and kern.log.1 that is about 650 MB. these are the messages

just checked if they are seated correctly on the motherboard and on the cards and they are, I did an lspci command and it looks like id a2eb is the first gpu on the rig, it has it's own power cord to the power supply on the card and on the riser. the card does work but it has these errors

looks like I was wrong about a2eb being the first gpu. I removed the gpu completely and I'm still getting these errors as soon as I boot, it won't even go into the GUI any more

Three days I try to solve a problem.I changed versions of BIOS (0325,0608,0610) and risers, control 4G is included, has updated NVIDIA drivers to 381.22 - nothing helps.Maybe somebody will have ideas?

My guess is your mobo is trying to / is using SLI. Are you using an M2 ssd?

There should be some setting in the bios related to SLI; disable it / what slots are you using and are you using risers, if so on which GPUs?

If you are using risers; how are they powered?

Hi, no, I don't use M2 SSD. I use risers of the version 006s with the molex socket.

I managed to solve a problem. I modified / etc/default/grub

m1@m1-desktop:/etc/default$ more grub# If you change this file, run 'update-grub' afterwards to update# /boot/grub/grub.cfg.# For full documentation of the options in this file, see:# info -f grub -n 'Simple configuration'

Those look like the errors I was getting with some crappy PCIe extenders I had recently ordered. Here's a closeup of the inadequate soldering on the bit that goes in the slot; the other end of the riser is probably similar. Click for the full-res original:

I'll be sending these back. There were no reviews when I bought them, but since then someone else has left a 1-star review.

Those look like the errors I was getting with some crappy PCIe extenders I had recently ordered. Here's a closeup of the inadequate soldering on the bit that goes in the slot; the other end of the riser is probably similar. Click for the full-res original:

I saw your post about the soldering, I checked mine for the same thing and the solder joints look good. I have another riser set and tried swapping the parts one at a time and still kept getting the same errors. i plugged the riser straight to the psu using one of the molex cables that came with it and no change (actually that's when it no longer booted to the GUI) I used the same riser and mobo with an AMD gpu and ethos and it never reported the error, but simplemining OS did and those serrors started showing. Windows and the nvidia gpus and that same riser didn't have an issue, It's just weird. I truly have no clue as to what is going on.

ERROR: Error assigning value 85 to attribute 'GPUTargetFanSpeed' (m1-desktop:0[fan:3]) as specified in assignment '[fan:3]/GPUTargetFanSpeed=85' (Unknown Error).ERROR: Error assigning value 85 to attribute 'GPUTargetFanSpeed' (m1-desktop:0[fan:4]) as specified in assignment '[fan:4]/GPUTargetFanSpeed=85' (Unknown Error).I'm running on Gigabyte Aorus gaming 7 z270 , 4 Zotac 1070 amp edition and 1 gigabyte 1070 gaming g1.Cannot OC Gpu 3 and 4 in the Nvidia xServer settings also.And I found another thing /home/m1/eth/Genoil-U/ethminer: /usr/local/cuda/lib64/libOpenCL.so.1: no version information available (required by /home/m1/eth/Genoil-U/ethminer)/home/m1/eth/Genoil-U/ethminer: /usr/local/cuda/lib64/libOpenCL.so.1: no version information available (required by /home/m1/cpp-ethereum/build/libethcore/libethcore.so)/home/m1/eth/Genoil-U/ethminer: /usr/local/cuda/lib64/libOpenCL.so.1: no version information available (required by /home/m1/cpp-ethereum/build/libethash-cl/libethash-cl.so)/home/m1/eth/Genoil-U/ethminer: /usr/local/cuda/lib64/libOpenCL.so.1: no version information available (required by /home/m1/cpp-ethereum/build/libethash-cl/libethash-cl.so)Genoil's ethminer 0.9.41-genoil-1.1.7is this ok?Edit:all I had to do was to modify the xorg.conf with the gpu's BusID that are found in the nvidia x server settings. I've seen a post to change them with the BusIDs found in the nvidia-smi but that did not work at all. Hope this will help.

Solved my power issue. Ok so rewind and I was a rookie and started my rig with 1 card. Fullzero fixed that issue for me by showing me how to copy my xorg file over. I think that was also my power issue.

Essentially I re imaged the SSD and now I have power and my cards are doing 28.5 ETH and 500 SC each

For anyone else that started with 1 card and copied xorg over you might want to re image as well. I was on 25.5 ETH and 550 SC before.

is it possible to mine a different coin with each GPU?atm one can select a single coin for the full rig, but sometimes you have an heterogeneous rig and would like to mine different currencies with each GPU or even diversify the coins with an homogeneous rig

Those look like the errors I was getting with some crappy PCIe extenders I had recently ordered. Here's a closeup of the inadequate soldering on the bit that goes in the slot; the other end of the riser is probably similar. Click for the full-res original:

I'll be sending these back. There were no reviews when I bought them, but since then someone else has left a 1-star review.

I saw your post about the soldering, I checked mine for the same thing and the solder joints look good. I have another riser set and tried swapping the parts one at a time and still kept getting the same errors. i plugged the riser straight to the psu using one of the molex cables that came with it and no change (actually that's when it no longer booted to the GUI) I used the same riser and mobo with an AMD gpu and ethos and it never reported the error, but simplemining OS did and those serrors started showing. Windows and the nvidia gpus and that same riser didn't have an issue, It's just weird. I truly have no clue as to what is going on.

Those look like the errors I was getting with some crappy PCIe extenders I had recently ordered. Here's a closeup of the inadequate soldering on the bit that goes in the slot; the other end of the riser is probably similar. Click for the full-res original:

I'll be sending these back. There were no reviews when I bought them, but since then someone else has left a 1-star review.

I have been GPU mining for over a year now (mostly ETH on AMD rigs) using Ethos. In my experience, 70=80% of the hardware problems I have had have been related to poor quality risers. It's hard to find a source of decent quality ones - they are all made in China and it seems with little or no quality control. They are a very cheap high volume item so this means unfortunately for us it's just luck if you get good ones.I have a friend who ordered a bag of 10 and 7 of them were faulty right away, then an eighth failed after 24 hours.

So if you find a reliable source - buy twice as many as you think you will need!

First of all big thank you to fullzero and everyone contributing to this distro!

I've been struggling with the Genoil crash issue and lack of watchdog implementation for the past few days and I have a bandaid solution that seems to be actually working quite well, perhaps it can help others in the community:

Essentially you need to split the Genoil output to a file, grep it (we only care about 'error' instances only ; and then this output as input for a monitoring script that kills and restarts the misbehaving process.

So we have 2 scripts launched in screen as daemons "ltail" script and "ett" script

finally I also send output of ltail to timestamp.log to track how many times Genoil fails per hour - with roughly aiming at 1 crash per hour this gives me about 130MHs out of 5xGTX1060 which is a good 20+ MHs higher then Claymore... most importantly it gives stable hashing despite the OC introduced errors. The recovery is literally seconds.Oh yeah and I also run $tail -f ~/eth/Genoil-U/timestamp.log in a screen as well as watch -n 5 'sensors |grep Core' in another screen to fine tune the OC vs crash per hour vs tempHope this helps, and I hope the message is not too chaotic.Cheers!

Three days I try to solve a problem.I changed versions of BIOS (0325,0608,0610) and risers, control 4G is included, has updated NVIDIA drivers to 381.22 - nothing helps.Maybe somebody will have ideas?

My guess is your mobo is trying to / is using SLI. Are you using an M2 ssd?

There should be some setting in the bios related to SLI; disable it / what slots are you using and are you using risers, if so on which GPUs?

If you are using risers; how are they powered?

Hi, no, I don't use M2 SSD. I use risers of the version 006s with the molex socket.

I managed to solve a problem. I modified / etc/default/grub

m1@m1-desktop:/etc/default$ more grub# If you change this file, run 'update-grub' afterwards to update# /boot/grub/grub.cfg.# For full documentation of the options in this file, see:# info -f grub -n 'Simple configuration'

ERROR: Error assigning value 85 to attribute 'GPUTargetFanSpeed' (m1-desktop:0[fan:3]) as specified in assignment '[fan:3]/GPUTargetFanSpeed=85' (Unknown Error).ERROR: Error assigning value 85 to attribute 'GPUTargetFanSpeed' (m1-desktop:0[fan:4]) as specified in assignment '[fan:4]/GPUTargetFanSpeed=85' (Unknown Error).I'm running on Gigabyte Aorus gaming 7 z270 , 4 Zotac 1070 amp edition and 1 gigabyte 1070 gaming g1.Cannot OC Gpu 3 and 4 in the Nvidia xServer settings also.And I found another thing /home/m1/eth/Genoil-U/ethminer: /usr/local/cuda/lib64/libOpenCL.so.1: no version information available (required by /home/m1/eth/Genoil-U/ethminer)/home/m1/eth/Genoil-U/ethminer: /usr/local/cuda/lib64/libOpenCL.so.1: no version information available (required by /home/m1/cpp-ethereum/build/libethcore/libethcore.so)/home/m1/eth/Genoil-U/ethminer: /usr/local/cuda/lib64/libOpenCL.so.1: no version information available (required by /home/m1/cpp-ethereum/build/libethash-cl/libethash-cl.so)/home/m1/eth/Genoil-U/ethminer: /usr/local/cuda/lib64/libOpenCL.so.1: no version information available (required by /home/m1/cpp-ethereum/build/libethash-cl/libethash-cl.so)Genoil's ethminer 0.9.41-genoil-1.1.7is this ok?Edit:all I had to do was to modify the xorg.conf with the gpu's BusID that are found in the nvidia x server settings. I've seen a post to change them with the BusIDs found in the nvidia-smi but that did not work at all. Hope this will help.

/home/m1/eth/Genoil-U/ethminer: /usr/local/cuda/lib64/libOpenCL.so.1: no version information available (required by /home/m1/eth/Genoil-U/ethminer)/home/m1/eth/Genoil-U/ethminer: /usr/local/cuda/lib64/libOpenCL.so.1: no version information available (required by /home/m1/cpp-ethereum/build/libethcore/libethcore.so)/home/m1/eth/Genoil-U/ethminer: /usr/local/cuda/lib64/libOpenCL.so.1: no version information available (required by /home/m1/cpp-ethereum/build/libethash-cl/libethash-cl.so)/home/m1/eth/Genoil-U/ethminer: /usr/local/cuda/lib64/libOpenCL.so.1: no version information available (required by /home/m1/cpp-ethereum/build/libethash-cl/libethash-cl.so)

this is not a problem

If you are connecting a monitor directly to the motherboard (don't do this); only connect a monitor the the primary GPU (the one in the 16x slot closest to the CPU).

If you aren't connecting a monitor directly to the motherboard; I would try reimaging the USB key.