hadeb05

Due to JUMBO frames hadeb05 can stop receiving DHCP discovers from TRBs. Solution:

ifconfig eth0 down

modprobe -r sk98lin; modprobe sk98lin

ifconfig eth0 192.168.0.1 netmask 255.255.255.0 up

rcdhcpd restart

hadeb07

hadeb07 parameters:

SuSE 10.2

Hard disks: sda+sdb = 0.32TB, sdc+sdd = 1TB. The two last disks (sdc, sdd) were additionally put into the machine to serve as backup disks. They are not really fixed inside the machine: take care when moving the machine. Nagios monitors the temperature of both disks.

Two 3GHz processors

2GB memory

hadeb04 (remote backup system)

hadeb04 serves as a remote backup system.

Software: rsnapshot, executed 1 times a day (crontab -e)

Config file: /etc/rsnapshot.conf

Backup disk: /data/hadeb04/backup

Test mode: rsnapshot -t hourly

Fixes for rsnapshot on hadeb04:

WARNING: Could not lchown() symlink

Reason: perl Lchown module is missing

Fix: perl -MCPAN -e 'install qw(Lchown)'

ERROR: rsync returned error 12 in rsync_cleanup_after_native_cp_al()

Reason: old versions of rsync cannot hold the entire file list in memory at once when there are too many files to be rsynced.

Fix: upgrade to rsync 3.0.0 or newer. This uses an incremental recursion mode to avoid the need to hold the entire file list in memory.

512 : SEMMNI defines the number of entire semaphore sets for the system (changed)

net.core.rmem_max=10485760 : Receive socket buffer size

net.core.wmem_max=10485760 : Send socket buffer size

Possible errors:

"No space left on device". This error occurs when the event builder application tries to open more than 128 sets of semaphores (the standard setting is kernel.sem="250 32000 32 128"). 128 sets mean 64 shared memory segments since two semaphore sets are required per memory segment. In this case, daq_evtbuild -m 65 will lead to an error.

"File exists". This error occurs when semaphores remained from previous execution of daq_evtbuild are not properly cleaned. Use ipcrm -s semid (or /home/hadaq/bin/ipcrm.pl).

Howto:

List open semaphores: ipcs -s

List open shared memory segments: ipcs -m

List all: ipcs -a

Remove semaphore: ipcs -s semid

Remove all open semaphores: /home/hadaq/bin/ipcrm.pl

RAID Array Controller

Adaptec RAID Controller has been exchanged on 04.06.2009.

Adaptec Storage Manager is a java tool to control RAID Arrays. You can start it under root by executing /usr/StorMan/StorMan.sh

How to rebuild degraded logical device with failed segment:

Click on lxhadeb01.gsi.de (Logical system) in Enterprise view,

then click on Controller in Physical devices

Goto Actions -> Rescan

Wait until rescan is finished. You will see that the failed disk is taken out of the logical device.

Configuration

Many configuration files are overwritten by cfagent which is started as a cron job at reboot and once per day (/etc/cron.d/gsi). If you want to stop it, you should comment out a couple of lines in /etc/cron.d/gsi.

To enable remote logins for new users you should add the user to /etc/security/access.conf (access.conf is also overwritten!)

IPMI Module

IPMI module provides a remote access to the machine. It is connected to ITM 'yellow' network. Currently we have hades30.gsi.de machine
in the 'yellow' network for an access to IPMI module.

Backup-system for lxhadesdaq

IT department can not guarantee that they can get lxhadesdaq up running a few hours after a hardware failure. Therefore, a backup-system should be prepared to replace lxhadesdaq in case of
hardware failure.

backup-system (hadeb06) will get nightly all daq related files from lxhadesdaq via rsync.

Directories to be rsynced on lxhadesdaq:

/home/hadaq

/home/scs

/var/diskless/linuxvme

/var/diskless/etrax

/etc/hosts

/etc/dhcp3/dhcpd.conf

/tftpboot

to be continued

Lustre mount

The new kernel 2.6.22-gsi-lustre was installed by Thomas Roth and Lustre cluster is mounted as /lustre_alpha.