Posted by Vide on November 23, 2007
Today I was debugging a problem I had with keepalived not discovering that a real server behind a virtual IP it manages, had died.
The problem was really strange because the check was very, very simple
real_server 192.168.1.65 3306
{
TCP_CHECK
{
connect_port 3306
bindto 192.168.1.65
connect_timeout 2
}
}
This configuration was created after reading keepalived.conf man pages, that talk about these 3 options for the TCP_CHECK, without going in deeper details. So I assumed that bindto IPADDR has to be used to indicate to which IP address we should connect to do the check. But I was wrong, because with this configuration if the real server behind dies, keepalived doesn’t notice anything at all. This is because the “bindto” option, I guess, is used to choose to which local (to the LVS director) IP address bind to check the external IP:port.
So, changing the configuration to looks like this:
real_server 192.168.1.65 3306
{
TCP_CHECK
{
connect_port 3306
connect_timeout 2
}
}
fixed the problem. Keepalived is a great product and works quite well, but it’s documentation is a bit disappointing.
Posted in Fixes, LVS, Linux, Networking | Tagged: fix, keepalived, LVS, tcp_check problem | Leave a Comment »
Posted by Vide on November 15, 2007
We have got a couple of Dell PowerEdge SC1435 (Dual Opteron) with a lspci output like this:
00:01.0 PCI bridge: Broadcom HT1000 PCI/PCI-X bridge
00:02.0 Host bridge: Broadcom HT1000 Legacy South Bridge
00:02.1 IDE interface: Broadcom HT1000 Legacy IDE controller
00:02.2 ISA bridge: Broadcom HT1000 LPC Bridge
00:03.0 USB Controller: Broadcom HT1000 USB Controller (rev 01)
00:03.1 USB Controller: Broadcom HT1000 USB Controller (rev 01)
00:03.2 USB Controller: Broadcom HT1000 USB Controller (rev 01)
00:04.0 VGA compatible controller: ATI Technologies Inc ES1000 (rev 02)
00:07.0 PCI bridge: Broadcom Unknown device 0140 (rev a2)
00:08.0 PCI bridge: Broadcom Unknown device 0142 (rev a2)
00:09.0 PCI bridge: Broadcom Unknown device 0144 (rev a2)
00:0a.0 PCI bridge: Broadcom Unknown device 0142 (rev a2)
00:0b.0 PCI bridge: Broadcom Unknown device 0144 (rev a2)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control
00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller
00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control
01:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express (rev 21)
02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express (rev 21)
03:0d.0 PCI bridge: Broadcom HT1000 PCI/PCI-X bridge (rev c0)
03:0e.0 IDE interface: Broadcom BCM5785 (HT1000) PATA/IDE Mode
it may happen that, when there is disk activity, tha SATA disk just disconnects, causing the processes using the disk to freeze for 30-60 seconds. The output in /var/log/messages could look like something like this:
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x40000000 action 0x2 frozen
ata1.00: cmd ec/00:00:00:00:00/00:00:00:00:00/00 tag 0 cdb 0x0 data 512 in
res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1: port is slow to respond, please be patient (Status 0xd0)
ata1: soft resetting port
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata1.00: configured for UDMA/133
ata1: EH complete
SCSI device sda: 312500000 512-byte hdwr sectors (160000 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA
The solution is to put
pci=noacpi
in your Grub/Lilo configuration as parameter of the kernel you’re using. I’ve experienced this problem with kernels 2.6.18 and 2.6.20, both 32 and 64 bit
EDIT:
I’ve spoken too early, it seems that the trick doesn’t work, so we are here again with this SATA problem on these machines. Any idea from the web?
Posted in Fixes, Linux | Tagged: Dell, fix, Linux, sata problem | 2 Comments »