Sunday, August 16, 2009

Service restored: treat with caution

atreus is running again, but an attempt to upgrade the kernel failed for reasons as yet undetermined (I'm finding that about 50% of kernel upgrade attempts fail at the moment, and am really missing the days of linux 2.2).

Logging from during the problem period suggests that it occurred due to memory exhaustion; I believe this then just caused the system to grind to a halt in the usual fashion. It doesn't appear to be an external attack, which means it might happen again. I'll monitor things over the next few days, and if I'm really lucky and have been a good boy, Linux will give us a new stable kernel that can boot on my hardware.

Update

atreus had to be hard rebooted (possibly because of a SYN flood attack, although I'm unconvinced); it was completely unresponsive on terminal. It then took most of two hours to verify the RAID set, and hopefully now will be able to complete booting. With luck we'll be back up under 24 hours after we went down.

atreus status

atreus is currently unavailable on all services; I'm waiting for an engineer from my ISP to reach the site so we can investigate.

Friday, July 17, 2009

atreus service restored

atreus is now connected and running again. At some point soon I may need to reboot it to confirm that this will go more smoothly after a power cycle in future, but I will announce that in the usual way.

atreus status

The data centre atreus was in was hit by a power failure last night. It has since been moved to another data centre in the same ISP (big thanks to Jon Morby and the fido.net guys for their work here).

There is an issue with the primary kernel image not being functional, so it's being brought up on the secondary image, which apparently still works fine on the ground. It isn't yet correctly plugged back into the network, which should happen in the next few minutes, at which point I can log in and figure out what's going on, and fix the kernel rebooting problem. Even on a power cycle, with hardware RAID and a reasonably resilient fs, we shouldn't have any further problems as a result of this outage.

Thursday, July 16, 2009

Current downtime

Some time between about 1945 and 2115 BST today (16th July), atreus became unreachable, apparently from everywhere. At this stage, given what I'm seeing off the network, I think it's most likely that we've had a kernel crash, although there are other possibilities.

There isn't much I can do right now, but I'll investigate this further tomorrow morning.