We’ve been bringing up a number of new servers at the San Francisco data center. We’ve got some great Core2Duo machines which draw between 0.75A and 1.10A but have pretty substantial horsepower. So far so good, and almost all of the machines went in happy and stayed that way.
An interesting aspect of hosting servers at a data center on the west coast is that there’s plenty of space and lots of connectivity, but fairly scarce power. This is important for two reasons; first you need to be careful not to put too many servers in one cabinet or you’ll blow your 20A fuse all at once and all the servers shut off simultaneously. Second and nearly as scary is that you don’t get enough cooling from the building’s overtaxed and potentially under-sized air conditioners, and slowly cook your servers to an early death.
[We use a great system called syslog which collects all kinds of system stats and logs in one place for all of our servers – it makes it simple to collect and plot data like ‘fan speed, cpu temperature, and cpu workload of all machines for the last few days, data points every 15 minutes’.]
You can see on the plot below that on first power-up, “sf3″ was running hot.
This is one of two machines where the factory hard drives were flaky and I replaced them with drives from another vendor. Because the new drives are full-on SATA and our cases only supply standard ATX-style 4-pin drive power, I needed a jumper cable. This just a few inches of 4 wires, plus connectors on the ends. Turns out I’d done two dumb things.
I left these jumper cables hanging down a little near the motherboard, an inch away from a fan exhaust. I’d also stashed the folded-up spare IDE cable in an evidently unused space within the case.
On power-up we find, as shown in the graph below, the CPU is running 30F too hot. Once I was home from the data center and had time to build the thermal report for all the servers the two machines with SATA jumpers stood out, sf3 was the extreme example but sf9 was running 20F too hot. A half hour drive each way plus a few minutes of fiddling with the cables and baffles and the cpu temp was back under control.