Hot and heavy – rackmount servers and thermal management

25 Jan

We’ve been bringing up a number of new servers at the San Francisco data center. We’ve got some great Core2Duo machines which draw between 0.75A and 1.10A but have pretty substantial horsepower. So far so good, and almost all of the machines went in happy and stayed that way.

An interesting aspect of hosting servers at a data center on the west coast is that there’s plenty of space and lots of connectivity, but fairly scarce power. This is important for two reasons; first you need to be careful not to put too many servers in one cabinet or you’ll blow your 20A fuse all at once and all the servers shut off simultaneously. Second and nearly as scary is that you don’t get enough cooling from the building’s overtaxed and potentially under-sized air conditioners, and slowly cook your servers to an early death.

[We use a great system called syslog which collects all kinds of system stats and logs in one place for all of our servers - it makes it simple to collect and plot data like 'fan speed, cpu temperature, and cpu workload of all machines for the last few days, data points every 15 minutes'.]

You can see on the plot below that on first power-up, “sf3″ was running hot.

This is one of two machines where the factory hard drives were flaky and I replaced them with drives from another vendor. Because the new drives are full-on SATA and our cases only supply standard ATX-style 4-pin drive power, I needed a jumper cable. This just a few inches of 4 wires, plus connectors on the ends. Turns out I’d done two dumb things.

I left these jumper cables hanging down a little near the motherboard, an inch away from a fan exhaust. I’d also stashed the folded-up spare IDE cable in an evidently unused space within the case.

On power-up we find, as shown in the graph below, the CPU is running 30F too hot. Once I was home from the data center and had time to build the thermal report for all the servers the two machines with SATA jumpers stood out, sf3 was the extreme example but sf9 was running 20F too hot. A half hour drive each way plus a few minutes of fiddling with the cables and baffles and the cpu temp was back under control.

24 hours of temperature plots

One Response to “Hot and heavy – rackmount servers and thermal management”

  1. Ram February 21, 2008 at 8:26 pm #

    Hi,

    I would like to know more about thermal management in data centers and tools used to control.

    Pl. reply.

    Thank You..!

    Ram

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 88 other followers

%d bloggers like this: