4 Common Server Hardware Failure Causes & Troubleshooting
Park Place Hardware Maintenance
As a System Administrator or Data Center Manager, you’ve probably paid your midnight dues at least once in your career. Whether it’s coming into the data center in the middle of the night, or spending hours upon hours poring through logs and troubleshooting server hardware to find the causes of server failure, data center management can be a headache.
Whether you’re doing some preemptive research or you’re currently in the midst of server troubleshooting, this quick guide will help you gain clarity about the most common server problems.
Server Hardware Failure Statistics
You may be tasked with maintaining a corporate data center or providing client hosting, but either way, outages can leave a pit in your stomach. When downtime strikes, your servers and networking hardware are the usual culprits. In fact, 80% of all outages in data centers result from server hardware.
By far the most common form of server hardware failure is hard drive malfunction. In fact, 80.9% of all failures come from HDD malfunctions, so it’s always the first place to look.
The likelihood of failure also climbs as the server ages. Starting with an average 5% server hardware failure rate at year one and an 18% rate when seven years in, aging hardware is definitely something to watch.
Park Place Technologies offers multivendor IT server support for your post-warranty equipment. If you want to prolong the useful life of your hardware while maintaining peace of mind, contact us for a quote today!
4 Types of Server Failures
When it comes to server problems, there are four main categories that you should consider to quickly resolve any issues.
1. Hard Drive Failure
Spinning disks are notoriously fault prone. While the median lifespan of an HDD is just over six years, plenty of things can, and do go wrong before then.
Causes of Hard Drive Failures
There are three common causes behind the failure of hard drives:
- Mechanical failure
- Electronic failure
- Logical failure
Common identifiers for mechanical issues include clicking and scratching noises. Common causes are being dropped, jarred, or exposed to unfavorable environmental conditions. Electronic failure can happen during voltage spikes or if overheated. Last, logical failures can happen from data corruption, improper registry changes, or accidental drive formatting.
Beyond plugging in a new drive or trying different cables (which could lead to data loss), admins can use command-line tools like fsck for Linux machines and chkdsk on Windows to check and repair logical errors for server troubleshooting.
Of course, building in redundancy via RAID or a distributed parallel filesystem can help prevent these failures from becoming an issue. Opting for solid state drives (SDDs) also mitigates some of these risks, especially mechanical failures.
2. Motherboard Failure
Motherboards are perhaps the most difficult common server problem to deal with. It can be hard to tell whether the failure is due to the motherboard itself or another piece of hardware that’s connected to it.
Causes of Motherboard Issues
There are three common causes of motherboard malfunctions:
- Overheating
- Electrical failure
- Physical
Overheating, the most common server hardware issue, happens for a few reasons. Blockage in the fans can prevent the cooling system from proper functioning. A warm or humid environment can cause thermal throttling. Depending on your current data center infrastructure management stack, you can generally monitor air quality and temperature events before they cause system failures.
Electrical failure can occur due to short circuiting if any metal encounters the motherboard while its running, such as accidental contact during a hot swap. A static charge on a technician’s finger or even a loosely fitted component can also cause circuit malfunctions. Power surges and spikes are also common culprits, so it’s important to leverage surge protectors.
Physical damage to your server and storage infrastructure components is less common in data centers. Impacts on the rack or a liquid spill can spell disaster, but at least they are easier to diagnose.
There’s always the possibility that the hardware has simply reached its end of life (EOL). A quality motherboard can last 10 to 20 years, so if you’re running legacy equipment in your data center then this could be a factor.
3. Power Source Failures
Blackouts, brownouts, fluctuations caused by severe weather, and poor electrical infrastructure inside your building or data center can cause unexpected power outages. In turn, power source failures can lead to frustrating errors, server crashes, and irreversible damage to your IT operations.
Causes of Power Supply Problems
Some of the more common causes of power supply disruption include the following:
- Environmental
- PSU hardware issues
- Faulty connections
Lightning strikes, power outages caused by storms, and other environmental factors can create problems for supplying power to servers. The best way to protect against power outages is an uninterruptible power supply (UPS), an especially crucial tool for decreasing your server hardware failure rates.
It’s also possible to have power issues within the server itself. The power supply unit that provides power to the motherboard can also malfunction, either in the form of fault in the unit itself or in the cabling. Sometimes all it takes is replacing a cable or even unplugging it and plugging it back in.
4. Air Quality and Temperature Failures
The final piece of the puzzle is carefully controlling the climate inside your data center. A proper HVAC system is just as important to server maintenance as hardware and patching.
Causes of Temp/Air Quality Issues
Common server hardware issues can be the result of the following environmental factors:
- Overheating
- Dust
- Humidity
Overheating can contribute to the thermal throttling that we discussed above, and it’s the main reason that server rooms are usually kept between 64 and 81 degrees F (18-27 C). Dust can clog fans and heatsinks and can subsequently lead to overheating. Humidity also needs to be controlled. Moisture in the air and electronics don’t mix well, and humidity can create problems such as hardware corrosion or short circuiting.
Avoid Server Failure with a Trusted Partner
Troubleshooting your server hardware is frustrating, but it doesn’t have to happen at all when you have the right data center and networking optimization partner in your corner. Park Place Technologies has been providing third party data center maintenance for over 30 years and can help you maximize your uptime.
Whether your operations are better suited for post-warranty support paired with 24/7 data center hardware monitoring, or fully managed server management services, we can provide the support you need.
Contact us today to learn how we can help your team do less with more!