How to Check Hard Disk Health Without Downtime

Especially if you are using a Windows Server in a production environment, you would definitely want to check the health of your hard disks without downtime. After all, most servers are “always-on” with only few opportunities to take a server offline; however, there are several ways you can check on your hard drives without having to shut down the server or interrupt its services.

Step 1: Check event viewer logs for disk warnings

If there are serious disk issues simmering, many times you will find disk warnings in the event viewer logs. Unfortunately the term ‘warning’ was not chosen carefully, they should have named them ‘errors’, since in our experience a logged disk ‘warning’ is really nearly always indicating a hardware fault, such as a bad sector. Note that many disk errors, as well as bad sectors, don’t show up in Windows until the disk has exhausted its ability to self repair. Ideally you would want to catch a failing drive long before that, and below we have a method of doing exactly that. Before we get to internal disk checks, there are more common issues, such as file system corruption.

Step 2: Check for file system corruption and inconsistencies

The good old chkdsk command in Windows was brought to Windows from MS-DOS, and that for a good reason. It’s a very important tool. The file system may become corrupt for many different reasons, the most common one is a sudden loss of power, blue screens, or driver bugs. To check for file system consistency without downtime, use:

chkdsk C:

Without options. If you suspect the disk may have bad sectors, use the following command, but it will require the volume to be taken offline. In the case of the system boot disk, it will require a scan before Windows completes booting:

chkdsk D: /b

The /b parameter tells chkdsk to scan and test every single disk sector. Note that the disk may report no errors even if there were bad sectors, because the disk uses self repair mechanisms when you actualize sectors. By scanning every single sector, the disk is forced to check on each sector, which is normally not done. Therefore, even a successful /b scan with no bad sectors reported may hide the fact that some sectors were bad and were safely replaced by the drive internally. Modern drives ship with extra space to accommodate a certain number of bad sectors. When these are all used up, however, the drive will start reporting the issue to the operating system in the form of a read or write error; hence, the need to check the Event Viewer logs.

Step 3: Check internal hard drive error reports

A simple way to check all drives is to run this command:

wmic diskdrive get status

But you will likely find that it’s a little too primitive as it only shows “OK” for each drive listed without any additional information. A much more detailed report can be obtained by using the disk’s SMART mechanism.

Which hard drive is it?

Before we dig further, we need to know where (on which disk) each partition is stored. Using diskpart.exe, or if you have a full user interface, Windows Disk Management, you can check on each drive and see which partitions are stored on which drives. Ideally you will have tagged each disk in your server with the serial number so you know which one to pull out if need be.

Here’s an example. We run diskpart and then select disk 0, then lookup the details:

DISKPART> list disk

  Disk ###  Status         Size     Free     Dyn  Gpt
  --------  -------------  -------  -------  ---  ---
  Disk 0    Online         3726 GB      0 B        *
  Disk 1    Online          447 GB      0 B
  Disk 2    Online          238 GB  1024 KB        *
  Disk 3    Online         2794 GB      0 B        *
  Disk 4    Online         7452 GB  1024 KB        *
  Disk 5    Online         7452 GB      0 B        *

DISKPART> select disk 0

Disk 0 is now the selected disk.

DISKPART> detail disk

TOSHIBA MG03ACA400 ATA Device
Disk ID: {0820091B-E651-405F-8CF8-F87426A34014}
Type   : SATA
Status : Online
Path   : 0
Target : 0
LUN ID : 0
Location Path : PCIROOT(0)#PCI(1100)#ATA(C00T00L00)
Current Read-only State : No
Read-only  : No
Boot Disk  : No
Pagefile Disk  : No
Hibernation File Disk  : No
Crashdump Disk  : No
Clustered Disk  : No

  Volume ###  Ltr  Label        Fs     Type        Size     Status     Info
  ----------  ---  -----------  -----  ----------  -------  ---------  --------
  Volume 0     X   Toshiba4TB   NTFS   Partition   3725 GB  Healthy

The drive letter X and volume label show up in the disk details so we know which partition is on the disk. What is not so good about diskpart is that it didn’t tell us the disk’s serial number. If you happen to have many “TOSHIBA MG03ACA400” in your server, which is common practice especially for RAID setups, then it’s going to be difficult to narrow down the affected drive.
That’s why we typically recommend using the Disk Information screen in BackupChain instead to obtain all the relevant disk information on one screen, including the disk’s serial number.

Working with smartmontools

A useful and free tool for this purpose is smartmontools and it also works on Core and Hyper-V installations of Windows Server without a user interface, straight from the command line, without any dependencies. It’s a simple install, for help simply issue this command:

smartctl.exe -h

Below is a sample report for hard disk drive #9. The parameter is sdj because it’s the 10th drive, counting from 0. The numbers and letters are as follows:

abcdefghij
0123456789

Note the numbering starts with 0 in Windows, hence disk #0 is sda. To get the report for drive #9, we use device sdj with parameter /dev/sdj as shown below.

In bold we have highlighted the information that is most useful for a quick look at a drive’s health:

C:\Program Files\smartmontools\bin>smartctl.exe -a /dev/sdj
smartctl 6.6 2017-11-05 r4594 [x86_64-w64-mingw32-2016-1607] (sf-6.6-1)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     ST14000NM001G-2KJ103
Serial Number:    ZL2H****
LU WWN Device Id: 5 000c50 0db6cb3f3
Firmware Version: SN03
User Capacity:    14,000,519,643,136 bytes [14.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Sep 07 14:51:45 2021 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  567) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (1234) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x70bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   082   064   044    Pre-fail  Always       -       160452326
  3 Spin_Up_Time            0x0003   091   090   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       15
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   075   060   045    Pre-fail  Always       -       33400174
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       532
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       15
 18 Unknown_Attribute       0x000b   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   067   045   040    Old_age   Always       -       33 (Min/Max 33/34)
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       0
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       176
194 Temperature_Celsius     0x0022   033   040   000    Old_age   Always       -       33 (0 22 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0023   100   100   001    Pre-fail  Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       268 (195 211 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       74804549797
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       2556893914

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       198         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

 

If the drive already logged errors at some point in the past, they will be listed in the above report. At that point, it’s best to replace the drive long before it fails.

Step 4: Run Internal Disk Tests Without Downtime

Unlike chkdsk, the drive’s internal firmware allows for short and long tests to be run while the disk is in use. To run a test use these commands:

smartctl -t short /dev/sda
smartctl -t long /dev/sda

The above command with parameter ‘short’, runs a very quick self test without downtime. The ‘long’ parameter is a much more comprehensive test. This is also the test that the hard drive’s manufacturer asks you to run if you suspect a disk issue and want to send the drive back for a replacement. Note the above example uses /dev/sda, this means we want drive #0 tested.

If you run a long scan, it could take many hours to finish, but it won’t affect the services running on your server. To check on the scan, run this command and it will report how many percent are left and whether any errors were found:

smartctl -a /dev/sda

Note that if the disk is in a seriously bad shape the scan may cause the disk to fail while it’s being scanned. Sometimes the controller may freeze up and lose connectivity to the drive. In that case a reboot will be needed and you should proceed immediately with recovering any files that are recoverable, and replace the drive as soon as possible.

Other Hints

In the case of mechanical drives, and if you have physical access to the server you can sometimes spot a disk problem by listening. When a drive spots a bad sector, it goes into a cycle where it retries to read the sector many times over. Each time a click noise can be heard. These repeated clicks may be an indication that the disk is about to fail. It would make sense to run the above scans in that case.

 

Don’t Wait for Hard Drives to Fail, Prevention is Always Better

Download BackupChain today and use the fully functional trial to take all backup features for a test drive. If you suspect the drive may be failing, we don’t recommend taking disk images, as they may be placing too much mechanical stress on the drive. File-level backup is generally a better option if you suspect a drive may be failing. Apart from file server backup, disk imaging and disk cloning, BackupChain offers a wide range of backup features, such as: Virtual Machine Backup, VMware Backup, Hyper-V Backup, VirtualBox backup, and Windows Server Backup.

Backup Software Overview

The Best Backup Software in 2021
Download BackupChain®

BackupChain is the all-in-one server backup software for:
Server Backup
Disk Image Backup
Drive Cloning and Disk Copy
VirtualBox Backup
VMware Backup
FTP Backup
Cloud Backup
File Server Backup
Virtual Machine Backup
Server Backup Solution

Hyper-V Backup

  • 18 Hyper-V Tips & Strategies You Need to Know
  • How to Back up Windows 10 Hyper-V VMs
  • Hyper-V Backup

    Popular

    Resources