Assessment of the Effect of Memory Page Retirement on System RAS Against Hardware Faults |
|
|
|
|
Written by Dong Tang, Peter Carruthers, Zuheir Totari, Michael Shapiro
|
|
Thursday, 01 December 2005 17:00 |
IEEE 11th Pacific Rim International Symposium on Dependable Computing (PRDC 2005), December, 2005. [Slides]
One of the Solaris operating system fault management architecture provisions is the automatic memory page retirement (MPR), intended to reduce the negative impact of memory permanent faults that generate either correctable or uncorrectable errors, on system reliability, availability, and serviceability (RAS). The MPR technique allows memory pages suffering from correctable errors to be removed from usage pools without interrupting user applications running on the system. It also allows memory pages suffering from uncorrectable errors to be isolated from usage with limited impact on affected user processes to avoid an outage for the entire system. This study applies analytical models, with parameters calibrated by field experience, to quantify the reduction that can be made by this operating system level self-healing technique, on the system interruptions, yearly downtime, and number of services introduced by hardware permanent faults, for typical low-end and midrange server systems. The results show that significant improvements can be made on these three system RAS metrics by deploying the MPR capability.
Assessment of the Effect of Memory Page Retirement on System RAS Against Hardware Faults
|
|
Last Updated on Friday, 01 September 2006 06:42 |