Handling System Aborts and System Failures

Computer system interruptions, while rare, are unfortunately, inevitable. It really isn’t a question of if you will have one, but when.

For our purposes here, system interruptions consist of system aborts/failures, system hangs, and hardware failures. My goal is to give you general guidelines and procedures for handling such interruptions. You may already have written procedures in place. Hopefully the procedures contained herein can be an extension to your documentation leading to smoother transitions from a down system to one that is usable and stable. I will begin by defining each type of interruption.

A system abort or system failure is MPE’s defense mechanism. When the operating system determines that an event has occurred that is unexpected and uncorrectable and may cause data integrity problems it aborts the system to protect itself. A message appears on the console with the abort or failure number and all system activity is halted and a red halt light should appearon the front of the system. This varies depending upon the model HP 3000 you are running. Furthermore, on an MPE/iX system there should be a rolling hex display on the console. A memory dump can tell which program/process/session/job was executing when the system aborted.

System hangs are when the machine appears to be running and the green run light is still lit, but you cannot access it. This can mean there is a process running that is consuming all CPU time, not leaving any available for even high priority system processes. Another type of hang is when the entire system is paused, waiting for some resource to become available. These resources can range from system buffers to console messages to I/O on a disk drive that is offline. Hung systems are the most difficult problems to diagnose.

Hardware failures can be categorized into two groups. Some failures that occur on hardware contained in the system such as device adapters and controllers are detected by the CPU itself (not MPE) and will immediately halt the system. You may have seen these with such messages as “WCS Parity Error” or “Machine Check” or simply a rolling hex display on the console that includes “FLT DEAD” .
These types of errors should be reported to your hardware support supplier.
Other times hardware failures are not detected at the CPU level but MPE detects the corruption and causes a System Abort (see above.)

What should you do if you experience a system interrupt? First, don’t panic. Most system aborts are isolated events that do not recur and can be ignored. A motto I frequently use is “Once is a fluke, twice is a trend” can be applied to system aborts. If, however, you have a rash of system aborts then certainly they should not be ignored.

When you call to report a system interruption you should be prepared to answer the following questions:

What is the system abort message or system failure number?
Has this situation occurred recently or is this the first time?
What values are displayed on the hex status on either the console or the system status panel?
Are there any unusual messages on the console?
What lights are lit on the system?
Have you applied any patches recently?
Has any new hardware been installed or replaced?
Have you installed any new applications or utilities lately?
Are any of the disk drives offline?
Can you get a control-A or control-B prompt on the console?

Depending on the answers to the above questions it may be suggest that you take a memory dump. A memory dump will dump out the system state to tape(s) and often is the only recourse getting to the root of a problem. Depending on the model and amount of memory in the system this process can take from 15 minutes to 45 minutes and may weigh heavily in the decision process. Can you afford for your system to be down for that length of time for a possible one-time occurrence?

If a memory dump is deemed necessary the following procedure can be followed at the system console:

Press ctrl-B to obtain a CM> prompt.
Enter “TC” Be sure not to enter “RS” or the memory dump will not be valid.
If you have autoboot enabled you must interrupt the autoboot process to take a memory dump.
Boot from the primary boot path and answer “Y” to “Interact with IPL”
Next, mount a write-enabled scratch tape on the tape drive that is configured as the alternate boot device. This would be same drive you load an SLT from.
At the ISL prompt type in DUMP. You will be prompted for a freeform identification string for the dump. This is simply a comment. A good comment would include your company name and the date and time.
After the dump is complete the system will automatically reboot again. This time you can bring your system up normally. If you have waiting batch jobs that would be difficult to resubmit you want to perform a START RECOVERY, otherwise a START NORECOVERY would be the best method for rebooting.

Reading a memory dump can be a fairly CPU intensive process and may have negative impact on your system. In most cases we will ask you to send us the dump via overnight delivery service so we can view it on a system in our lab. We have found this to be the most effective method for reading memory dumps.

While HP 3000 hardware is extremely durable and the MPE operating system is probably the most reliable and stable in the business, failures are inescapable. Please make sure you have an action plan to follow when it occurs.