Using the Standalone Analysis Tool (SAT)
All too often we get support call, or email about a system interruption with very scant details. A string of Bxxx DEAD hex codes or a system abort # with subsystems and status codes. The worst case is that of a system hang with no status codes at all. The severity of the hang may range from a running system that no one can access, to total eclipse, where even a ctrl-b is ignored. Generally the consensus is that ‘I won’t take the time for a DUMP until it happens again’. Taking this stance is acceptable if a reasonable amount of time passes until the next event, but can come back to bite you if ‘the problem’ starts to reoccur with a vengeance. Then at least some amount of data from the first event would have been very valuable. So the question is, what are the alternatives when:
- Time is of the essence, and
- There is too much data to manually collect given issue #1.
The HP3000 platform has a powerful combination to address these issues. Namely, the remote console facility and the Standalone Analysis Tool (SAT). This article will focus on using remote console and SAT as the first line of defense diagnostic tools. For a more comprehensive article on how to handle unexpected system interruptions you are encouraged to review the article titled “Handling System Aborts and System Failures” .
The remote console facility has two major features: One is, as its name implies, it allows console control of the system remotely. And secondly, using a terminal emulator, it allows large amounts of data to be collected in a short amount of time. Please contact us if you need help on configuring or the remote console access on your system.
cpu 0 ; tr,i,d
cpu 1 ; tr,i,d
cpu 2 ; tr,i,d
exit
After a system abort, the console shows in inverse video:
SYSTEM ABORT 0 FROM SUBSYSTEM 0
SYSTEM HALT 7, $0000
With Status Codes of:
FLT DEAD FLT B907 FLT 0100
At this point, press CTRL-B (9×9 systems ensure that the key is in SERVICE, on 9×8 system verify the toggle switch on the back of the system is in the SERVICE position.)
CM> tc
Wait for the system to reset. Interrupt the AUTOBOOT process if necessary by pressing any key within 10 seconds when prompted.
Main Menu: Enter command or menu > boot pri
Interact with IPL (Y or N)?>
ISL> sat
...
Processor: 00 HPA: fffa0000 IVA: 00148000 Config: TRUE
Processor: 01 HPA: fffa2000 IVA: 00970000 Config: TRUE
Processor: 02 HPA: fffa4000 IVA: 00976000 Config: TRUE
Current CPU: 0 Original CPU: 0 Monarch CPU: 0 MP array at: c8000
Main memory: 80000000
HPDIR table: 1000000, len 1000000 HPDIROF table: 2000000, len 800000
IPDIR table: 2822000
RGLOB: 0 ICS: 0
Last PIN: 5 ON ICS DISP running
Processing dumpworthy file NMLOGMON.PUB.SYS, SID $128, #946 symbols...Done
Processing dumpworthy file NMCONSOL.PUB.SYS, SID $129, #456 symbols...Done
$2 ($0) nmsat > cpu 0 ; tr,i,d
PC=a.00178648 idle_disable_int+$8
NM* 0) SP=81e41390 RP=a.002c5c44 dispatcher+$790
NM 1) SP=81e41390 RP=a.00177800 iexit
--- Interrupt Marker
(end of NM stack)
$4 ($0) nmsat > cpu 1 ; tr,i,d
PC=a.00178648 idle_disable_int+$8
NM* 0) SP=81fd0390 RP=a.002c5c44 dispatcher+$790
NM 1) SP=81fd0390 RP=a.00177800 iexit
--- Interrupt Marker
(end of NM stack)
$6 ($0) nmsat > cpu 2 ; tr,i,d
PC=a.0018304c system_abort
NM* 0) SP=41855428 RP=a.00182dd8 ?system_abort+$8
export stub: a.008b61a0 make_pcall_from_debug+$330
NM 1) SP=41855428 RP=a.007a5d2c func_code_eval+$998
NM 2) SP=41855128 RP=a.007aa868 func_evaluate+$594
NM 3) SP=41854ee8 RP=a.007c7dd0 operand_search+$6a4
NM 4) SP=41852428 RP=a.007cada8 getvalue+$558
NM 5) SP=4184c7e8 RP=a.007c9e20 getaddrvalue+$d0
NM 6) SP=4184b6a8 RP=a.007c9900 getfactor+$c4
NM 7) SP=4184a568 RP=a.007c93cc getterm+$c0
NM 8) SP=41848c28 RP=a.007c8eb0 getsimpexpr+$1b0
NM 9) SP=41847ae8 RP=a.007c8934 getanyexpression+$f0
NM a) SP=418469a8 RP=a.007a28b0 scn_calc+$44
NM b) SP=41845068 RP=a.00726ca0 do_the_command+$104
NM c) SP=41844768 RP=a.00727b94 secondary_cmd_loop+$204
NM d) SP=41844668 RP=a.007280f8 main_cmd_loop+$98
NM e) SP=418443e8 RP=a.010465c4 nm_debug+$ca0
NM f) SP=41843ae8 RP=a.00e2d098 dbg_tell_the_owner+$36c
NM 10) SP=418438a8 RP=a.001ae65c dbg_break_handler+$688
NM 11) SP=418436e8 RP=a.0036c858 hpe_debug+$504
NM 12) SP=418435a8 RP=a.00383cd4 recovery_counter+$5c
NM 13) SP=41843468 RP=a.0013b038 hpe_interrupt_marker_stub
--- Interrupt Marker
NM 1) SP=418433e8 RP=a.00a5a064 hxdebug+$e4
--- End Interrupt Marker Frame ---
$a ($2f) nmsat > exit
ISL> start norecovery
Gathering information from SAT in this manner will only take a few minutes and is surely worth the effort. Consider that taking a memory dump can take an hour or more and is largely dependent upon the amount of memory in the system, the number and type of disks in the system volume set that contain virtual memory, and the speed of the tape drive in which memory will be dumped. As you can see from the sample output above there is a large amount of information contained here to write down. Clearly connecting to the remote console (or telnet to the GSP on an A-class or N-class system) via a terminal emulator is the way to go. You can capture the entire screen contents and email them to us to record the event and for analysis. Or we can log on to the remote console ourselves and collect the data directly.