| SciCo | DAQ Ace | Monitoring Ace | CO | (Operations Manager) |
| Rei Tanaka | Ian Vollrath | Andrew Ivanov | Guenakh Mitselmakher | Mary Convery |
Start of Shift Notes:  Inherit Sore#3275. ODH Alarm now being investigated by Steve and Cryo tech.
Sat Mar 6 16:40:29
first got bust timeout from svx06: (MLE) b0dap73.fnal.gov:Thread-27:4:34:41 PM->Busy Timeout: VRB_ISL_06 (MLE) b0dap73.fnal.gov:Thread-27:4:34:41 PM->Requested Halt-Recover-Run issued [errmon] CT: 2004.03.06 16:34:41 34'41" 1 crate/s: b0svx06(16), busy.[RXPT] (MLE) b0svx06:Messenger:4:34:36 PM->Silicon Timeout:BUSY- Slots: 08:fa00 10:fa20 12:fa40 16:f800 18:f820 20:f840 which was followed by eb vrb error: (RC) 16:34:45 Halt -> HALTED CT: 2004.03.06 16:34:50 34'50" 1 crate/s: b0svx06(16), busy.[RXPT]b0l3pcom1.fnal.gov:main:4:34:48 PM->Host b0eb16.fnal.gov, task tRec_0 SCPU-P1-E-CantResetVrb: Reset of VRB in slot 10 failed. VRB_BUS_ERROR: Alignment or VME bus error. --> Additional Information: Attention!!!. Event Builder SCPU_CANT_RESET_VRB Error !!! THE BAD VRB CRATE is: b0eb16 THE BAD VRB MODULE is in SLOT: 10 reset vrb. back running.- ian :: (run 179683)
got error: (RC) 16:40:35 Halt -> HALTED (MLE) b0l2de00:SpyAlpha:4:40:38 PM-> L1Mon: saw 210 L1 DMA transfers, expect 1 (buffer number 0) L1Mon: Dumping data for 1 word. hrr worked.- ian :: (run 179683)
again
WHA has a high HV bar at about 120% and has yellow color.- Andrew
All PMTs are green though. Not clear where the problem is.
Lost ICICLE heart bit. Restarted ICICLE. It is still purple on iFix for about 10 min now.- Andrew
if any of these two conditions is not met, page silicon SPL momentarily! - rainer
-- Sat Mar 6 18:24:34 comment by...Andrew --
Both of conditions are met. Cooling is functional and the power has been cut.
It all started with the ODH alarm at the end of my shift. "ODH alarm in col hall" appeared on the status panel, but no FIRUS message was issued. On opening the "ODH C-Hall" display under the "Misc. Menu" in iFIX, I found sensor NE_B was reading out -2.6% O2; it had obviously broken, maybe power was no longer going to the sensor or the signal connection was bad.
Steve Gordon and I looked into this, and I also talked to Dervin a couple of times. Ideally, we would have bypassed the sensor in alarm with a Burndy bypass "key", the after much searching we concluded all bypass keys were already in use. We tried to use the chassis bypass key for the alarm chassis that did collision hall ODH alarms, but this only silenced the alarm and did not keep the HVAC system from going into purge mode (instead of HVAC Full). All we could do is what we already had done, which was bypass the entire ODH system, both in the collision hall and the assembly building (not the clean rooms).
This did not strike me as a good solution safety-wise, so I continued to investigate ways to bypass only the one sensor. Finally, I tried disabling the sensor readout in 4mation. Bad move! Since the ODH sensors are in Quadlog A, any tampering results in the entire Quadlog shutting down.
FIRUS came up with two alarms: flammable gas in the gas sheds, and a fire alarm in the collision hall. Strangely, the action in B0 was that of a flammable gas condition in the collision hall--that is, all detector power and HV were turned off, and the solenoid went into slow dump. Perhaps this is explained by ODH being bypassed. I explained the false alarm to the fire chief when the fire trucks arrived (I had also called dispatch immediately but they had already dispatched). Now we are still in the process of recovery--most everything is on at this point, but Jim Humbert is having problems bringing up the solenoid.
I've explained to Jim, and I'll make sure he passes it on to new crews, that since the whole ODH system is still bypassed, we have to consider the process system techs as keeping a fire watch on the ODH sensors until we're able to bypass the bad sensor.
- Steve Hahn| CEs wire plots and CPR wire plots sre "hotter" than reference plots |
| CES wire plot, seems a bit "hot" |
run control tip of the day: if you can't see the error messages when you shepherd a crate, open an xterm and type: kdestroy kticket- ian
CEM CHA WHA HV problems:
When we first brought up the Gammas and I restarted PISABOX, all three of these were yellow. Now after a second pass of readouts, they all appear to be slightly better--CEM is green, and the bad channels in CHA WHA are slightly less out of tolerance. All the remaining problems now are in two Gammas--the NE CHA and WHA 1. Of course, the Gamma HV values are perfect. I've recommended to Mary that we leave this for the moment and see if it continues to get better, but we should call the calorimeter HV pager if it does not get better by tomorrow morning.
- Steve Hahnhave a VISIONread/write error from b0cot01 slot 17. tried shepherding -> it failed. tried reboot from vxworks -> failed. also having problems coldstarting b0cot16. still have some time to work on it since solenoid is still hosed. waiting for response from daq pager carrier.- ian
rather tried "reset" from vxworks -> failed.
VME BERR received! *** Error reading slot 17 channel 26 calibration reg *** *** In slot 17, found DspVer=255, expected 37 FISION: errno = 0x5 (0x5): VME BERR received! *** Error writing slot 17 channel 27 calibration reg *** FISION: errno = 0x5 (0x5): VME BERR received! *** Error reading slot 17 channel 27 calibration reg *** *** In slot 17, found DspVer=255, expected 37 FISION: errno = 0x5 (0x5): VME BERR received! (...) ISION: errno = 0x5 (0x5): VME BERR received! *** Error reading slot 17 channel 95 calibration reg *** *** In slot 17, found DspVer=255, expected 37 Processing a COT configuration message iteration = 0 flag20 flag21 DTH requested = 225 < 350 !! Cannot configure readout - check readout list. Readout process not started. Error status: 1 Enabling TAXI to send data to VRB Error Processing Readout List ! Sent ack to Run Control: 1 ERROR ERROR Resetting card in slot 17 failed Got error message through error_q
for cot16 get a bunch of these: FISION: errno = 0x16 (0x16): architecture init. failure! FISION: errno = 0x16 (0x16): VME controller not found (at 0xc1400000)! FISION: errno = 0x16 (0x16): architecture init. failure! bill badgett said he'd try looking into this. also paged tdc expert regarding cot01 slot 17 problem.
The Universe/Tundra VME/PCI interface on b0cot16 is not accessible from the MVME CPU. I suggest cycling the power on b0cot16. If this does not work, we will have to replace the MVME card. |
ok. how do we power cycle?
mike lindgren is looking into this
tdc expert is working on cot01 slot 17
Power cycling COT crates individually can be done via one of the iFix PSM displays.
ok. mike reset it. now "VME controller not found" error doesn't appear upon coldstarting. only thing is now crate is "margarine" in local cleint guy but says "ready" in green when i check the status.
| Run Number | Data Type | Physics Table | Begin Time | End Time | Live Time | L1 Accepts | L2 Accepts | L3 Accepts | Live Lumi, nb-1 | GR | SC | RC |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 179683 x2BDE3 | BEAM | PHYSICS_2_02 [2,424,431] | 23:51:25 | 17:52:37 | 16:45:43 | 881,025,509 | 15,870,935 | 3,323,844 | 2029.546 | 1 | 1 | 1 |
| Totals | 23:55:02 | 16:45:43 | 881,025,509 | 15,870,935 | 3,323,844 | 2029.546 |
| Totals | |
|---|---|
| Date: | 2004.03.06 |
| Shift: | eve |
| Delivered luminosity: | 416.1 nb-1 |
| Acquired luminosity: | 120.3 nb-1 |
| Efficiency: | 28.9 |
- Around 15:50 we got an ODH(Oxygen Deficiency Hazard) alarm in collision hall due to a broken sensor located at North-East B. - Trying to bypass this sensor by disabling the sensor readout in 4-mation caused the entire Quadlog shut down. - At around 17:41, flammable gas error in B0 caused all detector power off and solenoind slow dump. Fire trucks arrived. - Recovery process since then. Silicon is safe. Solenoid is up again. Still people are working on: - CEM CHA WHA HV problems (should page tomorrow morning if in same condition). - Shower max calibration. - COT crate b0cot17. Plan: - Recover, do calibrations and then take more data! - Store dump is forseen Sunday morning or afternoon depending on p-bar stack. The p-bar stack is already 140E10 at 24:00, i.e. 7E10/hour stack rate.- Rei Tanaka
Comments on the current done-timeout problems:
b0wcal06: ADMEM in slot 18 is not responding to VME reads
to its Level 2 Buffers (but is not completely
unresponsive to VME accesses)
This is usually solved by re-downloading the
Et look-up table, which also control the L2 buffers
Please call ADMEM experts for further instructions.
b0cot17: There are systemic problems in this crate related
to inter-card synchronization. This usually
indicates a problem with the Tracer broadcasting
various DAQ signals on the backplane. Often a
Tracer reseat will help, but that's not an option
right now.
I'd suggest turning of the crate for some extended
period of time (say at least 15 minutes) and
try again; if this doesn't work an access will
be needed to massage or replace the Tracer - W.Badgett :: (run 179697)