2004 CDF E-Log -- Eve shift. Sat Mar 6, 2004
SciCo DAQ Ace Monitoring Ace CO (Operations Manager)
Rei Tanaka Ian Vollrath Andrew Ivanov Guenakh Mitselmakher Mary Convery


Start of Shift Notes:  

Inherit Sore#3275. ODH Alarm now being investigated by Steve and Cryo tech.

Sat Mar 6 16:40:29
first got bust timeout from svx06: 

(MLE) b0dap73.fnal.gov:Thread-27:4:34:41 PM->Busy Timeout: VRB_ISL_06  
(MLE) b0dap73.fnal.gov:Thread-27:4:34:41 PM->Requested Halt-Recover-Run issued [errmon] 
CT: 2004.03.06 16:34:41 
34'41" 1 crate/s: b0svx06(16),  busy.[RXPT] 
(MLE) b0svx06:Messenger:4:34:36 PM->Silicon Timeout:BUSY- Slots:  08:fa00 10:fa20 12:fa40 16:f800
18:f820 20:f840 


which was followed by eb vrb error: 


(RC)   16:34:45 Halt -> HALTED 
CT: 2004.03.06 16:34:50 
34'50" 1 crate/s: b0svx06(16),  busy.[RXPT]b0l3pcom1.fnal.gov:main:4:34:48 PM->Host b0eb16.fnal.gov,
task tRec_0 

SCPU-P1-E-CantResetVrb: Reset of VRB in slot 10 failed. 
VRB_BUS_ERROR: Alignment or VME bus error. -->   
 Additional Information:  

 Attention!!!. Event Builder SCPU_CANT_RESET_VRB Error !!!  
 THE BAD VRB CRATE is: b0eb16  
 THE BAD VRB MODULE is in SLOT: 10   

reset vrb. back running.
 - ian :: (run 179683)
Sat Mar 6 16:41:17 CDF ODH(Oxygen) Alarm in Collion Hall. Inner North-East value is -2.6. It should be around 21. Steve Hahn is working on it. Ops Manager contacted. It looks false alarm due to broken sensor. Cryo tech. has bypassed it. - Rei Tanaka
Sat Mar 6 16:42:06
got error: 

(RC)   16:40:35 Halt -> HALTED 
(MLE) b0l2de00:SpyAlpha:4:40:38 PM-> 
L1Mon: saw 210 L1 DMA transfers, expect 1 (buffer number 0) 
L1Mon: Dumping data for  1 word. 

hrr worked.
 - ian :: (run 179683)
-- Sat Mar 6 16:42:56 comment by...ian --  
again

Sat Mar 6 17:06:02
WHA has a high HV bar at about 120% and has yellow color.
 - Andrew
-- Sat Mar 6 17:09:23 comment by...Andrew --  
All PMTs are green though. Not clear where the problem is.

-- Sat Mar 6 17:10:18 comment by...SciCo --  Paged WHA HV expert following the instruction.
Sat Mar 6 17:20:42
 - 15:30-17:00 status plots - Andrew
Sat Mar 6 17:23:24
Lost ICICLE heart bit. Restarted ICICLE. It is still purple  
on iFix for about 10 min now.
 - Andrew
Sat Mar 6 17:46:57 Alarms are off, and HV on many detectors is off - G. Mitselmakher
Sat Mar 6 17:52:56 Run 179683 Terminated at 2004.03.06 17:52:37 - RunControl
Sat Mar 6 17:53:17 Run 179683 TERMINATE: run aborted due to hv trips - Ian X2080
Sat Mar 6 17:59:14 Got Low Level Gas Warning. All HV tripped. Solenoid is ramping down. Paged Silicon, COT. - Rei Tanaka
-- Sat Mar 6 18:10:11 comment by...Rei Tanaka --  Silicon and COT responded. Cooling for silicons are OK. Also notified MCR about CDF status (HV all off, Solenoid ramping down).
Sat Mar 6 18:18:11

silicon status

  • my understanding is that cooling is functional
  • my understanding is that power to the silicon has been cut (looking at the log files) given these points the silicon appears to be safe (judging from remote and talking to shift crew).

    if any of these two conditions is not met, page silicon SPL momentarily!  - rainer
    -- Sat Mar 6 18:24:34 comment by...Andrew --  

    Both of conditions are met. Cooling is functional and the power
    has been cut.

    Sat Mar 6 18:19:33 Steve is trying to recover power supplies down stairs. Shall ramp up the solenoid then. - Rei Tanaka
    -- Sat Mar 6 18:31:06 comment by...Rei Tanaka --  Ops Manager (Mary) is here. We also notified Mike Lindgren about the situation.
    -- Sat Mar 6 18:32:52 comment by...Rei Tanaka --  Silicon expert is here. Rainer is also in contact, and indeed is here.
    Sat Mar 6 18:32:02 JJ also called us if we need help. Thanks JJ. - Rei Tanaka
    Sat Mar 6 18:38:35 The COT ASD power and HV are back on after the power outage. The HV IFIX controls are working and the control room monitor is updating properly. - R. Wagner
    Sat Mar 6 18:41:27 Paged Muon (Phil Schlabach). We can put back Muon HV when all crates are ON. - Rei Tanaka
    Sat Mar 6 19:15:30 Still in recovery process. All crate's HVs down stairs are up. Solenoid is still off.  - Rei Tanaka
    Sat Mar 6 19:55:37

    My long mea culpa - by Steve Hahn

    It all started with the ODH alarm at the end of my shift. "ODH alarm in col hall" appeared on the status panel, but no FIRUS message was issued. On opening the "ODH C-Hall" display under the "Misc. Menu" in iFIX, I found sensor NE_B was reading out -2.6% O2; it had obviously broken, maybe power was no longer going to the sensor or the signal connection was bad.

    Steve Gordon and I looked into this, and I also talked to Dervin a couple of times. Ideally, we would have bypassed the sensor in alarm with a Burndy bypass "key", the after much searching we concluded all bypass keys were already in use. We tried to use the chassis bypass key for the alarm chassis that did collision hall ODH alarms, but this only silenced the alarm and did not keep the HVAC system from going into purge mode (instead of HVAC Full). All we could do is what we already had done, which was bypass the entire ODH system, both in the collision hall and the assembly building (not the clean rooms).

    This did not strike me as a good solution safety-wise, so I continued to investigate ways to bypass only the one sensor. Finally, I tried disabling the sensor readout in 4mation. Bad move! Since the ODH sensors are in Quadlog A, any tampering results in the entire Quadlog shutting down.

    FIRUS came up with two alarms: flammable gas in the gas sheds, and a fire alarm in the collision hall. Strangely, the action in B0 was that of a flammable gas condition in the collision hall--that is, all detector power and HV were turned off, and the solenoid went into slow dump. Perhaps this is explained by ODH being bypassed. I explained the false alarm to the fire chief when the fire trucks arrived (I had also called dispatch immediately but they had already dispatched). Now we are still in the process of recovery--most everything is on at this point, but Jim Humbert is having problems bringing up the solenoid.

    I've explained to Jim, and I'll make sure he passes it on to new crews, that since the whole ODH system is still bypassed, we have to consider the process system techs as keeping a fire watch on the ODH sensors until we're able to bypass the bad sensor.

     - Steve Hahn
    Sat Mar 6 20:00:42
    CEs wire plots and CPR wire plots sre "hotter" than reference plots
     - G.Mitselmakher
    Sat Mar 6 20:02:52
    CES wire plot, seems a bit "hot"
     - G.Mitselmakher
    Sat Mar 6 20:29:24 Problems powering up solenoid. Awaiting expert help (Bob Sanders expected around 2100). Silicon experts still working and central calorimeter gamma supplies still not stabilized, so we are not taking data. We will start a AAA_SHOTSETUP run to exercise the DAQ. - convery
    Sat Mar 6 20:32:20 We made use of the down time to do measurements with an optical power meter:
    - on the end of Muon fibers before going into the L2 Muon Pulsar board (trigger room)
    - at the output of the L2 out of Muon Matchbox and prematch box cards in the Muon Trigger crate.
    - we tried cleaning the L2 out connectors in the Muon Trigger crate with dry air
    - we redid the measurement on L2out connectors in the Muon Trigger crate: we see only small to no improvment. More details of the measurements to be given in the Pulsar e-log  - Ted & Burkard
    Sat Mar 6 20:46:10 Run 179684 TERMINATE: oups - we need to clean up the event builder due to silicon expers (my bad rsw) - 2080 ian
    Sat Mar 6 20:59:52
    run control tip of the day: 
    
    if you can't see the error messages when you shepherd a crate, open an xterm and type: 
    
    kdestroy 
    kticket
     - ian
    Sat Mar 6 21:04:23

    CEM CHA WHA HV problems:

    When we first brought up the Gammas and I restarted PISABOX, all three of these were yellow. Now after a second pass of readouts, they all appear to be slightly better--CEM is green, and the bad channels in CHA WHA are slightly less out of tolerance. All the remaining problems now are in two Gammas--the NE CHA and WHA 1. Of course, the Gamma HV values are perfect. I've recommended to Mary that we leave this for the moment and see if it continues to get better, but we should call the calorimeter HV pager if it does not get better by tomorrow morning.

     - Steve Hahn
    Sat Mar 6 21:51:10 Run 179685 TERMINATE: aborting the run - 2080 ian
    Sat Mar 6 21:56:33
    have a VISIONread/write error from b0cot01 slot 17. tried shepherding -> it failed. tried
    reboot from vxworks -> failed. 
    
    also having problems coldstarting b0cot16. still have some time to work on it since solenoid is
    still hosed. waiting for response from daq pager carrier.
     - ian
    -- Sat Mar 6 22:01:55 comment by...ian --  
    rather tried "reset" from vxworks -> failed.

    -- Sat Mar 6 22:05:52 comment by...rainer --  from frontend CPU
    VME BERR received!
    *** Error reading slot 17 channel 26 calibration reg ***
    *** In slot 17, found DspVer=255, expected 37
    FISION: errno = 0x5 (0x5): VME BERR received!
    *** Error writing slot 17 channel 27 calibration reg ***
    FISION: errno = 0x5 (0x5): VME BERR received!
    *** Error reading slot 17 channel 27 calibration reg ***
    *** In slot 17, found DspVer=255, expected 37
    FISION: errno = 0x5 (0x5): VME BERR received!
    (...)
    ISION: errno = 0x5 (0x5): VME BERR received!                                   
    *** Error reading slot 17 channel 95 calibration reg ***                        
    *** In slot 17, found DspVer=255, expected 37                                   
    Processing a COT configuration message                                          
    iteration = 0                                                                   
    flag20                                                                          
    flag21                                                                          
    DTH requested = 225 < 350 !!                                                    
    Cannot configure readout - check readout list.                                  
    Readout process not started. Error status: 1                                    
    Enabling TAXI to send data to VRB                                               
    Error Processing Readout List !                                                 
    Sent ack to Run Control: 1 ERROR ERROR                                          
    Resetting card in slot 17 failed                
    
    Got error message through error_q          
    

    -- Sat Mar 6 22:13:29 comment by...ian --  
    for cot16 get a bunch of these:
    
    
    FISION: errno = 0x16 (0x16): architecture init. failure!
    FISION: errno = 0x16 (0x16): VME controller not found (at 0xc1400000)!
    FISION: errno = 0x16 (0x16): architecture init. failure!
    
    bill badgett said he'd try looking into this. 
    
    also paged tdc expert regarding cot01 slot 17 problem.

    Sat Mar 6 22:00:07 Run 179686 TERMINATE: cot crates in error - 2080 ian
    -- Sat Mar 6 22:12:51 comment by...Rei Tanaka --  Paged DAQ, then TDC primary pager.
    -- Sat Mar 6 23:25:43 comment by...J. Nachtman --  DAQ pager never went off...
    Sat Mar 6 22:11:54 Solenoid is ramping up. - Rei Tanaka
    -- Sat Mar 6 23:20:34 comment by...Rei Tanaka --  Ramped to nominal value (4650Amps, 13700Gauss).
    Sat Mar 6 22:15:55
    The Universe/Tundra VME/PCI interface on  
    b0cot16 is not accessible from the MVME CPU. 
    
    I suggest cycling the power on b0cot16. 
    If this does not work, we will have to  
    replace the MVME card.
     - W.Badgett
    -- Sat Mar 6 22:22:48 comment by...ian --  
    ok. how do we power cycle?

    -- Sat Mar 6 22:30:25 comment by...ian --  
    mike lindgren is looking into this

    -- Sat Mar 6 22:30:57 comment by...ian --  
    tdc expert is working on cot01 slot 17

    -- Sat Mar 6 22:32:35 comment by...Rei Tanaka --  Mike is going down stairs to power cycle this crate.
    -- Sat Mar 6 22:33:40 comment by...WB --  
    Power cycling COT crates individually can 
    be done via one of the iFix PSM displays.
    

    -- Sat Mar 6 22:45:47 comment by...ian --  
    ok. mike reset it. now "VME controller not found" error doesn't appear upon coldstarting. only
    thing is now crate is "margarine" in local cleint guy but says "ready" in green when i check the
    status.

    -- Sat Mar 6 23:42:16 comment by...W.Badgett --  The margarine color is expected, as cot16 was out of sync with the rest of the runControl clients.
    Sat Mar 6 22:54:30 Abort gap rate is high (pinkies in Tevmon). Called MCR to do scraping. Do not forget to set HV to stand-by state for Silicon, COT, CES, CCR, CPR, CMU, CMP, CMX and BMU before calling MCR ! - Rei Tanaka
    -- Sat Mar 6 23:06:57 comment by...Rei Tanaka --  Called MCR again for scraping to reduce the abort gap. They are not allowed to change parameters globally if abort gap < 20kHz. Now doing monitoring and do minor change. Now it is 17kHz.
    -- Sat Mar 6 23:23:54 comment by...Rei Tanaka --  MCR called us. They keep watching on abort gap loss. Told them that CDF can survive with current condition < 18kHz.
    Sat Mar 6 23:05:54
     - 17:00-23:00 beam status plots
    Sat Mar 6 23:09:07 Run 179688 TERMINATE: junk junk - 2080 ian
    Sat Mar 6 23:09:35 Run 179688 ACTIVATE: junk j8unk - 2080 ian
    Sat Mar 6 23:21:26 silicon checked out OK . Friendly reminder that silicon can not be biased with abort gap losses >= 20 kHz. Bring it to standby if TevMon alarms and inform MCR. if collimators will be moved, bring other detectors to standby as written on the whiteboard. - rainer :: (run )
    Sat Mar 6 23:23:17 Run 179690 TERMINATE: more junk - 2080 ian
    Sat Mar 6 23:31:18 Run 179693 TERMINATE: junk - 2080 ian
    Sat Mar 6 23:55:13 Ops manager neglected to request beam-on calibrations, etc, to checkout detector after power was off. Finally did showermax calibration herself only to discover failures in ccal01 slot 5 smxr ch 1 and pcal02 slot 6 smxr ch 6. Powercycled NW arch and fixed ccal01. Powercycled W plug twice but pcal02 still bad - will require an access. - convery
    -- Sun Mar 7 00:53:26 comment by...convery --  DBANA plot in owl shift elog
    Sat Mar 6 23:55:49
    Run Number Data Type Physics Table Begin Time End Time Live Time L1 Accepts L2 Accepts L3 Accepts Live Lumi, nb-1 GR SC RC
    179683 x2BDE3 BEAM PHYSICS_2_02 [2,424,431] 23:51:25 17:52:37 16:45:43 881,025,509 15,870,935 3,323,844 2029.546 1 1 1
    Totals 23:55:02 16:45:43 881,025,509 15,870,935 3,323,844 2029.546
     - End of Shift Report
    Sat Mar 6 23:58:51 Luminosity summary
    Totals
    Date:2004.03.06
    Shift:eve
    Delivered luminosity: 416.1 nb-1
    Acquired luminosity: 120.3 nb-1
    Efficiency: 28.9
     - Rei Tanaka
    Sun Mar 7 00:00:05 Shift Summary:
    - Around 15:50 we got an ODH(Oxygen Deficiency Hazard)
    alarm in collision hall 
    
      due to a broken sensor located at North-East B. 
    - Trying to bypass this sensor by disabling the sensor readout in 4-mation 
      caused the entire Quadlog shut down. 
    - At around 17:41, flammable gas error in B0 caused all detector power off  
      and solenoind slow dump. Fire trucks arrived. 
    - Recovery process since then. Silicon is safe. Solenoid is up again. 
    
    Still people are working on: 
    - CEM CHA WHA HV problems (should page tomorrow morning if in same condition). 
    - Shower max calibration. 
    - COT crate b0cot17. 
    
    Plan: 
    - Recover, do calibrations and then take more data! 
    - Store dump is forseen Sunday morning or afternoon depending on p-bar stack. 
      The p-bar stack is already 140E10 at 24:00, i.e. 7E10/hour stack rate. 
    
     - Rei Tanaka
    Sun Mar 7 00:01:26 To be explicit about abort gap losses, here is current policy
    from Ron Moore to Accelerator Operators (27-Feb-04):

    "We (CDF and I) have developed a modified procedure for dealing
    with proton abort gap losses when their silicon is on.
    The aim is to reduce the number of operator intervention to move
    collimators and to reduce CDF DAQ downtime when they turn
    off during collimator movement.

    * WARNING limit for CDF silicon: 18 < B0PAGC < 20 kHz
    Silicon high voltage stays on, so data-taking continues,
    but MCR crews should evaluate ways to reduce
    losses, e.g., verify collimators are not far away from
    nominal positions.

    Minor tuning of the Tev (tunes, chromaticity, etc) is OK
    with the CDF Silicon high voltage is up.
    Moving a collimator is NOT OK in this range...
    CDF Silicon high voltage needs to be at standby for collimator
    moves...they would prefer to keep taking data with losses
    in this range.

    If one or two attempts at tuning fail, wait until B0PAGC
    reaches/exceeds 20 kHz before trying to move collimators.

    * ALARM limit for CDF Silicon: B0PAGC > 20 kHz

    CDF Silicon high voltage turned down to standby without hesitation.

    MCR should take action to diagnose/reduce abort gap losses.
    Tuning the Tev and moving collimator(s) both OK here since the
    silicon high-voltage is already at standby. Notify both
    experiments prior to moving a collimator to give them
    a chance to lower whatever high voltage they need.

    Ron Moore"


    On the CDF side, TevMon is set to turn pink when losses
    reach 18 kHz but the popup message does not instruct SciCo to
    immediately call the MCR and request any specific action.
    Instead, the SciCo should instruct the ACEs to make plots of
    losses and look at trends and compare to recent runs to anticipate
    where current run might be headed. In a case where the abort gap
    losses look particularly unusual, the SciCo can call the MCR,
    or the silicon pager, or the ops manager to discuss the situation.
    (This rule applies of course to any of the acnet variables that
    CDF monitors as indicators of beam quality.)

    If TevMon turns red because abort gap losses exceed 20 kHz,
    CDF turns silicon to standby and informs MCR that we can not
    take data with silicon until situation is corrected. MCR should
    hopefully follow Ron Moore's directions above.
    Silicon pager should also be notified.
     - jj
    -- Sun Mar 7 05:51:25 comment by...rainer --  the scico asked me for advice what to do about abort gap losses. I agreed it would be a good idea to contact MCR because if scraping was the weapon of choice, it would have been a good idea to do it while we were dead in the water anyway - however I did not _suggest_ scraping as we let MCR decide what they deem necessary, and it turned out they had other means to get the losses down. would have probably a good idea to remember the protocol I helped setting up myself.
    -- Sun Mar 7 17:22:50 comment by...Rei Tanaka --  Thanks JJ and Rainer for reminding us the procedure.
    Indeed, MCR did exactly what they are expected to do for pinkie (18kHz < abort gap loss < 20kHz) condition.
    Here is MCR e-Log.
    23:04:40- The abort gap losses are ~18kHz (bouncing from 16kHz or so). This is just above the minimum limit mentioned by Ron Moore in his memo to us about abort gap losses and tuning (Ron says that CDF is supposed to continue running until losses reach 20kHz, but above 18kHz we should look at tunes and chromaticities to try and bring them down). CDF has asked us to take a look at the losses, so we are. - djinn
    -- Sat Mar 6 23:50:43 comment by...spb -- CDF called about the Abort gap losses. They were between 17.5 and 18.5 KHz, they requested we do minor tune changes. Raised Horz tune by .0005 and lowered vert tune by .0004, see first 2 graphics. The FTP, 3rd plot, is the Abort gap and Lostp during the tuning and beyound. Looks like we did make some improvement. We tuned for about 30 minutes and informed CDF that we were done. On the 4th plot, datalogger during the store, the cyan high peak is where we started to tune.
    Sun Mar 7 00:07:12
    Comments on the current done-timeout problems: 
    
      b0wcal06:   ADMEM in slot 18 is not responding to VME reads  
                  to its Level 2 Buffers (but is not completely  
                  unresponsive to VME accesses) 
                  This is usually solved by re-downloading the  
                  Et look-up table, which also control the L2 buffers 
                  Please call ADMEM experts for further instructions. 
    
      b0cot17:    There are systemic problems in this crate related  
                  to inter-card synchronization.   This usually  
                  indicates a problem with the Tracer broadcasting  
                  various DAQ signals on the backplane.  Often a  
                  Tracer reseat will help, but that's not an option  
                  right now. 
                  I'd suggest turning of the crate for some extended  
                  period of time (say at least 15 minutes) and  
                  try again;   if this doesn't work an access will  
                  be needed to massage or replace the Tracer
     - W.Badgett :: (run 179697)