2004 CDF E-Log -- Owl shift. Sat Feb 28, 2004
SciCo DAQ Ace Monitoring Ace CO (Operations Manager)
S.Miscetti/Guram Ian Vollrath Alison Lister Brian Mohr jj Schmidt


Start of Shift Notes:  

Trying to put Silicon In DAQ. 
Test new trigger table PHYSICS_2_03[1,431,435]

Sat Feb 28 00:07:05 Run 179468 Activated at 2004.02.28 00:06:27 - RunControl
Sat Feb 28 00:07:06 Run 179468 ACTIVATE: PHYSICS_2_03[1,431,435] with silicon - Ian x2080
Sat Feb 28 01:00:59
getting L1 done timeouts every ~3-5min on average. however, none yet resulting from e120 ...
which they tell me is a good thing. also have had a few (~4) reformatter errors.
 - ian :: (run 179468)
Sat Feb 28 01:07:48
done timeout from b0cmx01 
hrr recovered
 - ian :: (run 179468)
Sat Feb 28 01:23:54 Run 179468 Terminated at 2004.02.28 01:23:06 - RunControl
Sat Feb 28 01:23:56 Run 179468 TERMINATE: run ended for silicon work - Ian x2080
Sat Feb 28 01:29:34
We have given a first look to the new trigger table 
comparing run 179468 with previous run number at low-lum 
(run 179467). From Xmon, we see that global rates of 
L1,L2,L3 looks reasonable. L2 processing looks better. 
We decided to keep the table up to 8 in the morning. 
 
 - S.Miscetti
Sat Feb 28 01:34:52
COT cooling alarm which went away on it's own after a couple of minutes. 
"Cryo guy" said this is a "normal" behaviour when the Silicon is put into a run later than the rest.


(guess this has been seen often before but thought I would put it in e-log anyway for
information)
 - alison
Sat Feb 28 01:46:15 Run 179469 Activated at 2004.02.28 01:44:50 - RunControl
Sat Feb 28 01:46:16 Run 179469 ACTIVATE: PHYSICS_2_03[1,431,435] - Ian x2080
Sat Feb 28 01:46:17 Run 179469 Terminated at 2004.02.28 01:46:10 - RunControl
Sat Feb 28 01:46:42 Run 179469 TERMINATE: bad bad bad - Ian x2080
Sat Feb 28 01:46:59
 - alison (2 hour plots)
Sat Feb 28 01:57:21 Run 179468 RUNSTATUS:
Marked Bad, explanation:
L3T too many reformatter errors (5%)
 - cdfscico
Sat Feb 28 02:06:05 Run 179470 TERMINATE: bad bad - Ian x2080
Sat Feb 28 02:09:33
 - alison
Sat Feb 28 02:17:01 Run 179471 Activated at 2004.02.28 02:16:34 - RunControl
Sat Feb 28 02:21:06 Run 179471 Terminated at 2004.02.28 02:19:26 - RunControl
Sat Feb 28 02:21:07 Run 179471 ACTIVATE: bad bad - Ian x2080
Sat Feb 28 02:21:08 Run 179471 TERMINATE: bad bad - Ian x2080
Sat Feb 28 02:31:39 Run 179472 Activated at 2004.02.28 02:30:45 - RunControl
Sat Feb 28 02:32:26 Run 179472 ACTIVATE: PHYSICS_2_03[1,431,435] after some si fixes - Ian x2080
Sat Feb 28 02:46:44
about 80% of L3 data flow states are in an "error state". however, have no errors and running
~smoothly. maybe this is a result of the numerous reformatter we have been having.
 - ian :: (run 179472)
Sat Feb 28 02:49:57

Update on Silicon Massace

  • yesterday, we had attempted to upgrade the SRC firmware - we are still running with old firmware in the SVX SRC, for which there is no spare.
  • we knew the old firmware was plagued with L2 w/o L1 fatal errors, out of which we knew we could HRR out.
  • lester downloaded some firmware with additional diagnostics into the b0svx06/ISL SRC.
  • checkouts w/ silicon appeared normal - with silicon, the SVX got into a dramatic state which kicked SVX into a high current state - so high that the detector temperature increased and the temperature alarm tripped off.
  • we decided to undo all of the changes and revert to the old situation - old SRC in SVX, latest firmware w/o diagnostics in ISL as before.
  • unfortunately, we are now plagued by numerous L1 DONE timeouts from various ladders.
  • we suspected corrupted GLINK senders and power cycled FIB crates in the collision hall. no avail.
  • we suspected corrupted GLINK receivers and power cycled VRB crates. to no avail.
  • almost by accident, we swapped one VRB. that seem to have helped at least individual sources of DONE timeout.
  • unfortunately, there are instantenous spurs of reformatter errors, likely connected to the DONE TO, we result in the runs to be marked bad .
  • we have currently no idea how this could come about - the fact that swapping VRBs seems to help is almost ridiculous.
  • nonetheless we swapped 3 times successfully, and we had to steal one VRB from the EVB upgrade crate because there is no way to get into FCC for after hours, and we burned the two spares we have available in the Si office.
  • all this was discussed with Ops and ok'ed.
  • one EVB cleanup was my fault - I forgot to put a cable back in.  - Rainer, for Sal, Marcel, Pete and Lester and Steve consulting.
    -- Sat Feb 28 02:50:48 comment by...rainer --  ps.: to be continued ... tomorrow we'll try to brainstorm with the experts.
    -- Sat Feb 28 02:51:24 comment by...rainer --  See silicon elog
    -- Sat Feb 28 03:33:30 comment by...ian --  
    in the past hour or so have had L1 done timeouts:
    
    e480: 1
    e200: 7
    e420: 12
    
    none from e120.
    
    reformatter error rate for this run so far is: 1.78% and decreasing.

    Sat Feb 28 03:09:51
     - alison
    Sat Feb 28 04:06:11
     - alison
    Sat Feb 28 05:05:53
     - alison
    Sat Feb 28 06:05:38
     - alison
    Sat Feb 28 06:27:04
     - alison
    -- Sat Feb 28 06:28:36 comment by...alison --  
    All of B1W3 went down to zero at the same time around 6.05
    These are only the AVDD plots.

    -- Sat Feb 28 07:14:15 comment by...rainer --   another fit of CAEN madness - fixed by hockerization. details see silicon elog.
    Sat Feb 28 07:10:36
     - alison
    Sat Feb 28 07:14:17
    PSM alarm: 1RR18D (CMX-Muon crate)Channel 0 was slighly high (just over 6V). Alarm went away on
    it's own.
     - alison
    Sat Feb 28 07:15:19
    got trigger inhibit from IFIX: SVX HV. no signs of any trips, alarms, etc. low voltage bar of
    SVX HV on IFIX had disappeared. brought SVX HV to standby. contacted expert. brought SVX HV back up
    at expert's request. same problem - i.e. trigger inhibit without anything else. brought SVX HV back
    down at expert's request. rainer called ... came back in. turns out there was an inhibit and some
    power supply problems. some crates had to "hockerized"
     - ian :: (run 179472)
    -- Sat Feb 28 07:17:07 comment by...ian --  
    note that the above is summary of what's been going on for the past while (got first inhibit at
    ~5:45am)

    -- Sat Feb 28 07:17:34 comment by...ian --  
    we are back running as of 7:12am

    Sat Feb 28 07:46:59
    CPR trip: sections 0 through 5 West. 
    "ON" for that section recovered the trip.
     - alison
    Sat Feb 28 07:48:05
    got error: 
    
    46'34" 1 crate/s: b0svx06(16),  in error.[RXPT]b0svx06:Messenger:7:46:23 AM->SRC Fatal Error:Sl 5
    Too Many L1A 2 L1A to Buff  
    
     -->   
     Additional Information:  
    
     Attention !!!. FERML_SRC_FATALITY ERROR !!!  
    
     SRC Fatal Error from b0svx06: Sl 5 Too Many L1A 2 L1A to Buff  
    
    hrr worked
     - ian :: (run 179472)
    -- Sat Feb 28 07:51:57 comment by...ian --  
    again

    Sat Feb 28 07:56:09
    Run Number Data Type Physics Table Begin Time End Time Live Time L1 Accepts L2 Accepts L3 Accepts Live Lumi, nb-1 GR SC RC
    179468 x2BD0C BEAM PHYSICS_2_03 [1,431,435] 00:06:27 01:23:06 00:59:48 56,227,513 1,047,617 180,634 135.953 0 1 1
    179472 x2BD10 BEAM PHYSICS_2_03 [1,431,435] 02:30:45 03:05:45 180,917,362 3,043,852 554,559 354.951 1
    Totals 07:55:02 04:05:34 237,144,875 4,091,469 735,193 490.904
     - End of Shift Report
    Sat Feb 28 08:01:42 Shift Summary:
    We started the shift with Store #3261 in at an
    istantaneous luminosity of  
    
    3.9*10^31. 
    
    The previous shift has just started the new run (179468) with the low  
    luminosity trigger table PHYSICS_2_03[1,431,435]. Silicon experts were  
    swapping boards in order to reduce the Silicon timeout. We compared  
    XMON with  previous run and did not look too different so we run with 
    the new table. 
    
    A lot of problems while swapping Si boards. At 2AM Rainer completed 
    the fix and we started a run with a reasonable number of L3 reformatter  
    errors ( <1% ). Still we have a 4% DeadTime due to L1 Done. The situation 
    is unclear and this morning the Si experts will keep working on this. 
    
    We had also at least 1 hour of downtime due to a Si trigger Inhibit from 
    IFIX although there was no signs of any trips. Rainer came back in 
    and discovered some power supply problems. Now this is fixed. 
    
    SI experts at work. 
    

    End of Shift Numbers
    CDF Run II

    Runs                   
    Delivered Luminosity   941.5  
    Acquired Luminosity    498.0  
    Efficiency             52.9
    
    
     - S.Miscetti
    Sat Feb 28 08:18:49 Run 179472 ACTIVE: Got "L3_REF_ERROR_HIGH_RATE Error: L3 Instantaneous error rate is 1.3838321 per cent." and then "FERML_SRC_FATALITY ERROR !!! SRC Fatal Error from b0svx06: Sl 5 L2A w/o L1A" - Vadim x2080
    Sat Feb 28 08:23:48
     (entry outside this shift's time range ) - Susana.
    Sat Feb 28 08:38:50 Run 179472 ACTIVE: Fatal one again: first "FrontEnd Crate Error Condition from: VRB_ISL_06", then RXPT Error, and "FERML_SRC_FATALITY ERROR !!! SRC Fatal Error from b0svx06: Sl 5 Too Many L1A 2 L1A to Buff" - Vadim x2080