2004 CDF E-Log -- Day shift. Sat Feb 28, 2004
SciCo DAQ Ace Monitoring Ace CO (Operations Manager)
rainer (kaori) vadim khotilovich susana cabrera franco semeria JJ


Start of Shift Notes:  

Store # 3261, Inst Lum 2.7e31, Stack 137, stacking around 7m/h 
Run 179472 in progress. 
COT HV (SL12 off, SL345 reduced gain) 
Silicon IN (but L1 done timeout happens.  3-4%) 
Trigger table is new one physics: PHYSICS_2_03 [1,431,435] 
Plan: - Take data  
      - Silicon people continue to investigate about L1 DTO 
      - Silicon D-mode calibration (end of store) 
      - alpha clustering test (end of store) 

 

Sat Feb 28 08:42:54 Run 179472 ACTIVE: in the last half an hour already had two L2 decision timeouts from b0l2de00:SpyAlpha - Vadim x2080
Sat Feb 28 08:46:20 Run 179472 ACTIVE: "Done Timeout detected in crates: VRB_SVX_02" + "FrontEnd Crate Error Condition from: VRB_SVX_02" + "done timeout for crate/s: b0svx02 Done TO.1 crate/s: b0svx02" + "FERML_SRC_FATALITY ERROR !!! SRC Fatal Error from b0svx02: Sl 5 Too Many L1A 2 L1A to Buff" - Vadim x2080
Sat Feb 28 09:00:01 Checked with MCR: no firm plan for Ron's Helix opeing time, yet. They will notify us if that is going to happen. Till then, we keep taking data. - kaori(temp scico/ops)
Sat Feb 28 09:04:12 Run 179472 ACTIVE: again a chain of events leading to "FERML_SRC_FATALITY ERROR !!! SRC Fatal Error from b0svx06: Sl 5 2 L1A to Buff " - Vadim x2080
Sat Feb 28 09:07:04 Run 179472 ACTIVE: "CER_SVXMON_HALT_RECOVER_RUN_ERROR !!! Stuck Cellid S/B2/W7/L1/C0-2." - Vadim x2080
Sat Feb 28 09:07:54 Run 179472 ACTIVE: just two minutes later: "CER_SVXMON_HALT_RECOVER_RUN_ERROR !!! Stuck Cellid S/B1/W4/L4/C7-13 ." - Vadim x2080
Sat Feb 28 09:08:17
 - Susana
-- Sat Feb 28 09:24:45 comment by...kaori --  Called MCR (talked to Chip Edstrom) asking if they were doing something while the abort-gap to behave like this. (Yellow). He said they were not doing anything, but he will look into it.
-- Sat Feb 28 10:04:22 comment by...kaori --  MCR called back, just telling us that they are still investigating.
Sat Feb 28 09:10:09 Run 179472 Terminated at 2004.02.28 09:09:11 - RunControl
Sat Feb 28 09:10:10 Run 179472 TERMINATE: ending this miserable run - Vadim x2080
Sat Feb 28 09:13:47 We are ending this run because Steve N. and Rainer like to try a few things for L1DTO. - kaori
-- Sat Feb 28 12:10:34 comment by...rainer --  

L1 DONE Timeout Saga Summary

  • a systematic power cycle of all VRB crates seems to have solved the problem.
  • leading 'theory' is that somewhere in the VRB chain a 'noise source ' was introduced after the power cycle for the SRC firmware upgrade - a VRB might have come up in a funky state etc.
  • a systematic power cycle of all (permissible) crates seems to have cleared the cause of interference.
  • diagnostic is now in place that in case the L1 DONE TO re-occur, we can further investigate.

    See entry here and silicon elog for details.


    Sat Feb 28 09:19:49 Solinoid alarmed briefly, and alarm stopped. No trip. Talked to Cryo person. He said he is looking into more, but it seems the current just went over briefly and came back down. It is stable now. (we are not taking data at a moment) - kaori
    Sat Feb 28 09:56:54 Run 179473 Activated at 2004.02.28 09:56:16 - RunControl
    Sat Feb 28 09:57:16 Run 179473 ACTIVATE: PHYSICS_2_03 [1,431,435] - Vadim x2080
    Sat Feb 28 10:06:13
     - Susana
    Sat Feb 28 10:11:01 Run 179473 ACTIVE: similar to what we had before: "FrontEnd Crate Error Condition from:VRB_ISL_06" + "RXPT Error for b0svx06" + "FERML_SRC_FATALITY ERROR !!! SRC Fatality Error for b0svx06: Sl 5 Too Many L1A 2 L1A to Buff" - Vadim x2080
    Sat Feb 28 10:29:06
    s
     - s
    Sat Feb 28 10:29:42
    L2 decision timeout: 
    
    (MLE) b0l2de00:SpyAlpha:10:29:04 AM-> 
    CListMon: Dumping CLIST data: 
    Word       upper 32 bits  lower 32 bits 
     0: 0x80055401	0x20000011  
     1: 0x00055401	0x00000011  
     2: 0x01055401	0x20000011  
     3: 0x01055401	0x00000011 
     - Vadim :: (run 179473)
    Sat Feb 28 10:56:11
    After discussing with Kevin Pitts, Kirsten, Rainer, Steve N. this morning,  
    following is the current plan for the rest of this store: 
    
    1) Keep the new default physics table  PHYSICS_2_03 [1,431,435], continue 
       to take data with Silicon, COT (compromised HV condition). 
    
    2) We will NOT do the uber prescale test for this store.  Reason being 
    that (as agreed before), silicon people would like to see at least one store 
    with 'stable' silicon running condition before doing the test. 
    This (current store) is the store to see if 'stable' or not.  
    If all goes well, we plan to do the uber prescale test at the end of  
    the next store.  
    
    3) 2h before the end of the store, take silicon D-mode calibration. 
    
    4) - assuming beam is stable, keep silicon HV on. 
       - take ~45 min test run (until L2_PS10_L1_EM8 has 100K events): 
           use run_config:  AAA_NOSILICON 
           use trigger table: TEST_ALPHA_CLUSTERING_NOSPIKES[2,430,403] 
    
    5) - anytime remainder of the store, take physics run 
         as 1) (with silicon) assuming the beam condition is good.
     - kaori
    Sat Feb 28 11:08:49
     - Susana.
    Sat Feb 28 12:02:52
     - R.J. Tesarek
    -- Sat Feb 28 12:12:58 comment by...R.J. Tesarek --  
    Odd Abort Gap Behavior:
    Abort gap (yellow) and halo (green) for protons (left) and antiprotons (right). The proton variables are derived from the same counters, but different from the antiproton variables. Common behavior between proton and antiproton abort gap variables indicates that the conditions are similar for both devices. The differing behavior between the halo and abort gap indicates that the abort gap behavior is not due to a failure of the electronics (common readout). The effect observed is real.

    -- Sat Feb 28 12:27:13 comment by...kaori --  Darren/Chip (MCR) called us to tell us that they investigated many possible things might cause the strange abort gap behaviour, but so far they have not found anything. They suspected it may be caused by a failure of the electronics in this end. Rick assured us (see above) it can not be the electronics, and informed MCR. D0 reports they see no strange structure in abort gap. (according to the MCR log entry).
    Sat Feb 28 12:13:38
     - Susana
    Sat Feb 28 12:30:14
     - R.J. Tesarek
    -- Sat Feb 28 12:34:06 comment by...R.J. Tesarek --  
    Abort Gap Variables:
    Abort gap variables from the CDF halo counters (B0PAGC, yellow), the CDF beam shower counters (E/W coincidence, B0MSC3 cyan) and from the telescope located at E0 (E0LABT green). All variables show the same behavior. MCR folk posted a comment about the D0PHTL (D0 proton halo) and D0AHTL (D0 antiproton halo) not showing the same behavior. Neither do the CDF halo monitors (ref previous plots).
    Sat Feb 28 12:36:57
    L2 decision timeout: 
    
    (MLE) b0fcal00:Messenger:12:15:47 PM->Runtime Error 1, Event 1867650: Bunch counter mismatch,
    mismatch count = 1 
    
    (MLE) b0dap73.fnal.gov:Thread-7076:12:34:29 PM->Requested Halt-Recover-Run issued [errmon] 
    (RC)   12:34:31 Halt -> HALTED 
    (MLE) b0l2de00:SpyAlpha:12:35:22 PM-> 
    L1Mon: saw 210 L1 DMA transfers, expect 1 (buffer number 0) 
    L1Mon: Dumping data for  1 word. 
    Word       upper 32 bits  lower 32 bits 
       0: 0x00000000	0x00001000  
       1: 0x00000000	0x00802016  
    .... a lot of lines .... 
      418: 0x00000000	0x00000000  
      419: 0x00000000	0x00000000 
     - Vadim :: (run 179473)
    Sat Feb 28 12:43:03
    Rick Vidal plot for halo counters gated on abort gaps. (similar to Ricky T's above..)
     - JJ
    Sat Feb 28 12:48:39 Dan Cyr calls:

    carries TDC and BMU hardware pager, but wants to be called on his cell 1-630-715-3704 in case there is a problem. - Rainer


    Sat Feb 28 13:09:55 had TEVMon Error due to another spike in the abort gap losses - we informed MCR that we think the effect is real. This spike reached 18 kHz, close to 20 kHz alarm limit to warrant some concern about the silicon safety.  - Rainer
    Sat Feb 28 13:10:34
    s
     - s
    Sat Feb 28 13:11:56
     - Susana
    Sat Feb 28 13:14:46
    s
     - s
    -- Sat Feb 28 13:17:01 comment by...Susana. --  
    TEVMON indicates "Silicon in danger state", because of RMS
    of B0PAGC, at the time of the spike in the plot.
    MCR is aware of the situation spikes and they are investigating the origin.

    -- Sat Feb 28 13:58:42 comment by...JJ --  
    I think it is more fair to say that MCR has been unable
    to find any obvious source of this structure and is now
    mostly an observer (as are we).

    -- Sat Feb 28 14:20:06 comment by...JJ --  I take it all back. Ops have pasted a plot showing a correlation with M:B1WRP. Now to figure out what B1WRP is.
    -- Sat Feb 28 14:38:19 comment by...JJ --  M:B1WRP from the .MRING data logger sounds like a "main ring" variable..?? In any case, I made plot for this variable for last few hours - see entry further down - and see no correlation at all. The mystery remains.
    Sat Feb 28 14:09:45
     - Susana.
    Sat Feb 28 14:16:17
    Many red lines on SL1DMonitor
     - franco
    -- Sat Feb 28 14:26:49 comment by...rainer --  talked to XMON expert about finding above as well is previous run - the reference fits are old (before the COT problem) and so the reference is outdated. also L1/L2 performing fine otherwise (DT<3 %)
    Sat Feb 28 14:39:36
    M:B1WRP versus abort gap loss. No obvious correlation.
     - JJ
    Sat Feb 28 14:39:51
    b0dap84.fnal.gov:ConsumerErrorRe:2:38:35 PM->Runtime Error 1, Event 34437, RunNum 179473:
    SvxMon Halt Recover Run: Stuck Cellid in 5 events in Silicon/S/B1/W5/L4/C7-13 . -->   
    
     Additional Information:  
    
     Attention !!!. CER_SVXMON_HALT_RECOVER_RUN_ERROR !!!  
    
     Stuck Cellid S/B1/W5/L4/C7-13 .  
     AUTO HRR will be issued 
     - Vadim :: (run 179473)
    -- Sat Feb 28 14:42:14 comment by...Vadim --  the error repeated after about 3 min
    Sat Feb 28 14:50:02 Run 179472 RUNSTATUS:
    Marked Bad, explanation:
    L3T high (~2.8%) reformatter error rate - however happened in 'spikes' 
     - cdfscico
    Sat Feb 28 15:21:32
     - Susana.
    Sat Feb 28 15:31:51 Update on plan from Run Coordinator's eLog:

    "Sat Feb 28 14:35:13-
    The plan for the near future is: Keith Gollwitzer will begin
    reverse proton studies at 20:00 tonight which will last till 1 am.
    Ron Moore will come in at 8 pm and open the helix by 15%. He will
    scrape, if needed, prior to the studies. There will be a 1 hour access
    after the store for D0. - DJC"


    Note by JJ: CDF currently has no requests for a 1 hour access.  - JJ
    Sat Feb 28 15:46:40
    s
     - s
    -- Sat Feb 28 15:47:27 comment by...Susana. --  
    SVX rad evolution during the shift. It looks OK.

    Sat Feb 28 15:50:25
    DateTimeBLMDose
    2004.02.2815:49:48W Inner BLM833.51RADS
    2004.02.2815:49:48W Outer BLM46.43RADS
    2004.02.2815:49:48E Inner BLM141.16RADS
    2004.02.2815:49:48E Outer BLM833.11RADS
    Integrated dosage  - Susana
    -- Sat Feb 28 15:51:48 comment by...Susana. --  
    Sorry !

    Sat Feb 28 15:55:57
    Run Number Data Type Physics Table Begin Time End Time Live Time L1 Accepts L2 Accepts L3 Accepts Live Lumi, nb-1 GR SC RC
    179472 x2BD10 BEAM PHYSICS_2_03 [1,431,435] 02:30:45 09:09:11 04:11:23 245,277,495 4,067,790 727,345 458.159 0 1 1
    179473 x2BD11 BEAM PHYSICS_2_03 [1,431,435] 09:56:16 05:47:06 308,872,158 4,572,366 904,715 471.246 1
    Totals 15:55:02 09:58:29 554,149,653 8,640,156 1,632,060 929.405
     - End of Shift Report
    Sat Feb 28 15:57:25 MCR calls - plan is to keep store as long as possible until luminosity is at 10E30. At 8pm, will have approx. 30min of orbit helix studies - they recommend taking detector to standby, but no additional hazards - normal beam study. after the store, D0 has asked for an access of ~ 1hr. MCR would like to know whether wqe would piggy back on D0 access. - Rainer
    Sat Feb 28 15:58:06
    s
     - s
    Sat Feb 28 15:58:24 MCR calls - they would like to have a plot of the cooling temperature versus time of the rack which houses the B0PAGC electronics. Will call R. Tesarek. - Rainer
    -- Sat Feb 28 16:11:31 comment by...JJ --  
    Unable to reach Tesarek. Rick (or Muge?) - if you see this entry would you respond and let us know if there is a rack temp that makes any sense to monitor?

    -- Sat Feb 28 22:05:24 comment by...jj --  Please ignore. We found the right rack I believe plus all our monitoring has been exonerated. (See next shift eLog.
    Sat Feb 28 15:59:48
    s
     - s
    Sat Feb 28 16:02:35 Shift Summary:
     
    Started shift with store #3261 at Inst Lum 27e30, now 21E30. stack  
    now at 173.9. new trigger table PHYSICS_2_03 in effect. started 
    
    History: 
    
    - Silicon people performed some diagnostics and eventually got 
      system back to 'normal' behavior. No DONE TO or  
      reformatter errors  seen in ongoing run 179473. 
    
    - Strange Abort Gap loss behavior over time. Informed MCR about 
      our concern and the fact that we think the oscillations are  
      real. 
    
    Plan: 
    
    - MCR will keep store as long as possible down to 10E30. 
    - Open helix study at 8pm for approx 30min. take detector  
      to standby. afterwards resume data taking. 
    - We have no need for 1 hour access at end of store. D0 will  
      go in. 
    - pending request: MCR wishes to see plot of B0PAGC electronics 
      rack temperature vs. time (cooling water etc.) 
    
    
    

    End of Shift Numbers
    CDF Run II

    Runs                    179472,179473
    Delivered Luminosity   675.4  
    Acquired Luminosity    571.6  
    Efficiency             84.6
    
    
     - Rainer