Title: CSC Online Error Monitoring with the DDU
1CSC Online ErrorMonitoring with the DDU
J. Gilmore CSC-DPG 41 July 17, 2008
2DDU Overview
- Functions
- Merge data from 15 CSCs
- Perform online data unpacking and status
monitoring in real-time - (CRC, word count, format quality, BXN, L1A
number, buffer status, link status) - Send CSC status to FMM
- Large Buffer Capacity
- 2.5 MB buffer
- Average DDU data volume estimated to be 0.4kB per
L1A at LHC (_at_1034 lumi) - Buffer can hold over
6000 events - Status info accessed via VME
15 Optical Fiber Inputs. Reads a 20-degree
slice through an endcap
GbE/SPY To Local DAQ
3Data Unpacking in the DDU
- Scan data for evidence of SEUs, determine if
Reset is needed - Data errors are an indicator for SEU
- Requires Hard Reset, report it to FMM
- Monitor front-end data for event sync loss
- Requires Sync Reset, report it to FMM
- Watch for buffer warning signals, avoid
Overflows! - Set FMM Warning as needed, at half-to-3/4 full
(many events!) - Beyond 90 full DDU will set FMM Busy
- As buffers get near empty, DDU returns to FMM
Ready - Note that Buffer Overflows will lead to other
errors if not Reset - Sync loss, Data corruption, Timeout errors
- Diagnose cause and source of problems
- Track which CSCs have set which error types
- Report Reset Required states via VME Interrupt
- Tracking for chronic problems in offline log
files - Provide VME registers for diagnostics and
monitoring - Include status and error information in the DDU
Trailer
4Reported Error Categories I
- Configuration failures
- Constants loaded on a board are not correct
- Caused by communication errors, bad timing or
hardware - Often leads to data errors Timeout, bad DAV,
sync loss, buffer overflow, dead or hot channels,
format errors, data corruption - Format error, Consistency error or Not Present
- An expected format marker is not detected in the
proper position - Can cause DDU to misidentify a board
header/trailer word - May show as missing board in event
- May show as bad L1A, CRC or word count
- Caused by config fail, bad hardware or signal
timing/quality - Hot/dead channels or Empty/Missing CSC
- Caused by HV, config fail, bad hardware or signal
timing/quality - Can lead to buffer overflows
- Missing CSCs are caused by LV-off or disabled
CSCs - DAV-LCT mismatch
- A CFEB was triggered but it failed to send data
- Caused by config fail, bad hardware or signal
timing/quality - Can lead to buffer overflows or Timeout errors
5Reported Error Categories II
- Full FIFO _at_DMB (ALCT or CFEB buffer overflow)
- Caused by config fail, bad hardware or signal
timing/quality - Can cause Sync loss, Data corruption, or Timeout
- L1A Number Mismatch Errors
- Fundamental sign of sync loss
- Caused by problem with hardware or signal
timing/quality - Possibly SEU related
- CRC error bit error detected in transmission
- Generally a minor concern, affecting only one
event - Only serious if it affects multiple
Header/Trailer bits - May be an indicator of a deeper problem
- CSC electronics have a CRC at every level to
detect bit errors - CFEB, ALCT, TMB, DMB and DDU
- Overall severity of an error is hard to predict
- Cases that appear as Critical require a Reset
as they usually lead to more errors, but
sometimes may be self-correcting
6Event Quality Indicators from DDU
- The Single Error flag in DDU trailer Do Not
Analyze Event - Any events with non-perfect data checks will get
this - Minor bit errors or format problems, SCA Full
- Single Warning if problem might not affect the
data payload - Clean single-bit error in a header/trailer-word
marker - Fiber receiver/link error that may have occurred
between events - DCM phase-lock-loss that may occur between events
- The Critical Error Sync Lost case Data
Integrity Failure - L1A mismatch detected twice on one CSC
- Two different boards in the same event
- Separate occurrences in two different events
- Buffer Overflow at DMB or DDU
- Note offline analysis might not see the loss in
data integrity - At the full point, a buffer still has many good
events to read out before the compromised data is
observed, and sTTS actions can conceal all this - The Critical Error Hard Reset case Unpacker
Failure Likely - Anything that corrupts the data irreversibly
- Violation of event boundaries, cant determine
end-of-CSC data stream - Anything that looks like an SEUe.g. repeated
trivial errors
7Summary
- The DDU performs online CSC error monitoring in
real-time - The monitor status is in the DDU Trailer for
every event - The DDU monitoring results are useful for offline
data quality checking - Details of DDU monitoring status can be found
here - http//www.physics.ohio-state.edu/cms/dd
u/ddu2_pro.htmltr-1
8DDU Error Table I
1 Error bits resulting in RESET REQUIRED
persist until the RESET occurs. Questionable
cases (in gold) indicate that a reset is only
required for mitigation of recurring errors. TBD
sync/hard reset distinctions. 2 Found inside an
event, i.e. between Beginning-Of-Event (Header1
signature) and End-Of-Event (combination
Trailer1Trailer2 signatures), at least one of
the following Extra DMB_Header1, Extra
DMB_Header2, Lone Word, Extra TMB/ALCT_Trailer,
Extra DMB_Trailer1, DMB_Trailer2. 3 Missing
TMB/ALCT_Trailer word, missing DMB Header word,
Wrong First word, or Extra Control words.
9DDU Error Table II
1 Error bits resulting in RESET REQUIRED
persist until the RESET occurs. Questionable
cases (in gold) indicate that a reset is only
required for mitigation of recurring errors. TBD
sync/hard reset distinctions. 2 Found inside an
event, i.e. between Beginning-Of-Event (Header1
signature) and End-Of-Event (combination
Trailer1Trailer2 signatures), at least one of
the following Extra DMB_Header1, Extra
DMB_Header2, Lone Word, Extra TMB/ALCT_Trailer,
Extra DMB_Trailer1, DMB_Trailer2. 3 Missing
TMB/ALCT_Trailer word, missing DMB Header word,
Wrong First word, or Extra Control words.
10DDU Error Table III
- Footnotes for the error table
- 1 Error bits resulting in RESET REQUIRED
persist until the RESET occurs. Questionable
cases (in gold) indicate that a reset is only
required for mitigation of recurring errors. TBD
sync/hard reset distinctions. - 2 Found inside an event, i.e. between
Beginning-Of-Event (Header1 signature) and
End-Of-Event (combination Trailer1Trailer2
signatures), at least one of the following
Extra DMB_Header1, Extra DMB_Header2, Lone Word,
Extra TMB/ALCT_Trailer, Extra DMB_Trailer1,
DMB_Trailer2. - 3 Missing TMB/ALCT_Trailer word, missing DMB
Header word, Wrong First word, or Extra Control
words.