Title: Using FPGAs to Generate Gigabit Ethernet Data Transfers
1Using FPGAs to Generate Gigabit Ethernet Data
TransfersThe Network Performance of DAQ
Protocols
Dave Bailey, Richard Hughes-Jones, Marc Kelly
The University of Manchester www.hep.man.ac.uk/
rich/ then Talks
2Collecting Data over the Network
Detector elements e.g. Calorimeter Planks
- Aim for a general purpose DAQ solution for
CALICE - CAlorimeter for the LInear Collider Experiment
- Take ECAL as an example.
- At the end of the beam spill the planks send all
the data, to the concentrators - Concentrators pack data send to one
processing node - Classic bottleneck problem for the switch
Custom Links
???
Concentrators
Ethernet Switches
Output link Bottleneck Queue
Processing Nodes 1 Burst / Node
3XpressFX Vertex4 Network Test Board
- XpressFX Development Card from PLDApplications
- 8 lane PCI-e card
- Xilinx Virtex4FX60 FPGA
- DDR2 memory
- 2 SFP cages 1GigE
- 2 HSSDC connectors
4Overview of the Firmware Design
- Virtex4FX60 has
- 16 RocketIP Multi-Gigabit Tranceivers
- Large internal memory
- 2 PPC CPUs
- Ethernet Interface
- Embedded MAC
- RocketIO
- Packet Buffers logic
- Allows routing of input
- Prioritising of output
- Packet State Machine
- Packet Generator
- State Machines
- VHDL model HC11 CPU
- Control of MAC State Machines (Green Mountain
Computer Systems) - Reserve the PPC for data processing
5The State Machine Blocks
- Packet Generator
- CSRs (set by HC11) for
- Packet length
- Packet count
- Inter-packet delay
- Destination Address
- Request Response
- RX State Machine
- Decode Request Packet
- Checksum RFC768
- Action Mem writes
- Q Other Requests
- FIFO
- TX State Machine
- Process Request
- Construct reply
- Fragment if needed
- Checksum
- Packet Analyser
Packet Analyser State Machine
6The Receive State Machine
End of packet
Packet in Queue
Idle
Empty Packet
Read Header
Wrong packet type
Fifo written
Correct packet type
Fifo has Address cmd
Fill Fifo
Read Cmd
Bad cmd
Not a memory write
Write finished
All bytes received
Write Mem
Check Cmd
Do Cmd
Is a memory write
Good cmd
7The Receive State Machine
End of packet
cmd in fifo
Idle
End Pkt
Send Header cmd
Xsum sent
Header cmd sent
cmd needs no data
Send Xsum
Check Cmd
Update Counter
cmd requires data
More data to send
Send Memory
All bytes have been sent
All Sent?
Max packet size or byte count done
8The Test Network
- Use for testing Raw Ethernet Frame generation by
the FPGA - Test Data collection with Request-Response
protocols
Responding nodes
FPGA Concentrator
Cisco 7609 1 GE and 10 GE blades
Requesting Node
9Request-Response Latency 1 GE
- Request sent from PC
- Linux Kernel 2.6.20-web100_pktd-plus
- Intel e1000 NIC
- Interrupt Coalescence OFF on PC
- MTU 1500 bytes
- Response Frames generated by FPGA code
- Latency 19.7 µs well behaved
- Latency Slope 0.018 µs/byte
- B2B Expect 0.0182 µs/byte
- Mem 0.0004
- PCI-e 0.0018
- 1GigE 0.008
- FPGA 0.008
- Smooth to 35,000 bytes
10FPGA ? PC ethCal_recv Frame jitter
- 12 us frame spacing (line speed)
Peak separation 4-5 us no coalescence
11Test the Frame Spacing from the FPGA
- Frames generated by FPGA code
- Interrupt Coalescence OFF on PC
- Frame size 1472 bytes
- 1M packets sent.
- Plot mean of observed frame spacing vs requested
spacing - Appear have offset of -1 us ?
- Slope close to 1 as expect
- Packet loss decreases with packet rate.
- Packet lost in receiving host
- Larger effect than UDP/IP packets
- UDP/IP losses linked to scheduling
12The Test Network
- Use for testing Raw Ethernet Frame generation by
the FPGA - Test Data collection with Request-Response
protocols - This time use 10GE hosts
- But does 10GE work on a PC??
Responding nodes
FPGA Concentrator
Cisco 7609 1 GE and 10 GE blades
Requesting Node
1310 GigE Back2Back UDP Throughput
- Motherboard Supermicro X7DBE
- Kernel 2.6.20-web100_pktd-plus
- NIC Myricom 10G-PCIE-8A-R Fibre
- rx-usecs25 Coalescence ON
- MTU 9000 bytes
- Max throughput 9.4 Gbit/s
- Notice rate for 8972 byte packet
- 0.002 packet loss in 10M packetsin receiving
host - Sending host, 3 CPUs idle
- For lt8 µs packets, 1 CPU is gt90 in kernel
modeinc 10 soft int - Receiving host 3 CPUs idle
- For lt8 µs packets, 1 CPU is 70-80 in kernel
modeinc 15 soft int
14Scaling of Request-Response Messages
- Requests from 10GE system
- Interrupt Coalescence OFF on PC
- Frame size 1472 bytes
- 1M packets sent.
- Request 10,000 bytes of data
- Host does fragment collectionlike the IP layer
- Sequential Requests
- Time to receive all responses scales with round
trip time. - As expected from sequential requests
- Grouped Requests
- Collection time increases by 24.6µs per node.
- From network alone expect 112.3 13.3 µs
15Sequential Request-Response
- Interrupt Coalescence OFF on PCs
- MTU 1500 bytes
- 10,000 packets sent.
- Histograms similar
- Strong 1st peak
- Second peak 5 µs later
- Small group 25 µs later
- Ethernet occupancy for 1500 bytes
- 1Gig 12.3 µs
- 10Gig 1.2 µs
16Grouped Request-Response
- Interrupt Coalescence OFF on PCs
- MTU 1500 bytes
- 10,000 packets sent.
- Histograms multi-nodal
- Second peak 7 µs later
- Small group 25 µs later
17Conclusions
- Implemented MAC and PHY layers inside Xilinx
Virtex4 FPGA - Learning curve steep had to overcome issues with
- Xilinx CoreGen design
- Clock generation stability on PCB
- FPGA easily drives 1Gigabit Ethernet at line rate
- Packet dynamics on the wire as expected
- Loss of Raw Ethernet frames in end host being
investigated - Request-Response style data collection promising
- Developing a simple Network test system
- Planned upgrade to operate at 10Gbit/s
- Work performed in collaboration with ESLEA UK
e-Science EU EXPReS projects
18 1910 GigE UDP Throughput vs packet size
- Motherboard Supermicro X7DBE
- Linux Kernel 2.6.20-web100_pktd-plus
- Myricom NIC 10G-PCIE-8A-R Fibre
- myri10ge v1.2.0 firmware v1.4.10
- rx-usecs0 Coalescence ON
- MSI1
- Checksums ON
- tx_boundary4096
- Steps at 4060 and 8160 byteswithin 36 bytes of
2n boundaries - Model data transfer time as t C mBytes
- C includes the time to set up transfers
- Fit reasonable C 1.67 µs m 5.4 e4 µs/byte
- Steps consistent with C increasing by 0.6 µs
- The Myricom drive segments the transfers,
limiting the DMA to 4096 bytes PCI-e chipset
dependent!