$~$

MSc assignments contributing to

the StaccatoLab program

$~$Author: Kees van Berkel,

- part-time full professor of SAN (
*System Architecture and Networking*), - within the department
*Computer Science and Mathematics*of the*Technical University Eindhoven*, - email: c.h.v.berkel@tue.nl

Date: 2018 January 3

**Introduction**

- As background and motivation for the assingments below: see the StaccatoLab research program.
- All assignments include a pre-study phase (5 ects), assuming the Embedded Systems program.
- Successful completion of the VLSI Programming course (2IMN35) is a considerable advantage, but not required.
- Experience with Python programming is a considerable advantage, but not required.
- The list of MSc assignments below may not be up to date. Feel free to contact me on the latest status.

In [ ]:

```
```

**Background**

- Large dataflow graphs may take advantage of multiple FPGAs, ultimately up to thousands of FPGAs.
- Inter-FPGA communication of dataflow tokens is much more involved than intra-FPGA communication. There are multiple options for inter-FPGA communication, involving different resources, with different constraints, and with different performance impacts.

**Goal**

- dataflow
*models*of inter-FPGA traffic (latency, slack, throughput), for different options: FPGA-FPGA, FPGA-DRAM-FPGA, and FPGA-DRAM-network-DRAM-FPGA; - tool support for dataflow graph partitioning, user-directed and/or automated.

**Approach**

- pre-study: design example dataflow graph(s);
- pre-study: develop inter-FPGA traffic models, based both on theory and on experiments;
- pre-study: define the dataflow graph partitioning problem (nb. graph cycles!);
- analyze the performance impacts of a given dataflow-graph partitioning;
- develop tool support for user-directed graph partitioning;
- evaluate existing graph partitioning algorithms;
- select/modify a graph partitioning algorithm and implement it in Python;
- evaluate the partitioning algorithm on selected dataflow graphs.

**Results**

- dataflow models for inter-FPGA traffic;
- a study of relevant dataflow graph partitioning algorithms;
- a tool for user-directed graph partitioning;
- a tool that analyzes the performance impact of dataflow graph partitioning;
- a tool that proposes a dataflow partitioning with minimal impact on the overall graph throughput;

**Background**

- The current StaccatoLab dataflow graph simulator is programmed in Python, but offers a throughput of only about 100k events (token productions) per cpu second (on a Mac notebook), for audio and image processing applications.
- Running a dataflow graph on a large FPGA should easily reach 100M - 100G events/sec, but offers limited support for debugging.

**Goal**

- support the development and debugging of large dataflow graphs,
- .. by means of a simulator with a throughput of 1-10M events/sec on a laptop,
- .. while retaining (most of) the current (Jupyter-based) debug support.

**Approach**

- pre-study: analyze throughput limitations of the existing simulator;
- pre-study: explore the
*potential*of compiled-C code alternatives; - develop a compiled-C simulator for single-rate dataflow-graphs;
- interface this simulator to current Python-based StaccatoLab graphs and the Jupyter interfaces;
- evaluate throughput gains;
- optimize and extend this simulator to cyclo-static dataflow graphs.

**Results**

- a simulator extension offering a speed-up by one or two orders of magnitude.

**Background**

- Can the behavior of a dataflow
*sub*graph be modeled by an "equivalent" single dataflow node? What kind of dataflow*sub*graphs can be modeled by an "equivalent" cyclo-static dataflow node? - A practical value of such abstraction could be in replacing a subgraph by an equivalent node and thereby speeding up dataflow simulation.
- From an application viewpoint it may be sufficient to require "data equivalence": the node produces the same output data stream(s), given the same input streams.
- When the
*sub*graph operates in a more dynamic environment, "flow equivalence" may (or may not) be important.

**Goal**

- a description of dataflow sub-graph categories ("patterns") that can be transformed into a CSDF node;
- a tool that supports such abstraction transformation;
- a demonstration on e.g. a deeply-pipelined 1D FFT as part of a 2D FFT.

**Approach**

- pre-study: design a collection of a single-rate dataflow subgraphs with a data-equivalent CSDF node, and analyze their flow behaviors;
- pre-study: study relevant literature on forms of program equivalence;
- define flow equivalence, assuming the StaccatoLab execution model;
- specify several classes of dataflow sub-graphs for which flow-equivalent nodes can be generated;
- develop a tool that generates a StaccatoLab node description (finite state machine + output functions) for such a class of dataflow subgraphs.

**Results**

- some theory + a practical tool + examples.

**Background**

- A pulsar is a highly magnetized, rotating neutron star or white dwarf. It radiates a beam of electro-magnetic emission, much like a light house. Rotation frequencies ranging from milliseconds to seconds.
- Searching pulsars is based on so-called
*coherent de-dispersion*, and is extremely computationally intensive. See e.g. link.

**Goal**

- a scalable dataflow graph for pulsar search;
- .. optimized for implementation on (multiple) FPGA(s);
- an "apple-to-apple" benchmark with existing GPU-based implementations. (Goal: 10x lower power consumption).

**Approach**

- pre-study: create a reference solution for pulsar-search algorithm, hopefully with help from colleagues involved in link;
- pre-study: design a dataflow graph for one (or more) kernels of the pulsar-search algorithm;
- explore the limitations of GPU implementations and the potential of FPGA implementations (and vv);
- design a scalable dataflow graph for pulsar search;
- demonstrate this graph on an FPGA, using real radio-astronomy data;
- perform benchmarking.

**Results**

- a scalable dataflow graph for pulsar search;
- benchmarking FPGA versus GPU for pulsar search.

**Background**

- Molecular dynamics is a computer simulation method for studying the physical movements of atoms and molecules, and is thus a type of N-body simulation.
- The Lennard-Jones potential is a mathematically simple model that approximates the interaction between a pair of neutral atoms or molecules.

**Goal**

- acceleration of molecular dynamics (assuming the Lennard-Jones potential)
- .. by means of an FPGA co-processor programmed by a dataflow-graph;
- improved understanding the use of "dataflow-graph as accelerator" software architecture.

**Approach**

- pre-study: create a reference solution for a molecular-dynamics simulator (preferably based on existing work);
- pre-study: design a dataflow graph for one (or more) molecular-dynamics kernels;
- analyze available (forms of) parallelism, both qualitatively and quantitatively;
- define a suitable accelerator and specify the host-accelerator interface;
- program this accelerator as a dataflow graph;
- implement and demonstrate this accelerator on an FPGA board;
- perform benchmarking.

**Results**

- a simple dataflow-based molecular-dynamics simulator that can be scaled up to multiple (large) FPGAs;
- a demonstrator on a medium-scaled FPGA;
- a performance model (simulation throughput versus available resources);
- a template for using a dataflow-graph as an accelerator.