$~$

MSc assignments contributing to

the StaccatoLab program

$~$

Author: Kees van Berkel,

• part-time full professor of SAN (System Architecture and Networking),
• within the department Computer Science and Mathematics of the Technical University Eindhoven,
• email: c.h.v.berkel@tue.nl

Date: 2018 January 3

Introduction

• As background and motivation for the assingments below: see the StaccatoLab research program.
• All assignments include a pre-study phase (5 ects), assuming the Embedded Systems program.
• Successful completion of the VLSI Programming course (2IMN35) is a considerable advantage, but not required.
• Experience with Python programming is a considerable advantage, but not required.
• The list of MSc assignments below may not be up to date. Feel free to contact me on the latest status.
In [ ]:



# Mapping dataflow graphs on multiple FPGAs¶

Background

• Large dataflow graphs may take advantage of multiple FPGAs, ultimately up to thousands of FPGAs.
• Inter-FPGA communication of dataflow tokens is much more involved than intra-FPGA communication. There are multiple options for inter-FPGA communication, involving different resources, with different constraints, and with different performance impacts.

Goal

• dataflow models of inter-FPGA traffic (latency, slack, throughput), for different options: FPGA-FPGA, FPGA-DRAM-FPGA, and FPGA-DRAM-network-DRAM-FPGA;
• tool support for dataflow graph partitioning, user-directed and/or automated.

Approach

• pre-study: design example dataflow graph(s);
• pre-study: develop inter-FPGA traffic models, based both on theory and on experiments;
• pre-study: define the dataflow graph partitioning problem (nb. graph cycles!);
• analyze the performance impacts of a given dataflow-graph partitioning;
• develop tool support for user-directed graph partitioning;
• evaluate existing graph partitioning algorithms;
• select/modify a graph partitioning algorithm and implement it in Python;
• evaluate the partitioning algorithm on selected dataflow graphs.

Results

• dataflow models for inter-FPGA traffic;
• a study of relevant dataflow graph partitioning algorithms;
• a tool for user-directed graph partitioning;
• a tool that analyzes the performance impact of dataflow graph partitioning;
• a tool that proposes a dataflow partitioning with minimal impact on the overall graph throughput;

# StaccatoLab simulator acceleration¶

Background

• The current StaccatoLab dataflow graph simulator is programmed in Python, but offers a throughput of only about 100k events (token productions) per cpu second (on a Mac notebook), for audio and image processing applications.
• Running a dataflow graph on a large FPGA should easily reach 100M - 100G events/sec, but offers limited support for debugging.

Goal

• support the development and debugging of large dataflow graphs,
• .. by means of a simulator with a throughput of 1-10M events/sec on a laptop,
• .. while retaining (most of) the current (Jupyter-based) debug support.

Approach

• pre-study: analyze throughput limitations of the existing simulator;
• pre-study: explore the potential of compiled-C code alternatives;
• develop a compiled-C simulator for single-rate dataflow-graphs;
• interface this simulator to current Python-based StaccatoLab graphs and the Jupyter interfaces;
• evaluate throughput gains;
• optimize and extend this simulator to cyclo-static dataflow graphs.

Results

• a simulator extension offering a speed-up by one or two orders of magnitude.

# Abstraction of dataflow sub-graphs as dataflow nodes¶

Background

• Can the behavior of a dataflow subgraph be modeled by an "equivalent" single dataflow node? What kind of dataflow subgraphs can be modeled by an "equivalent" cyclo-static dataflow node?
• A practical value of such abstraction could be in replacing a subgraph by an equivalent node and thereby speeding up dataflow simulation.
• From an application viewpoint it may be sufficient to require "data equivalence": the node produces the same output data stream(s), given the same input streams.
• When the subgraph operates in a more dynamic environment, "flow equivalence" may (or may not) be important.

Goal

• a description of dataflow sub-graph categories ("patterns") that can be transformed into a CSDF node;
• a tool that supports such abstraction transformation;
• a demonstration on e.g. a deeply-pipelined 1D FFT as part of a 2D FFT.

Approach

• pre-study: design a collection of a single-rate dataflow subgraphs with a data-equivalent CSDF node, and analyze their flow behaviors;
• pre-study: study relevant literature on forms of program equivalence;
• define flow equivalence, assuming the StaccatoLab execution model;
• specify several classes of dataflow sub-graphs for which flow-equivalent nodes can be generated;
• develop a tool that generates a StaccatoLab node description (finite state machine + output functions) for such a class of dataflow subgraphs.

Results

• some theory + a practical tool + examples.

# A dataflow graph for pulsar search¶

Background

• A pulsar is a highly magnetized, rotating neutron star or white dwarf. It radiates a beam of electro-magnetic emission, much like a light house. Rotation frequencies ranging from milliseconds to seconds.
• Searching pulsars is based on so-called coherent de-dispersion, and is extremely computationally intensive. See e.g. link.

Goal

• a scalable dataflow graph for pulsar search;
• .. optimized for implementation on (multiple) FPGA(s);
• an "apple-to-apple" benchmark with existing GPU-based implementations. (Goal: 10x lower power consumption).

Approach

• pre-study: create a reference solution for pulsar-search algorithm, hopefully with help from colleagues involved in link;
• pre-study: design a dataflow graph for one (or more) kernels of the pulsar-search algorithm;
• explore the limitations of GPU implementations and the potential of FPGA implementations (and vv);
• design a scalable dataflow graph for pulsar search;
• demonstrate this graph on an FPGA, using real radio-astronomy data;
• perform benchmarking.

Results

• a scalable dataflow graph for pulsar search;
• benchmarking FPGA versus GPU for pulsar search.

# A dataflow accelerator for molecular dynamics¶

Background

• Molecular dynamics is a computer simulation method for studying the physical movements of atoms and molecules, and is thus a type of N-body simulation.
• The Lennard-Jones potential is a mathematically simple model that approximates the interaction between a pair of neutral atoms or molecules.

Goal

• acceleration of molecular dynamics (assuming the Lennard-Jones potential)
• .. by means of an FPGA co-processor programmed by a dataflow-graph;
• improved understanding the use of "dataflow-graph as accelerator" software architecture.

Approach

• pre-study: create a reference solution for a molecular-dynamics simulator (preferably based on existing work);
• pre-study: design a dataflow graph for one (or more) molecular-dynamics kernels;
• analyze available (forms of) parallelism, both qualitatively and quantitatively;
• define a suitable accelerator and specify the host-accelerator interface;
• program this accelerator as a dataflow graph;
• implement and demonstrate this accelerator on an FPGA board;
• perform benchmarking.

Results

• a simple dataflow-based molecular-dynamics simulator that can be scaled up to multiple (large) FPGAs;
• a demonstrator on a medium-scaled FPGA;
• a performance model (simulation throughput versus available resources);
• a template for using a dataflow-graph as an accelerator.