$~$

MSc assignments contributing to

the StaccatoLab program

$~$

Author: Kees van Berkel,

part-time full professor of SAN (System Architecture and Networking),
within the department Computer Science and Mathematics of the Technical University Eindhoven,
email: c.h.v.berkel@tue.nl

Date: 2018 January 3

Introduction

As background and motivation for the assingments below: see the StaccatoLab research program.
All assignments include a pre-study phase (5 ects), assuming the Embedded Systems program.
Successful completion of the VLSI Programming course (2IMN35) is a considerable advantage, but not required.
Experience with Python programming is a considerable advantage, but not required.
The list of MSc assignments below may not be up to date. Feel free to contact me on the latest status.

Table of Contents¶

1 Mapping dataflow graphs on multiple FPGAs

2 StaccatoLab simulator acceleration

3 Abstraction of dataflow sub-graphs as dataflow nodes

4 A dataflow graph for pulsar search

5 A dataflow accelerator for molecular dynamics

Mapping dataflow graphs on multiple FPGAs¶

Background

Large dataflow graphs may take advantage of multiple FPGAs, ultimately up to thousands of FPGAs.
Inter-FPGA communication of dataflow tokens is much more involved than intra-FPGA communication. There are multiple options for inter-FPGA communication, involving different resources, with different constraints, and with different performance impacts.

Goal

dataflow models of inter-FPGA traffic (latency, slack, throughput), for different options: FPGA-FPGA, FPGA-DRAM-FPGA, and FPGA-DRAM-network-DRAM-FPGA;
tool support for dataflow graph partitioning, user-directed and/or automated.

Approach

pre-study: design example dataflow graph(s);
pre-study: develop inter-FPGA traffic models, based both on theory and on experiments;
pre-study: define the dataflow graph partitioning problem (nb. graph cycles!);
analyze the performance impacts of a given dataflow-graph partitioning;
develop tool support for user-directed graph partitioning;
evaluate existing graph partitioning algorithms;
select/modify a graph partitioning algorithm and implement it in Python;
evaluate the partitioning algorithm on selected dataflow graphs.

Results

dataflow models for inter-FPGA traffic;
a study of relevant dataflow graph partitioning algorithms;
a tool for user-directed graph partitioning;
a tool that analyzes the performance impact of dataflow graph partitioning;
a tool that proposes a dataflow partitioning with minimal impact on the overall graph throughput;

StaccatoLab simulator acceleration¶

Background

The current StaccatoLab dataflow graph simulator is programmed in Python, but offers a throughput of only about 100k events (token productions) per cpu second (on a Mac notebook), for audio and image processing applications.
Running a dataflow graph on a large FPGA should easily reach 100M - 100G events/sec, but offers limited support for debugging.

Goal

support the development and debugging of large dataflow graphs,
.. by means of a simulator with a throughput of 1-10M events/sec on a laptop,
.. while retaining (most of) the current (Jupyter-based) debug support.

Approach

pre-study: analyze throughput limitations of the existing simulator;
pre-study: explore the potential of compiled-C code alternatives;
develop a compiled-C simulator for single-rate dataflow-graphs;
interface this simulator to current Python-based StaccatoLab graphs and the Jupyter interfaces;
evaluate throughput gains;
optimize and extend this simulator to cyclo-static dataflow graphs.

Results

a simulator extension offering a speed-up by one or two orders of magnitude.

Abstraction of dataflow sub-graphs as dataflow nodes¶

Background

Can the behavior of a dataflow subgraph be modeled by an "equivalent" single dataflow node? What kind of dataflow subgraphs can be modeled by an "equivalent" cyclo-static dataflow node?
A practical value of such abstraction could be in replacing a subgraph by an equivalent node and thereby speeding up dataflow simulation.
From an application viewpoint it may be sufficient to require "data equivalence": the node produces the same output data stream(s), given the same input streams.
When the subgraph operates in a more dynamic environment, "flow equivalence" may (or may not) be important.

Goal

a description of dataflow sub-graph categories ("patterns") that can be transformed into a CSDF node;
a tool that supports such abstraction transformation;
a demonstration on e.g. a deeply-pipelined 1D FFT as part of a 2D FFT.

Approach

pre-study: design a collection of a single-rate dataflow subgraphs with a data-equivalent CSDF node, and analyze their flow behaviors;
pre-study: study relevant literature on forms of program equivalence;
define flow equivalence, assuming the StaccatoLab execution model;
specify several classes of dataflow sub-graphs for which flow-equivalent nodes can be generated;
develop a tool that generates a StaccatoLab node description (finite state machine + output functions) for such a class of dataflow subgraphs.

Results

some theory + a practical tool + examples.

A dataflow graph for pulsar search¶

Background

A pulsar is a highly magnetized, rotating neutron star or white dwarf. It radiates a beam of electro-magnetic emission, much like a light house. Rotation frequencies ranging from milliseconds to seconds.
Searching pulsars is based on so-called coherent de-dispersion, and is extremely computationally intensive. See e.g. link.

Goal

a scalable dataflow graph for pulsar search;
.. optimized for implementation on (multiple) FPGA(s);
an "apple-to-apple" benchmark with existing GPU-based implementations. (Goal: 10x lower power consumption).

Approach

pre-study: create a reference solution for pulsar-search algorithm, hopefully with help from colleagues involved in link;
pre-study: design a dataflow graph for one (or more) kernels of the pulsar-search algorithm;
explore the limitations of GPU implementations and the potential of FPGA implementations (and vv);
design a scalable dataflow graph for pulsar search;
demonstrate this graph on an FPGA, using real radio-astronomy data;
perform benchmarking.

Results

a scalable dataflow graph for pulsar search;
benchmarking FPGA versus GPU for pulsar search.

A dataflow accelerator for molecular dynamics¶

Background

Molecular dynamics is a computer simulation method for studying the physical movements of atoms and molecules, and is thus a type of N-body simulation.
The Lennard-Jones potential is a mathematically simple model that approximates the interaction between a pair of neutral atoms or molecules.

Goal

acceleration of molecular dynamics (assuming the Lennard-Jones potential)
.. by means of an FPGA co-processor programmed by a dataflow-graph;
improved understanding the use of "dataflow-graph as accelerator" software architecture.

Approach

pre-study: create a reference solution for a molecular-dynamics simulator (preferably based on existing work);
pre-study: design a dataflow graph for one (or more) molecular-dynamics kernels;
analyze available (forms of) parallelism, both qualitatively and quantitatively;
define a suitable accelerator and specify the host-accelerator interface;
program this accelerator as a dataflow graph;
implement and demonstrate this accelerator on an FPGA board;
perform benchmarking.

Results

a simple dataflow-based molecular-dynamics simulator that can be scaled up to multiple (large) FPGAs;
a demonstrator on a medium-scaled FPGA;
a performance model (simulation throughput versus available resources);
a template for using a dataflow-graph as an accelerator.