GPU strong and branching bisimilarity checking

This page is dedicated to checking strong and branching bisimilarity of Labelled Transition Systems completely on an NVIDIA GPU with at least CUDA computation capability 2.0. Bisimilarity checking is a useful computation for offline model checking, in which the state space is completely known and stored on disk, and temporal properties must be checked on it.

Download links for both the tools and a set of models for experimentation are provided on this page.
To generate the state spaces of the involved models directly on a CPU, one of the model checking toolsets mCRL2, CADP or PRISM is required, depending on the particular model.

In the paper, strong and branching bisimilarity checking is done on the CPU using the LTSmin toolset, in particular, the LTSmin-reduce-dist tool.

REQUIRED PACKAGES: CUDA driver + SDK; To produce the state spaces of the used models on a CPU: CADP, mCRL2, and PRISM

SOFTWARE: The tools required for GPU bisimilarity checking can be downloaded here.

Models and State Spaces

Following, we present all the characteristics of the analysed state spaces, with a brief description of each model and how to obtain it.

Models# States# Transitions# Tau-transitions# LabelsMax. branching factor# Blocks after strong bisim. checking# Blocks after branching bisim. checkingModel descriptionSourceDownload link
BRP250219,560,265266,598,503250,622,012751249101,251,15262,506Bit Retransmission Protocol with packet length 250, 250 retransmissionsCADP toolboxBRP250
BRP250 h2219,311,020266,248,888261,265,41537762,498n.a.19,347BRP + hidingCADP toolbox-
coin8.387,887,872583,015,424231,413,760341620,170,219n.a.the shared coin protocol from the randomised consensus algorithm of Aspnes and HerlihyPRISM model checkercoin8.3
cwi_3394933,949,609165,318,22274,133,3063117122,035n.a.UnknownVLTS benchmarkscwi_33949
cwi_33949 h133,946,699165,312,102741,127,186315256n.a.12,463cwi_33949 + hidingVLTS benchmarks-
cwi_33949 h232,941,081158,945,770113,385,58617303,973n.a.3,181cwi_33949 + hidingVLTS benchmarks-
cwi_78387,838,60859,101,00722,842,1222013966,47062,031UnknownVLTS benchmarkscwi_7838
diningcrypt1418,378,370164,329,2840711418,378,370n.a.Dining Cryptographers (14 cryptographers)mCRL2 toolsetdiningcrypt14
diningcrypt14 h21,761,93015,721,4784,186,0113717n.a.497,493diningcrypt14 + hidingmCRL2 toolset-
firewire_dl_800.36129,267,079293,634,127016533,862,719n.a.Full firewire protocol with integer semantics, deadline = 800, delay = 36PRISM model checkerfirewire_dl_800.36
firewire_dl 800.36 h2129,267,079293,634,12780,538,644105n.a.25,566,321firewire_dl_800.36 + hidingPRISM model checker-
mutualex7.1376,217,344653,708,60801481476,217,344n.a.Mutual Exclusion protocol, 7 processes, 13 states per processPRISM model checkermutualex7.13
mutualex7.13 h176,217,344612,570,560119,111,55211214n.a.41,140,224mutualex7.13 + hidingPRISM model checker-
mutualex7.13 h276,217,344561,982,176227,433,6967614n.a.31,761,120mutualex7.13 + hidingPRISM model checker-
SCSI_C_673,570,112403,891,7580381373,570,112n.a.SCSI controller with 6 disksCADP toolboxSCSI_C_6
SCSI_C_6 h173,570,112403,357,405164,184,3832912n.a.41,140,224SCSI_C_6 + hidingCADP toolbox-
SCSI_C_6 h273,570,112403,357,405183,745,0762012n.a.31,761,120SCSI_C_6 + hidingCADP toolbox-
Szymanski579,518,740375,297,913113,335,72090531,271,358n.a.Szymanski Mutual Exclusion protocol, instance 5BEEM benchmarks (translated to mCRL2)Szymanski5
Szymanski5 h179,518,740375,297,913113,335,720905n.a.11,611,589Szymanski5 + hidingBEEM benchmarks (translated to mCRL2)-
Szymanski5 h279,518,740375,297,913274,382,458465n.a.1,270,702Szymanski5 + hidingBEEM benchmarks (translated to mCRL2)-
vasy_1102611,026,93224,660,5132,748,55911913882,341775,618UnknownVLTS benchmarksvasy_11026
vasy_1232312,323,70327,667,8033,153,50211913996,774876,944UnknownVLTS benchmarksvasy_12323
vasy_60206,020,55019,353,47417,526,1445112607,168256UnknownVLTS benchmarksvasy_6020
vasy_61206,120,71811,031,2923,152,976125166,492349UnknownVLTS benchmarksvasy_6120
vasy_80828,082,90542,933,1102,535,94421148408290UnknownVLTS benchmarksvasy_8082
vasy_8082 h28,082,90542,933,11022,487,49510748n.a.152vasy_8082 + hidingVLTS benchmarks-
zeroconf_dl.F.200.1k118,608,961273,464,630386161065,623,364n.a.IPv4 Zeroconf protocol, reset = false, T = 200, N = 1000, K = 6PRISM model checkerzeroconf_dl.F.200.1k
zeroconf_dl.F.200.1k h1118,608,961273,464,24401610n.a.65,623,364zeroconf + hidingPRISM model checker-


In the table, "n.a." stands for "not applicable", meaning that no corresponding experiment has been performed. For instance, no branching bisimilarity checking has been done for the 'coin8.3' model. How to obtain a state space from a particular model depends on the tools that are required. Assuming that CADP, mCRL2, and PRISM have all been installed in standard paths, each state space can be generated from the corresponding model by running the script called generate_statespace which is included in all the archives. For the PRISM models, the resulting files also need to be converted to the .aut format. For this, there is a Python script included in the GPUreduce tool package called mdp2aut. Applying it on the state space files converts them to the .aut format.

For the VLTS cases, the ones prefixed 'vasy_' and 'cwi_', direct links to the state spaces themselves are given. These spaces are in '.bcg' format, though. To convert them to the desired '.aut' format, use the CADP toolbox by running 'bcg_io A.bcg A.aut' for each .bcg file named A.

On most of the state spaces we also applied hiding, to analyse how branching bisimilarity checking performs for varying numbers of tau-transitions. In the tables, the resulting state spaces are marked h1 or h2. In the GPUreduce tool package, a tool called produce_hidden_auts can be found. It selects deterministically fixed percentages of the labels in a state space and hides those, i.e. renames them to tau. After that, tau-compression is performed to ensure that no tau-cycles are constructed.

To prepare a .aut file for the GPU, autrelabel should be applied on it first. This tool renames all action labels to integers, and sorts the outgoing transitions of each state by label. The latter is a precondition for the GPU bisimilarity checking tool.

Finally, the bisimilarity checking tool gpureduce can be run as follows:

gpureduce <relabelled .aut file> [-br]

The -br argument is optional, and selects branching bisimilarity, as opposed to the (default) strong bisimilarity.

Experimental Results

The table below presents our runtime measurements (in seconds) for the strong bisimilary checking experiments. The setups used are:
  • GPUreduce: Our strong bisimilarity, with multi-way splitting;
  • GPUreduce ss: The same, but without multi-way splitting;
  • GPU LR: A CUDA-implementation of the algorithm by Lee and Rajasekaran (see A Parallel Algorithm for Relational Coarsest Partition Problem and Its Implementation in the proceedings of CAV'94);
  • LTSmin: The distributed bisimilarity checking tool ltsmin-reduce-dist running on a single machine with 4 cores, 16 cores, and 32 cores.


The hardware setups for the experiments were as follows:

For the GPU experiments, we used a machine with an Intel E5-2620 2.0 GHz CPU, running CentOS Linux, equipped with an NVIDIA K20m GPU. The latter contains 13 Streaming Multiprocessors, each having 192 Streaming Processors, and 5 GB global memory.

For the CPU experiments, we used a machine with an AMD Opteron 6172 CPU with 48 cores, 192 GB RAM, running Debian 6.0.7.

In the two tables below, '-' entries indicate that the experiment ran out of memory, either in the global memory of the GPU, or in the RAM on the CPU, depending on the type of experiment. Entries 'o.o.t.' indicate running out of time. As a time-bound, we selected 14 hours.

ModelsGPUreduceGPUreduce ssGPU LRLTSmin 4 coresLTSmin 16 coresLTSmin 32 cores
BRP25045930.24o.o.t.-91838.9748466.2336434.32
coin8.3850.422539.2217504.452890.931167.79204.39
cwi_33949612.133388.60-195.5877.4744.32
cwi_78383853.556593.64-601.85321.37305.11
diningcrypt142.483.45-119.3126.1112.31
firewire_dl_800.365493.2320533.25-10313.896189.533515.35
mutualex7.13230.441034.24-1215.67243.38141.68
SCSI_C_6329.442395.15-897.45348.52204.19
Szymanski5940.365302.11-2600.081041.34600.98
vasy_1102684.8310644.68-64.7534.2130.20
vasy_12323114.7513449.89-51.8828.1723.46
vasy_60207.07379.71325.053.111.281.25
vasy_6120117.88323.21349.76---
vasy_80829.1012.0146.906.943.232.71
zeroconf_dl.F.200.1k3506.22o.o.t.-7372.293367.133250.99


Finally, the final table below presents our measured runtimes (in seconds) for the branching bisimilarity experiments. In this case, GPUreduce ss and GPU LR are no longer present, the first, because the strong bisimilarity results already demonstrate how multi-way splitting affects the runtimes, and the second, because the LR algorithm is limited to checking strong bisimilarity.

ModelsGPUreduceLTSmin 4 coresLTSmin 16 coresLTSmin 32 cores
BRP25024506.3378371.3128129.9715858.16
BRP250 h218543.2457508.7321262.8714811.01
cwi_33949 h11179.72331.01150.31-
cwi_33949 h25483.339100.986514.36-
cwi_78388618.21823.71422.30366.89
diningcrypt14 h221.9911.788.434.26
firewire_dl 800.36 h25483.2316429.856330.453337.10
mutualex7.13 h11302.445275.081606.99731.22
mutualex7.13 h21740.335487.902062.031030.60
SCSI_C_6 h1940.222733.171011.12557.64
SCSI_C_6 h21024.523349.611320.08772.86
Szymanski5 h12496.785726.882516.111602.57
Szymanski5 h22506.336278.682777.101709.69
vasy_1102631.56107.1845.1035.48
vasy_1232367.2787.2737.7028.23
vasy_60205.434.461.52-
vasy_6120146.00---
vasy_80822.1514.314.983.34
vasy_8082 h28.6127.1210.266.27
zeroconf_dl.F.200.1k h14305.227182.133617.224560.49


Some nice observations can be made: Firstly, the GPU bisimilarity checking tool requires an a-priori known amount of memory, namely 2.|S|+2.|T|, with S the set of states and T the set of transitions of an LTS. For LTSmin, which stores complete signatures in the memory, this is not the case. For instance, in the case of vasy_6120, this means that the GPU approach can completely perform the calculation in 5 GB, whereas LTSmin does not manage it with 192 GB. Of course, this is possible since repeatedly reconstructing the signatures is not too harmful in a many-core setting, in which many thousands of threads are run, whereas when running up to 32 threads, this would result in a serious runtime penalty.

Secondly, in some cases, the GPU tool is the slowest of all. Intuitively, this depends on the number of tau-transitions and the maximum branching factor of an LTS, since those influence how many inert paths there are potentially, and therefore how much checking needs to be done in each partition refinement step to detect them. However, a clear relation between those LTS characteristics and the measured runtimes remains to be identified.