Anton Wijs -- Software - GPU bisimilarity checking

GPU strong and branching bisimilarity checking

This page is dedicated to checking strong and branching bisimilarity of Labelled Transition Systems completely on an NVIDIA GPU with at least CUDA computation capability 2.0. Bisimilarity checking is a useful computation for offline model checking, in which the state space is completely known and stored on disk, and temporal properties must be checked on it.

Download links for both the tools and a set of models for experimentation are provided on this page.
To generate the state spaces of the involved models directly on a CPU, one of the model checking toolsets mCRL2, CADP or PRISM is required, depending on the particular model.

In the paper, strong and branching bisimilarity checking is done on the CPU using the LTSmin toolset, in particular, the LTSmin-reduce-dist tool.

REQUIRED PACKAGES: CUDA driver + SDK; To produce the state spaces of the used models on a CPU: CADP, mCRL2, and PRISM

SOFTWARE: The tools required for GPU bisimilarity checking can be downloaded here.

Models and State Spaces

Following, we present all the characteristics of the analysed state spaces, with a brief description of each model and how to obtain it.

Models	# States	# Transitions	# Tau-transitions	# Labels	Max. branching factor	# Blocks after strong bisim. checking	# Blocks after branching bisim. checking	Model description	Source	Download link
BRP250	219,560,265	266,598,503	250,622,012	751	249	101,251,152	62,506	Bit Retransmission Protocol with packet length 250, 250 retransmissions	CADP toolbox	BRP250
BRP250 h2	219,311,020	266,248,888	261,265,415	377	62,498	n.a.	19,347	BRP + hiding	CADP toolbox	-
coin8.3	87,887,872	583,015,424	231,413,760	34	16	20,170,219	n.a.	the shared coin protocol from the randomised consensus algorithm of Aspnes and Herlihy	PRISM model checker	coin8.3
cwi_33949	33,949,609	165,318,222	74,133,306	31	17	122,035	n.a.	Unknown	VLTS benchmarks	cwi_33949
cwi_33949 h1	33,946,699	165,312,102	741,127,186	31	5256	n.a.	12,463	cwi_33949 + hiding	VLTS benchmarks	-
cwi_33949 h2	32,941,081	158,945,770	113,385,586	17	303,973	n.a.	3,181	cwi_33949 + hiding	VLTS benchmarks	-
cwi_7838	7,838,608	59,101,007	22,842,122	20	13	966,470	62,031	Unknown	VLTS benchmarks	cwi_7838
diningcrypt14	18,378,370	164,329,284	0	71	14	18,378,370	n.a.	Dining Cryptographers (14 cryptographers)	mCRL2 toolset	diningcrypt14
diningcrypt14 h2	1,761,930	15,721,478	4,186,011	37	17	n.a.	497,493	diningcrypt14 + hiding	mCRL2 toolset	-
firewire_dl_800.36	129,267,079	293,634,127	0	16	5	33,862,719	n.a.	Full firewire protocol with integer semantics, deadline = 800, delay = 36	PRISM model checker	firewire_dl_800.36
firewire_dl 800.36 h2	129,267,079	293,634,127	80,538,644	10	5	n.a.	25,566,321	firewire_dl_800.36 + hiding	PRISM model checker	-
mutualex7.13	76,217,344	653,708,608	0	148	14	76,217,344	n.a.	Mutual Exclusion protocol, 7 processes, 13 states per process	PRISM model checker	mutualex7.13
mutualex7.13 h1	76,217,344	612,570,560	119,111,552	112	14	n.a.	41,140,224	mutualex7.13 + hiding	PRISM model checker	-
mutualex7.13 h2	76,217,344	561,982,176	227,433,696	76	14	n.a.	31,761,120	mutualex7.13 + hiding	PRISM model checker	-
SCSI_C_6	73,570,112	403,891,758	0	38	13	73,570,112	n.a.	SCSI controller with 6 disks	CADP toolbox	SCSI_C_6
SCSI_C_6 h1	73,570,112	403,357,405	164,184,383	29	12	n.a.	41,140,224	SCSI_C_6 + hiding	CADP toolbox	-
SCSI_C_6 h2	73,570,112	403,357,405	183,745,076	20	12	n.a.	31,761,120	SCSI_C_6 + hiding	CADP toolbox	-
Szymanski5	79,518,740	375,297,913	113,335,720	90	5	31,271,358	n.a.	Szymanski Mutual Exclusion protocol, instance 5	BEEM benchmarks (translated to mCRL2)	Szymanski5
Szymanski5 h1	79,518,740	375,297,913	113,335,720	90	5	n.a.	11,611,589	Szymanski5 + hiding	BEEM benchmarks (translated to mCRL2)	-
Szymanski5 h2	79,518,740	375,297,913	274,382,458	46	5	n.a.	1,270,702	Szymanski5 + hiding	BEEM benchmarks (translated to mCRL2)	-
vasy_11026	11,026,932	24,660,513	2,748,559	119	13	882,341	775,618	Unknown	VLTS benchmarks	vasy_11026
vasy_12323	12,323,703	27,667,803	3,153,502	119	13	996,774	876,944	Unknown	VLTS benchmarks	vasy_12323
vasy_6020	6,020,550	19,353,474	17,526,144	511	260	7,168	256	Unknown	VLTS benchmarks	vasy_6020
vasy_6120	6,120,718	11,031,292	3,152,976	125	16	6,492	349	Unknown	VLTS benchmarks	vasy_6120
vasy_8082	8,082,905	42,933,110	2,535,944	211	48	408	290	Unknown	VLTS benchmarks	vasy_8082
vasy_8082 h2	8,082,905	42,933,110	22,487,495	107	48	n.a.	152	vasy_8082 + hiding	VLTS benchmarks	-
zeroconf_dl.F.200.1k	118,608,961	273,464,630	386	16	10	65,623,364	n.a.	IPv4 Zeroconf protocol, reset = false, T = 200, N = 1000, K = 6	PRISM model checker	zeroconf_dl.F.200.1k
zeroconf_dl.F.200.1k h1	118,608,961	273,464,244	0	16	10	n.a.	65,623,364	zeroconf + hiding	PRISM model checker	-

In the table, "n.a." stands for "not applicable", meaning that no corresponding experiment has been performed. For instance, no branching bisimilarity checking has been done for the 'coin8.3' model. How to obtain a state space from a particular model depends on the tools that are required. Assuming that CADP, mCRL2, and PRISM have all been installed in standard paths, each state space can be generated from the corresponding model by running the script called generate_statespace which is included in all the archives. For the PRISM models, the resulting files also need to be converted to the .aut format. For this, there is a Python script included in the GPUreduce tool package called mdp2aut. Applying it on the state space files converts them to the .aut format.

For the VLTS cases, the ones prefixed 'vasy_' and 'cwi_', direct links to the state spaces themselves are given. These spaces are in '.bcg' format, though. To convert them to the desired '.aut' format, use the CADP toolbox by running 'bcg_io A.bcg A.aut' for each .bcg file named A.

On most of the state spaces we also applied hiding, to analyse how branching bisimilarity checking performs for varying numbers of tau-transitions. In the tables, the resulting state spaces are marked h1 or h2. In the GPUreduce tool package, a tool called produce_hidden_auts can be found. It selects deterministically fixed percentages of the labels in a state space and hides those, i.e. renames them to tau. After that, tau-compression is performed to ensure that no tau-cycles are constructed.

To prepare a .aut file for the GPU, autrelabel should be applied on it first. This tool renames all action labels to integers, and sorts the outgoing transitions of each state by label. The latter is a precondition for the GPU bisimilarity checking tool.

Finally, the bisimilarity checking tool gpureduce can be run as follows:

gpureduce <relabelled .aut file> [-br]

The -br argument is optional, and selects branching bisimilarity, as opposed to the (default) strong bisimilarity.

Experimental Results

The table below presents our runtime measurements (in seconds) for the strong bisimilary checking experiments. The setups used are:

GPUreduce: Our strong bisimilarity, with multi-way splitting;
GPUreduce ss: The same, but without multi-way splitting;
GPU LR: A CUDA-implementation of the algorithm by Lee and Rajasekaran (see A Parallel Algorithm for Relational Coarsest Partition Problem and Its Implementation in the proceedings of CAV'94);
LTSmin: The distributed bisimilarity checking tool ltsmin-reduce-dist running on a single machine with 4 cores, 16 cores, and 32 cores.

The hardware setups for the experiments were as follows:

For the GPU experiments, we used a machine with an Intel E5-2620 2.0 GHz CPU, running CentOS Linux, equipped with an NVIDIA K20m GPU. The latter contains 13 Streaming Multiprocessors, each having 192 Streaming Processors, and 5 GB global memory.

For the CPU experiments, we used a machine with an AMD Opteron 6172 CPU with 48 cores, 192 GB RAM, running Debian 6.0.7.

In the two tables below, '-' entries indicate that the experiment ran out of memory, either in the global memory of the GPU, or in the RAM on the CPU, depending on the type of experiment. Entries 'o.o.t.' indicate running out of time. As a time-bound, we selected 14 hours.

Models	GPUreduce	GPUreduce ss	GPU LR	LTSmin 4 cores	LTSmin 16 cores	LTSmin 32 cores
BRP250	45930.24	o.o.t.	-	91838.97	48466.23	36434.32
coin8.3	850.42	2539.22	17504.45	2890.93	1167.79	204.39
cwi_33949	612.13	3388.60	-	195.58	77.47	44.32
cwi_7838	3853.55	6593.64	-	601.85	321.37	305.11
diningcrypt14	2.48	3.45	-	119.31	26.11	12.31
firewire_dl_800.36	5493.23	20533.25	-	10313.89	6189.53	3515.35
mutualex7.13	230.44	1034.24	-	1215.67	243.38	141.68
SCSI_C_6	329.44	2395.15	-	897.45	348.52	204.19
Szymanski5	940.36	5302.11	-	2600.08	1041.34	600.98
vasy_11026	84.83	10644.68	-	64.75	34.21	30.20
vasy_12323	114.75	13449.89	-	51.88	28.17	23.46
vasy_6020	7.07	379.71	325.05	3.11	1.28	1.25
vasy_6120	117.88	323.21	349.76	-	-	-
vasy_8082	9.10	12.01	46.90	6.94	3.23	2.71
zeroconf_dl.F.200.1k	3506.22	o.o.t.	-	7372.29	3367.13	3250.99

Finally, the final table below presents our measured runtimes (in seconds) for the branching bisimilarity experiments. In this case, GPUreduce ss and GPU LR are no longer present, the first, because the strong bisimilarity results already demonstrate how multi-way splitting affects the runtimes, and the second, because the LR algorithm is limited to checking strong bisimilarity.

Models	GPUreduce	LTSmin 4 cores	LTSmin 16 cores	LTSmin 32 cores
BRP250	24506.33	78371.31	28129.97	15858.16
BRP250 h2	18543.24	57508.73	21262.87	14811.01
cwi_33949 h1	1179.72	331.01	150.31	-
cwi_33949 h2	5483.33	9100.98	6514.36	-
cwi_7838	8618.21	823.71	422.30	366.89
diningcrypt14 h2	21.99	11.78	8.43	4.26
firewire_dl 800.36 h2	5483.23	16429.85	6330.45	3337.10
mutualex7.13 h1	1302.44	5275.08	1606.99	731.22
mutualex7.13 h2	1740.33	5487.90	2062.03	1030.60
SCSI_C_6 h1	940.22	2733.17	1011.12	557.64
SCSI_C_6 h2	1024.52	3349.61	1320.08	772.86
Szymanski5 h1	2496.78	5726.88	2516.11	1602.57
Szymanski5 h2	2506.33	6278.68	2777.10	1709.69
vasy_11026	31.56	107.18	45.10	35.48
vasy_12323	67.27	87.27	37.70	28.23
vasy_6020	5.43	4.46	1.52	-
vasy_6120	146.00	-	-	-
vasy_8082	2.15	14.31	4.98	3.34
vasy_8082 h2	8.61	27.12	10.26	6.27
zeroconf_dl.F.200.1k h1	4305.22	7182.13	3617.22	4560.49

Some nice observations can be made: Firstly, the GPU bisimilarity checking tool requires an a-priori known amount of memory, namely 2.|S|+2.|T|, with S the set of states and T the set of transitions of an LTS. For LTSmin, which stores complete signatures in the memory, this is not the case. For instance, in the case of vasy_6120, this means that the GPU approach can completely perform the calculation in 5 GB, whereas LTSmin does not manage it with 192 GB. Of course, this is possible since repeatedly reconstructing the signatures is not too harmful in a many-core setting, in which many thousands of threads are run, whereas when running up to 32 threads, this would result in a serious runtime penalty.

Secondly, in some cases, the GPU tool is the slowest of all. Intuitively, this depends on the number of tau-transitions and the maximum branching factor of an LTS, since those influence how many inert paths there are potentially, and therefore how much checking needs to be done in each partition refinement step to detect them. However, a clear relation between those LTS characteristics and the measured runtimes remains to be identified.