Q: Fitness preservation of fall-throughs in IM-D

Sebastian · April 2021

Hello experts on inductive mining,

here's a conundrum about fitness preservation in the Inductive Miner framework based on directly-follows graphs (IM-D), as presented by Sander Leemans in his Ph. D. thesis.

In section 6.1.1 of his thesis, dealing with the basic log-based IM, Leemans presents tHe following log L69:

<b, c, d, e>

<d, b, e>

<e, b>

<c, b>

<b, d, e, c>

To L69 he applies a parallel cut, and splits the log so that the non-atomic part of the cut is represented by log L71:

<c, d, e>

<d, e>

<e>

<c>

<d, e, c>

That log has a directly-follows graph (DFG71) in which activity d, although it does not occur at the end of a trace, is classified as an end node after the split, on account of having an outgoing connection to activity b in the parent graph.

Leemans states that no cut can be found for DFG71, and using IM he applies a fall-through, namely activityConcurrent. That fall-through takes out d and places it concurrently to the other activities. Moreover, when applying a parallel cut to the remaining part of the log containing c and e, empty traces are introduced into the log (to fill up traces to equal length, I guess).

With IM-D, which does not use the log at all, but only the DFG, we can do neither of these things.

the information for the introduction of empty traces is just not present in the DFG. So even if we could adapt the activityConcurrent fall-through, we'd finally end up with a simple parallelism over c, d, e. That would not preserve fitness.
as Leemans discusses later, the activityConcurrent fall-through is too expensive to adapt to IM-D. The next more specific fall-through that is part of IM-D is strictTauLoop. This fall-through will remove every edge from an end activity to a start activity. In our case, as every activity is both an end and a start activity, it will remove ALL edges between c, d and e, leaving us with three isolated activities, which in turn will be translated to a process tree with an XOR over these activities. This is not fitness preserving either.

I must be getting something wrong, because I trust Leemans' claim that fall-throughs preserve fitness. His thesis has been reviewed many times. But still I do not see how to apply the rules of the IM-D framework in a fitness preserving way to this example. As fitness preservation is said to be one of the main features of inductive mining, I feel bad about this. Can someone please enlighten me?

Sebastian

Sebastian · April 2021

Update regarding ProM 6.10: I'm still more confused by the behavior of ProM. When I import a log L69 from a csv-file, converting it to a directly follows graph and then run plugin "Mine process tree with Inductive Miner - directly follows" (written by Leemans) on that DFG, then contrary to what he says in his thesis, the miner actually finds a loop cut over the DFG that remains after making the first parallel cut. (This loop cut also strangely involves two empty redo parts.)

Are there perhaps alternative specifications of IM-D that differ from the thesis?

hverbeek · April 2021

Hello Sebastian,

I have asked Sander Leemans to have a look at your questions.

All I can say at this point in time is that the Inductive miners has been a 'work in progress' for the recent years, and that the current version of the Inductive miner may behave differently from the version at the time when Sander wrote his thesis. Sander's thesis dates from 2017, the then-current version of ProM was version 6.6. Perhaps you could try the Inductive miner from ProM 6.6 on the example logs. For this, you will have to install ProM 6.6 next to ProM 6.10, but that should not be a problem. Different versions of ProM can co-exist.

Kind regards,

Eric.

Sebastian · April 2021

OK, I discovered one of my mistakes: Of course using the strictTauLoop fall-through will introduce the loop node into the process tree. But I still don't see how IM-D can discover the structure (XOR, c, SEQ(XOR(d, tau), e)) after strictTauLoop has stripped everything down to three isolated vertices. It does discover this in ProM 6.10, but in theory shouldn't it be just XOR(c,d,e)?

Sebastian · April 2021

Next step: strictTauLoop will not strip away the edge between d and e, because d will not be counted an end node after the loop split, because it has no connection to the outside: The connection to b will have been removed when splitting the DFG after the parallel cut.

This explains why there is a sequence from d to e.

So my only remaining question would be: what explains the optionality of d in that sequence?

As I understand the thesis, an empty trace (epsilon) is only added to a sequence cluster if there is an arc skipping that cluster. That would seem to require three clusters (see the example in Figure 6.22). But here we have a sequence consisting of only two clusters.

I guess that the special cases for start and end nodes in the function SequenceDfgSplit (page 262) somehow deal with this, in a way that still eludes me.

I am sorry that knowledge comes so slowly and piecemeal, but Sander's thesis is not bedside reading ;-)

Sebastian · April 2021

My guess is that the following is true:

If a later cluster of a sequence contains a start node, then an epsilon (empty trace) is added to all previous clusters.

If a previous cluster of a sequence contains an exit node, then an epsilon is added to all later clusters.

Can someone confirm or correct my reading?

sjjleemans · April 2021

Dear Sebastian,

- DFG71 in the thesis is not correct: d should not be an end node. Notice that this DFG is derived only from the split log 71, in which d does not appear as an end activity. Still, no cut can be found in DFG71;

- Log71 is split into log 71 and 73: each trace is split into a trace with d's (which can be empty) and a trace with the other activities. I believe these are correct in the thesis.

Continuing for IM - directly follows:

1) first of all, activityConcurrent is not part of IMd (see page 265). I didn't explore this further, but it could very well be that if you want to preserve fitness, you have to go for and(loop(tau, d), ...).

2) [please note the mistake in DFG71]. For the DFG you describe, I believe that the strictDfgTauLoop would result in loop(xor(c, d, e), tau), which is a fitting model.

Please note that not all fallthroughs need to preserve fitness; see Definition 4.1 for a formal characterisation. An example of a non-fitness preserving fallthrough is emptyTracesFiltering.

Finally, in Section 6.6.6: "Specifically, a fitness-guaranteeing algorithm would need to return a model that allows for any possible path through the directly follows graph. Therefore, such an algorithm has to seriously underfit/generalise, and we chose the basic algorithm IMd to not guarantee fitness...".

[this was an answer to your original question]

sjjleemans · April 2021

Sebastian said:

My guess is that the following is true:

If a later cluster of a sequence contains a start node, then an epsilon (empty trace) is added to all previous clusters.

If a previous cluster of a sequence contains an exit node, then an epsilon is added to all later clusters.

Can someone confirm or correct my reading?

That is correct: add the empty trace if there is an edge from somewhere before (including start) the cluster to somewhere after (including end) the cluster (last line before return in sequenceDfgSplit).

Sebastian · April 2021

Dear Sander,
thank you for so patiently and thoroughly answering my question. This has been most helpful!
Best Regards,
Sebastian

sjjleemans · April 2021

Dear Sebastian,

No worries; cool that you're diving so deep in my thesis.

Sander

Sebastian · April 2021

I may have found another little mistake in your thesis: On page 194 you present log L81
L81 = {(a, b, c, d), (d, a, b), (a, d, c), (b, c, d)} and claim that it has no cut. But d is both a start and an end node, a is a start node and c is an end node, and d is bidirectionally connected to a and c. Further b (also a start and end node) is connected to a and c. So there is a parallel cut partitioning the vertices of the DFG into <{d},{a,b,c}>, isn't there? Or am I missing something?

I am trying to adapt L81 so that really no cut applies.

Sebastian · April 2021

Dear Sander,

it probably was just a typo. For example, when one eliminates the edge from a to d, there really is no cut. One simply has to delete the a from the third trace.

Cheers,
Sebastian

Sebastian · April 2021

Dear Sander,

Update: I did get something wrong. In a parallel cut, all parts should be fully interconnected, i. e. any two nodes a from set_i and b from set_j with i \= j are connected by edges (a,b) and (b, a). This is indeed not the case in the example.

But isn't that overly restrictive? It certainly looks reasonable to discover a parallel cut here, instead of relying on fall-through activityConcurrent, which would be expensive o implement in IM-D.

My idea is the following: We require the DFG to be connected. Then we only look at the edge set of the DFG. For any edge from a to b, if there also is an edge from b to a (the activities are connected in both directions) then the activities go into different components, otherwise their components are merged.

If any component remains that has both a start and an end activity, and none of those are connected bidirectionally to some node of another component that has both a start and end activity, then we have not found a cut. [This clause will distinguish between b being in the same component as a and c in our example, and b being in a component by itself, disconnected from d.]

Then we merge the components that don't contain a start and end event with an arbitrary component that does.

If there is only one component we also have not found a cut. Otherwise we have.

Did you ever think about something like that? Do you think the idea holds water?

Sebastian

sjjleemans · April 2021

Dear Sebastian,

The implementation of the concurrent cut detection considers all pairs of two activities and if they are not connected both ways, then their components are merged. (it does the check for start/complete afterwards, which is non-deterministic).

The tricky part with detecting cuts from DFGs is that loop and concurrency are very much alike, and difficult to distinguish. Also, cut detection needs to detect only the highest-level cut, and leave all the lower-level cuts alone. For instance, if you have and(a, xor(b, and(c, d))), your method would not find the correct top-level cut, which is {a} {b, c, d}, even though c and d are in the end parallel. I think the method you proposed would keep c and d separated in this example. (and would not be defined as the rules would say that c and d are separated, but also that c, b and d would be in the same component).

Sander

Sebastian · April 2021

Dear Sander,

thanks, I see. Indeed my method would only find {a} {b} {c} {d} in your example. I guess there's a trade-off here, but I do understand now why the more restrictive approach would be considered better.

Thanks again for the private tutorial :-)

Sebastian

sjjleemans · April 2021

Dear Sebastian,

No worries

Good to be challenged.

The thing with discovery techniques is that there's always some example on which it doesn't work, or another model would be better. For the basic IM, I've shown that the defined cut detections allow for "rediscoverability" (i.e. getting the original process's language back) under some assumptions. For IMd, this is much shakier (=these assumptions are stronger) as there's a lot less information to work with.

Kind regards,

Sander

Q: Fitness preservation of fall-throughs in IM-D

Comments

Howdy, Stranger!

Categories

In this Discussion