Clustering traces based on edit distance

nirmalpatel · April 2017

Hi everyone,

Is there a ProM plugin that can cluster traces based on edit distance (also called Levenshtein distance)? I see that there are some papers [1] mentioning the approach, but I'm unable to find a plugin for the same.

Please tell me if you know any other plugin for clustering that is uses similar approach, that will also work.

I'm working with data that come from place where there is no underlying normative process model. Basically, its kids using educational software, and so things differ from teacher to teacher, and also from student to student so event log has quite a lot of variety. And so process models I get are always complicated. I'm hoping that clustering traces beforehand might help.

Thanks,
Nirmal

[1] Context Aware Trace Clustering: Towards Improving Process Mining Results

nirmalpatel · April 2017

Hi Folks,

Right now I've done a little arrangement to do clustering outside ProM. I'm trying out hierarchical clustering to cluster traces. I use R, and R's hclust() function that does hierarchical clustering requires user to provide a distance matrix that contains pairwise distances between observations. So I just plug in edit distance matrix of all traces and I am getting very interesting results. I haven't optimized code, so it takes a bit of time, but results are looking good to me.

One problem with using centroid based algorithm (like K-Means) for trace clustering is deciding up on centroid, which is a bit difficult. Earlier I tried to make centroid as trace of length that occurs most in the cluster and events in trace were most common events in each position. But I think hierarchical clustering seems more sound to use, without any kind of tricks involved.

- Nirmal

[Deleted User] · April 2017

Hi Nirmal,

Sorry, I do not know the answer. I am contacting one of my colleague to help you. Please be patient.

Cong

bhompes · April 2017

Dear Nirmal,

If I understand correctly, your question is twofold: you want to provide a certain similarity/distance measure yourself (e.g. Levenshtein distance), and would like to obtain the number of clusters automatically (i.e. not having to select the k for k-Means).

At the moment, the easiest thing to do is to run the plug-in called "Cluster traces using Markov clustering". The default implementation uses the cosine similarity between profile vectors created from the traces using a selected set of 'perspectives'. The perspectives are the properties used to cluster upon, such as attribute values or control-flow constructs. Then, the Markov cluster algorithm (MCL) is used to cluster the traces. MCL is a clustering algorithm that does not take the number of clusters as input.

The MCL clustering technique uses two parameters, expansion and inflation, which both affect clustering granularity. When a low level (fine-grained) clustering is of interest, expansion can be decreased and inflation increased, and vice-versa for when only high-level clustering is required.

The technique used in the plug-in is described in our paper titled "Discovering Deviating Cases and Process Variants Using Trace Clustering".

There's also some other topics on this forum talking about this technique here, here, and here.

Would you prefer to use another similarity measure, you'd need to run the plug-in (and ProM) from code and implement the similarity measure. The code is prepared for this (interfaced).

Let me know whether this plug-in suits your needs and if you need more help with the custom implementation.

For the record, the TraceClustering package also contains the code for the plug-in called "Cluster traces over time using Markov clustering", which takes into account the changes over time in the similarity matrix, in order to let the user detect concept drift (see our paper titled "Detecting Changes in Process Behavior Using Comparative Case Clustering").

Kind regards,

Bart

nirmalpatel · April 2017

Thanks a lot Bart! I read first 4 sections of your paper, and changed parameters of clustering algorithm (I decreased expansion to 1 for start) and I'm seeing all the different variants now!

I'll try and make more use of the MCL plugin, and see how results turn out. Hopefully, I'll be able to post some updates here later.

- Nirmal

Clustering traces based on edit distance

Comments

Howdy, Stranger!

Categories

In this Discussion