To prevent spam users, you can only post on this forum after registration, which is by invitation. If you want to post on the forum, please send me a mail (h DOT m DOT w DOT verbeek AT tue DOT nl) and I'll send you an invitation in return for an account.

Mismatch between Application event log in ProM-Lite and its own exported CSV

I’m confused about a mismatch when inspecting the Application event log in ProM-Lite and in R. I first loaded the Application event log into ProM and then selected „Export to disc“ to save the log as a CSV-file. The reason: for some of the challenge questions R might be useful. The number of observations (events) is the same in ProM and the CSV (1’202’267). So is the number of cases (31’509). The number of different event classes matches as well (26). However, the distribution of events is completetly different in ProM and in the exported CSV (see attached html and pdf files). When looking at a specific case (Application_828200680), the structure is different in ProM and in the CSV, even though the number of events is 21 for both (see attachments).

Is there a problem in the CSV-export in ProM-Lite or do I misinterpret the structure of the log?

If the CSV-structure is incorrect, then mining the data outside of ProM probably doesn’t make sense. Also, the CSV-export in ProM does not contain the resource attribute.

---
Attachments:
1. Analysis of the mismatch: http://rpubs.com/frkbr/255659

2. ProM Log summary for comparison: https://www.dropbox.com/s/hyh3dv7glzg1b0g/ProM_LogSummary Application log.html?dl=0

3. Specific case-ID: https://www.dropbox.com/s/8b4y96fqmfr1vyp/Application_828200680 Flow.pdf?dl=0

4. Same case-ID in Inspector-view: https://www.dropbox.com/s/xo1nqrpagwbq4yo/Application_828200680 Inspector.pdf?dl=0

Comments

  • FrankFrank Posts: 29
    Any answer? Did I post in the wrong thread?
  • JBuijsJBuijs Posts: 910
    Hi Frank. I could confirm your issue and have contacted a colleague who might be able to assist.
    Joos Buijs

    Senior Data Scientist and process mining expert at APG (Dutch pension fund executor).
    Previously Assistant Professor in Process Mining at Eindhoven University of Technology
  • FrankFrank Posts: 29
    Thanks Joos for referring the problem to your colleague!

    I just digged down a little bit further. It seems that the CSV-Export in ProM-Lite works better (but still not correct) for "simple" event logs where we have 2 different lifecycle events for every activity (e.g. a start-/complete-event for every activity). In this case, the ProM-export creates one unnecessary timestamp column or one unnecessary row, depending on the preferred output format. The CSV-export also forgets the lifecycle transition names and the resource names.

    However, in more complex logs, where some activites have only one possible lifecycle transition and other activities have several of them, the ProM-Export even gets some activity names and timestamps wrong, in addition to the above problems. This is the case with the A_/O_/W_ -activities in the BPM2017 log.
  • hverbeekhverbeek Posts: 319
    edited March 2017
    Hi Frank, Joos,

    It seems that the CSV exporter does not handle the non-start-complete events too well. As an example, take the first six events from the Application_652823628 case:
    1. A_Create Application, complete, 01.01.2016 10:51:15.304
    2. A_Submitted, complete, 01.01.2016 10:51:15.352
    3. W_Handle leads, schedule, 01.01.2016 10:51:15.774
    4. W_Handle leads, withdraw, 01.01.2016 10:52:36.392
    5. W_Complete application, schedule, 01.01.2016 10:52:36.403
    6. A_Concept, complete, 01.01.2016 10:52:36.413
    Note that the events 3, 4, and 5 use schedule and withdraw instead of start or complete.
    Now observe what is exported into the CSV file:
    1. A_Create Application, complete, 01.01.2016 10:51:15.304
    2. A_Submitted, complete, 01.01.2016 10:51:15.352
    3. A_Submitted, complete, 01.01.2016 10:51:15.352
    4. A_Submitted, complete, 01.01.2016 10:51:15.352
    5. A_Submitted, complete, 01.01.2016 10:51:15.352
    6. A_Concept, complete, 01.01.2016 10:52:36.413
    It seems that instead of exporting the schedule or withdraw events, the exporter exports the last start or complete event it outputted before.

    Of course, this needs to be fixed. But for now, the exporter cannot be used on logs that contain such events. Frank, perhaps you can filter these events out first.

    Kind regards,

    Eric.









    Post edited by hverbeek on
  • FrankFrank Posts: 29
    Thanks Eric. Yes, I had the same observation. However, filtering out the trouble-making events would mean that all the W_* activities have to be filtered out, leaving only the A_* and O_* activities. Even then, the CSV-exporter forgets the resource-attribute and creates redundant rows or columns.

    Here's another idea until the bug is fixed: would it be possible to provide the event-logs as flat CSV-files too, in addition to XES? That would allow more flexibility in using generic data analysis tools in addition to process mining software.
  • Hi,  I agree with Frank, please upload the CSV format of the dataset.
  • So do I.
  • If someone can point me to a location that takes a file that size, I'm happy to oblige. I've written a custom program to extract to a database, which I could extract
  • Hi theMarlzy,
    That's is a great news. DropBox is not Enough?
    Or can you make the code available in which language you have developed?

    I'm trying to do Something similar in java. but this requires a good look at the .xes framework, a problem since the file does not open.


  • JBuijsJBuijs Posts: 910
    Dear all,

    ProM should be able to handle this log file. Especially if you select the 'lightweight sequential' option (instead of 'naive' which requires the most memory).

    If the problem is with the CSV file then Microsoft Access or other (freely available) database software, or BI tools from Microsoft or Tableau might work.
    Joos Buijs

    Senior Data Scientist and process mining expert at APG (Dutch pension fund executor).
    Previously Assistant Professor in Process Mining at Eindhoven University of Technology
  • nhanitvnnhanitvn Posts: 2
    Hi Frank, 

    I got the same issue and decided to write a Python script to read the XES file and write applications and events to CSV files instead of using ProM (XES is just an XML format)

    Feel free to download my script and generate the CSV files here: https://github.com/nhanitvn/BPI_Challenge_2017/blob/master/application_xes_to_csv.py

    The Application_828200680's data that I read out is the same to the one showed in ProM


    However, because of the XES file's big size, it takes a bit long time for the script to finish running (20 min on my MacOS with a 3.1 GHz Intel Core i7 and 16GB RAM). I can make it faster but I already had my needed CSV files already :).
  • FrankFrank Posts: 29
    Thanks a lot for sharing your Python script. In the meantime I loaded the XES-file in Celonis and exported it as a CSV, so that I could look at it in R. I don't know if the CSV-export in ProM-Lite ist still broken, but in Celonis it worked.

    I don't have any Python experience, so if you know of any R code that translates a XES-file into CSV, I'd be glad to know. That could be useful in the future.
  • nhanitvnnhanitvn Posts: 2
    Found this package => https://cran.r-project.org/web/packages/edeaR/index.html. However, the loading function hangs forever. You may want to try it. I could help writing an R code when I have time.
Sign In or Register to comment.