This graph shows the average accuracy values over three different runs for different categories of models.
This graph shows the F-score values over three different runs for the situation without routing tasks and with routing tasks. For example, it shows that if there are no routing tasks the F-score may be 100%, but adding routing tasks results in a drop of the accuracy to 0% in many cases.