Filters

Gert Janssenswillen

2023-04-26

library(bupaR)
## Warning: package 'bupaR' was built under R version 4.2.3
library(edeaR)
library(eventdataR)

The filters for event data subsetting can mostly be divided into two type: event filters and case filters. Event filters will subset parts of cases based on criteria applied on the events (e.g. the resource which performed it), while case filters will subset complete cases, based on criteria applied on the cases (e.g. the trace length).

Each filter has a reverse argument, which allows to reverse the filter very easily. Furthermore, each filter has an interface-alternative, which can be called by adding a i before the function name.

Event filters

Filter activities

The filter activity function can be used to filter activities by name. It has three arguments

patients %>%
    filter_activity(c("X-Ray", "Blood test")) %>%
    summary
## Number of events:  996
## Number of cases:  498
## Number of traces:  2
## Number of distinct activities:  2
## Average trace length:  2
## 
## Start eventlog:  2017-01-05 08:59:04
## End eventlog:  2018-05-05 01:34:30
##                   handling     patient          employee handling_id       
##  Blood test           :474   Length:996         r1:  0   Length:996        
##  Check-out            :  0   Class :character   r2:  0   Class :character  
##  Discuss Results      :  0   Mode  :character   r3:474   Mode  :character  
##  MRI SCAN             :  0                      r4:  0                     
##  Registration         :  0                      r5:522                     
##  Triage and Assessment:  0                      r6:  0                     
##  X-Ray                :522                      r7:  0                     
##  registration_type      time                            .order     
##  complete:498      Min.   :2017-01-05 08:59:04.00   Min.   :  1.0  
##  start   :498      1st Qu.:2017-05-06 12:31:43.00   1st Qu.:249.8  
##                    Median :2017-09-08 00:10:11.00   Median :498.5  
##                    Mean   :2017-09-03 07:11:55.96   Mean   :498.5  
##                    3rd Qu.:2017-12-23 02:06:20.50   3rd Qu.:747.2  
##                    Max.   :2018-05-05 01:34:30.00   Max.   :996.0  
## 

As one can see, there are only 2 distinct activities left in the event log.

Filter on activity frequency

It is also possible to filter on activity frequency. This filter uses a percentile cut off, and will look at those activities which are most frequent until the required percentage of events has been reached. Thus, a percentile cut off of 80% will look at the activities needed to represent 80% of the events. In the example below, the least frequent activities covering 50% of the event log are selected, since the reverse argument is true.

patients %>%
    filter_activity_frequency(percentage = 0.5, reverse = T) %>%
    activity_frequency("activity")
## # A tibble: 4 × 3
##   handling   absolute relative
##   <fct>         <int>    <dbl>
## 1 Check-out       492    0.401
## 2 X-Ray           261    0.213
## 3 Blood test      237    0.193
## 4 MRI SCAN        236    0.192

Filter on attributes

The filter_attributes function is a very generic function an can be supplied with conditions on the data set, in the same way as the dplyr::filter function. As such, it allows you to filter on event or case attributes. Multiple conditions can be listed, separated by a comma. In that case, the comma will be treated as “and”. You can use the |-symbol to state “OR”. Since the patients dataset does not have many additional attributes, the example below uses the resource and activity. This filter is thus the same as the combination of filter_activity and filter_resource, in case both conditions were required. However, it has the advantange of stating both conditions as OR.

patients %>% 
    filter_attributes(employee == "r1" | handling == "X-Ray") 
## Warning: `filter_attributes()` was deprecated in bupaR 0.5.0.
## ℹ Please use `filter()` instead.
## # Log of 1522 events consisting of:
## 2 traces 
## 500 cases 
## 761 instances of 2 activities 
## 2 resources 
## Events occurred from 2017-01-02 11:41:53 until 2018-05-05 01:34:30 
##  
## # Variables were mapped as follows:
## Case identifier:     patient 
## Activity identifier:     handling 
## Resource identifier:     employee 
## Activity instance identifier:    handling_id 
## Timestamp:           time 
## Lifecycle transition:        registration_type 
## 
## # A tibble: 1,522 × 7
##    handling     patient employee handling_id regist…¹ time                .order
##    <fct>        <chr>   <fct>    <chr>       <fct>    <dttm>               <int>
##  1 Registration 1       r1       1           start    2017-01-02 11:41:53      1
##  2 Registration 2       r1       2           start    2017-01-02 11:41:53      2
##  3 Registration 3       r1       3           start    2017-01-04 01:34:05      3
##  4 Registration 4       r1       4           start    2017-01-04 01:34:04      4
##  5 Registration 5       r1       5           start    2017-01-04 16:07:47      5
##  6 Registration 6       r1       6           start    2017-01-04 16:07:47      6
##  7 Registration 7       r1       7           start    2017-01-05 04:56:11      7
##  8 Registration 8       r1       8           start    2017-01-05 04:56:11      8
##  9 Registration 9       r1       9           start    2017-01-06 05:58:54      9
## 10 Registration 10      r1       10          start    2017-01-06 05:58:54     10
## # … with 1,512 more rows, and abbreviated variable name ¹​registration_type

Filter resources

Similar to the activity filter, the resource filter can be used to filter events by listing on or more resources.

patients %>%
    filter_resource(c("r1","r4")) %>%
    resource_frequency("resource")
## # A tibble: 2 × 3
##   employee absolute relative
##   <fct>       <int>    <dbl>
## 1 r1            500    0.679
## 2 r4            236    0.321

Trim cases

The trim filter is a special event filter, as it also take into account the notion of cases. In fact, it trim cases such that they start with a certain activities until they end with a certain activity. It requires two list: one for possible start activities and one for end activities. The cases will be trimmed from the first appearance of a start activity till the last appearance of an end activity. When reversed, these slices of the event log will be removed instead of preserved.

patients %>%
    filter_trim(start_activities = "Registration", end_activities =  c("MRI SCAN","X-Ray")) %>%
    traces()
## # A tibble: 2 × 3
##   trace                                                  absolute_freq…¹ relat…²
##   <chr>                                                            <int>   <dbl>
## 1 Registration,Triage and Assessment,X-Ray                           261   0.525
## 2 Registration,Triage and Assessment,Blood test,MRI SCAN             236   0.475
## # … with abbreviated variable names ¹​absolute_frequency, ²​relative_frequency

Case filters

Filter activity presence

This functions allows to filter cases that contain certain activities. It requires as input a vector containing one or more activity labels and it has a method argument. The latter can have the values all, none or one_of. When set to all, it means that all the specified activity labels must be present for a case to be selected, none means that they are not allowed to be present, and one_of means that at least one of them must be present.

Filter case

The case filter allows to subset a set of case identifiers. As arguments it only requires a vector of case id’s. The selection can also be negated using reverse = T.

Filter end points

The filter_endpoints method filters cases based on the first and last activity label. It can be used in two ways: by specifying vectors with allowed start activities and/or allowed end activities, or by specifying a percentile. In the latter case, the percentile value will be used as a cut off. For example, when set to 0.9, it will select the most common endpoint pairs which together cover at least 90% of the cases, and filter the event log accordingly. This filter can also be reversed.

Filter precedence

In order to extract a subset of an event log which conforms with a set of precedence rules, one can use the filter_precedence method. There are two types of precendence relations which can be tested: activities that should directly follow each other, or activities that should eventually follow each other. The type can be set with the precedence_type argument. Further, the filter requires a vector of one or more antecedents (containing activity labels), and one or more consequents. Finally, also a filter_method argument can be set. This argument is relevant when there is more than one antecedent or consequent. In such a case, you can specify that all possible precedence combinations must be present (all), or at least one of them (_one_of).

Filter processing time, throughput time and trace length

There are three different filters which take into account the length of a case:

Each of these filters can work in two ways, similar to the endpoints filter: either by using an interval or by using a percentile cut off. The percentile cut off will always start with the shortest cases first and stop including cases when the specified percentile is reached. The processing and throughput time filters also have a units attribute to specify the time unit used when defining an interval. All the methods can be reversed by setting reverse = T.

Filter time period

Cases can also be filtered by supplying a time window to the method filter_time_period. There are four different filter methods, of which one can be used as argument:

The selection can also be reversed. Note that there is a 5 filter method, trim, but this is actually an event filter and will thus be discussed in the next section.

Filter trace frequency

The last case filter can be used to filter cases based on the frequency of the corresponding trace. A trace is a sequence of activity labels, and will be discussed in more detail in Section \(\ref{mining-and-analysis-1}\). There are again two ways to select cases based on trace frequency, by interval or by percentile cut off. The percentile cut off will start with the most frequent traces. This filter also contains the reverse argument.