1 Ping data and trip aggregation

1.1 Ping data demonstration

pacman::p_load(data.table, dplyr, ggplot2, lubridate, 
               kableExtra, psych, corrplot, scales, corrgram)

## also installing the dependencies 'selectr', 'rvest', 'readr', 'webshot'

## Warning: unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/4.0:
##   cannot open URL 'http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/4.0/PACKAGES'

## package 'selectr' successfully unpacked and MD5 sums checked
## package 'rvest' successfully unpacked and MD5 sums checked
## package 'readr' successfully unpacked and MD5 sums checked
## package 'webshot' successfully unpacked and MD5 sums checked
## package 'kableExtra' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\zheng\AppData\Local\Temp\Rtmpcl9A1E\downloaded_packages

## 
## kableExtra installed

## also installing the dependency 'mnormt'

## Warning: unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/4.0:
##   cannot open URL 'http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/4.0/PACKAGES'

## package 'mnormt' successfully unpacked and MD5 sums checked
## package 'psych' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\zheng\AppData\Local\Temp\Rtmpcl9A1E\downloaded_packages

## 
## psych installed

## Warning: unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/4.0:
##   cannot open URL 'http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/4.0/PACKAGES'

## package 'corrplot' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\zheng\AppData\Local\Temp\Rtmpcl9A1E\downloaded_packages

## 
## corrplot installed

## also installing the dependencies 'iterators', 'bitops', 'foreach', 'gdata', 'caTools', 'TSP', 'qap', 'gclus', 'dendextend', 'gplots', 'registry', 'seriation'

## Warning: unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/4.0:
##   cannot open URL 'http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/4.0/PACKAGES'

## package 'iterators' successfully unpacked and MD5 sums checked
## package 'bitops' successfully unpacked and MD5 sums checked
## package 'foreach' successfully unpacked and MD5 sums checked
## package 'gdata' successfully unpacked and MD5 sums checked
## package 'caTools' successfully unpacked and MD5 sums checked
## package 'TSP' successfully unpacked and MD5 sums checked
## package 'qap' successfully unpacked and MD5 sums checked
## package 'gclus' successfully unpacked and MD5 sums checked
## package 'dendextend' successfully unpacked and MD5 sums checked
## package 'gplots' successfully unpacked and MD5 sums checked
## package 'registry' successfully unpacked and MD5 sums checked
## package 'seriation' successfully unpacked and MD5 sums checked
## package 'corrgram' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\zheng\AppData\Local\Temp\Rtmpcl9A1E\downloaded_packages

## 
## corrgram installed

d = fread("data/sample_ping.csv") %>% 
  .[,ping_time := ymd_hms(ping_time)]

knitr::kable(d[1:10,], align = "c", caption = "Fake ping data")

Fake ping data
driver	ping_time	speed	lat	long
a fake driver	2014-09-30 23:00:00	0	35.4	-77.9
a fake driver	2014-09-30 23:15:00	0	35.4	-77.9
a fake driver	2014-09-30 23:16:00	0	35.4	-77.9
a fake driver	2014-09-30 23:18:00	0	35.4	-77.9
a fake driver	2014-09-30 23:21:00	12	35.4	-77.9
a fake driver	2014-09-30 23:24:00	54	35.4	-77.9
a fake driver	2014-09-30 23:30:00	56	35.4	-78.0
a fake driver	2014-09-30 23:30:00	58	35.4	-78.0
a fake driver	2014-09-30 23:30:00	1	35.4	-78.0
a fake driver	2014-09-30 23:37:00	23	35.4	-78.0

The above table shows a fake copy of our ping data, which is similar to but not the real data for confidentiality reasons. It is provided here to give the readers a flavor of how our ping data looks like. The data set includes the date and time of the record (year, month, day, hour, and minute [the seconds here are rounded to 0]), latitude and longitude (specific to five decimal places), speed, and drivers’ anonymized unique ID.

1.2 Aggregating ping data into trips

The self-defined function segment_0() below is used to separate the fake ping data into trips according to a threshold value.

segment_0 = function(speed, threshold, time_diff) {
  speed1 = speed
  speed[time_diff >= threshold] <- 0
  r1 = rle(speed != 0)
  r1$values <- replicate(length(r1$values), 1)
  r1$values <- cumsum(r1$values)
  order_tmp <- inverse.rle(r1)
  dat_tmp1 <- data.table::data.table(speed, order_tmp, time_diff)
  dat_tmp2 <- dat_tmp1[,.(sumdiff = sum(time_diff)), by = order_tmp]
  r2 = rle(speed != 0); first_rle = r2$values[1]
  r2$values[r2$values == 0 & dat_tmp2$sumdiff < threshold] <- TRUE
  r2$values[1] = first_rle
  r2 <- rle(inverse.rle(r2))
  r2$values[r2$values] = cumsum(r2$values[r2$values])
  id = inverse.rle(r2)
  jump_speed = which(id == 0 & speed1 != 0)
  id[jump_speed] = id[jump_speed + 1]
  return(id)
}

The code block below shows R code to separate the fake ping data into different trips by adding a trip_id column. Here we set the threshold value as 30 minutes, which means that the ping data will be separated into different trips when the vehicle is not moving (the speed of the ping equals zero) for more than 30 minutes. The pings that has trip_id of 0 are stopping pings.

d_id = d %>% 
  .[,diff := as.integer(difftime(ping_time, shift(ping_time, type = "lag", fill = 0), 
                                 units = "mins")), driver] %>%
  .[,diff := {diff[1] = 0L; diff}, driver] %>%
  .[,trip_id := segment_0(speed = speed, threshold = 30, time_diff = diff), driver]

knitr::kable(d_id[1:10,], align = "c", caption = "Sample ping data with time difference and trip ID")

Sample ping data with time difference and trip ID
driver	ping_time	speed	lat	long	diff	trip_id
a fake driver	2014-09-30 23:00:00	0	35.4	-77.9	0	0
a fake driver	2014-09-30 23:15:00	0	35.4	-77.9	15	0
a fake driver	2014-09-30 23:16:00	0	35.4	-77.9	1	0
a fake driver	2014-09-30 23:18:00	0	35.4	-77.9	2	0
a fake driver	2014-09-30 23:21:00	12	35.4	-77.9	3	1
a fake driver	2014-09-30 23:24:00	54	35.4	-77.9	3	1
a fake driver	2014-09-30 23:30:00	56	35.4	-78.0	6	1
a fake driver	2014-09-30 23:30:00	58	35.4	-78.0	0	1
a fake driver	2014-09-30 23:30:00	1	35.4	-78.0	0	1
a fake driver	2014-09-30 23:37:00	23	35.4	-78.0	7	1

The table below is the aggregated trips data, including start time, end time, trip length, and the median trip time. The ping table and trips table can be merged into one table using driver ID and trip ID.

d_trip = d %>%
  .[trip_id != 0,] %>% 
  .[,.(start_time = ping_time[1], end_time = ping_time[.N]), .(driver, trip_id)] %>% 
  .[,trip_length := round(as.numeric(difftime(end_time, start_time, units = "mins")), 2)] %>% 
  .[,trip_median := start_time + 60*trip_length/2]

knitr::kable(d_trip, align = "c", caption = "Aggregated trips from sample ping")

Aggregated trips from sample ping
driver	trip_id	start_time	end_time	trip_length	trip_median
a fake driver	1	2014-09-30 23:21:00	2014-10-01 05:45:00	384	2014-10-01 02:33:00
a fake driver	2	2014-10-01 06:26:00	2014-10-01 07:45:00	79	2014-10-01 07:05:30
a fake driver	3	2014-10-01 08:54:00	2014-10-01 12:26:00	212	2014-10-01 10:40:00
a fake driver	4	2014-10-01 23:06:00	2014-10-02 00:04:00	58	2014-10-01 23:35:00
a fake driver	5	2014-10-02 01:15:00	2014-10-02 02:00:00	45	2014-10-02 01:37:30
a fake driver	6	2014-10-02 04:21:00	2014-10-02 05:32:00	71	2014-10-02 04:56:30
a fake driver	7	2014-10-02 08:13:00	2014-10-02 12:45:00	272	2014-10-02 10:29:00

1.3 Ping and trip data visualization

p2 = d %>% 
  .[,p_color := case_when(speed >= 50 ~ ">=50 MPH", 
                             speed >= 25 & speed < 50 ~ "(25, 50] MPH", 
                             speed >   0 & speed < 25 ~ "(0, 25] MPH", 
                             speed == 0 ~ "0 MPH")] %>% 
  .[,p_color := factor(p_color, levels = c("0 MPH", "(0, 25] MPH", "(25, 50] MPH", ">=50 MPH"))] %>% 
  ggplot(aes(ping_time, speed)) + 
  geom_point(aes(color = p_color)) + 
  scale_colour_manual(name = "speed category", values = c("#636363", "#31a354", "#fb6a4a", "#a50f15")) + 
  geom_line() + theme_bw() + 
  geom_segment(data = d_trip, aes(x = start_time, xend = end_time, y = -3, yend = -3),
               arrow = arrow(length = unit(.2, "cm")), lineend = 'butt', size = .8, color = "#7b3294") + 
  geom_text(data = d_trip, aes(x = trip_median, y = rep(-4.8, nrow(d_trip)),
                             label = paste(rep("Trip", nrow(d_trip)), 1:nrow(d_trip), " ")),
            color = "#7b3294", size = 3) + 
  labs(x = "date and time of ping", y = "ping speed (MPH)") + 
  scale_y_continuous(breaks = c(0, 25, 50), limits = c(-5, 70)) + 
  theme(legend.justification = c(1, 1), legend.position = c(0.97, 1),
        legend.background = element_rect(fill = alpha('white', 0.1)),
        legend.direction = "horizontal", 
        panel.grid.major.x = element_blank(), panel.grid.minor = element_blank())
p2

The figure above shows the data aggregation process. The x-axis shows the data and time of pings, and the y-axis presents the speed of the ping (miles per hour, MPH). Each point represented a ping at that date and time, with different colors indicating the real-time speed category. Whenever the truck stopped (the grey points) for at least 30 minutes, the pings were separated into different trips, indicated by the purple arrows in the bottom (Trip 1, Trip 2, \(\ldots\), Trip 6). The trip time was then calculated by taking the difference between the trip end time and start time.

2 Statistical modeling

2.1 Bayesian NB regression using `rstanarm`

The code below shows the code for Bayesian NB regression models. For demonstration purpose, we only use the first 1,000 observations of the data, 1 Markov chain with 1,000 iterations and the first 500 of them are warm-up iterations.

pacman::p_load(rstanarm, broom)

fit <-
  stan_glm(
    crash ~ SCE + speed + age + gender + bus_unit + d_type,
    offset = log(distance / 1000),
    data = data,
    family = neg_binomial_2,
    prior = normal(0, 10),
    prior_intercept = normal(0, 10),
    QR = TRUE,
    iter = 4000,
    chains = 4,
    cores = 4,
    seed = 123
  )

broom::tidy(fit, intervals = TRUE, prob = 0.95) %>% 
  mutate(estimate = exp(estimate),
         lower = exp(lower),
         upper = exp(upper)) %>% 
  select(term, IRR = estimate, `95% CI left` = lower, `95% CI right` = upper) %>% 
  knitr::kable(align = "c", 
               caption = "Posterior estimates of Bayesian NB model.")

The table above shows the incident rate ratios of the Bayesian NB regression, as well as their 95% credible intervals. The Bayesian interpretation is that there is 95% chance that the IRRs were within these intervals.

2.2 Model comparison and diagnostics using `loo`

prop_zero <- function(y) mean(y == 0)
pp_check(fit, plotfun = "stat", stat = "prop_zero")

The code above will give a figure showing the posterior predictive checks, which is a measure of the prediction accuracy. It compares the observed data to 100 replicated datasets generated from the posterior parameters distributions. For each simulated dataset, the proportion of zero crashes was computed, and the blue histograms shows the simulated distribution of the proportions. The black solid vertical lines are the observed proportion of zero crashes in observed data. When the observed proportion (black solid line) is near the center of the plot, it demonstrates good model fit.

fit_loo = loo(fit)
fit_loo

The above block shows the expected log predicted density (elpd_loo), estimate number of parameters (p_loo), and the LOO Information Criterion (looic) for a new dataset from Pareto smoothed importance-sampling leave-one-out (PSIS-LOO) cross-validation (CV). They can be used to check the goodness-of-fit of and compare different models (Vehtari, Gelman, and Gabry 2017, 2015). In a well-specified model, p_loo should be close to the total number of parameters. Similar to the Akaike information criterion (AIC) and Bayesian information criterion (BIC), the looic in the output can be used to compare different models, with lower values indicating better models.

fit_new <- stan_glm(
  crash ~ SCE + speed + age + gender,
  offset = log(distance / 1000),
  data = data,
  family = neg_binomial_2,
  prior = normal(0, 10),
  prior_intercept = normal(0, 10),
  QR = TRUE,
  iter = 4000,
  chains = 4,
  cores = 4,
  seed = 123
)

fit_new_loo = loo(fit_new)

loo::compare(fit_loo, fit_new_loo)

With two model fits fit and fit_new above, researchers can compare the model fit using compare() from the loo package, as shown above. It compares the expected predictive accuracy by the difference in elpd_loo, with positive difference elpd_diff suggesting the second model while negative difference favoring the first model.

Session information

The R session information when building this website is shown below:

sessionInfo()

## R version 4.0.0 (2020-04-24)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## system code page: 936
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] corrgram_1.13     scales_1.1.0      corrplot_0.84     psych_1.9.12.31  
## [5] kableExtra_1.1.0  lubridate_1.7.8   ggplot2_3.3.0     dplyr_0.8.5      
## [9] data.table_1.12.8
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.4.6       lattice_0.20-41    gtools_3.8.2       assertthat_0.2.1  
##  [5] digest_0.6.25      foreach_1.5.0      R6_2.4.1           evaluate_0.14     
##  [9] highr_0.8          httr_1.4.1         pillar_1.4.3       gplots_3.0.3      
## [13] rlang_0.4.5        rstudioapi_0.11    gdata_2.18.0       rmarkdown_2.1     
## [17] webshot_0.5.2      readr_1.3.1        stringr_1.4.0      munsell_0.5.0     
## [21] compiler_4.0.0     xfun_0.13          pkgconfig_2.0.3    mnormt_1.5-7      
## [25] htmltools_0.4.0    tidyselect_1.0.0   gridExtra_2.3      tibble_3.0.1      
## [29] seriation_1.2-8    codetools_0.2-16   dendextend_1.13.4  viridisLite_0.3.0 
## [33] crayon_1.3.4       withr_2.2.0        MASS_7.3-51.5      bitops_1.0-6      
## [37] grid_4.0.0         nlme_3.1-147       gtable_0.3.0       lifecycle_0.2.0   
## [41] registry_0.5-1     pacman_0.5.1       magrittr_1.5       KernSmooth_2.23-16
## [45] stringi_1.4.6      farver_2.0.3       viridis_0.5.1      xml2_1.3.2        
## [49] ellipsis_0.3.0     generics_0.0.2     vctrs_0.2.4        iterators_1.0.12  
## [53] tools_4.0.0        glue_1.4.0         purrr_0.3.4        gclus_1.3.2       
## [57] hms_0.5.3          parallel_4.0.0     yaml_2.2.1         colorspace_1.4-1  
## [61] cluster_2.1.0      caTools_1.18.0     TSP_1.1-10         rvest_0.3.5       
## [65] knitr_1.28

Reference

Vehtari, Aki, Andrew Gelman, and Jonah Gabry. 2015. “Pareto Smoothed Importance Sampling.” arXiv Preprint arXiv:1507.02646.

———. 2017. “Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and Waic.” Statistics and Computing 27 (5): 1413–32.

Supplemental Materials

The association between crashes and safety-critical events:
synthesized evidence from crash reports and naturalistic driving data among commercial truck drivers

2020-05-12

1 Ping data and trip aggregation

1.1 Ping data demonstration

1.2 Aggregating ping data into trips

1.3 Ping and trip data visualization

2 Statistical modeling

2.1 Bayesian NB regression using `rstanarm`

2.2 Model comparison and diagnostics using `loo`

Session information

Reference

Supplemental Materials

The association between crashes and safety-critical events: synthesized evidence from crash reports and naturalistic driving data among commercial truck drivers

2020-05-12

1 Ping data and trip aggregation

1.1 Ping data demonstration

1.2 Aggregating ping data into trips

1.3 Ping and trip data visualization

2 Statistical modeling

2.1 Bayesian NB regression using rstanarm

2.2 Model comparison and diagnostics using loo

Session information

Reference

The association between crashes and safety-critical events:
synthesized evidence from crash reports and naturalistic driving data among commercial truck drivers

2.1 Bayesian NB regression using `rstanarm`

2.2 Model comparison and diagnostics using `loo`