The snippet below documents the list of R packages and functions that were used in this research. For convenience, we used the pacman
package since it allows for installing/loading the needed packages in one step. Please make sure that the package is installed on your system using the command install.packages("pacman")
before running this code chunk.
rm(list = ls()) # clear global environment
graphics.off() # close all graphics
library(pacman) # needs to be installed first
# p_load is equivalent to combining both install.packages() and library()
p_load(dataPreparation, DataExplorer, DT, tidyverse, MVN)
As an illustration of the statistical (process control) opportunities in improving the utilization of human-generated data for cyber security applications, let us examine the dataset from Mohamed and Saxena (2016).
source("functions/functions.R") # to load data.types()
df <- read.csv("data/dcg_fetears_all.csv")
cat(paste("We have read the Full Dataset into a data.frame (df) titled df.", "The df consists of", nrow(df),"and",
paste0(ncol(df),"."),"Additionally, R initially divides the columns of different types. We summarize these in the table below."))
We have read the Full Dataset into a data.frame (df) titled df. The df consists of 8651 and 67. Additionally, R initially divides the columns of different types. We summarize these in the table below.
# First tab, where we summarize the column types
cat(paste("###","Column Types","{-}","\n"))
types <- data.types(df) # see functions.R file
types
cat("\n") # Printing a line break
## Second Tab (Missing Data)
cat(paste("###","Missing Data","{-}","\n"))
cat("In the plot below, we sample 40 columns at random from the dataset to show the actual percentage of the data that is missing for each variable. The colors are used to denote the data quality for that column using a traffic light scheme (where green is good and red is bad).")
In the plot below, we sample 40 columns at random from the dataset to show the actual percentage of the data that is missing for each variable. The colors are used to denote the data quality for that column using a traffic light scheme (where green is good and red is bad).
df.na.plot <- df[,sample(colnames(df),40)] %>% plot_missing()
## Third Tab (Clean Data)
cat(paste("###","Clean Data","{-}","\n"))
cat("Using the fastFilterVariables() from the dataPreparation R package, we can remove: (a) constant columns: they take the same value for every line; (b) double columns: they have an exact copy in the data set; and (c)bijection columns: there is another column containing the exact same information (but maybe coded differently) for example col1: Men/Women, col2 M/W. The results from this analysis is saved into a data frame titled: df.cleaned.")
Using the fastFilterVariables() from the dataPreparation R package, we can remove: (a) constant columns: they take the same value for every line; (b) double columns: they have an exact copy in the data set; and (c)bijection columns: there is another column containing the exact same information (but maybe coded differently) for example col1: Men/Women, col2 M/W. The results from this analysis is saved into a data frame titled: df.cleaned.
df.cleaned <- fastFilterVariables(df)
[1] “fastFilterVariables: I check for constant columns.” [1] “fastFilterVariables: I delete 1 constant column(s) in dataSet.” [1] “fastFilterVariables: I check for columns in double.” [1] “fastFilterVariables: I check for columns that are bijections of another column.” [1] “fastFilterVariables: I delete 15 column(s) that are bijections of another column in dataSet.”
cat(paste0("The data frame df.cleaned consists of ", ncol(df.cleaned), " columns. Note that the original data frame df had ", ncol(df), " columns."))
The data frame df.cleaned consists of 51 columns. Note that the original data frame df had 67 columns.
saveRDS(df.cleaned, "results/sec_cleaned.RDS")
df.cleaned <- readRDS("results/sec_cleaned.RDS") %>%
subset(select = c(ID, speed_touch_mean,
pause_and_drop_mean))
df.cleaned.num <- select_if(df.cleaned, is.numeric) %>%
slice(1:5000)
mvn(df.cleaned.num, mvnTest = "hz", univariatePlot = "qqplot",
multivariatePlot = "contour")$multivariateNormality
Mohamed, Manar, and Nitesh Saxena. 2016. “Gametrics: Towards Attack-Resilient Behavioral Authentication with Simple Cognitive Games.” In Proceedings of the 32nd Annual Conference on Computer Security Applications, 277–88. ACM.
Email: fmegahed@miamioh.edu | Phone: +1-513-529-4185 | Website: Miami University Official↩
Email: farmerl2@miamioh.edu | Phone: +1-513-529-4823 | Website: Miami University Official↩
Email: miao.cai@slu.edu | Phone: +1-314-326-8418 | Website: Saint Louis University↩
Email: steve.rigdon@slu.edu | Phone: +1-314-977-8127 | Website: Saint Louis University Official↩
Email: mohamem@miamioh.edu | Phone: +1-513-529-0346 | Website: Miami University Official↩