Read the attached three documents and write the summary of Chapter 2 (literature review ) for ANYONE of the three attached papers in 300 words minimum.
Note: Write a “summary of the Literature review” of ANY ONE of the three papers attached. NO need to write a summary for all papers attached.
Artificial Intelligence in Cybersecurity:
Concentration on the Effectiveness of Machine
Learning
By: Preston Pham
December 7, 2021
Submitted in Partial Fulfillment of the Requirements for the Doctor of Education degree.
St. Thomas University
Miami Gardens, Florida
ii
Copyright© 2021 by Preston Pham
All Rights Reserved
iii
Copyright Acknowledgement Form
St. Thomas University
I, the writer’s full name, understand that I am solely responsible for the content of this
dissertationand its use of copyrighted materials. All copyright infringements and issues
are solely the responsibly of myself as the author of this dissertation and not St. Thomas
University, its programs,or libraries.
________________________ _____________________
Signature of Author Date
________________________ _____________________
Witness (Martin Nguyen) Date
__________________
Signature of Author
_____________
_____________
iv
St. Thomas University Library Release Form
Artificial Intelligence in Cybersecurity – Concentration on the Effectiveness of Machine
Learning
Preston Pham
I understand that US Copyright Law protects this dissertation against unauthorized use. By
my signature below, I am giving permission to St. Thomas University Library to place this
dissertation in its collections in both print and digital forms for open access to the wider
academic community. I am also allowing the Library to photocopy and provide a copy of
this dissertation for the purpose of interlibrary loans for scholarly purposes and to migrate
it to other forms of media for archival purposes.
________________________ _____________________
Signature of Author Date
________________________ _____________________
Witness (Martin Nguyen) Date
__________________
Signature of Author
_____________
_____________
v
St. Thomas University Dissertation Manual Acknowledgement Form
Artificial Intelligence in Cybersecurity – Concentration on the Effectiveness of Machine
Learning
Preston Pham
By my signature below, I Preston Pham assert that I have read the dissertation publication
manual, that my dissertation complies with the University’s published dissertation
standards and guidelines, and that I am solely responsible for any discrepancies between
my dissertation and the publication manual that may result in my dissertation being
returned by the library for failure to adhere to the published standards and guidelines
within the dissertation manual. The Dissertation Publication Manual may be found:
________________________ _____________________
Signature of Author Date
________________________ _____________________
Signature of Chair Date
__________________
Signature of Author
_____________
_____________
vi
Abstract
Modern networks drive for ubiquitous connectivity and digitalization in support
of globalization but also, inadvertently and unavoidably, create a fertile ground for the
rise in scale and volume of cyberattacks. Countermeasures to these advanced attacks have
never been more crucial than in our present time; hence with Artificial Intelligence (AI),
this technological breakthrough can help augment protective techniques for the defensive
side of cybersecurity. AI improves its knowledge by detecting the patterns and
relationships among data and learns through the data to build self-learning algorithms. It
analyzes relationships between threats like malicious network traffic, suspicious internet
protocol (IP) addresses, or malware files within minutes or even seconds to provide the
intelligence to the organization for quicker response to a threat event than traditional
labor-intensive methods. This paper is intended to explore the phenomenon of AI in
cybersecurity and determine whether the present stage of AI technology and in particular
Machine Learning can help improve cybersecurity. The paper has two main objectives of
testing AI’s threat classification ability against a human cybersecurity analyst and AI’s
prediction ability of future threat events against a renowned time-series data-forecasting
model, the autoregressive integrated moving average (ARIMA) statistical model.
Keywords: cybersecurity, artificial intelligence, classification, prediction, ARIMA
vii
Acknowledgments
It is with genuine pleasure that I would like to express my deep sense of gratitude
and give my warmest thanks to my former professors and committee members which
consist of Dr. Lisa J. Knowles, Dr. Joseph M. Pogodzinski, and Dr. Jose G. Rocha. Their
dedication, advice, meticulous scrutiny, and scholarly advice have helped me to
accomplish my dissertation paper.
I would like to profoundly acknowledge Dr. Knowles, my dissertation chair, for
her kindness, enthusiasm, positivity, and dynamism. Dr. Knowles relentlessly helped me
manage every step along the way to ensure I completed my all my chapters and the work
overall during the time of a global pandemic in 2020-2021. I also thank my former
manager Mr. Raymond Lee, Director of Information Security, for suggesting necessary
technological advice during my research pursuit in writing about the topic of
cybersecurity.
viii
Dedication
The study serves as time-capsule literature to show how a doctorate student at St.
Thomas University performed a study on Artificial Intelligence within the domain of
cybersecurity using the current technologies of the era. The study seeks to be a reference
guide for both academia and industry leaders to further extend the research and
applications of Artificial Intelligence in cybersecurity. The paper is also dedicated to the
men and women in the cybersecurity industry around the globe who are actively fighting
against cybercriminals to protect their organizations, institutions, or government
agencies.
ix
Table of Contents
Copyright Acknowledgement Form ……………………………………………………………………. iii
St. Thomas University Library Release Form………………………………………………………… iv
St. Thomas University Dissertation Manual Acknowledgement Form ………………………… v
List of Tables…………………………………………………………………………………………………..xii
List of Figures ………………………………………………………………………………………………. xiii
List of Formulas …………………………………………………………………………………………….. xiv
CHAPTER ONE. INTRODUCTION……………………………………………………………………. 1
Introduction to the Problem ……………………………………………………………………….1
Background, Context, and Theoretical Framework ………………………………………..2
Statement of the Problem ………………………………………………………………………….5
Purpose of the Study ………………………………………………………………………………..5
Research Question …………………………………………………………………………………..6
Rationale, Relevance, and Significance of the Study ……………………………………..6
Nature of the Study ………………………………………………………………………………….7
Definition of Terms ………………………………………………………………………………….7
Assumptions, Limitations, and Delimitations ……………………………………………….8
Organization of the Remainder of the Study …………………………………………………9
Chapter One Summary …………………………………………………………………………… 10
CHAPTER TWO. LITERATURE REVIEW ……………………………………………………….. 12
Introduction to the Literature Review ……………………………………………………….. 12
Review of Research Literature ………………………………………………………………… 14
Chapter Two Summary ………………………………………………………………………….. 22
x
CHAPTER THREE METHODOLOGY ……………………………………………………………… 25
Introduction to Methodology …………………………………………………………………… 25
Purpose of Study …………………………………………………………………………………… 27
Research Questions ……………………………………………………………………………….. 27
Research Design …………………………………………………………………………………… 27
Data Collection and Data Analysis Procedures …………………………………………… 30
Target Population, Sampling Method, and Related Procedures ……………………… 33
Instrumentation …………………………………………………………………………………….. 34
Limitations of the Research Design ………………………………………………………….. 37
Data Validity Test …………………………………………………………………………………. 38
Expected Findings …………………………………………………………………………………. 41
Ethical Issues ……………………………………………………………………………………….. 42
Conflict of Interest Assessment ……………………………………………………………….. 42
Chapter Three Summary ………………………………………………………………………… 42
CHAPTER FOUR DATA ANALYSIS AND RESULTS ……………………………………….. 44
Introduction to Data Analysis and Results …………………………………………………. 44
AI vs. Human Analysis in Classification of Threat Events ……………………………. 44
Detailed Analysis (AI vs. Human Analysis in Classification of Threat Events) … 47
AI vs. ARIMA Statistical Computation in Prediction of Threat Events …………… 48
Detailed Analysis (AI vs. ARIMA Statistical Computation to Predict Threat
Events) ………………………………………………………………………………………………… 52
Chapter Four Summary ………………………………………………………………………….. 53
CHAPTER FIVE. CONCLUSIONS AND DISCUSSION ……………………………………… 56
xi
Introduction to Conclusions and Discussion ………………………………………………. 56
Summary of the Results …………………………………………………………………………. 57
Discussion of the Results ……………………………………………………………………….. 58
Discussion of the Results in Relation to the Literature …………………………………. 60
Limitations …………………………………………………………………………………………… 61
Implication of the Results for Practice ………………………………………………………. 62
Recommendations for Further Research ……………………………………………………. 64
Conclusion …………………………………………………………………………………………… 65
APPENDIX A. INSTITUTIONAL REVIEW BOARD (IRB) …………………………………. 67
REFERENCES ……………………………………………………………………………………………….. 68
xii
List of Tables
Table 1. 10 Intrusion Categories with Depiction of Training and Testing Samples …. 30
Table 2. Results of Precision, Recall, and F1-Score for Classifiers ……………………….. 40
Table 3. Results of P-Values Stationary Test…………………………………………………….. 41
Table 4. Ordinary Least Squares Regression: AI vs. Cybersecurity Analyst Results … 46
Table 5. Number of Intrusion Events Detected and Average Time of AI vs.
Cybersecurity Analyst …………………………………………………………………………………… 47
Table 6. Spearman’s Rank Correlation Estimation Results ………………………………….. 50
Table 7. Trend to Month Translog Estimation Results………………………………………… 52
xiii
List of Figures
Figure 1. Branches of Artificial Intelligence ……………………………………………………….. 14
Figure 2. Methodological Literature Used …………………………………………………………… 22
Figure 3. Honeypot Network Architecture Diagram ……………………………………………… 28
Figure 4. Data Collection and Analysis Workflow Diagram …………………………………… 29
Figure 5. Configuration of Log and Netflow Forwarding ………………………………………. 30
Figure 6. Example of a Firewall Log ………………………………………………………………….. 31
Figure 7. Example of Raw Netflow Data ……………………………………………………………. 31
Figure 8. Display of Log Highlights when Unstructured in Corelight ………………………. 32
Figure 9. Display of Log Parsing when Structured in Elastic ………………………………….. 33
Figure 10. AI vs. Cybersecurity Analyst Intrusion Detection Regression Graph ………… 45
Figure 11. AI vs. Cybersecurity Analyst Intrusion Prediction Regression Graph ……….. 49
xiv
List of Formulas
Formula 1. Precision, Recall, and F1-Score Calculation Model ……………………………. 38
Formula 2. Ordinary Least Squares Regression for Severity Score ………………………… 46
Formula 3. Spearman’s Rank Correlation Estimation Model ………………………………… 50
Formula 4. Trend to Month Translog Estimation Model …………………………………….. 51
CHAPTER ONE. INTRODUCTION
Introduction to the Problem
Today’s world is highly network interconnected with the pervasiveness of small
personal devices (e.g., smartphones) as well as large computing devices or services (e.g.,
cloud computing). Each passing day, millions of data bytes are being generated,
processed, exchanged, and consumed by various applications within the cyberspace.
Thus, securing the data and users’ privacy on the world wide web has become an utmost
concern for individuals, business organizations, and national governments (Benavente-
Peces & Bartolini, 2019). With the massive amount of data that travels over the Internet,
it is also a great opportunity for cyber criminals to take advantage of this phenomena to
attack various organizations’ networks. An ever-growing percentage of cyberattacks is
explicitly targeted at specific organizations to steal intellectual properties or sensitive
data; perform espionages; and execute industrial sabotages or denial of services
(Apruzzese, et al., 2018). Although, organizations can employ human analysts to detect
threat agents on their network, yet the amount of time for a human analyst to triage the
malicious activities could take hours, days, or even months of correlating between
multiple data points to identify true positive threat events (Benavente-Peces & Bartolini,
2019). Thus, organizations are now looking at a new prodigy of technological discipline:
Artificial Intelligence (AI) which can gather knowledge by detecting the patterns and
relationships among data, then learn through data architectures to build self-learning
algorithms (Virmani et al., 2020). AI can analyze relationships between threats like
malicious network traffic, suspicious IP addresses, or malware files in seconds or minutes
2
and provide the intelligence to organizations for quicker response to threat events
(Apruzzese, et al., 2018).
Background, Context, and Theoretical Framework
With the rapid expansion in support of globalization, modern networks drive for
ubiquitous connectivity and digitalization, but also, simultaneously and unavoidably,
create a fertile ground for the rise in scale and volume of cyberattacks. Increasing cyber
threats with diversified and sophisticated tactics, cyber criminals and nation state
attackers target the systems that run our day-to-day-lives and easily exposed targets (Al
Qahtani, 2020). Countermeasures to these advanced attacks have never been more crucial
than in our present time; thus, with AI, learning new cyberattack vectors can help
augment protective techniques for the defensive side of cybersecurity. Defense in
cybersecurity can be a set of technologies and processes designed to protect systems,
networks, applications, and data from unauthorized access, alteration, or destruction
(Tyugu, 2011). A cybersecurity defense system consists of a network-based security
system and a host (computer-based) security system. Each of these systems includes
firewalls, antivirus software, intrusion detection and prevention systems (Al Qahtani,
2020). These systems are intended to block certain unwanted traffic; determine and
identify unauthorized system or user behaviors; analyze and distinguish everyday
baseline versus an anomalous event; then lastly eradicate or contain the malicious agent
from further executions.
Calderon and Floridi (2019) believe that AI can improve cybersecurity and
defense measures, allowing for greater system robustness, resilience, and recognition.
First, AI can improve systems’ robustness with the ability of a system to maintain its
3
stable configuration and settings even when it has processed erroneous inputs. Secondly,
AI can strengthen systems’ resilience, that is, the ability of a system to resist and tolerate
an attack without fatal failure or shutdown. Third, AI can be used to enhance system
recognition or detection, in terms of the capacity for a system to discover autonomous
intrusion behaviors and self-identification of vulnerabilities (Calderon & Floridi, 2019).
According to Banoth, et al. (2017), the driving forces that are boosting the use of AI in
cybersecurity are comprised of: (1) speed of impact: In some of the major attacks, the
time of impact on an organization is unpredictable. Today’s attacks are not just targeting
one specific system or certain vulnerability; the attackers can maneuver and change their
targets once they have penetrated the network. These types of attacks occur incredibly
quickly and not many human interactions can counteract the velocity of impact. (2)
Operational complexity is another concern, given the proliferation of cloud computing
and the fact that those platforms and services are operationalized and delivered very
quickly in the millisecond range. This level of complexity overwhelms the human
interactions; therefore, these actions can only be performed by machines matching to
another machines’ prowess. (3) Skills gaps in the cybersecurity workforce remain an
ongoing challenge: There is a global shortage of cybersecurity experts. The level of
scarcity has pushed industry to automate processes at a faster pace (Banoth, et al., 2017).
Realizing the crucial impact of AI today, AI (and in particular, Machine Learning in
cybersecurity) became the focus for this research.
AI is the science that enables computers and machines to learn, judge, and predict
based its own logic (Virmani, et al., 2020). As technology becomes more sophisticated,
the demand for AI is growing because of its ability to solve complex problems within a
4
limited amount of time. AI adopts abilities to equip the technical expertise to a machine
to learn and deploy new theories, methods, and techniques that aim to simulate and
extend the human intelligence (Conner-Simons, 2016). There has been a big
breakthrough in the field of AI due to advances in big data and graphic processing units
(GPUs) which have helped AI to grow exponentially in the last two decades (Sarker,
2020). Organizations can now benefit from AI’s cognitive ability to quickly become a
subject-matter expert in a relatively short time through self-training. Through repeated
use, the system will provide increasingly accurate responses, eventually eclipsing the
accuracy of human expertise (Mittal, et al., 2019). As the intelligence of machines and
the use of digital sensor data improve, various fields of science can use AI to understand
a wide range of collective information (Hussain, et al., 2020). AI is now being applied in
a variety of business industries, with underlying technological subsets such as natural
language processing, robotics, and computer vision. Hence, in particular regard to
cybersecurity, Truve (2017) considers AI techniques are most useful in cybersecurity in
its classification and prediction ability of entities and events. Automated classification of
events will help analysts prioritize on what they should focus their attention. Instead of
spending significant amounts of time deciding what topics to focus on, cybersecurity
analysts can improve their forensic work with already categorized and sorted threats. In
addition, Truve believes cyber defenders today are almost always one step behind, trying
to defend or patch systems where attacks and threats already exist. With predictive
information, defenders might instead start being proactive and protect their systems
against future threats. Therefore, predictive threat intelligence is important with AI’s
capability to predict future events from historical and current data. Prediction generation
5
is an example of a task that is hard or even impossible for a human analyst to carry out,
due to the complexity and large volume of data needed. Algorithms and machines scales
from AI generate predictive models that can be used to forecast events to solve such
problems (Williams & McGregor, 2020).
In this paper, the first task is to determine which branch of AI best applies to
cybersecurity. The overall objective is to apply the most popular branch of AI—Machine
Learning—to classify cybersecurity intrusion events against a human cybersecurity
analyst. Next, AI is tested to predict future cybersecurity intrusion events with time-series
datasets to determine its effectiveness in comparison to a popular time-series data-
prediction model, the autoregressive integrated moving average (ARIMA) statistical
model.
Statement of the Problem
AI is adopted in a wide range of domains where it shows its superiority over
traditional rule-based algorithm and manual human knowledge analysis (Benavente-
Peces & Bartolini, 2019), although the complete automation of detection and analysis of
cybersecurity threats and predict future attacks is an enticing goal. Yet, the efficacy and
accuracy of AI in cyber security must be evaluated with due diligence based on real-life
data.
Purpose of the Study
With its cognitive data-processing capability of Machine Learning, AI is a great
complement to defensive cybersecurity systems which can better detect and defend
against modern cyberwarfare (Truve, 2017). The purpose of the study is to examine the
phenomenon of AI in cybersecurity, research its implications in the business world, and
6
determine whether the present stage of AI technology—in particular, Machine
Learning—can help improve cybersecurity.
Research Question
This research paper focuses on the basic question: What branch of AI is most
applicable to cybersecurity? From this main question emerges the following sub-
questions: How accurate is Machine Learning currently, and can it be beneficial to
cybersecurity? What is the accuracy rate for AI to classify intrusion events versus a
human cybersecurity analyst? When AI is used to predict future intrusion events, what is
the accuracy rate when compared to a time-series prediction statistical ARIMA model?
Rationale, Relevance, and Significance of the Study
In a computing context, the world of information technology has undergone
massive shifts in technology from recent years; the power of high-performance
computers and big data analytics have been the driving factor to these the changes
(Sarker, 2020). With the high trend in cyber-attacks on the frontier of many
organizations. Cybersecurity arguably is the discipline that could benefit most from the
introduction of AI (Calderon & Floridi, 2019). Hence, this research is significant and
relevant to the cybersecurity community. The research is a technical paper that uniquely
designed a small Machine Learning engine with threat-detection algorithms based on
collected data from a dedicated honeypot network environment. Additionally, the
Machine Learning engine is trained with a large amount of data and has an integration
with threat intelligence feeds where the machine self-learns then provides analysis and
predictive results from the data.
7
Nature of the Study
AI has become a hot topic and keyword in recent years; it is being adopted and
widely used in various fields of science (Parrend, et al., 2018). Since AI itself has many
subsets of technology, literature was reviewed on multiple historical AI-related study
cases to determine what AI branch was the most widely used and applicable to
cybersecurity. From there, AI’s ability to classify network threats was compared against a
human cybersecurity analyst, then AI’s ability to forecast future threats was compared
against a well-known ARIMA statistical formula. In order to accomplish this, a dedicated
honeypot network environment was set up to collect firewall logs and netflow data in
order to train and use real-life examples to test the Machine Learning engine hosted on
Microsoft’s Azure Artificial Intelligence Web Service.
Definition of Terms
Anomaly: An activity deviates from what is standard, normal, or expected from
the normal behaviors of systems, network traffic, and system resources (National
Initiative for Cybersecurity Careers and Studies [NICCS], 2018).
Big Data: Extremely large data sets or data points that may be analyzed
computationally to gain insights on patterns, trends, behaviors and interactions (NICCS,
2018).
Cloud: On-demand availability of computer system resources accessible over the
internet, especially networking or computing power, without direct or physical
management by the user (NICCS, 2018).
Cloud Services: Software or program services that are accessible over the
internet (NICCS, 2018).
8
Cyberattack: An act of assault by which an entity intended to evade security
services in order to damage or destroy a computer network or system (NICCS, 2018).
Intelligence Source: A reliable information source where cybersecurity defensive
systems can absorb information about the latest malware algorithms, attack patterns, etc.
(NICCS, 2018).
Intrusion: A security incident in which an entity attempts circumvent security
services in order to gain access to a system or system resource without having proper
authorization (NICCS, 2018).
Netflow: Data of network protocols, IP traffic information as packets enters or
exits the interface collected by a network device (NICCS, 2018).
Network Traffic: Data transmissions in the form of packets sent over the
network from a sender host to a recipient host (NICCS, 2018).
System Log: Data of informational, error, or warning events related to the
behaviors of a computer system and its resources (NICCS, 2018).
Threat: A potential entity that has the intention to cause adversely affect through
unauthorized access to disrupt, damage, or steal from a network, computer system, or
resources (NICCS, 2018).
Assumptions, Limitations, and Delimitations
The study of AI is complicated due to the lack of data available, firstly, as the
phenomenon is very new and, secondly, companies that utilize this technology are often
privately held and are not obliged to disclose the results to the public. Even if some
companies that employ AI are willing to publish their results, due to the small number of
those companies worldwide, the quantitate data on this topic would be limited (Patel,
9
2017). Existing public datasets that may be found on the Internet also face the problems
of uneven data and outdated contents from technology to data collection methods. In
addition, most of the data have already been aggregated. Hence, it is empirical to collect
sufficient and proper raw security data in order to build an AI engine (Bhuyan, et al.,
2014). Furthermore, the study’s budget constraint meant that it was not possible to
implement the best of breed AI system in the market.
For these reasons, this study is limited to investigation of the performance of
Azure Artificial Intelligence Web Service. It is important to note that the results of this
research may not be extrapolated to represent AI widely, as AI types differ from each
another. The research is also limited to the linear methods used, such as the ARIMA
statistical model compared to AI. Thus, the conclusion can only be made based on one
branch of AI since this study has a large concentration on Machine Learning in
comparison to its classification and prediction abilities. The Machine Learning model is a
subset within AI while this study rejects other models such as: natural language
processing, vision, robotics, pattern recognition, and convolutional neural networks,
which pose questions for further research on how other subsets of AI can be applicable to
cybersecurity. Lastly, this study is limited to the United States market, in particular to
cybersecurity, hence the results might differ if a similar study were conducted in a
different country or region.
Organization of the Remainder of the Study
The paper is structured with the following format: Chapter One serves as prelude
to introduce the readers to the contextual background of AI and cybersecurity along with
the theoretical framework for this study. Chapter Two takes the readers through a
10
comprehensive understanding of AI in various fields of science and highlights the AI
branch that seems to be most commonly used in previous literature. With the AI branch
identified in Chapter Two, the study uses Machine Learning, a subset of AI, as the focal
point on how it can be applied to cybersecurity. Chapter Three provides the readers with
a concrete guideline on the methodologies of data collection and analysis that are used to
conduct the study. Chapter Four depicts the results and detailed analysis from the study
based on the methodology as discussed in Chapter Three. Chapter Five summarizes the
study’s results and discusses implications for further research.
Chapter One Summary
Cybersecurity in recent years has been a fast-growing field demanding a great
deal of attention from individuals, business organizations, and national governments as
cyber criminals are on the rise to steal intellectual properties and sensitive data (Virmani,
et al., 2020). The remarkable progress in web technologies such as cloud computation
and big data analytics has also fostered the growth of AI. AI techniques have been
applied to many areas of science due to its distinctive properties of adaptability,
scalability, and potential to rapidly adjust to new and unknown challenges (Rajbanshi, et
al., 2017). It has been hypothesized that the Machine Learning engine from AI could be
deployed to cybersecurity to address such wide-ranging problems (Truve, 2017).
Therefore, this paper attempts to amalgamate AI and cybersecurity, discussing and
highlighting AI’s applicability in cybersecurity. The objective of this paper is twofold:
first, to identify which branch of AI is most pertinent to cybersecurity through an
extensive review of past research; secondly, to assess the current maturity of AI, in
particular to Machine Learning, for cyber detection schemes. To achieve the second
11
objective, AI’s ability to classify network threats was compared against a human
cybersecurity analyst; then AI’s ability to forecast future threats was juxtaposed against a
prominent ARIMA statistical model.
12
CHAPTER TWO. LITERATURE REVIEW
Introduction to the Literature Review
AI in today’s world is progressing rapidly with new advanced innovations day in
and day out. The development of AI is speeding up rapidly and has started to change the
many business landscapes (Mittal, et al., 2019). Companies from various industries of
healthcare, finance, to manufacturing are focused on applying AI with automation
processes to gain new heights of efficiency and quality (Benavente-Peces & Bartolini,
2019). According to Frank (2014), one crucial aspect that might have been overlooked is
cybersecurity which is an important factor to many businesses. Cybersecurity essentially
protects businesses’ sensitive and proprietary data from being breached. Thus,
cybersecurity can be enhanced with AI to make a superb combination (Khisimova, et al.,
2019). When used in conjunction with cyber security, AI can a powerful tool for
protecting against cybersecurity attacks. Furthermore, in our Internet Age, with hackers’
abilities to commit theft or cause harm remotely, shielding or masquerading their own
operations from those deceivers has become more difficult than ever (Al Qahtani, 2020).
With such ever-increasing threats and numbers of breaches staggering, many
organizations need help on their cybersecurity frontiers. AI may be the solution to help
organizations solve this problem and heighten organizations’ cybersecurity defense
postures (Demertzis & Iliadis, 2015).
As organizations are looking towards automation to reduce manual processes, AI
can help make cybersecurity more manageable, efficient and effective, yet ultimately
lower their cyber threat risk (Parrend, et al., 2018). Today, typical AI capabilities include:
speech, image and video recognition, autonomous objects, conversational agents,
13
prescriptive modeling, augmented creativity, smart automation, advanced simulation, as
well as complex analytics and predictions (Conner-Simons, 2016). One of the driving
forces to AI is Machine Learning which is the science of getting computers to act without
being explicitly programmed (Schuurmans, 1995). It takes algorithms inspired by the
structure and function similar to a human’s brain, in order to create an artificial neural
network (Becue, et al., 2021). Hence, AI in cybersecurity when combined can create
security systems with a set of capabilities that allow organizations to detect, predict and
respond to cyber threats in real-time (Collins, 2019).
The literature review first evaluates prior studies conducted on AI in the domain
of cybersecurity. Yet, to avoid a parochial or narrow scope the literature review also
surveys other studies of AI when used across the different business sectors of
manufacturing, product design, healthcare, customer communication, environmental
science, higher education, and finance. The literature review poses the initial question
whether Machine Learning is right the AI, then analyzes branches within AI—e.g.,
machine learning, natural language processing, vision, and robotics (see Figure 1)—to
find out which technological branch is mostly commonly used in the industries and can
be applicable to cybersecurity.
14
Figure 1
Branches of Artificial Intelligence
Note: Model re-designed from Kabbas, A., & Munshi, A. (2020), p. 120.
Review of Research Literature
AI in Cybersecurity
There have been previous academic works done by various researchers to
understand AI’s relationships to computer security. Ghosh, et al. (1998) proposed to train
an applied support vector machine model to detect anomalous Windows registry accesses
by using the Knowledge Discovery and Data Mining (KDD99) benchmark dataset to
evaluate the performance of their model. In other research, Kozik, et al. (2014) used the
Canadian Society of Immigration Consultants (CSIC) 2010 Hypertext Transfer Protocol
(HTTP) Dataset to assess the classification of internet traffic. Their study specifically
focuses on traffic using HTTP protocols to communicate clients with the servers. The
techniques described therein look for well-known port numbers of IP flows that are
statically abnormal but do not characterize the traffic itself. Bhuyan, et al. (2014)
introduce a new approach to create unbiased real-life network intrusion datasets in order
to compensate for the lack of available datasets. They create a significant amount of an
15
intrusion dataset in the development of a detection system, launching temporally unique
identifiers (TUIDS) distributed denial of service (DDoS) to test against an older DDoS
Center for Applied Internet Data Analysis (CAIDA) dataset. Bhuyan, et al. propose an
empirical study using the K-Nearest Neighbors model in order to handle important
security metrics such as detection of both low-rate and high-rate DDoS attacks. They
conduct several experiments using significant entropy measures to analyze DDoS attack
from normal traffic. This methodology is known as Feature Score, which consists of three
features on the network traffic which are the source IPs, variation of source IPs, and
packet rate flow. The experimental results show that the proposed model yield 65%
detection accuracy on the normalized CAIDA dataset. Yet, the paper primarily only
focuses on DDoS detection on wired networks, while leaving out wireless networks
where it is also another noteworthy DDoS vector. Together, the three research articles
above have investigated the impact of Machine Learning-based computer solutions (early
stages of AI) in their research, however those researches have not been very impactful. In
their conclusions and footnotes, the authors aspire for future research on Machine
Learning methods that can detect anomalous traffic, possible attack, and misuse by
analyzing the data on its own. In “A Survey of Data Mining and Machine Learning
Methods for Cyber Security Intrusion Detection,” Buczak and Guven (2016) survey
numerous articles that relate to AI in cybersecurity. Their results indicate that using AI
for cybersecurity purposes in the three main areas of intrusion detection, malware
analysis, and spam detection could be very useful. Altogether, a noble pursuit in the study
of AI in cybersecurity is largely encouraged.
16
AI in Smart Manufacturing
AI applications and robotics in smart production generate new industrial
paradigms. Cioffi, et al. (2019) provide a systematic literature review of research from
1999 to 2019 in the areas of AI and Machine Learning technique. The mixes bibliometric,
content analysis, and social network techniques are used. Through research and
classification process, the paper reviews, classifies, and analyzes 82 articles from the
Web of Science and SCOPUS database. Greater innovation, process optimization,
resources optimization and improved quality are the most significant benefits of using AI
that leverages Machine Learning and Robotics in the industrial sectors. The results also
emphasize emerging trends of AI with Machine Learning and Robotics in sustainable
manufacturing through the intelligent utilization of materials and energy consumption;
inventory and supply chain management; predictive maintenance; and production.
Moreover, AI with Machine Learning and Robotics also improves quality control
optimization in manufacturing systems. In addition, with the consumption of test and
manufacturing data from various systems, AI can help factories optimize their
manufacturing processes to be more efficient, systematic, and smart.
AI in Product Design
AI improves the ways companies innovate and develop new products in order to
speed up development and maintenance processes and consolidate companies’ support
functions according to Min (2015). Min lists several standards that address products’ life
cycle, management and quality control such as those from the International Organization
for Standardization (ISO) and the International Electrotechnical Commission (IEC) and
compares them to manufacturing data, such as that from the Institute for Supply
17
Management. The paper lists 30 processes that can benefit from AI compared to the
technical processes that data engineers are struggling with. The research indicates that AI
can assist engineers with Machine Learning and AI’s vision capability in creating more
accurate product specifications and at the same time helping them in the design-defining
phases. Example of an AI-assisted awareness learning cycle consists of engineers
developing product quality systems with some criteria of quality assurance in mind. An
anomaly detector analyzes usage of the system to find abnormal events and/or patterns
from various quality assurance tests. AI absorbs those data and computes a result table of
test flaws and product quality disintegrations where they occurred. Hence, AI helps
engineers validate which quality assurance test is most applicable to the product and
builds a risk matrix of mal-functionalities to products. In addition, AI assists engineers to
assess the flaw probabilities in their design tests to create more accurate design of
products based on their test outcomes.
AI in Healthcare
Data and information play an important role for decision-making and provision of
healthcare. A tremendous volume of data from patients, doctors, hospitals and healthcare
providers, medical insurance, medical equipment, and medical research could be
consumed and utilized for improving the delivery of healthcare. The objective of
Davenport and Kalakota’s (2019) research is to identify the evidence based on big data
analytics, machine learning and AI in healthcare. The authors use 2,421 articles between
2013 and 2019 to evaluate and analyze prior research on big data analytics and AI within
the healthcare sector to conduct a systematic literature mapping study. Research type
facet, contribution facet, and publication year are used to focus on previous studies’
18
research dimensions and topic-specific schemes. Different perspectives on research with
big data analytics and AI in healthcare are shown by five facets. A summary of existing
research in the field of big data analytics and AI in healthcare is also provided by this
systematic mapping and review paper. The study discusses barriers to rapid
implementation of AI in healthcare and the potentials in AI offering to automate aspects
of care. The development of AI in healthcare can improve diagnosis and treatment
recommendations as well as transform many aspects of patient care and administrative
process. The authors show that these AI tasks can be accomplished by neural networks
and deep learning, natural language processing, rule-based expert systems, physical
robots, and robotic process automation. The paper also uses Electric Health Record data
as supporting evidence for the applicability of AI to successful progress in diagnosis and
treatment; patient engagement; adherence applications, administrative applications; and
various implications for the healthcare workforce. The paper overall highlights the
important role of AI in healthcare. At the end, the research recommends that future
software incorporate healthcare systems, biosensors, watches, smartphones,
conversational interfaces and other instrumental data to interconnect with the patient’s
diagnosed data to identify effective treatment pathways. The recommendations then can
be used by healthcare providers, frontline staff such as nurses, call-center agents or care
delivery coordinators.
AI in Customer Communication
Online chatbot creates a new and more efficient support platform for customers.
It leverages AI’s natural language processing and speech recognition capabilities to
mimic human conversations as well as providing more realistic customer-support
19
experiences. Recent technological advances in AI allow chatbots to assist with
increasingly complicated and multiplex tasks. The objective of Pantano and Pizzi’s
(2020) journal is to provide a comprehensive understanding of actual progress in AI,
focusing on online chatbots. To provide a good overview of patent development, the
paper uses 668 patents including the words “chatbot” in the title and/or abstract from
1998 to 2018. By analyzing the investigation of occurrences with the extraction of topics
and phrases by the Cogito software, hierarchical cluster analysis and multidimensional
scaling showed that the adoption of new conversational agents based on natural language
has increased tremendously in recent years. Their findings highlight chatbot systems are
more characterized with different abilities through the incorporations of AI. The paper
emphasizes the strong connection of the digital assistants’ analytical skills and their
ability to automatically interact with the users. Lastly, the study draws inferences on
consumers from different data points to automate and improve chatbot abilities and
provides more customized solutions for chatbots by using consumers’ knowledge.
AI in Higher Education
Self-exploration education or self-determined learning of heutagogical techniques
are examples of AI in higher education. These systems interact, assist, and guide students
by semi-automatic learning methods via natural language processing from AI. In order to
stay competitive and fulfill their stakeholder needs, higher-education providers in the
Malaysian school system are forced to adapt with technological innovations in education.
Fazil,, et al. (2019) aim to examine the relationship of AI technology and the educational
industry through the examination of self-determined learning platforms. The paper uses
multiple case studies designed to showcase the heutagogical theme for the research. The
20
research is both quantitative and qualitative which utilizes expert opinions of consultants,
educational suppliers, and educational providers towards the importance of Self-
Determined Learning platforms and leverages massive open online course (MOOC) data
within the educational industry of Malaysia to examine the level validity of AI in
education. The framework is applied to precisely determine how AI leverages Machine
Learning and natural language processing to attract and promote interests to self-
determined learning students. The results depict a positive outlook in support of AI-
enabled technology in education, affecting the value proposition towards promoting
education services, an increase of value-centric value propositions fostered by continuous
interactions between AI and students using self-exploration education.
AI in Environmental Science
Rainfall prediction, although largely observed in many prior studies, is extremely
challenging because of stochastic meteorological parameters such as temperature,
humidity, wind besides time and space. In order to provide a prompt estimating method,
Prakash, et al. (2020) attempt to introduce a newly-developed model (namely Adaptive
Network based Fuzzy Inference System optimized with Particle Swarm Optimization and
Machine Learning from an advanced AI model) to be among the most effective
methodologies in predicting daily rainfall. The study leverages 3,653 data samples
collected in Hoa Binh province, Vietnam from January, 2004 to December, 2013. In each
model, rainfall is used as an output parameter, while input parameters include: maximum
temperature, minimum temperature, wind speed, relative humidity and solar radiation.
The research highlights evidence that AI models are validated; it also proves by
correlation coefficient and mean absolute error, skill score, probability of detection,
21
critical success index, and false alarm ratio that there is a plausible range for forecasting
daily rainfall, even in the utilization of Monte Carlo approach. The results show Machine
Learning appears to be the best performer. The paper shows the helpfulness of AI-based
study to the existing literature of rainfall prediction. AI used in corporate decision-
making AI can help business decision-making to identify stakeholders’ wills and needs.
AI in Finance
Zavadskaya (2017) focuses on stock-market prediction to ask whether AI could
offer an investor more accurate forecasting results. The study uses two datasets: monthly
returns of S&P 500 index returns over the period 1968-2016, and daily S&P 500 returns
over the period of 2007-2017. Both datasets undergo a test for univariate and multivariate
with 12 explanatory forecasting variables. The test uses recurrent dynamic Machine
Learning techniques and compares performance with ARIMA and vector autoregression
(VAR) models, using both statistical measures of forecast accuracy (such as mean square
of predicted error [MSPE] and mean absolute predicted error [MAPE]), as well as
economic Success Ratio and Direction prediction measures. Further, given that AI may
produce different results during each iteration, the study also performs a sensitivity
analysis, checking for the robustness of the results given different network configuration,
such as training algorithms and number of lags. Even though, some networks outperform
certain linear models, the overall result is mixed. ARIMA models may seem to be the
best in minimizing forecast errors, while Machine Language often displays better
accuracy in sign or direction predictions. After the forecast-accuracy MSPE and MAPE
tests have been applied, Machine Learning seems to outperform the respective ARIMA
models in many parameters, but the difference of this outperformance was not significant.
22
At the end, all models produced in the study have significant results to 60-65% accuracy
of stock index direction predictions and changes in the S&P 500 returns. Figure 2 lists
the studies reviewed.
Figure 2
Methodological Literature Used
Authors Research Area AI Sub-Type Used
Davenport & Kalakota
(2019)
AI in Healthcare Machine Learning
Pantano & Pizzi (2020) AI in Customer
Communication
Machine Learning and
Natural Language
Processing
Cioffi, et al. (2019) AI in Smart
Manufacturing
Machine Learning and
Robotics
Hockey (2015) AI in Product Design Machine Learning and
Computer Visions
Fazil, et al. (2019) AI in Higher Education Machine Learning and
Natural Language
Processing
Prakash, et al. (2020). AI in Environmental
Science
Machine Learning (Subset:
Adaptive Fuzzy Inference
System)
Zavadskaya (2017). AI in Finance Machine Learning (Subset:
Artificial Neural Networks)
Chapter Two Summary
In summary, previous research shows that Machine Learning to be frequently
superior and widely adopted to the backend processing component to AI. Cioffi, et al.
(2019) have performed an in-depth literature review of 82 articles from the Web of
Science and SCOPUS database in the areas of AI; in the research the authors discover
that Machine Learning and Robotics can help improve quality controls in manufacturing
systems. With the utilization of test and manufacturing data from various systems, AI can
help factories optimize their manufacturing processes smarter and more efficient.
Hockey’s (2015) paper lists the top technical processes data engineers are struggling and
23
through the author’s quantitative evaluation reflects the areas where AI can help benefit
the most. The research indicates that AI could assist engineers with Machine Learning
and AI’s vision capability in creating more accurately products’ specifications as well as
help them during the architecture and design phases. Davenport and Kalakota (2019)
have their study on how AI could benefit the Healthcare industry. Their study involves
Electric Health Record evidence/data to claim about the applications of AI in particular to
Machine Learning have established tremendous progress and success in diagnosis and
treatment applications, patient engagement and adherence applications, and
administrative applications. Pantano and Pizzi (2020) emphasize online chatbots that
have created a new and more efficient communication alternative to support for
customers by using AI’s natural language processing and speech capabilities to mimic
human language/conversations. The objective of Pantano and Pizzi (2020) is to provide a
comprehensive understanding of the actual mechanism in AI that are behind online
chatbots. To provide a comprehensive study, the paper uses 668 patents including the
words “chatbot” in the title and/or abstract from 1998 to 2018. By analyzing the
investigation of occurrences, the extraction of topics and phrases by a software and the
hierarchical cluster analysis and multidimensional scaling, the research shows that the
adoption of new conversational agents are based on Machine Learning and Natural
Language Processing components behind AI. Rainfall prediction is extremely challenging
because of stochastic meteorological parameters such as temperature, humidity, wind
besides time and space. In order to provide a prompt estimating method, Prakash (2020)
attempts to introduce a newly-developed model, namely Adaptive Network based Fuzzy
Inference System, an advanced model of the Machine Learning branch in AI. The study
24
shows the model is most effective in predicting daily rainfall based on the computation of
3,653 data samples collected. The paper contributes to the existing literature of rainfall
prediction the helpfulness of AI-based study. Zavadskaya (2017) focuses on a stock
market prediction, whether AI could offer an investor more accurate forecasting result.
The study uses two datasets: monthly returns of the Standard and Poor (S&P) 500 index
returns over the period 1968-2016, and daily S&P 500 returns over the period of 2007-
2017. Both datasets undergo a test for univariate and multivariate with 12 explanatory
forecasting variables. The test uses dynamic Machine Learning in comparison to the
performance of ARIMA and MOOC models. The study shows Machine Learning
outperforms the respective VAR models in many parameters. Moreover, all models show
significant results to the significant 60-65% accuracy rate of stock index direction
predictions and changes in S&P 500 returns. The study proves AI could drastically
influence business decision-making in the financial sector.
All in all, previous literature suggests the majority of AI’s computation capability
must rely on Machine Learning as an algorithm that can become smarter through the
absorbance of data. Machine Language can learn on its own to produce more accurate
predictive results. The use of AI in customer communication, smart manufacturing,
product design, finance, healthcare, higher education, and environmental science was also
studied and suggests the value of choosing Machine Learning, an ingrained branch of AI
in many industries (Becue, et al., 2021), to find its applicability to cybersecurity.
25
CHAPTER THREE METHODOLOGY
Introduction to Methodology
In this research, the main goal is to show the accuracy of machine learning in
classifying cybersecurity intrusions and predicting feature attacks based on a time-series
dataset. For this, the research compares the effectiveness of intrusion classification
techniques of Machine Learning to a human cybersecurity analyst, and the intrusion
event prediction ability of Machine Learning as compared to a popular ARIMA statistical
model. Note that in the remaining chapters, the terms AI and Machine Learning will be
used interchangeably as Machine Learning is a subset or branch of AI, whereas the
research purely relies on Machine Learning for the technical computation and analysis.
Machine-learning-based approaches rely on identifying anomalies that can
identify false positive results. So-called analytical solutions are based on rules created by
Machine Learning design experts to detect the outlying events that do not match to the
established rules (Schuurmans, 1995). Machine Learning is a branch of AI that is closely
related to (and often overlaps with) computational statistics that focus on analysis-making
through the use of computers. It has strong ties to mathematical optimization, which
delivers methods, theory and application domains to the field of research (Butler &
Kazakov, 2010). Machine learning leverages the use of data mining and exploratory data
analysis, and its techniques have been applied in many areas of science due to
adaptability, scalability, and potential to rapidly adjust to new and unknown challenges
(Palmer, 2017). Machine learning techniques offer potential solutions that can be
employed for resolving such challenging and complex situations due to their ability to
adapt quickly to new and unknown circumstances (Kabbas & Munshi, 2020). Machine
26
Learning can also be unsupervised and used to learn and establish baseline behavioral
profiles for various entities and then used to find meaningful anomalies. The pioneer of
Machine Learning, Arthur Samuel, defined it as a ‘‘field of study that gives computers
the ability to learn without being explicitly programmed’’ (Dasgupta, et al., 2020, p. 8). It
primarily focuses on classification and regression based on known features previously
learned from training data. It also mimics the human brain’s function to interpret data
inflow and learn from it. Its motivation lies in the establishment of a neural network that
simulates the human brain for analytical learning (Bhatele, et al., 2019).
In cybersecurity, security breaches include external intrusions and internal
intrusions. There are three main types of network analysis for threat detection: misuse-
based (also known as signature-based), anomaly-based, and hybrid. Misuse-based
detection techniques aim to detect known attacks by using the signatures of these attacks.
They are used for known types of attacks without generating a large number of false
alarms. However, administrators often must manually update the database rules and
signatures. Anomaly-based techniques study the normal network and system behavior
and identify anomalies as deviations from normal behavior. New (zero-day) attacks
cannot be detected based on signature or misused algorithms as they are the hybrid of the
two. They are appealing because of their capacity to go undetected in their early stages
(Goyal & Sharma, 2019). The data on which anomaly-based techniques alert (novel
attacks) can be used to define the signatures for misuse detectors. The main disadvantage
of anomaly-based techniques is the potential for high false alarm rates because previously
unseen system behaviors can be categorized as anomalies (Cylance, 2020). Hybrid
detection combines misuse and anomaly detection. It is used to increase the detection rate
27
of known intrusions and to reduce the false positive rate of unknown attacks. Hence, with
Machine Learning’s capability to utilize both potentials of detecting threat events by both
signature-based and anomaly-based techniques, it can discover a wide range of issues,
such as: malware attack, ransomware, denial of service (DoS), phishing or social
engineering, Structured Query Language (SQL) injection attack, man-in-the-middle,
vulnerability discovery, deception, or insider threats (Devakunchari & Sourabh, 2019).
Purpose of Study
This study aims to investigate the approach of AI in cybersecurity, research its
effectiveness to determine whether the present stage of AI technology and in particular to
ask how Machine Learning can help improve cybersecurity.
Research Questions
The research paper starts at the highest level with the general question of: What
branch of AI is most applicable to cybersecurity? From the main question, it asks the sub-
questions: How accurate is Machine Learning at our present time to be beneficial to
cybersecurity? What is the accuracy rate for AI to classify intrusion events versus a
human cybersecurity analyst? When AI is used to predict future intrusion events, what is
the accuracy rate when compared to a time-series prediction statistical ARIMA model?
Research Design
The research utilizes a honeypot hosted on Microsoft Azure Cloud (Figure 3); it
has a weak IP address of 192.168.1.1 which is widely known as the default IP address
out-of-the-box for networking devices. The public IP is provided by MS Azure to be
static and resolves back to a purchased IP domain of Goooogle.com purchased
specifically for this project. “Goooogle.com” was chosen as it is attractive to global
28
internet citizens to visit the site with the most common typo error of typing the extra “o”
to the popular search engine site of a “Google.com.” The site also gets heavy net traffic
from botnets and cybersecurity hackers due to the workstation and server being
connected within the network where those two hosts actively visit malicious websites to
download malware contents and click on spiteful phishing emails (see Figure 2).
Figure 3
Honeypot Network Architecture Diagram
The site was live in production for 274 calendar days (9 months). Logs and
netflow data captured from the devices within the honeypot are unambiguous
representation of instants in time; they contain timestamps that uniquely bind to each
event and are considered to be time-oriented data. The dataset contains a combination of
110,516 total traffic flow with 42,871 benign traffic flows and 67,645 malicious traffic
flows that are reflected in 10 categories of cybersecurity attacks. Raw security data are
used to analyze the various patterns of security incidents or malicious behavior, to build a
29
data-driven security model and achieve the study’s objective. The raw security data
(netflows and logs) are captured at the firewall-level in the instance hosted on Azure,
then it gets loaded into Corelight that converts unstructured data (netflows and logs) to
structured data for Elastic to interpret the data and sort the data to their respective schema
and data structure. The data is then analyzed and loaded into Azure SQL Database where
the Azure Artificial Intelligence Web Service picks up the data and develops a
classification model for differentiating the relationship between the various intrusion
event types; (refer to Figure 4 for visualization).
Figure 4
Data Collection and Analysis Workflow Diagram
Table 1 shows the result attacks classified in 10 attributes with training samples
and testing samples used. From the partitioned datasets 80% is used to train the Machine
Learning algorithm within Azure Artificial Intelligence Web Service and the remaining
20% is used for testing as depicted in Table 1. Existing datasets were not leveraged since
they are subjected to defects of old data as the latest public AI testing data on
cybersecurity hasn’t been released since 2012. These data are susceptible to outdated and
unbalanced information due legacy technologies and attack vectors have changed in our
30
present days. Legacy data is prone to the aggregation measures of information technology
components such as mainframe computers, software or programs, and communication
equipment that are clustered together. Also, there is a problem of insufficient data volume
and raw data integrity to build the AI (Machine Learning) engine. Therefore, establishing
network intrusion detection datasets with large amounts of data, wide-type coverage and
balanced sample numbers of attack categories for analysis of intrusion detection was a
top priority in this research.
Data Collection and Data Analysis Procedures
To expand on Figure 4, the processes within the Data Collection and Analysis
Workflow are described in Figure 5 in the order below: Step 1) From the Azure Firewall,
the management console’s netflow and log data are being forwarded to a centralized log
management console which is Corelight acting as a syslog server. Syslogs are sent via
dedicated port 5985 and netflows on port 2055; both with SSL encryptions.
Figure 5
Configuration of Log and Netflow Forwarding
Note: Screenshot taken from Azure Firewall console
31
Figure 6 displays a raw log (unstructured data format) that requires Corelight to
parse the data.
Figure 6
Example of a Firewall Log
Note: Screenshot taken from Azure Log Analytics console
Figure 7 provides an example of raw netflow data.
Figure 7
Example of Raw Netflow Data
Note: Screenshot taken from Wireshark console
Step 2) Corelight parses the given data, identifying the key words based of
various events within the netflow and log data. It also converts the unstructured data into
structured data formats labeling them with the correct header information for each data
column (Figure 8), which then gets loaded onto Elastic for further analysis.
32
Figure 8
Display of Log Highlights when Unstructured in Corelight
Note: Screenshot taken from Corelight console
Step 3) Once the structured data are loaded onto Elastic, the system can read the
columns with their respective header information based on: time, TCP/UDP description,
error code, event type, hostname, etc. (Figure 9). The data then are mapped to an open-
source library that can correlate these data to specific malicious events by Cisco Talos
Threat Intelligence Rules.
33
Figure 9
Display of Log Parsing when Structured in Elastic
Note: Screenshot taken from Elastic console
Target Population, Sampling Method, and Related Procedures
The dataset contains roughly 110,516 total traffic flow with 42,871 benign traffic
flows and 67,645 malicious traffic flows that reflect 10 categories of cybersecurity
attacks. As mentioned above, 80% of the dataset is used to train the Machine Learning
engine. Once the data are loaded into the Elastic software, those 80% of the dataset is
mapped to an open-source threat-intelligence library that can correlate the data points to
specific threat events based on Cisco Talos Threat Intelligence Rules. In this process,
Elastic adds another column to tag that event to a specific threat labeled by Talos. Once
the dataset is ready, it is fed to the Machine Learning engine hosted on Azure Artificial
Intelligence Web Service which inspects all the data. This process took around 16 hours
for the engine to sort through and analyze all the data to build Machine Learning
34
algorithms for detecting cybersecurity threats. The remaining 20% of the data are
unknown to the Machine Learning engine which will be used for the testing purpose of
this study. The dataset and their respective threat categories are depicted in Table 1.
Table 1
Ten Intrusion Categories with Depiction of Training and Testing Samples
Category
Training Samples
N
Testing Samples
N
Total
N
Benign Traffic 34,296 8,575 42,871
Cross-Site Scripting 7,014 1,754 8,768
SQL Injection 6,075 1,519 7,594
Email Spam 9,254 2,313 11,567
Password Brute-Force 4,434 1,109 5,543
Port Scanning 10,288 2,572 12,860
Registry Takeover 2,968 743 3,711
Denial-of-Service (Dos) 6,138 1,535 7,673
Shellcode Execution 1,671 418 2,089
Malware Exploit 6,992 1,748 8,740
Note: Table 1 shows result attacks classified in 10 attributes with training samples and testing samples
used. From the partitioned datasets 80% is used to train the Machine Learning algorithm within Azure
Artificial Intelligence Web Service and the remaining 20% is used for testing.
Instrumentation
Parrend, et al. (2018) quote Bernard Marr’s definitions for AI as: “Artificial
Intelligence is the broader concept of machines being able to carry out tasks in a way that
we would consider ‘smart’’ (p. 85) and, “Machine Learning is a current application of AI
based around the idea that we should really just be able to give machines access to data
35
and let them learn for themselves” (p. 85). Therefore, AI specific to Machine Learning
engine in this project is a self-learning model that can learn and solve problems,
especially in environments where algorithms or rules need to evolve in order to solve
dynamic problems. Machine Learning can successfully achieve this by learning and
classifying from past network activities to predict future attacks that are actually
transpiring (Banoth, et al., 2017). As mentioned, patterns that describe normal and
abnormal network activities are traditionally defined manually by security professionals
based on their expert knowledge while Machine Learning can be trained to identify such
patterns automatically. AI improves its knowledge to understand cybersecurity threats by
consuming a large number of data artifacts (Mittal, et al., 2019). Therefore, in order to
build the AI Machine Learning engine, data needs to be fed to the Machine Learning
model hosted on Azure Artificial Intelligence Web Service where it can process, analyze,
and match events to rules in order to build its intrusion detection and prediction
algorithms.
The Machine Learning model for this project has two main principles: (a) to
identify signature-based detection approach, which identifies malicious activities by pre-
defined patterns of abnormal network and/or system behaviors; and (b) use of a system-
anomaly-detection approach, which is based on evaluating abnormal patterns from the
normal network and/or system behaviors. Machine Learning needs to be fed a large
knowledge base, which stores expert knowledge, and an inference engine, which is used
for reasoning about predefined knowledge as well as finding answers to given problems.
Depending on the form of reasoning, Machine Learning will apply to different problem
classes (Hossein, et al., 2020). A case-based reasoning approach allows solving problems
36
by recalling previous similar cases, assuming the solution of a past case can be adapted
and applied to a new problem case. Subsequently, newly proposed solutions are evaluated
and, if necessary, revised, thus leading to continual improvements of accuracy and ability
to learn new problems over time. In addition, rule-based reasoning solves problems using
rules defined by experts. Rules consist of two parts: a condition and an action. Problems
are analyzed stepwise: first, the condition is evaluated and then the action that should be
taken next is determined. It is crucial to recall that expert systems so far solely assist
decision makers (Al Qahtani, 2020). Ultimately, Machine Learning can define both
patterns, mainly based on their experiences plus their prior knowledge of cyber threats.
The cybersecurity analyst who volunteered in the study to compare AI’s intrusion
event classification ability to a human cybersecurity analyst is a 5-year experienced
analyst. The analyst has a Bachelor’s degree from a university in California; the degree is
in Computer Science with expertise in software development and networking. He also
possesses two cybersecurity certifications of the Computing Technology Industry
Association Security+ and Certified Information Systems Security Professional.
Previously, the analyst has worked for two U.S. based Fortune 500 companies as part of
their Security Operations Center team with his primary focus on detecting a wide range
of malware attacks, denial of service, phishing emails, web application attacks, man-in-
the-middle, vulnerability discovery, masquerade or deception insider threats on a daily
job basis.
An ARIMA model was used in this study to compare its ability to forecast future
trends to AI. ARIMA is a statistical analysis model that uses time series data to forecast
or predict future outcomes based on a historical time series. The model explains a given
37
time series based on its own past values, that is, of its own lags and the lagged forecast
errors, so that equation can be used to forecast future values (Prakash, et al., 2020). It is a
popular and widely used statistical method for time series prediction. For example, an
ARIMA model might seek to predict a stock’s future prices based on its past performance
or forecast a company’s earnings based on past periods and are widely used in technical
analysis to forecast future security prices. It is based on the statistical concept of serial
correlation, where past data points influence future data points (Zavadskaya, 2017).
ARIMA forecasting is achieved by plugging in time series data for the variable of
interest. Here, time series data were collected in Python statistical and coding software
that had an ARIMA computation model pre-configured with integration from PANDAS,
an open-source Python Data Analysis Library. The software then identifies the
appropriate number of lags or amount of differencing that can be applied to the data, and
then outputs a computed data table with multiple linear regression values.
Limitations of the Research Design
Data constitutes the essential foundation of cybersecurity network research.
Hence, due to the lack of disaggregated and up-to-date data, it was necessary to collect
fresh network data from a honeypot, parse the data, and structure the data in a format for
Machine Learning engine to learn and understand from in order to build its threat
detection algorithm. A well-known limitation of threat identification approach to build a
Machine Learning engine is subjected to static detection rules is the need for frequent and
continuous updates (e.g., daily updates of malware definitions). In this case, the research
is only able to retrieve Cisco Talos Threat Intelligence Rules at the time when the
research is conducted in order to map out the threat categories of 80% of the data used for
38
training of the Machine Learning engine. Furthermore, the study design is limited to the
use of low-end technology such as Corelight, Elastic, Python, STATA statistical
software, and the Azure Artificial Intelligence Web Server; thus, it was not possible to
implement the best of breed AI system. It is important to note that the results of this
research could have been altered if more sophisticated technology were introduced.
Data Validity Test
Supervised learning algorithm analyzes existing training data with labeled results
to map to new entries. Unsupervised learning is a machine-learning algorithm that
deduces the description of hidden structures from unlabeled data (Soni & Bhushan,
2019). This study leveraged the technology of semi-supervised learning, which is a
method of combining supervised learning with unsupervised learning. Semi-supervised
learning uses a large amount of labelled data on unlabeled data for pattern recognition.
Using semi-supervised learning can reduce label efforts while achieving high accuracy.
However, as suggested by Selden (2016), the quality of each classifier must be measured
for accuracy through common performance assessment metrics, namely Precision, Recall,
F-score, as computed below:
Formula 1. Precision, Recall, and F1-Score Calculation Model
��������
�
��
��
��
���������������������������� �
��
��
��
��������������������� � �����
� � ��
�������
� ������
������
������
�
TP denotes true positives, FP false positives, and FN false negatives. For
coherency, the research considers a true positive to be a correct detection of a malicious
sample while a false negative is the erroneous detection of a malicious sample. Precision
39
indicates how much a given approach is likely to provide a correct result. Recall is used
to measure the detection rate. The F-score combines Precision and Recall into a single
value to capture the true effectiveness of a classifier. Finally, to reduce the possibility of
biased results, each evaluation metric is computed after performing 10-fold cross
validation. The higher and better the precision and recall scores are, the better, but in fact
these two are in some cases contradictory and can only be emphatically balanced;
therefore, the F-score is needed to be the harmonic average of precision and recall, in
respect to their results. The intuition for F-score interpretation is that it measures both the
balanced of good precision and good recall together to result in a good F-score measure.
Thus, in general, the higher the F-score, the better the model will perform. Selden (2016)
indicates that according to the International Statistical Institute, an F1 score is considered
perfect when it’s 1, while the model is a total failure when it’s 0. Table 2 shows a
protectory F-Score of 0.89 for the average of the 10 classifiers (rounded to the nearest
hundredth) depicting the confidence level in the semi-supervised learning algorithm to
have an 89% learning accuracy rate which is a positive-looking score.
40
Table 2
Results of Precision, Recall, and F1-Score for Classifiers
Classifier F-Score Precision Recall
Benign Traffic 0.83 0.87 0.80
Cross-Site Scripting 0.93 0.89 0.97
SQL Injection 0.89 0.92 0.87
Email Spam 0.93 0.93 0.94
Password Brute-Force 0.88 0.91 0.85
Port Scanning 0.89 0.88 0.90
Registry Takeover 0.83 0.86 0.81
Denial-of-Service (Dos) 0.89 0.85 0.94
Shellcode Execution 0.86 0.85 0.88
Malware Exploit 0.92 0.90 0.96
In the field data analysis, time series data are very special data due to their nature
of being proportion and random. Stationarity is a desired property in the field of time
series analysis as it has a large influence on how the data are perceived and predicted
when processed through a statistical model. There are two main factors that cause time
series data to become non-stationary: (a) when the data are subjected to a trend which is
an extended absence of events within the timeline of a dataset; (b) seasonality, which is a
long-term recurring pattern at a fixed and known frequency whether a time of the year,
week, or day (Yavanoglu & Aydos, 2017). Thus, before any further work is done to build
out predictive models, it is necessary to determine whether the data collected are
stationary. An Augmented Dickey Fuller (ADF) test is a great way and most widely used
technique to confirm if the series is stationary or not. In addition, another Kwiatkowski–
Phillips–Schmidt–Shint (KPSS) test can confirm the validity of the results. The KPSS
41
test is similar to ADF; it can validate the null hypothesis that an observable time series is
stationary around a deterministic trend. Both ADF and KPSS tests are done through
Python using the functions of adf.test and kpss.test from the statsmodels library. Table 3
depicts p values of ADF and KPSS. To interpret the obtained p-value scores, Ogunc and
Hill (2008) suggest the ADF p-value show a negative number, where the more negative it
is, the stronger the rejection of the hypothesis that there is a unit root present. The KPSS
needs to show a positive number greater than the confidence level of 5%, at this level the
null hypothesis can be rejected. Both tests performed are within the tolerated interval
ranges, both admit the stationarity of data or reject the hypotheses of non-stationarity.
Table 3
Results of P-Values Stationary Test
Test Type p values
Augmented Dickey Fuller (ADF) test
-0.08
Kwiatkowski–Phillips–Schmidt–Shint (KPSS)
test
0.10
Expected Findings
Based on the data collected and the rigorous tests performed on the time series
data to ensure the data were stationary, it was assumed that these stationary data could be
used in time-series forecasting models both through AI or ARIMA. It was expected that
80% of the data could be mapped to the threat-intelligence source (Cisco Talos) that
could identify and inject a label to the data with a relevant threat category which then
could be fed to train the Machine Learning engine to build its threat detection algorithm.
Since the remaining 20% of the data are unlabeled and unknown to the Machine Learning
42
engine, the data can used to be test the threat classification ability of AI compared to a
human cybersecurity analyst.
Ethical Issues
The protection of human subjects through the application of appropriate ethical
principles is imperative in all research. Thus, this concern can be eliminated as the study
was not done on human beings but rather on machine data. The research purely used
machine produced data and the data did not constitute to the study of human beings nor
unethical practice to harm human beings, animals, plants, or the natural environment.
Conflict of Interest Assessment
Awareness of potential conflicts of interest is very important in research to
maintain the integrity of an unbiased professional view in a research publication. There
were no conflicts of interest regarding the publication of this paper, nor any affiliations
with or involvement in any organization or entity with any financial interest (such as
honoraria; educational grants; membership, stock ownership, or other equity interests;
and expert testimony or patent-licensing arrangements), or non-financial interest (such as
personal or professional relationships, affiliations, marriages or beliefs) in all materials or
technologies discussed in this manuscript.
Chapter Three Summary
Chapter Three described how the data were collected and analyzed. The research
project utilizes a honeypot hosted on Microsoft Azure Cloud that is set up to have a weak
IP address and revolves back to a simulated domain similar to “Google.com.” The site
allures botnets and cybersecurity hackers as the workstations connected within the
honeypot network actively visit malicious websites to download malware contents then
43
also click on malicious phishing emails. The site hosted live in production for a duration
of 274 calendar days (9 months). Logs and netflow data from the devices within the
honeypot are captured for data analysis. In their raw forms, the logs and netflows are still
unstructured with various critical values that need to be parsed by a solution called
Corelight. Once parsed and re-formatted to structured data with the proper data points
aligned to their corresponding columns, the data are then fed to Elastic where 80% of the
data is mapped to a threat-intelligence source (Cisco Talos). Cisco Talos interprets the
data to identify and label the data to their respective threat categories. These data can then
be consumed by the Machine Learning engine in Azure Artificial Intelligence Web
Service to analyze and build out the threat detection algorithms. From there, a Machine
Learning engine was used to compare its classification of cybersecurity intrusion’s ability
to a human cybersecurity analyst with the remaining 20% of the data where those data are
unlabeled and unknown to the Machine Learning engine. Afterwards, the Machine
Learning engine also faces another test designed to compare its threat-prediction ability
to a popular ARIMA statistical model.
44
CHAPTER FOUR DATA ANALYSIS AND RESULTS
Introduction to Data Analysis and Results
This research paper has two goals: first, to find out the accuracy of Machine
Learning in classifying cybersecurity intrusions when compared to a cybersecurity
analyst, and secondly to compare its predictive ability of future attacks based on a time-
series dataset when tested against a popular ARIMA statistical model. In this chapter, the
two tasks are split into two sections that explain the data analysis and methods, and can
help in understanding the context more easily.
AI vs. Human Analysis in Classification of Threat Events
As depicted in early chapters, this project’s first task is to compare Machine
Learning’s ability to classify cybersecurity intrusions when compared to a human
cybersecurity analyst. As mentioned, a total of 110,516 traffic logs and netflow data were
captured from the honeypot in the data-collection phase. From the data 80% is used to
train and build out the threat detection algorithm for the Machine Learning engine, while
the remaining 20% of the data are unknown to the Machine Learning engine and are used
for testing purposes. Hence, the 20% remaining of data (which is equivalent to 22,286
intrusion events) was selected for testing the Cybersecurity Analyst’s (denoted as
CAnalyst) intrusion classification skills versus the Machine Learning engine. Figure 7
displays the outcome of 200 intrusion classification of AI and CAnalyst results selected
at random; yellow colors show intrusion events detected by AI while blue colors show
intrusion events detected by CAnalyst. The x-axis is the time it takes for the two entities
to detect the threat while the y-axis displays the threat severity-level. Figure 10 below is
graphed by STATA 14 to give the reader a visualization of the time duration of how long
45
it takes for the CAnalyst to detect the threat events compared to AI. The graph would get
saturated and fuzzy if 22,286 intrusion event results are graphed; therefore, 200 events
were chosen at random to represent on the graph. The y-axis of the regression graph is
unable to accept too many string variables, so the 10 categories of intrusion types were
converted into numeric severity levels ranked from 1-10.
Figure 10
AI vs. Cybersecurity Analyst Intrusion Detection Regression Graph
Note: Yellow colors show intrusion events detected by AI while blue colors show intrusion events detected
by CAnalyst. The x-axis shows the time it takes for the two entities to detect the threat while the y-axis
displays the threat severity-level. On the y axis, the 10 categories of intrusion types were converted into
numeric severity levels ranked from 1-10.
In order to get unbiased results, Formula 2 was introduced where the variables
underwent an Ordinary Least Squares test to regress them against their respective
intrusion detection time. The higher coefficient number relates to the time AI and
CAnalyst were able to detect the intrusion, thereby the coefficient (delineating the
intrusion category) will be ranked higher in number.
46
Formula 2. Ordinary Least Squares Regression for Severity Score
�
����������� �� � !
“�#������$�%
“&’�
�
� (
“�#������$�%
“)&
������
Formula 2 shows the Ordinary Least Squares formula used to calculate the regression
model where ����������� �� equals the changes in alpha �#������$�% over the
amount of Time for AI to classify the threat plus beta of �#������$�% over the amount
of time for CAnalyst to classify the threat. Table 4 displays results from STATA14.
Table 4
Ordinary Least Squares Regression AI vs. Cybersecurity Analyst Results
�
Variables (N = 10)
AI
b* (Robust SE)
CAnalyst
b* (Robust SE)
Time 0.312 0.486
(0.430) (0.569)
BenignTraffic 0.001 0.002
(0.070) (0.069)
CrossSiteScripting 0.037 0.029
(0.0167) (0.0946)
SQLInjection 0.043 0.036
(0.0375) (0.122)
EmailSpam 0.086 0.091
(122.6) (0.0765)
PWBrute-Force 0.056 0.047
(0.0701) (0.069)
PortScanning 0.073 0.069
(0.0167) (0.0946)
RegistryTakeover 0.103 0.087
(0.0375) (0.122)
Denial-of-Service 0.097 0.086
(122.6) (0.0765)
ShellcodeExecution 0.052 0.045
(0.0946) (0.0788)
MalwareExploit 0.122 0.116
(0.205) (0.0913)
Constant 7.937 6.562
(5,274) (3,731)
p-value .01*** .01***
Observations 2,228 2,228
F-statistic 11.81 12.46
R-squared .61 .47
Note: Robust standard errors in parentheses *** p < .01, ** p < .05, * p < .10 47 The values in Table 5 are the data output extracted from Azure Artificial Intelligence Web Service after the AI has classified the intrusion events based on their categories. The time the CAnalyst has taken to analyze and classify the intrusion events are manually recorded. Both the average of how long AI and CAnalyst take to classify the intrusion event are captured and rounded to the nearest hundredth decimal point. Once visualized, we can see that the time it takes for AI to detect and classify the intrusion events to their confined categories are much faster than for a CAnalyst’s detection ability. Table 5 Number of Intrusion Events Detected and Average Time of AI vs. Cybersecurity Analyst Category AI Detected Intrusions N Average Time for Detection Analyst Detected Intrusions N Average Time for Detection Benign Traffic 858 0.05 824 1.80 Cross-Site Scripting 176 7.20 165 10.50 SQL Injection 152 5.10 140 15.30 Email Phishing 231 3.80 230 10.10 Brute-Force Attack 110 8.60 110 45.20 Port Scanning 257 10.20 240 30.40 Registry Takeover 74 27.40 65 56.30 Denial-of-Service (Dos) 153 5.70 153 37.50 Shellcode Execution 42 18.30 42 32.70 Malware Exploit 175 10.90 167 15.20 Detailed Analysis (AI vs. Human Analysis in Classification of Threat Events) Although many cybersecurity analysts possess great depth of cybersecurity knowledge, in this case the cybersecurity analyst had 5 years’ experience of threat 48 detection and response for a Tier 2 Security Operations Center, and might at times be slower in threat detection compared to AI. AI can parse through, analyze, and correlate log events in the thousands faster and more efficiently. Thus, when it comes to reducing errors in operational tasks and finding anomalies in the case of threat classification, the result supports the hypothesis that AI is ahead of human ability and competence. AI is instrumental in establishing baselines and can more quickly detect anomalies and outlier events for a wider range than humans can (Virmani, et al., 2020). As a cybersecurity solution AI can help protect organizations from Internet threats, identify type of malware, ensure practical security standards, and help create better prevention and recovery strategies as it can correlate relationships between threats like malicious network traffic, suspicious IP addresses, or files in seconds or minutes (Apruzzese, et al., 2018). AI vs. ARIMA Statistical Computation in Prediction of Threat Events The second goal of this paper is to compare AI’s threat forecasting ability to a popular statistical model of ARIMA based on a time-series dataset. To give a visualization of the result, Figure 8 displays 10% of the 22,286 intrusion events selected for testing AI’s forecasting ability against the ARIMA model which are 2,228 dots displayed on the chart below. The red dots are the intrusion events predicted by AI while the blue dots show the intrusion events predicted by the ARIMA model. The predictive data values from AI and ARIMA are entered into STATA 14 and run with the regression commands of XTREG and REG with the cluster options. The regressions are performed twice to display two regression lines below for accuracy (Figure 11). When results display concave and convex humps, one can ask which threat events may be the underlining causes. Sokol and Gajdos (2018) propose that it does not make sense for one 49 particular threat category to cause a significant impact in the trend of a single month; therefore, the 10 categories are needed to undergo a Spearman’s Rank Correlation test; their intra-relationships helped to consolidate data to the categories of: Network Detected, Endpoint Detected, or Email Detected threat agents. Figure 11 AI vs. Cybersecurity Analyst Intrusion Prediction Regression Graph Note: Figure 8 displays 10% of the 22,286 intrusion events selected for testing AI’s forecasting ability against the ARIMA model. The red dots are the intrusion events predicted by AI while the blue dots show the intrusion events predicted by the ARIMA model. A Spearman Rank’s Correlation test was performed to see the correlation among intrusion types, which were then consolidated into three categories: Endpoint Detection (of all the intrusion threats detected at the endpoint level), Email Detection (all the intrusion threats detected as spam emails), and Network Detection (the intrusion threats detected at the network layer). It is important to check for correlations between variables before running the models in order to identify important explanatory variables and check for possible multicollinearity. Hence, based on the data we see that the data are within a good percentage of their confidence intervals. 50 Formula 3: Spearman Rank’s Correlation Estimation Model * � % � �! &�+,- �#������$�% .��/� ��������% ( &�+,- �#������$�� .��/� ��������� The formula in Model 1 shown above is the Spearman’s Rank Correlation Estimation formula used to calculate the correlation ratios of the variables within the model. In the model, * represents the ratio when 1 minus alpha average of �#������$�% over the .���/��(mull) which is the mean Time of AI and Time of the ARIMA models are able to forecast the threat events; then plus the beta �#������$�� with similar computation as��#������$�%. From there the values are input to STATA14’ Table 6 shows the * ratios for the correlations between the two variables. Table 6 Spearman’s Rank Correlation Estimation Results 0��12 BT CSS SI EP BFA PS RT DoS SE ME Benign Traffic .08 .02 .01 -.07 .01 .02 .01 .02 -.01 .03 Cross-Site Scripting -.01 .03 .08 -.02 .02 .01 .03 .01 .02 .02 SQL Injection .04 .07 .02 .02 .03 -.05 .02 .03 .04 .01 Email Phishing .02 .01 -.02 .01 .02 -.03 .01 .02 -.03 .08 Brute-Force Attack .03 -.04 .01 .03 .01 .01 -.02 .07 .01 .03 Port Scanning .01 .02 .02 -.01 .07 .02 .04 .01 .04 .03 Registry Takeover -.02 .02 .01 .04 .03 .08 .01 .02 .02 .01 Denial-of-Service .03 .01 -.03 .02 .04 .03 .02 .02 .07 .02 Shellcode Execution .01 -.02 .01 .01 .01 -.01 .09 .01 .01 .02 Malware Exploit -.01 .03 .02 .08 -.06 .01 .04 .03 .04 .01 Note: The rows display the names of the threat types; those names are abbreviated in the columns. 51 The Translog Estimation Model below is a derivative of the renowned Cobb- Douglas Production function model used to calculate the relationship between production output and production inputs. The model measures the ratios of inputs to one another for efficient production and estimates the technological change in production methods. The Cobb-Douglas Production model has a substantial limitation as it imposes an arbitrary level for substitution possibilities between inputs. The Translog Estimation model was chosen instead because it permits greater flexibility and more realistic estimation scores. Formula 4. Trend to Month Translog Estimation Model ��3 � 4 5$ � �6����� 4/���6����� ���7 �86����� ! 9&'� �������� ( 9:;<=:� �������� The formula above is the Translog Estimation Model where the delta in ��3 stands for Trend to Month, which is affected by 4 5$ � �6����� (Endpoint Detection), 4/���6����� (Email Detection), and ���7 �86����� (Network Detection) values. Average Time of AI over total number of �������� scores plus beta Time of the ARIMA able to forecast the threat events over their total number of �������� scores are added to the formula as fixed variables to produce a stable R2. The results are shown below after values have been calculated in STATA14. Table 7 shows that the most prominent threat agents are Network Detected, where the malicious intents are from external entities derived largely from the network at first. 52 Table 7 Trend to Month Translog Estimation Results ���5����53 �# Coefficient SE t P>|t| [95% Conf. Interval]
TrendtoMonth
ARIMATimeAvg.
AI TimeAvg.
EndpointDetect
EmailDetect
NetworkDetect
_cons
100.023 66.09165 25.45 0.004 48.19462 122.3276
85.9153 61.49118 24.77 0.006 45.16691 117.6643
87.9483 40.79165 9.01 0.001 36.65492 77.69821
51.2390 32.80269 8.14 0.003 27.26822 69.12977
43.8801 19.07419 1.68 0.018 27.04683 59.57094
77.9483 40.79165 9.01 0.001 36.65492 77.69821
27116.82 13921.92 4.52 0.607 18804.27 36516.69
R2
sigma_u
sigma_e
0.5932
.09724095
.2734401
Detailed Analysis (AI vs. ARIMA Statistical Computation to Predict Threat Events)
When visualized, AI’s’ prediction ability of future intrusion events is surprisingly
not too far off from the ARIMA statistical model. This test proves AI’s forecasting ability
is only replicating what is closest to ARIMA based on historical data. However, the
results have their limitations as they are solely based on a predictive model of past data
with an average trend computation. Yet, there are some large spikes and dipping lines
within the months of November and December as the results seem to mainly calculate the
average trend of each month reflecting to the traffic going into the honeypot from
historical data. A possible explanation based on Cyberlytic’s (2018) survey for such
occurrence of those two months is to have spikes as they are considered the holiday
season months where cyber criminals are most active. During the holiday season,
53
Cyberlytic (2018) believes cyber criminals prey on docile users with holiday season
phishing email advertisements and attempts to attack web applications due to high traffic
holiday shopping. They are also aware that the holiday season is the time when
cybersecurity analysts usually take time-off to spend time with family. Based on this
depiction, companies may need to explore contingency plans to ensure dedicated staff
have visibility on the network and endpoints during this time of the year when certain
members from Cybersecurity team are off for the holiday. When results show various
concave and convex nodes, one can ask which threat agents are the underlining causes.
According to Sokol and Gajdos (2018), the authors believe it is unreasonable in the realm
of Cybersecurity to have one single threat agent to have caused a significant change to
the threat trend result for a particular month. For instance, the entirety spike that resulted
for the December is not the outcome of only an SQL injection cause. Therefore, the 10
disseminated threat types must undergo a Spearman Rank’s Correlation test in order to
group them to the three categories of whether they are Network Detected, Endpoint
Detected, or Email Detected threat agents. Based on the result of Table 7, the reader can
see the most prominent threat agents are Network Detected which the malicious intents
are from external entities derived largely from the network at first.
Chapter Four Summary
In precis, Chapter Four has taken the reader to the two sections of “AI vs. Human
Analysis in Classifying Cybersecurity Intrusion Events” and “AI vs. ARIMA Statistic
Computation in Predicting Future Cybersecurity Intrusion Events.” In the first section,
data visualization was provided for 200 results of AI and Cybersecurity Analyst
classifying intrusion events chosen at random. The graph can get saturated and fuzzy if
54
22,286 intrusion event results are graphed; therefore, 200 results were graphed for a high-
level visualization to the reader. In addition, the y-axis of the graph is unable to accept
string variables; therefore, the 10 categories of intrusion types are converted to numeric
severity levels ranked from 1-10. An Ordinary Least Squares (model was run between the
data and their respective intrusion detection time. The higher in coefficient numbers
relate to the time AI and CAnalyst are able to detect the intrusion, the coefficient
(intrusion category) will be ranked higher in number. At the end of the section, Table 5
displays an overall data output extracted from Azure Artificial Intelligence Web Service
after the AI has classified the intrusion events based on their categories. The time the
CAnalyst takes to analyze and classify the intrusion events are recorded manually. In
closing, the results clearly indicate that AI has a much faster speed in classifying network
events to their intrusion categories than a human cybersecurity analyst. In the second
section, AI’s threat forecasting ability was compared to an ARIMA statistical model. The
predictive data values from AI and ARIMA are entered in to STATA 14 and a trend
graph is displayed. When results showed various concave and convex protuberances, the
study asked which threat agents are the underlining causes to the strange occurrence.
Based on Sokol and Gajdos (2018), it seemed unlikely that one standalone threat agent
could have a significant impact on the threat trend result for the month. Hence, the spike
in the month of December is not solely the outcome of only an SQL injection. Therefore,
the 10 disseminated threat types underwent a Spearman’s Rank Correlation test in order
to group them to the three categories of whether they are Network Detected, Endpoint
Detected, or Email Detected threat agents. Based on the result most outstanding threat
agent were Network Detected threats. Similar to Sokol and Gajdos (2018), the authors
55
believe most threats start out at the network layer then make their way into an
organization’s internal network in order to cause further damage to the endpoints,
databases, and systems that host mission-critical applications to run the business.
56
CHAPTER FIVE. CONCLUSIONS AND DISCUSSION
Introduction to Conclusions and Discussion
AI is considered one of the most promising developments in the information age,
with availability of high compute power and abundant data, AI has been on the rise in our
recent decade. It has been trying to enter into every business and industry, thus
cybersecurity cannot be left out (Hossein, et al. 2020). New algorithms, data volume, and
technological enhancements have let AI grow concurrently with the emerging global
security industry. Compared to conventional cybersecurity solutions, AI is more flexible,
adaptable, and robust, thus helping to improve security performance and better protect
systems from an increasing number of sophisticated cyber threats (Selden, 2016).
Currently, AI techniques have been widely adopted as powerful detection, prediction, and
response tool in the realm of cybersecurity (Collins, et al., 2019). Data within a network
environment is enormous, whether its firewall logs or network packets, user activities or
application logs, it’s difficult for human analysts to triage and analyze them in a rapid
speed for early detection, hence this is where AI can come into play with its Machine
Learning algorithm (Rajbanshi, et al., 2017). Armed with rapid and trustworthy analysis
provided by machine learning engine which can be used to take informed decisions for
the cybersecurity analysts. This study’s examination of AI’s ability to classify intrusion
events compared to a human Cybersecurity analyst proved that AI is much faster than
humans. Additionally, the paper also tests the prediction capability of AI to forecast
cybersecurity intrusion events with time-series datasets in comparison to a popular time-
series data prediction model, the ARIMA statistical model. Yet, this test proves AI’s
forecasting ability is only replicating what is closest to ARIMA based on historical data.
57
However, the results have limitations as they are based on a predictive model of past data
with an average trend computation. Thus, AI’s predictive ability is not heteroskedastic to
include threat risks due to social unrests, political changes, wars, natural disaster events,
or recently discovered product vulnerabilities which can constitute to the tremendous
changes in the threat vectors on a periodic basis of future threat trends.
Summary of the Results
This paper explores the topic of AI in its appositeness to Cybersecurity. AI’s
ability to classify intrusion events was compared to a human Cybersecurity analyst which
proved that AI is much faster than humans. Additionally, the paper also tests the
prediction capability of AI to forecast cybersecurity intrusion events with time-series
datasets compared to a popular time-series data prediction model, the ARIMA model.
Yet, this test proves AI’s forecasting ability is mimicking what is closest to ARIMA-
based past data values. As shown from the results on the first section of Chapter Four, the
results show that Cybersecurity Analysts are slower in threat detection than AI. Threats
with higher severity scores over five take the Cybersecurity Analysts more than 20
minutes to detect and classify while AI can identify the threat within minutes. The results
support the theory that AI can parse, analyze, and correlate log events to identify threats
with faster and more efficient speed. Thus, when it comes to reducing operational time in
finding anomalies, AI may prove to be ahead of human’s ability and competence. In the
second section, when AI is compared to an ARIMA statistical model to predict the
intrusion events. The two models are not too far off from each other, there are some dips
and spikes within the months. However, the results based on the historical data of the
traffic and threats going into the honeypot from the previous point in time. Yet, the
58
forecast is an average trend of the previous month’s data. It is understandable that there
are underlying limitations to AI’s prediction ability as it is based only on past data with
an average trend computation. Thus, AI’s predictive model was not heteroskedastic
where it took into consideration multiple externalities that can cause impacts to the threat
trend which includes but is not limited to: social unrest, policy changes, global news, or
natural disasters.
Discussion of the Results
The nature of this study type is exceptionally intricate and would require
leveraging several technology components in order to achieve the results. Those
technology components have been explained in previous chapters; hence, it seemed that
the three quintessential components that contributed most to the study’s results were the
Machine Learning engine, Cybersecurity Analyst, and the ARIMA model. The study
specifically focuses on Machine Learning to extract insights from security data as the
research design has the particular concept to build its own data-driven intelligent security
solution. Therefore, a dedicated honeypot environment was hosted on Azure Cloud
Services in order to collect the necessary data such as firewall logs and netflow in order
to build the Machine Learning engine. From there those data have been handled by
Corelight and Elastic technology in order to structure the data to a proper format that is
ingestible by the Azure Artificial Intelligence Web Service. From there the Machine
Learning engine builds its intelligence from the 80% that have been mapped using Cisco
Talos Threat Intelligence Rules while the remaining 20% have left unknown to AI to be
tested later. AI improves its knowledge to understand cybersecurity threats by consuming
a large number of data artifacts. In order to build the AI Machine Learning engine, data
59
are fed to the Machine Learning model hosted on Azure Artificial Intelligence Web
Service where it can process, analyze, and match events to rules in order to build its
intrusion detection and prediction algorithms. AI’s ability to classify intrusion events was
first compared to a human cybersecurity analyst, which proved that AI is much faster
than humans. However, the cybersecurity analyst used in the study to compare AI’s
intrusion event classification ability has 5-years experience as an analyst. However, the
results could have altered if the cybersecurity analyst is a more experienced or novice
analyst in the cybersecurity space and could have analyzed and classified intrusion events
way faster or slower than AI.
Next, the study asked about the prediction capability of AI to forecast
cybersecurity intrusion events when compared against a famous time-series data
prediction, the ARIMA statistical model. The ARIMA model, used to compare its ability
to forecast future trends to AI is a statistical analysis model that uses time series data to
forecast or predict future outcomes based on a historical time series. The model explains
a given time series based on its own past values, that is, of its own lags and the lagged
forecast errors, so that equation can be used to forecast future values (Zavadskaya, 2017).
Replacing the ARIMA model with a Bayesian Structural Time Series [BSTS]
model could have produced a different outcome. BSTS is also designed to work with
time series data. However, it has a different approach to ARIMA models when
performing time series forecasting, nowcasting, and inferring causal impact. More
importantly, it deals with uncertainty in a different manner. The model does not rely on
differencing, lags and moving averages, yet it quantifies the posterior uncertainty of the
60
individual components, controls the variance of the components, and incorporates
exogenous variables and multi-seasonal components more easily (Ogunc & Hill, 2018).
Discussion of the Results in Relation to the Literature
Prior academic works performed by various researchers to understand AI’s
relationships to cybersecurity are restricted by certain boundaries. Some are limited to the
lack of public data availability on AI in cybersecurity, some face a narrow scope in their
research due to technological constraints of their time. In Ghosh, et al. (1998), the authors
apply the support vector machine model to detect anomalous Windows registry accesses
using KDD99 dataset; however, anomalous Windows registry access has now become an
obsolete approach. Anomalous behaviors on users’ computers nowadays are comprised
of a wide range of activities from BIOS modification, malicious process execution,
account privilege escalation, to kernel corruption. In other research by Kozik, et al.
(2014), the authors used the CSIC 2010 HTTP Dataset to assess the classification of
Internet traffic. Their study specifically focuses on traffic using HTTP protocols by
detecting abnormality in well-known port numbers of IP flows, yet they have not
characterized the traffic itself. The weakness of this study is that at the time, the majority
of the Internet traffic used HTTP on logical port 80 and onlya few used other ports such
as port 23 for Telnet, port 21 for FTP, and port 139 for NetBIOS. Yet, over time, modern
network protocols have expanded drastically, with contemporary applications all
leveraging multitude port ranges from the 65,535 TCP/IP application ports from the
Internet founding fathers. Bhuyan, et al. (2014) attempt to take a different approach to
create unbiased real-life network intrusion datasets in order to compensate for the lack of
the available datasets. The authors create an entirely new dataset in the development of a
61
detection system, TUIDS Distributed Denial of service (DDoS) to test against an older
DDoS CAIDA dataset. Although the Bhuyan et al. study is very thorough, experimental
results show the proposed model yields 65% detection accuracy on the normalized
CAIDA dataset. Yet, the paper primarily only emphasizes DDoS attacks while leaving a
large number of attack vectors in the cyberspace unmentioned. Therefore, state of the art
in the works of this research, are fully applicable to the intrusion and misuse detection
problems and cover a wide-range intrusion vectors at the network, endpoint, and email
layers. The study also has a concentration on Machine Learning (a branch of AI) to
extract insights from security data as the research design; with the concept to build its
own data-driven intelligent security solution. Moreover, the paper has two unique
objectives of testing AI’s threat classification ability against a human cybersecurity
analyst and AI’s prediction ability of future threat events against a renowned time-series
data forecasting ARIMA statistical model in order to understand the effectiveness of AI
on cybersecurity.
Limitations
This dissertation had certain limitations as the study of AI is complicated due to
the lack of data. Data about AI in cybersecurity are extremely limited; even if some
legacy data can be found, there are problems with existing public datasets, such as
uneven data, and outdated contents from technology to data collection methods. In
addition, most of the data have already been aggregated. The dataset used to build a
Machine Learning engine is imperative as sufficient and proper data are needed to train
and test the system. In order to collect raw security data, it was necessary to create a
dedicated honeypot environment hosted on Azure Cloud Services. Obtaining such dataset
62
is difficult and time-consuming as it took 9 months to collect enough data. Furthermore,
the study’s limited budget meant that it was not possible to implement the best of breed
AI system out there. For these reasons, this study is limited to the investigation of the
performance of Azure Artificial Intelligence Web Service. It is important to note that the
results of this research could not represent the AI models used at wide, as each AI type
differs from one another. The research is also limited to the linear methods used, such as
the ARIMA statistical model when compared to Machine Learning. Thus, the comparison
on the classification and prediction of threats consists only of the branch of Machine
Learning model within AI while the research rejects other subsets of AI such as: natural
language processing, vision, robotics, pattern recognition, or convolutional neural
network, The study design has constraints of using low-end and open-sourced technology
such as Corelight, Elastic, Python, STATA, and Azure Artificial Intelligence Web Server.
Thus, the results of this research could tremendously vary if more sophisticated
technology were introduced.
Implication of the Results for Practice
AI, and in particular Machine Learning, has taken huge strides impacting all
aspects of industry and society. This development has been fueled by decades of
exponential improvement in hardware computing power, combined with progress in
algorithms and, perhaps most importantly, a huge increase in the volume of data for
training and testing that is ready for AI to intake (Xin, et al., 2018). AI is now ready
improve the efficiencies of our workplaces and can augment the work humans do as it is
gradually being integrated into the fabric of business and applicable fields of science.
However, not all sectors are equally advanced through AI (Tyugu, 2011). Therefore, this
63
paper explores the topic of AI in its effectiveness to cybersecurity. The paper showcases
AI’s ability to classify intrusion events compared to a human cybersecurity and AI’s
ability to forecast future cybersecurity events when compared to a time-series prediction
model, the ARIMA model. The study is uniquely conducted with a domain of
cybersecurity data science by collecting raw security data from a honeypot. Using cloud
services such as Azure Artificial Intelligence Web Service to build a Machine Learning
engine provided by semi-supervised machine learning techniques, where the data is
analyzed and referenced to threat intelligence sources to train the Machine Learning
algorithm. For practitioners this study opens a spectrum of ideas on how cybersecurity
data science and relevant learning methods can be used to design data driven intelligent
decision-making cybersecurity systems and services from machine learning perspectives
for organizations. This study on AI, Machine Learning, and cybersecurity data science
opens a promising path and can be used as a reference guide for both academia and
industry leaders for later research and applications in the area of cybersecurity. Due to the
fast-growing nature of AI and its promising benefits for the future, certain ethical issues
and adverse effects of AI still need to be addressed. It is necessary to resolve these related
risks and concerns as early as possible. But, given these concerns and those sustainable
solutions are still somewhat hindered in sight, a socially responsible use of AI within
cybersecurity is highly recommended. Defining the boundaries of ethical access to data is
a complex problem that affects various stakeholders: citizens, the state, corporations,
public institutions, etc. This study displays a lightweight AI engine that has the capability
to classify and forecast intrusion events. Although the AI engine built in this study is very
basic, government officials and world leaders must ensure organizations protect
64
proprietary AI machines that are super computers from ending up in the wrong hands of
malicious entities who would exploit AI’s ability to carry out much vitriolic intents.
Recommendations for Further Research
In this paper, the primary focal points of AI’s applicability to cybersecurity are on
its abilities to classify and predict attacks. Yet, according to Collins, et al. (2019) a
prolific component of AI its autonomous ability to respond to threat events where this
research paper’s conclusion section builds a theoretical hypothesis which poses questions
for further research. In the future, other researchers may focus on AI’s effectiveness to
respond to cybersecurity threats on behalf of human cybersecurity analysts. Furthermore,
data collected by honeypots conducted in this research are interesting as they are raw
cybersecurity data. Much of this paper focuses on cybersecurity data examining raw
security data to data-driven decision making for intelligent security solutions, yet it could
also be related to big data analytics in terms of data processing and decision making. Big
data deals with data sets that are too large or have complex characteristics. Overall, this
paper’s aim is not only to discuss cybersecurity data science and relevant methods but to
discuss applicability towards data-driven intelligent decision making in cybersecurity
systems and services from machine learning perspectives. Although there were not
impressive outcomes for the predictive model of the Machine Learning engine when
compared to an ARIMA statistical model, AI’s forecasting ability is only replicating what
is closest to ARIMA based on historical data by averaging the trends. Future research
may explore multivariate time series data that are heteroskedastic where the predictive
models can take into consideration multiple externalities that may cause impacts to the
threat trend which includes, but not limited to: social unrests, political changes, natural
65
disasters, pandemics, etc. Future work could assess the effectiveness of other areas of
empirical evaluation of the suggested data-driven model, and comparative analysis of
other subsets within AI such as natural language processing, vision, robotics, pattern
recognition, and convolutional neural network, which can pose further research on how
these can be applicable to using AI in cybersecurity.
Conclusion
Despite the positive future for AI, it can also introduce potential global risk for
human civilization, there could be some ethical issues such as missing moral code of AI’s
autonomous decision-making ability and concerns about the lack of data privacy (Patel,
2017). AI is a kind of intelligent system capable of making decisions on its own. This
system represents the direction of the development of computer functions related to
human intelligence, such as: reasoning, training, and problem solving. In other words, AI
is the transfer of human capabilities of mental activity to the plane of computer and
information technologies, but without inherent human vices (Zhang, et al., 2018). To that
extent, if AI is given the authority to act on its own without human interventions it would
wreak havoc to enterprises in such cases where AI has taken actions on its own, but those
are erroneous. For example, if AI were to respond to a threat event incorrectly and the
Cybersecurity department relies on that response of AI, it could ignore true positive threat
events that can cause major damage to the company. Secondly, AI raises major privacy
concerns as it can ingest large amounts of data in the milliseconds. With this speed and
volume of data ingestion, AI may potentially capture employee’s username, personal
identifiable information, salary, etc. across the network; once absorbed, it would become
66
hard to separate the wheat from the chaff of those data in the long run (Williams &
McGregor, 2020).
In the most recent cybersecurity products’ marketing articles written about AI,
many speak about the applicability of AI in cybersecurity; some even emphasize that AI
has the ability to replace human analysts. Although AI has undergone major advancement
in recent years, yet the truth is we might not be at that stage yet. AI can certainly be a
complimentary product for human cybersecurity analysts, but at our current stage in time
it cannot be a substitution product. As mentioned in the beginning of the research paper,
this study suggests that AI can automatically classify certain threat events to help
cybersecurity analysts prioritize where they should focus their attention, and can also
predict future threat trends to provide insight into futuristic threat trends. Thus, AI should
only act as a source of intelligence for human-decision making but not be taking
autonomous actions on its own. At this technological stage, a strong interdependence
between AI systems and human factors is necessary for augmenting cybersecurity’s
maturity. Moreover, a holistic view on the cybersecurity landscape within the enterprise
IT environment is needed as cybersecurity is not only a technological paradigm, it is also
an art of how security risks are dealt with through human logic and experience. It is
necessary to integrate technical solutions, relevant processes to achieve optimal security
performance; however, in the end, it is still the human factor that matters, not just the
tools themselves. Therefore, a combined effort of human and AI will surely provide more
excellence to fight off cyber-criminals jointly.
67
APPENDIX A. INSTITUTIONAL REVIEW BOARD (IRB)
Institutional Review Board | Academic Affairs
Phone. 786.417.9300 | Email. [email protected]
Memorandum
Date: 1.22.21
To: Dr. Knowles and Mr. Pham
From: Tony Andenoro, Chair, Institutional Review Board
Subject: IRB Application 1.2021.3 – Artificial Intelligence in Cybersecurity –
Concentration on the Effectiveness of Machine Learning
St. Thomas University Institutional Review Board (IRB) is pleased to inform you that
your research protocol submitted for review on 1.20.21 has been approved after a
formal review for exempt status aligning with the Common Rule, Code of Federal
Regulations – Human Subjects Research provision. Please note that approval for this
study will lapse on 1.22.22 and any changes to the provided protocol will require
notification and may require additional approval. More specifically, any changes to
any portion of the research project, including but not limited to instrumentation
protocol(s) or informed consent must be reviewed and approved by the IRB prior to
implementation. In addition, if there are any unanticipated adverse reactions or
unanticipated events associated with the conduct of this research, you should
immediately suspend the project and contact the IRB Chair for consultation.
Should you have any questions feel free to contact me at 786.417.9300 between 8AM
and 5PM EST Monday through Friday or via e-mail at [email protected] Thank you and
have a great day.
Sincerely,
Anthony C. Andenoro, Ph.D.
Chair | Institutional Review Board
Executive Director | Institute for Ethical Leadership
Office of the Provost | St. Thomas University
16401 NW 37 Avenue • Miami Gardens, FL 33054 • 305.625.6000
stu.edu
68
REFERENCES
Al Qahtani, H., Sarker, I. H., Kalim, A., & Hossain, S. (2020). Cyber intrusion detection
using machine learning classification techniques. In N. Chaubey, S. Parikh, & K.
Amin (Eds.), Computing Science, Communication, and Security.
Communications in Computer and Information Science, 1235, 121-131. Springer.
https://doi.org/10.1007/978-981-15-6648-6_10
Apruzzese, G., Colajanni, M., Ferretti, L., Guido, A., & Marchetti, M. (2018). On the
effectiveness of machine and deep learning for cyber security. 2018 10th
International Conference on Cyber Conflict (CyCon), 2018, 371-390. Tallinn,
Estonia, May 29-June 1 2018. doi: 10.23919/CYCON.2018. 8405026
Banoth, L., Teja, M. S., Saicharan, M., & Chandra, N. J. (2016). A survey of data mining
and machine learning methods for cyber security intrusion detection.
International Journal of Research, 4, 406-412. DOI: 10.23883/ijrter.2017.3117.
9nwqv
Benavente-Peces, C., & Bartolini, D. (2019). Insights in machine learning for cyber
security assessment. In C. Benavente-Peces, S. Slama, & B. Zafar (Eds)
Proceedings of the 1st International Conference on Smart Innovation, Ergonomics
and Applied Human Factors (SEAHF), Madrid, January 2019. Smart
Innovation, Systems and Technologies, 150. Springer. https://doi.org/10.
1007/978-3-030-22964-1_33
Becue, A., Praça, I., & Gama, J. (2021). Artificial intelligence, cyber-threats and industry
4.0: Challenges and opportunities. Artificial Intelligence Review, 54, 3849-3886.
DOI: 10.1007/s10462-020-09942-2
69
Bhatele, K. R., Shrivastava, H., & Kumari, N. (2019). The role of artificial intelligence in
cyber security. In S. Geetha, & A. V. Phamila (Eds.), Countering Cyber Attacks
and Preserving the Integrity and Availability of Critical Systems (pp.170-192).
IGI Global. Doi: 10.4018/978-1-5225-8241-0.ch009.
Bhuyan, M. H., Kashyap, H. J., Bhattacharyya, D. K., & Kalita, J. K. (2014). Detecting
distributed DoS attacks: Methods, tools and future directions. The Computer
Journal, 57(4), 537-556. https://doi.org/10.1093/comjnl/bxt031
Buczak, A. L., & Guven, E. (2016). A survey of data mining and machine learning
methods for cyber security intrusion detection. IEEE Communications Surveys &
Tutorials, 18, 1153-1176. Doi 10.1109/COMST.2015.2494502
Butler M., & Kazakov, D. (2010). The effects of variable stationarity in a financial time
series on artificial neural networks. 2011 IEEE Symposium on Computational
Intelligence for Financial Engineering and Economics (CIFEr), 2011, 1-8, doi:
10.1109/CIFER.2011. 5953557.
Calderon, R., & Floridi, L. (2019). The benefits of artificial intelligence in cybersecurity.
Nature machine intelligence, Business Computer Science Journal, 1065(3), 1-4.
Cioffi, R., Travaglioni, M., Piscitelli, G., Petrillo, A., & Felice, M. (2019). Artificial
intelligence and machine learning applications in smart production: Progress,
trends, and directions. Sustainability, 12(2), 492. DOI:10.3390/su12020492
Collins, C., Dennehy D., Kieran C., & Mikalef, P. (2019). Artificial intelligence in
information systems research. International Journal of Information Management,
60(2), 56-70. https://doi.org/10.1016/j.ijinfomgt.2021.102383.
70
Conner-Simons, A. (2016, April 18). System predicts 85 percent of cyber-attacks using
input from human experts. https://phys.org/news/2016-04-percent-cyber-attacks-
human-experts.html
Cyberlytic. (2018). AI for web security technical data sheet. https://www.cyberlytic.
com/uploads/ resources/Technical-Data-Sheet-Final.pdf
Cylance. (2020). Continuous threat prevention powered by artificial intelligence.
https://www. cylance.com/content/dam/cylance-web/en-us/resources/knowledge-
center/resource-library/datasheets/CylancePROTECT.pdf
Darktrace. (2018). Detects and classifies cyber-threats across your enterprise.
https://www. darktrace.com/en/products
Dasgupta, D., Akhtar, Z., & Sen, S. (2020). Machine learning in cybersecurity: A
comprehensive survey. The Journal of Defense Modeling and Simulation, 45, 90-
109. DOI: 10.1177/ 1548512920951275
Davenport T., & Kalakota, R. (2019). The potential for artificial intelligence in
healthcare. Future Healthcare Journal, 6(2), 94-98. DOI: 10.7861/futurehosp.6-2-
94
Demertzis, K., & Iliadis, L.S. (2015). A bio-inspired hybrid artificial intelligence
framework for cyber security. In N. J. Daras, & M. T. Rassias (Eds.),
Computation, Cryptography, and Network Security (pp.161-193). Springer.
Devakunchari, R., & Sourabh, M. (2019). A study of cyber security using machine
learning techniques. International Journal of Innovative Technology and
Exploring Engineering, 8(7C2), 178-255. https://www.ijitee.org/
71
Fazil, M., Rohila, A., & Ghaparb, W. (2019). The era of artificial intelligence in
Malaysian higher education: Impact and challenges in tangible mixed-reality
learning system towards self-exploration education. Procedia Computer Science,
163(2), 2-10. https://doi.org/10.1016/j.procs.2019.12.079
FinancesOnline (2018, July 5). FinancesOnline IBM MaaS360 review. https://reviews.
financesonline.com/p/ibm-maas360
Frank, J. (2014). Artificial intelligence and intrusion detection: Current and future
directions. Proceedings of the 17th National Computer Security Conference,
Baltimore, October 1994. https://www.cerias.purdue.edu/apps/reports_and_
papers/ view/894
Goyal, Y., & Sharma, A. (2019). A semantic machine learning approach for cyber
security monitoring. 2019 3rd International Conference on Computing
Methodologies and Communication, 2019, 439-442. March 2019. doi:
10.1109/ICCMC.2019.8819796.
Hossein, M. R., Karimipour, H., Rahimnejad, A., Dehghantanha A., & Srivastava, G.
(2020). Anomaly detection in cyber-physical systems using machine learning. In
K. R. Choo, & A. Dehghantanha (Eds.), Handbook of Big Data Privacy (pp.
219-236). Springer Nature. https://doi.org/10.1007/978-3-030-38557-6_10
Hussain, F., Hussain, R., Hassan, S. A., & Hossain, E. (2020). Machine learning in IoT
security: Current solutions and future challenges. IEEE Communications Surveys
& Tutorials, 22(3), 1686-1721. doi: 10.1109/COMST.2020.2986444.
72
Kabbas, A., & Munshi, A. (2020). Artificial intelligence applications in cybersecurity.
International Journal of Computer Science and Network Security, 20(2), 1-22.
http://paper.ijcsns.org/07_book/202002/20200216.pdf
Khisimova, Z. I., Begishev, I., R., & Sidorenko, E. L. (2019). Artificial intelligence and
problems of ensuring cyber security. International Journal of Cyber Criminology,
13(2), 564–577. DOI: 10.5281/zenodo.3709267
Kozik, R., Choraś, M., Renk, R., & Holubowicz, W. (2014). A proposal of algorithm for
web applications cyber attack detection security. In K. Saeed, & V. Snášel (Eds),
Computer Information Systems and Industrial Management. CISIM 2015.
Lecture Notes in Computer Science, 8838. Springer.
https://doi.org/10.1007/978-3-662-45237-0_61
Li, J. (2019). Cyber security meets artificial intelligence: A survey. Frontiers of
Information Technology & Electronic Engineering Journal, 19, 1462-1474.
https://doi.org/10.1631/ FITEE.1800573
Min, H. (2015). Artificial intelligence in design and quality assurance management.
International Journal of Logistics Management, 1855(4), 12-17. https://www.
emeraldgrouppublishing.com/journal/ijlm
Mittal, S., Joshi, A., & Finin, W. (2019). Cyber-all-intel: An AI for security related threat
intelligence. Computer Science ArcXiv Archive, 1905.02895. https://arxiv.org/abs/
1905.02895#:~:text=In%20this%20paper%20we%20present%2C%20Cyber-All-
Intel%20an%20artificial,cybersecurity%20informatics%20domain.%20It%20uses
%20multiple%20knowledge%20representations
73
National Initiative for Cybersecurity Careers and Studies [NICCS]. (2018). Glossary.
https:// niccs.cisa.gov/about-niccs/cybersecurity-glossary
Ogunc, A. & Hill, C. (2008) Using Excel: Companion to Principles of Econometrics (3rd
ed.).
https://econweb.tamu.edu/hwang/CLASS/Ecmt463/Lecture%20Notes/Excel/Exce
l_Lessons.pdf
Palmer, T. (2017). Vectra cognito—Automating security operations with AI. ESG lab
review. https://info.vectra.ai/hs-fs/hub/388196/file-1918923738.pdf
Pantano, E., & Pizzi, G. (2020). Forecasting artificial intelligence on online customer
assistance: Evidence from chatbot patents analysis. Journal of Retailing and
Consumer Services, 55, 10-39. https://doi.org/10.1016/j.jretconser.2020.102096
Parrend, P., Navarro, J., Guigou, F., Deruyver, A., & Collet, P. (2018). Foundations and
applications of artificial intelligence for zero-day and multi-step attack detection.
EURASIP Journal on Information Security, 4(2018), 1-85.
https://doi.org/10.1186/ s13635-018-0074-y
Patel, M. (2017). QRadar UBA app adds machine learning and peer group analyses to
detect anomalies in user’s activities. https://securityintelligence.com/qradar-uba-
app-adds-machine-learning-and-peer-group-analyses-to-detect-anomalies-in-
users-activities
Pham, B. T., Le, L. M., Le, T., Bui, K. T., Le, V. M., Hai-Bang, L., & Prakash, I. (2020).
Development of advanced artificial intelligence models for daily rainfall
prediction. Atmospheric Research, 237, 104845.
https://doi.org/10.1016/j.atmosres.2020.104845
74
Rajbanshi, A., Bhimrajka, S., & Raina, C. K. (2017). Artificial intelligence in
cybersecurity. International Journal for Research in Applied Science and
Engineering Technology, 2(3), 132-137.
https://ijsrcseit.com/paper/CSEIT1722265.pdf
Sarker, I. H., Abushark, Y. B., Alsolami, F., & Khan, A. I. (2020). IntruDTree: A
machine learning based cyber security intrusion detection model. Symmetry,
12(5), 754. https://doi.org/10.3390/sym12050754
Sarker, I. H., Kayes, A. S., Al Qahtani, H., & Watters, P. A. (2020). Cybersecurity data
science: An overview from machine learning perspective. Journal of Big Data,
7(41). Doi: 10.1186/s40537-020-00318-5
Schuurmans, D. (1995). Convex training algorithms: Explaining machine learning. IEEE
Transactions on Pattern Analysis and Machine Intelligence Journal, 86, 218-323.
Selden, H. (2016). Deep instinct: A new way to prevent malware, with deep learning.
Tom’s Hardware. https://www.tomshardware.com/news/deep-instinct-deep-
learning-malware-detection,31079.html
Sokol, P., & Gajdos, A. (2018). Prediction of attacks against honeynet based on time
series modeling. Advances in Intelligent Systems and Computing, 662, 360-371.
DOI:10.1007/ 978-3-319-67621-0_33
Soni, S. & Bhushan, B. (2019). Use of machine learning algorithms for designing
efficient cyber security solutions. 2019 2nd International Conference on
Intelligent Computing, Instrumentation and Control Technologies, 2019, 1496-
75
1501. Kannur, Kerala, India, July 2019. doi:10.1109/ICICICT46008.2019.
8993253.
SparkCognition. (2018). A cognitive approach to system protection.
https://www.sparkcognition. com/deep-armor-cognitive-anti-malware
Stonefly. (2018). Amazon Macie: Artificial intelligence for efficient data security. https://
stonefly.com/blog/amazon-macie-artificialintelligence-efficient-data-security
Truve, S. (2017). Machine learning in cyber security: Age of the centaurs. http://www.
brookcourtsolutions.com/wp-content/uploads/2017/07/Machine-Learning-in-
Cyber-Security-White-Paper-Brookcourt.pdf
Tyugu, E. (2011). Artificial intelligence in cyber defense. In C. Czosseck, E. Tyugu, & T.
Wingfield (Eds.), 3rd International Conference on Cyber Conflict, 3, 1-11.
Tallinn, Estonia, June 2011. CCD COE Publications.
Virmani, C., Choudhary, T., Pillai, A., & Rani, M. (2020). Applications of machine
learning in cyber security. In G. Padmavathi, & D. Shanmugapriya (Eds.),
Handbook of research on machine and deep learning applications for cyber
security (pp. 83-103). ICI Global.
Williams J., & McGregor, S. (2020). What can artificial intelligence do for security
analysis? IBM QRadar Advisor with Watson. https://www.ibm.com/us-
en/marketplace/cognitive-security-analytics
Xin, Y., Kong, L., Liu, Z., Chen, Y., Li, Y., Zhu, H., Gao, M., Hou, H., & Wang, C.
(2018). Machine learning and deep learning methods for cybersecurity. IEEE
Access, 6, 3335365-35381 DOI: 10.1109/ACCESS.2018.2836950
76
Yavanoglu, O., & Aydos, M. (2017). A review on cybersecurity datasets for machine
learning algorithms. 2017 IEEE International Conference on Big Data, 2017,
2186-2193. Honolulu, June 2017. doi: 10.1109/BigData.2017.8258167
Zavadskaya, A. (2017). Artificial intelligence in finance: Forecasting stock market
returns using artificial neural networks. The Alan Turing Institute Journal,
N510129, 1-177. https:// www. turing.ac.uk/research/research-programmes/data-
centric-engineering/journal
Zhang, Z., Yu, Y., Zhang, H., Newberry, E., Mastorakis, S., Li, Y., Afanasyev, A., &
Zhang, L. (2018). An overview of security support in named data networking.
NDN, Technical Report NDN-0057. http://named-data.net/techreports.htm
ProQuest Number:
INFORMATION TO ALL USERS
The quality and completeness of this reproduction is dependent on the quality
and completeness of the copy made available to ProQuest.
Distributed by ProQuest LLC ( ).
Copyright of the Dissertation is held by the Author unless otherwise noted.
This work may be used in accordance with the terms of the Creative Commons license
or other rights statement, as indicated in the copyright statement or in the metadata
associated with this work. Unless otherwise specified in the copyright statement
or the metadata, all rights are reserved by the copyright holder.
This work is protected against unauthorized copying under Title 17,
United States Code and other applicable copyright laws.
Microform Edition where available © ProQuest LLC. No reproduction or digitization
of the Microform Edition is authorized without permission of ProQuest LLC.
ProQuest LLC
789 East Eisenhower Parkway
P.O. Box 1346
Ann Arbor, MI 48106 – 1346 USA
28962006
2022
<<
/ASCII85EncodePages false
/AllowTransparency false
/AutoPositionEPSFiles true
/AutoRotatePages /None
/Binding /Left
/CalGrayProfile (Dot Gain 20%)
/CalRGBProfile (sRGB IEC61966-2.1)
/CalCMYKProfile (U.S. Web Coated 50SWOP 51 v2)
/sRGBProfile (sRGB IEC61966-2.1)
/CannotEmbedFontPolicy /Warning
/CompatibilityLevel 1.4
/CompressObjects /Tags
/CompressPages true
/ConvertImagesToIndexed true
/PassThroughJPEGImages true
/CreateJobTicket false
/DefaultRenderingIntent /Default
/DetectBlends true
/DetectCurves 0.0000
/ColorConversionStrategy /CMYK
/DoThumbnails false
/EmbedAllFonts true
/EmbedOpenType false
/ParseICCProfilesInComments true
/EmbedJobOptions true
/DSCReportingLevel 0
/EmitDSCWarnings false
/EndPage -1
/ImageMemory 1048576
/LockDistillerParams false
/MaxSubsetPct 35
/Optimize true
/OPM 1
/ParseDSCComments true
/ParseDSCCommentsForDocInfo true
/PreserveCopyPage true
/PreserveDICMYKValues true
/PreserveEPSInfo true
/PreserveFlatness true
/PreserveHalftoneInfo false
/PreserveOPIComments true
/PreserveOverprintSettings true
/StartPage 1
/SubsetFonts true
/TransferFunctionInfo /Apply
/UCRandBGInfo /Preserve
/UsePrologue false
/ColorSettingsFile ()
/AlwaysEmbed [ true
]
/NeverEmbed [ true
]
/AntiAliasColorImages false
/CropColorImages true
/ColorImageMinResolution 300
/ColorImageMinResolutionPolicy /OK
/DownsampleColorImages true
/ColorImageDownsampleType /Bicubic
/ColorImageResolution 300
/ColorImageDepth -1
/ColorImageMinDownsampleDepth 1
/ColorImageDownsampleThreshold 1.00000
/EncodeColorImages true
/ColorImageFilter /DCTEncode
/AutoFilterColorImages true
/ColorImageAutoFilterStrategy /JPEG
/ColorACSImageDict <<
/QFactor 0.15
/HSamples [1 1 1 1] /VSamples [1 1 1 1]
>>
/ColorImageDict <<
/QFactor 0.15
/HSamples [1 1 1 1] /VSamples [1 1 1 1]
>>
/JPEG2000ColorACSImageDict <<
/TileWidth 256
/TileHeight 256
/Quality 30
>>
/JPEG2000ColorImageDict <<
/TileWidth 256
/TileHeight 256
/Quality 30
>>
/AntiAliasGrayImages false
/CropGrayImages true
/GrayImageMinResolution 300
/GrayImageMinResolutionPolicy /OK
/DownsampleGrayImages true
/GrayImageDownsampleType /Bicubic
/GrayImageResolution 300
/GrayImageDepth -1
/GrayImageMinDownsampleDepth 2
/GrayImageDownsampleThreshold 1.00000
/EncodeGrayImages true
/GrayImageFilter /DCTEncode
/AutoFilterGrayImages true
/GrayImageAutoFilterStrategy /JPEG
/GrayACSImageDict <<
/QFactor 0.15
/HSamples [1 1 1 1] /VSamples [1 1 1 1]
>>
/GrayImageDict <<
/QFactor 0.15
/HSamples [1 1 1 1] /VSamples [1 1 1 1]
>>
/JPEG2000GrayACSImageDict <<
/TileWidth 256
/TileHeight 256
/Quality 30
>>
/JPEG2000GrayImageDict <<
/TileWidth 256
/TileHeight 256
/Quality 30
>>
/AntiAliasMonoImages false
/CropMonoImages true
/MonoImageMinResolution 1200
/MonoImageMinResolutionPolicy /OK
/DownsampleMonoImages true
/MonoImageDownsampleType /Bicubic
/MonoImageResolution 1200
/MonoImageDepth -1
/MonoImageDownsampleThreshold 1.00000
/EncodeMonoImages true
/MonoImageFilter /CCITTFaxEncode
/MonoImageDict <<
/K -1
>>
/AllowPSXObjects false
/CheckCompliance [
/None
]
/PDFX1aCheck false
/PDFX3Check false
/PDFXCompliantPDFOnly false
/PDFXNoTrimBoxError true
/PDFXTrimBoxToMediaBoxOffset [
0.00000
0.00000
0.00000
0.00000
]
/PDFXSetBleedBoxToMediaBox true
/PDFXBleedBoxToTrimBoxOffset [
0.00000
0.00000
0.00000
0.00000
]
/PDFXOutputIntentProfile ()
/PDFXOutputConditionIdentifier ()
/PDFXOutputCondition ()
/PDFXRegistryName ()
/PDFXTrapped /False
/CreateJDFFile false
/Description <<
/ARA
/BGR
/CHS
/CHT
/CZE
/DAN
/DEU
/ESP
/ETI
/FRA
/GRE
/HEB
/HRV (Za stvaranje Adobe PDF dokumenata najpogodnijih za visokokvalitetni ispis prije tiskanja koristite ove postavke. Stvoreni PDF dokumenti mogu se otvoriti Acrobat i Adobe Reader 5.0 i kasnijim verzijama.)
/HUN
/ITA
/JPN
/KOR
/LTH
/LVI
/NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken die zijn geoptimaliseerd voor prepress-afdrukken van hoge kwaliteit. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.)
/NOR
/POL
/PTB
/RUM
/RUS
/SKY
/SLV
/SUO
/SVE
/TUR
/UKR
/ENU (Use these settings to create Adobe PDF documents best suited for high-quality prepress printing. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.)
>>
/Namespace [
(Adobe)
(Common)
(1.0)
]
/OtherNamespaces [
<<
/AsReaderSpreads false
/CropImagesToFrames true
/ErrorControl /WarnAndContinue
/FlattenerIgnoreSpreadOverrides false
/IncludeGuidesGrids false
/IncludeNonPrinting false
/IncludeSlug false
/Namespace [
(Adobe)
(InDesign)
(4.0)
]
/OmitPlacedBitmaps false
/OmitPlacedEPS false
/OmitPlacedPDF false
/SimulateOverprint /Legacy
>>
<<
/AddBleedMarks false
/AddColorBars false
/AddCropMarks false
/AddPageInfo false
/AddRegMarks false
/ConvertColors /ConvertToCMYK
/DestinationProfileName ()
/DestinationProfileSelector /DocumentCMYK
/Downsample16BitImages true
/FlattenerPreset <<
/PresetSelector /MediumResolution
>>
/FormElements false
/GenerateStructure false
/IncludeBookmarks false
/IncludeHyperlinks false
/IncludeInteractive false
/IncludeLayers false
/IncludeProfiles false
/MultimediaHandling /UseObjectSettings
/Namespace [
(Adobe)
(CreativeSuite)
(2.0)
]
/PDFXOutputIntentProfileSelector /DocumentCMYK
/PreserveEditing true
/UntaggedCMYKHandling /LeaveUntagged
/UntaggedRGBHandling /UseDocumentProfile
/UseDocumentBleed false
>>
]
>> setdistillerparams
<<
/HWResolution [2400 2400]
/PageSize [612.000 792.000]
>> setpagedevice
A Framework for Artificial Intelligence Applications in
the Healthcare Revenue Management Cycle
by
Leonard J. Pounds
A dissertation submitted in partial fulfillment of the requirements
for the degree of Doctor of Philosophy
in
Information Systems
College of Computing and Engineering
Nova Southeastern University
2021
An Abstract of a Dissertation Submitted to Nova Southeastern University
in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy
A Framework for Artificial Intelligence Applications in the
Healthcare Revenue Management Cycle
by
Leonard J. Pounds
September 2021
There is a lack of understanding of specific risks and benefits associated with AI/RPA
implementations in healthcare revenue cycle settings. Healthcare companies are
confronted with stricter regulations and billing requirements, underpayments, and more
significant delays in receiving payments. Despite the continued interest of practitioners,
revenue cycle management has not received much attention in research. Revenue cycle
management is defined as the process of identifying, collecting, and managing the
practice’s revenue from payers based on the services provided.
This dissertation provided contributions to both areas, as mentioned above. To
accomplish this, a semi-structured interview was distributed to healthcare executives. The
semi-structured interview data obtained from each participant underwent a triangulation
process to determine the validity of responses aligned with the extant literature. Data
triangulation ensured further that significant themes found in the interview data answered
the central research questions. The study focused on how the broader issues related to
AI/RPA integration into revenue cycle management will affect individual organizations.
These findings also presented multiple views of the technology’s potential benefits,
limitations, and risk management strategies to address its associative threats. The
triangulation of the responses and current literature helped develop a theoretical
framework that may be applied to a healthcare organization in an effort to migrate from
their current revenue management technique to one that includes the use of AI/ML/RPA
as a means of future cost control and revenue boost.
Acknowledgments
This dissertation would not have been possible without the support of many people.
Many thanks to my chair, Dr. Gregory Simco, who read my numerous revisions and
helped make some sense of the confusion. Also, thanks to my committee members, Dr.
Ling Wang and Dr. Mary Harward, who offered tremendous guidance and support.
I’d also like to extend my gratitude to the numerous staff and administrators at NSU that
assisted me with completing this dissertation. Special thanks to Deans Meline Kevorkian
and Kimberly Durham, who may not know their impact on my completing this program.
As well as my mentor, someone I am so grateful to have had the chance to work with,
Mr. Tom West, for the tireless support and for sharing your knowledge and wisdom
during our frequent conversations. I hope to inspire others as you have inspired me.
And my greatest thanks to my family for all the support you have shown me through this
research, the culmination of the years of the Ph.D. program. For my in-laws, Carmen and
Amy Shick, your entrepreneurial skills, perseverance, and integrity are just a few of your
qualities that continue to inspire me to be a better person every day. And for my wife
Isabelle, who deserves an honorary degree for proofreading every one of my papers, for
her boundless love, support, and for always believing in me, without which I would have
stopped these studies a long time ago.
Finally, I cannot begin to express my thanks to my late parents, Franklin and Suzzann
Pounds. They will not get to share in the joy of this accomplishment, although the
example they set with their work ethic is something I strive to match each day. Thank
you for instilling values in me that I will carry throughout the rest of my life.
v
Table of Contents
Abstract iii
Acknowledgments iv
List of Tables viii
List of Figures ix
Chapters
1. Introduction 1
Background 1
Problem Statement 3
Problem Broader Context 5
Justification of the Study 6
Dissertation Goals 7
Research Questions 7
Relevance and Significance 8
Barriers and Issues 9
Assumptions, Limitations, and Delimitations 11
Assumptions 11
Limitations 12
Delimitations 12
Definition of Terms 13
Summary 14
2. Review of the Literature 16
Literature Review 16
Justification for Inclusion and Exclusion 17
Inclusion 17
Exclusion 18
Previous Work and Strengths and Weaknesses 18
Gaps in the Literature 20
Analysis of Research Methods Used 20
Concept of Artificial Intelligence, Machine Learning, and Robotic Process
Automation 21
Process Definitions of Revenue Cycle Management 23
Potential of Machine Learning for these Processes 24
Application of Machine Learning/Artificial Intelligence to Healthcare Issues 27
Summary and Thematic Analysis 29
vi
3. Methodology 33
Approach 33
Justification for the Methodology 34
Theoretical Framework and Development 35
Interview Structure and Design 36
Data Collection and Research Questions 39
Sampling 40
Data Analysis Procedures 41
Trustworthiness and Reliability 42
Summary 44
4. Results 45
Data Analysis 45
Thematic Analysis Approach 45
Details of Interviews 47
Word Frequency 47
Analysis Process in NVivo 48
Findings 48
Benefits of AI in HRCMP 48
Negative Impact of Risk Factors in HRCMP 52
Risk Management and Problem-Solving Strategies 53
Triangulation of Data 55
Framework Design 56
Development 56
Framework Example 57
Summary 61
5. Conclusion 62
Research Questions 62
Research Question 1: What prospective benefits can be generated by using
AI revenue cycle applications for healthcare organizations? 62
Research Question 2: What are the risk factors associated with AI
implementation in healthcare? 62
Research Question 3: What outcomes are derived by using a Lean Six
Sigma (LSS) designed framework for healthcare executives deciding to
implement AI/RPA in the healthcare revenue cycle? 63
Limitations 63
Implications 64
Recommendations 65
vii
Summary 66
Appendices 70
A. Interview Questions 71
B. IRB Exempt Initial Memo 75
C. Email Invitation 77
D. Informed Consent 78
E. NVIVO Codes 82
F. Triangulation of Data 84
References 87
viii
List of Tables
Tables
1. Research Questions with Relevant Themes Hierarchy 46
2. Interview Overview with Duration of Interviews 47
3. Risk Viability Framework 58
4. Benefits Framework 59
ix
List of Figures
Figures
1. Word Frequency 48
2. Theme Hierarchy of Benefits of AI in HRCMP 50
3. Percentage Coverage of Benefits of AI in HRCMP 51
4. Theme hierarchy of Negative Impact of Risk Factors in HRCMP 52
5. Percentage Coverage of Negative Impact of Risk Factors in HRCMP 53
6. Theme hierarchy of Risk management and problem-solving strategies 54
7. Percentage coverage of risk management and problem-solving strategies 55
8. Framework Scatterplot 61
1
Chapter 1
Introduction
Background
As healthcare institutions navigate the industry’s varied challenges, they
increasingly rely on healthcare information technology (HIT) as a means of cultivating
solutions to recurrent problems (Bohr & Memarzadeh, 2020, p. 25-60; Stanfill & Marc,
2019). HIT includes diverse systems, hardware, and software that provide varied benefits
to selected users across organizational contexts. One specific HIT application includes
artificial intelligence (AI): A collection of systems and programs that rely on computer-
generated thinking to perform a range of prescriptive and predictive tasks. Current AI
applications in healthcare contexts include machine learning platforms that contribute to
decision-making in both clinical and administrative areas of operation (Lin et al., 2017).
Analysts predict that as technology advances and becomes increasingly manageable from
risk mitigation, and cost-effectiveness standpoints, planners across the healthcare sector
will likely adopt AI and Machine Language (ML) platforms to assist with multiple
functions (Hut, 2019). However, other reports indicate that planners will also need to
address a complex set of potential barriers to HIT implementation and use it to implement
technology-oriented solutions within their organizations (Christodoulakis et al., 2020).
Artificial intelligence (AI) and related technologies are increasingly common in business
and are beginning to be applied to healthcare. While these technologies have the ability to
2
transform many aspects of patient care, as well as administrative processes within the
provider and payer organizations, researcher recommendations conflict regarding the
extent of benefits versus risks offered and posed by using AI in such settings
(Christodoulakis et al., 2020; Davenport & Kalakota, 2019). There are already several
research studies suggesting that AI can perform as accurately as or more accurately than
humans when it comes to essential healthcare tasks, such as diagnosing diseases
(Davenport & Kalakota, 2019).
Nonetheless, Christodokulakis et al. (2017) highlight the associated challenges
and risks introduced alongside such benefits. Therefore, a need exists for practitioners
and researchers in healthcare to understand the advantages as well as the barriers or
challenges associated and more comprehensively with AI implementation in clinical
settings, especially as such factors relate to organizational financial solvency so that AI
technologies can be implemented and used in the most appropriate and beneficial ways.
Another contextual factor complicating the consideration and implementation of AI
systems and their associated challenges and benefits is the current changing healthcare
regulation environment (Forcier et al., 2019; Gerke et al., 2020). Changing healthcare
regulations and evolving revenue cycle management lead to immense transformation in
the healthcare industry. Along with staying current on updates to the Affordable Care Act
(ACA), Medicaid, and other healthcare programs, healthcare providers need efficient
billing and tracking procedures in place. Staying up to date with changing regulations is
one area in which AI can assist organizations. Utilizing AI in the revenue cycle will assist
with some of the essential aspects, such as:
3
1. Billing and Collections Mistakes – If healthcare establishments do not
have an effective billing process, they risk losing money. With
complicated insurance plans becoming more common, a billing and
collections department’s need to continuously review payor receipts is
paramount.
2. Untrained Staff – Inaccurate data can cause billing issues in various ways,
such as improper medical coding, billing, and insurance claim filing.
These errors can add up to a significant amount of bad debt per year.
Unpaid bills can easily get lost in the shuffle, even after only 60 to 90 days
(DECO, 2019).
Based on the need to better understand associated risks, challenges associated
with, and benefits of AI systems in healthcare (Shaw et al., 2019), this research was able
to describe both the potential that AI offers to automate aspects of the revenue cycle and
some of the barriers to the rapid implementation of AI in healthcare. This research also
helped in filling a gap in understanding a conscience theoretical framework for healthcare
executives to use during implementation. Therefore, the results of this research
contributed to building a framework that administrators may use to leverage the benefits
of AI while minimizing the risks to improve organizational operability, productivity, and
financial solvency more successfully and appropriately.
Problem Statement
The problem addressed by this study was a lack of understanding regarding the
specific risks and benefits associated with AI implementation in healthcare settings.
Many administrative tasks are currently completed manually in healthcare, which takes
4
high labor costs and increases human computation error potential. However, it is
unknown to what extent AI may improve these administrative tasks and address
challenges (CAQH, 2018). To better understand this research was able to analyze the
issues affecting the healthcare industry revenue cycles. Despite some automation of claim
submission and other transactions, many administrative transactions are still primarily
driven by inefficient manual processes (CAQH, 2017). According to the 2017 CAQH
Index, an annual report of adopting electronic business transactions, the lack of
automation for these transactions costs the healthcare industry more than $11 billion per
year. In order to process a patient claim, the patient financial services department is
required to employ experts with advanced healthcare knowledge. Experienced
professionals are necessary for auditing the claims. The current manual claims auditing
methods involve extensive human efforts, time, and money and often result in claims
denial. One of the obvious solutions is to adopt automation, which, despite advantages, is
accompanied by many uncertainties and consideration of countless variables. Thus, this
dissertation analyzed the issues affecting the revenue cycle within the healthcare industry
to understand better the financial risks and benefits associated with AI implementation in
the healthcare setting and constructed a theoretical framework behind using Artificial
Intelligence (AI) and the financial benefits vs. the risks that will be gained by utilizing
this technology. The main goal of the study was to estimate the outcome of implementing
AI in the revenue cycle.
Furthermore, this study first examined theoretical trends in healthcare revenue
cycle processes by researching literature related to the topic. Existing literature was
analyzed to identify and address current gaps in understanding. By doing so, a broader set
5
of observations was generated that was applied to the report’s follow-up section of
designing a lean process theoretical framework for the use of AI within the healthcare
revenue cycle process.
Problem Broader Context
Revenue cycle management in modern health systems can be viewed in three
ways (Becker & Ellison, 2019). First, the processes represent critical areas of fiscal
management and administrative oversight. In brief, a health systems approach to revenue
generation requires a systemized and efficient model. The literature defines an efficient
model as combining the separate billing, collections, reimbursement, and accounting
activities within the same framework (Becker & Ellison, 2019). Second, revenue cycle
management ensures a health system’s effective ability to operate in immediate and future
terms. The revenue cycle must combine billing, collections, reimbursement, and
accounting in the immediate present and the future. Third, for the system to continue to
be efficient, it must anticipate how these domains will change in the future.
Administrative and medical employees focus almost half of their time addressing
revenue-oriented issues (Hillman, 2020). The same author additionally noted that
healthcare systems spend approximately $266 billion annually on revenue cycle
management operations. These same costs can also be compounded as systems seek to
reconcile problems generated through human error. Current regulations allow healthcare
entities to lower administrative costs and increase the rate of collections. However,
applying AI and ML should theoretically increase revenue by reducing the number of
timely filling errors and reducing the financial services team’s administrative burden.
Considering these issues, this dissertation addressed the problem associated with
6
developing a framework for more successfully implementing AI and ML into healthcare
organizations’ revenue cycle. The literature on this topic was evaluated by examining key
performance indicators (KPIs) that provide insight on reimbursements, denials, the price
per accession, price per unit, paid units, throughput, and write-offs (XIFIN, 2020).
Justification of the Study
The literature clearly demonstrates a role for AI and ML in the context of the
revenue cycle. For example, Blass and Porr (2019) argued that AI and ML could decrease
the risk of error within compliance and risk management, ultimately streamlining the
revenue cycle. However, this research was general and did not provide a specific
framework for integrating AI and ML into a system. Instead, the research stated that it
could be helpful. This trend has been present overall in all the research on this topic.
Accordingly, there is a significant gap in the literature concerning helping organizations
develop the appropriate frameworks and protocols to integrate AI and ML into their
revenue cycle systems successfully, thereby justifying this study’s need and developing a
framework to follow. Current literature findings are not helpful for healthcare system
administrators who seek to integrate technology-based solutions within their existing
fiscal cycle management operations. According to Hamet & Tremblay (2017), to
incorporate AI into the revenue cycle, it is first necessary to identify the barriers to
implementation and then develop a framework to implement that addresses those barriers.
Hence, this dissertation aided in clarifying how AI and ML can provide tangible solutions
for healthcare systems by utilizing the theoretical framework. Understanding the
development of such a tangible solution requires research that presents solutions that can
universally apply to diverse healthcare operations.
7
Dissertation Goals
The dissertation’s primary research goals that were addressed and detailed in
future chapters are summarized as follows:
1) To expand on the current literature surrounding the use of AI in the health care
revenue cycle and provide a framework to allow health care executives to quickly
visualize the benefits or drawbacks of such a technology in their specific
healthcare revenue cycle departments.
2) To create a framework that may be applied to a healthcare organization in an
effort to migrate from their current revenue management technique to one that
includes the use of AI/ML/RPA as a means of future cost control and revenue
boost.
Research Questions
This dissertation explored an increasingly critical issue affecting healthcare
organizations related to the use of AI software systems as a means to improve financial
operability and solvency. This study used a mixed-methods approach involving a meta-
analysis of the literature and semi-structured interviews to inform the following research
questions:
R1. What prospective benefits can be generated by using AI revenue cycle
applications for healthcare organizations?
R2. What are the risk factors associated with AI implementation in healthcare?
R3. What outcomes are derived by using a Lean Six Sigma (LSS) designed
framework for healthcare executives deciding to implement AI/RPA in the
healthcare revenue cycle?
8
Relevance and Significance
The research questions hypothesized in this study have high significance for the
field of healthcare. Discussions regarding the need for better fiscal management have
grown as the healthcare industry has matured. Before the 1950s, hospitals were mainly
non-profit, and financing was handled mainly through charitable campaigns (Cleverly &
Cleverley, 2018). When Medicare financing of many services delivered by hospitals
caused a significant growth in hospital revenues, this opened the door for a heightened
interest in healthcare accounting and finances. Hospitals started making the shift from
charities to big business. Both cost accounting and management control became essential
tools for managing finances in hospitals.
The most recent seismic shock to the system came in the 1980s when the federal
government started feeling pressure from hospital billings that seemed to be spiraling out
of control (Cleverly & Cleverley, 2018). At this point, the push began to have more
patients treated on an outpatient basis to control costs. With this, the federal government
created the Prospective Payment System, which created an opportunity for the creation of
other types of medical providers other than hospitals, such as ambulatory surgery centers
and other providers.
With more recent developments, such as the passage of the Patient Protection and
Affordable Care Act [ACA] (2010), healthcare providers have been put under increasing
pressure to find ways to achieve the “triple aim” of healthcare. The triple aim calls on
healthcare organizations to (1) improve patient care experiences, (2) improve the health
of populations, and (3) reduce the cost of healthcare per capita (J. Evans, 2017). The
latter component of the triple aim, the thrust to reduce healthcare costs, is at the heart of
9
financial management. Healthcare organizations must be run professionally and
efficiently to be able to deliver high-quality healthcare for diminishing payments. This
has required those in the healthcare industry to seriously rethink their business structures
and find ways within those structures to maximize the payments that they already receive
so that they can benefit the organization to the most significant degree possible.
That is why the consideration of using AI to improve fiscal management is so
relevant and significant for the healthcare industry today. Successful financial
management of modern healthcare organizations, which are becoming increasingly
complex, requires timely, relevant information to make better business decisions
(Cleverly & Cleverley, 2018). Because the existing systems are still overly dependent on
humans to do the processing, they are inefficient. This leads to delays in the
reimbursement for services delivered and delays in delivering an up-to-date look at the
healthcare organization’s financial situation. As a result, healthcare executives often find
themselves in a position to make critical business decisions based on information that is
out of date and often of questionable accuracy. If the use of AI can improve that situation,
then healthcare managers could move to a position where they have information that is
timely and accurate, enabling them to make better business decisions that will enable
them to improve the profitability and feasibility of the services provided to the public.
Barriers and Issues
Several barriers and issues were faced when doing this type of research. The field
of medicine has been primarily dominated by research that follows the scientific method.
As a result, literature reviews that are conducted regarding many healthcare topics
include discussions about levels of evidence used to support the study’s assertions,
10
foundation, and findings. As Fineout-Overholt and Melnyk (2015) outline, levels of
evidence can be categorized in seven levels, with systematic reviews of randomized
controlled trials (RCTs) being the “best” evidence, or categorized as level I, and evidence
from opinions expressed by either authorities or expert committees as being the “lowest”
form of evidence, which is categorized as level VII. The distinction between levels of
evidence is important because the level of evidence that is used to support an argument or
assertion is often used as a basis for determining if the research applies to healthcare
decision-making.
The Agency for Healthcare Research and Quality, as cited in Fineout-Overholt &
Melnyk, 2015 defines levels of evidence by three criteria: quality, quantity, and
consistency. In this context, quality speaks to how the study was designed and if
approaches were used that ensured that the findings were accurately measured and that
measurement, selection, and confounding biases were avoided. This is in part why
systematic analyses of RCTs are generally considered the highest level of evidence.
Within the AHRQ definition, quantity refers to the number of studies, the participants
involved, the magnitude of the treatment, the strength from causality assessments on the
outcomes, such as odds ratios or relative risk. Consistency refers to whether or not
multiple researchers are reporting similar findings using the same basic study criteria.
High-level evidence has a lower risk of bias in addition to greater generalizability. The
latter refers to whether the findings can be generalized to a more significant population
(Fineout-Overholt & Melnyk, 2015).
As Frączek (2016) discusses, the financial field has started to put more emphasis
on using evidence-informed practices (EIP), which are analogous to evidence-based
11
practices (EBP), which are used heavily in healthcare delivery practices. Applying EIP to
financial questions permits the practitioner to analyze the information that they are
receiving against the levels of evidence to determine the strength of the recommendations
and the applicability of the information to a wide range of financial situations. It is
precisely in this context where performing studies regarding healthcare finance becomes
somewhat difficult. The majority of literature considered in this study in the literature
review is from level VII evidence or opinions expressed by either authorities or expert
committees. As such, it is difficult to assign a weight to such studies, given that the
opinion primarily informs them of experts in the field. However, these opinions are not
necessarily backed by any evidence that would be considered empirical, at least not from
a scientific standpoint.
The fact is, due to the newness of AI, ML, and RPA, the key concepts under
discussion in this study, there is a lack of actual research studies of any type applying
these topics to the field of healthcare finance. A cursory look at Google Scholar with the
search terms +” artificial intelligence” +finance AND +” randomized controlled trial”
revealed zero studies regarding the combined topics in 5 pages of searches (the top 50
results). This search range was delimited to the past ten years (only articles since 2012).
As expected, removing the time limitation did not reveal any new articles on the topics.
Assumptions, Limitations, and Delimitations
Assumptions
This study assumed that the findings presented in the literature are accurate and a
true reflection of the current state of affairs regarding using AI, ML, and RPA in
healthcare finance situations. This has to be an assumption because the “evidence”
12
presented and reviewed in the literature review is Level VII evidence. There are no
empirical means to identify the presence of biases or the accuracy of statements in the
articles. It is also assumed that the information collected during the semi-structured
interviews from select subject matter experts indicates and represents the current state of
affairs in the healthcare industry, similar to the assumptions made regarding articles in
the literature review.
Limitations
The design of this mixed methods research study presents certain limitations. For
example, the selection of participants for the semi-structured interviews is a non-
randomized convenience sample. It may be indicative of circumstances or feelings
specific to certain healthcare organizations or the attitudes and approaches used in a
specific region of the country. Due to this limitation, the findings may or may not be
generalizable to the population of healthcare finance professionals in the United States,
let alone the approaches used in other countries that use an entirely different approach to
healthcare financing and funding.
Delimitations
Certain delimitations have been selected that may also impact the generalizability
of this study. In order to keep the study manageable, an arbitrary number of 10
participants for the semi-structured interviews were selected. As Creswell and Creswell
(2018) noted, a phenomenology study generally involves a range of 3-10 participants, so
the number selected for this component of the study is not inappropriately small. Another
aspect that could impact the study is historical contamination. Unfortunately, this study
was conducted during the coronavirus pandemic. While participants’ responses in this
13
study are expected to be as accurate as possible, there is a possibility that internal validity
could be compromised due to the impact of the coronavirus and related financial strains
that would not be present during other periods when a pandemic is not in process.
Definition of Terms
Analytical-oriented approaches – Analytical-oriented approaches utilize the
ability of a machine to perform sentiment analysis at the document and sentence levels as
well as based on the aspect. Through such approaches, insights that would ordinarily not
be extracted are identified and converted into decisions that can be acted upon (Gandomi
& Haider, 2015).
Artificial intelligence (AI) – Artificial intelligence is defined as a theory and
creation of computerized systems designed to perform actions that typically would be
done using human intelligence and senses such as hearing, vision, language translation,
and decision-making (McGrow, 2019).
Data mining – is a process that utilizes algorithms to comb through large data sets
(big data) to extract usable activity patterns or outcomes (Bautista et al., 2016).
Healthcare information technology (HIT) – is a blanket term used to delineate the
diverse systems, programs, and mechanisms of technology that collect, store, process,
and manipulate the information contained within them for various healthcare-related
purposes (Wager et al., 2017).
Lean Six Sigma (LSS) – This is a fact-based, data-driven philosophy of
improvement that values defect prevention over defect detection. It drives customer
satisfaction and bottom-line results by reducing variation, waste, and cycle time while
14
promoting the use of work standardization and flow, thereby creating a competitive
advantage. (ASQ, 2020).
Machine learning (ML) – This is the process of a computerized system advancing
“knowledge” of a selected phenomenon through testing and adaptation, using observed
patterns and trends to improve decision-making capabilities (McGrow, 2019).
Predictive modeling – This occurs when an analysis of past patterns of activity
can be used to accurately predict future events, such as analyzing past payments based on
a particular CPT code to predict when a current claim will be paid (Nilsson, 2019).
Revenue cycle management – This refers to the process of streamlining and
optimizing processes throughout the revenue cycle to achieve the best possible cash flow
outcome for the organization (LaPointe, 2020).
Robotic process automation (RPA) – This is a process whereby tasks previously
engaged in by humans are automated to be performed by computers. In the context of this
study, an analog would be a case where a human used to collect information from a
variety of inputs such as email, spreadsheets, and other sources, interpret and collate the
data, then transfer it to a business system like an enterprise resource planning (ERP) or
customer relations (CR) system (Lacity & Willcocks, 2016).
Summary
As this review has considered, healthcare organizations deal with tremendous
amounts of information that must be processed and handled. Current approaches to
financial management are primarily manual, and this requires a significant investment in
human resources at a considerable cost. Research suggests that many of the manual
processes that are currently being used in financial management could be replaced by a
15
combination of AI, ML, and RPA. With a switch to these technologies, the speed of
submitting claims, the accuracy of those claims, and predictions of when claims will be
paid can improve exponentially. This benefits healthcare organizations because the faster
claims are submitted and paid, the less strain this exerts on cash flow demands.
The following section will consider the current state of knowledge in the areas of
AI, ML, and RPA. These topics will be considered with a particular interest in how they
are currently utilized in connection with revenue cycle management. The following
literature review will also discuss gaps in the literature and areas where more information
is needed.
16
Chapter 2
Review of the Literature
Literature Review
The themes explored by academic and healthcare industry journals surround
discussions of technology and applications, the benefits delivered through analytical-
oriented approaches to revenue cycle management, and the barriers to these same
innovations. A final set of discussions entailed assessments of likely risk variables and
viable risk management approaches to address these challenges. Analyses that explored
background themes related to the dissertation’s topic focused on three areas of discussion:
the concept of artificial intelligence (AI) and machine learning (ML), process definitions
of revenue cycle management, and the broader assessment of ML’s potential for
managing these same processes.
A large group of research in the areas of AI and ML that is specific to healthcare
finance revolves around the processing of claim requests and payments from third-party
payers. The research has indicated that a significant amount of money is lost due to the
complexity of claims and inaccurately completed claims. When a claim is inaccurately
completed, it must be returned to the institution filing the claim, and this must be
rectified. This creates additional time in reworking the claim and extends the time
between claim submission and payment, which negatively reflects on the organization’s
17
financial health. Numerous studies have used novel approaches combining AI and ML to
automatically detect such errors and annotate them with reasons why they are being
flagged. Some of these systems boast a 25% improvement over any current claim
analysis software or methods.
This literature review identified several specific aspects of machine learning and
artificial intelligence related to the healthcare revenue cycle. Of importance, the revenue
cycle and the processes associated with it often have very repetitive tasks performed by
humans. However, many of these tasks would benefit from using machine learning or
artificial intelligence to automate them. In implementing these strategies, healthcare
organizations could likely reduce costs and improve accuracies related to payments and
other similar factors, thus increasing revenue from existing claims by reducing denials.
Justification for Inclusion and Exclusion
Inclusion
This literature review pursued studies that covered issues about AI, ML, and
RPA. Also, articles regarding process definitions for revenue cycle management were
sought. Finally, articles applying AI, ML, and RPA to financial processes in healthcare
were sought. Articles that focused on AI, ML, and RPA, especially how they applied to
the healthcare field, were selected. Articles published within the past ten years, available
in full text, and the English language were considered for inclusion. The full text was
required because, while abstracts provide a general overview, they do not provide details
that were needed for this report. Articles featuring expert opinion were included because
of a lack of research in this field, although research studies were preferred.
18
Exclusion
Studies were excluded from this literature review if they were published more
than ten years ago. Studies that were not published in English were not considered. The
reason for these exclusions is that studies older than ten years would likely not reflect
current practice or thinking about the use of technology in financial issues.
Previous Work and Strengths and Weaknesses
As noted in the introduction to this study, most of the articles regarding AI, ML,
and RPA were based on an accumulation of research (secondary research) and expert
opinion (primarily interviews). For example, several studies were explorations of the
knowledge about AI and ML, with an attempt to explain how these could be applied to
various aspects of healthcare, but mainly focusing on the clinical side of things (Clancy,
2020; Davenport & Kalakota, 2019; McGrow, 2019; Shaw et al., 2019). Some articles
researched the application of AI and ML from entirely different applications and
industries, such as using them for automation in the supply chain (Dash et al., 2019), for
making general business decisions (J. R. Evans, 2015), generic applications of AI and
ML (Kühl et al., 2019), order processing in the telecommunications industry (Lacity &
Willcocks, 2016), the manufacturing and construction industry (Lee et al., 2019), and
financial management in the hospitality industry (Millauer & Vellekoop, 2019).
The literature review contained many articles from Healthcare Financial Review
(HFR), a respected peer-reviewed journal. Unfortunately, many of the articles were
interview pieces that relied upon experts in the field recounting different ways that they
were already or were planning shortly to utilize AI and ML in their financial operations
(Baxter et al., 2019; Hegwer, 2018; Hut, 2019). Other HFR articles were secondary
19
research articles, using other research to quantify the use and intents of AI and ML in the
healthcare finance industry (Hillman, 2020; Navigant Consulting, 2019; Nilsson, 2019;
Schouten, 2013).
The use of secondary research was not limited to HFM. Several other articles
from peer-reviewed journals mainly were, if not entirely, secondary research, compiling
information about AI and ML from other sources (Blass & Porr, 2019; Cheatham et al.,
2019; Christodoulakis et al., 2020). In a search to create a sufficient research foundation
to work from, some non-peer-reviewed sources, including interviews and quotes from
industry professionals, were included in the literature review (Becker & Ellison, 2019;
LaPointe, 2020).
There were a few research papers that looked to apply AI and ML to specific
healthcare financial issues. One paper resulted from the authors analyzing a healthcare
financial situation, using attributional tools to predict future discrepancies to reduce
billing rejections, then testing them on a group of claims to evaluate whether the method
would be successful (Wojtusiak et al., 2011). This study only tested a small group of
claims, which could cause questions about the generalizability of the research to other
real-world situations with far greater claim diversity. Several research studies addressed
using an AI/ML approach to identifying and rectifying medical claim errors as a
component of risk prevention (Chimmad et al., 2017; Kim et al., 2020; Wojtusiak et al.,
2011). A few research studies focused on ways to use AI and ML to promote deep
learning in several areas of medicine, including finance (Kumar et al., 2010; Rajkomar et
al., 2018; Wojtusiak, 2014). Other studies focused on using various techniques associated
20
with AI and ML to “scrub” medical claims or improve medical claims prediction
(Abdullah et al., 2009; Che & Janusz, 2013).
The lack of high-quality research studies in this area presents a challenge. It
makes it difficult to make a compelling case for or against a particular AI, ML, or RPA
practice, absent quantifiable evidence to support the practice. While several research
studies were found, they almost all were oriented at creating and testing means to
improve aspects of finance that have proven to be tricky, such as claim denial by third-
party payers. No studies were identified that identified specific performance
improvements as a result of applying AI principles. Therefore, there is no empirical
foundation to quantify the benefits of AI, ML, and RPA on the healthcare industry other
than “expert” reports and secondary research.
Gaps in the Literature
The especially glaring gap in the research that was identified in this literature
review is the lack of rigorous research studies in this area. While several authors created
algorithms and approaches to common problems experienced by healthcare finance
professionals, backing their effectiveness up through a scientific method of testing, the
broader picture appears not to be addressed in the literature. It would be most helpful if
one of the many organizations who have put AI/ML/RPA into practice in their
organizations would perform a retrospective review that could provide numbers of
differences between using this approach compared with the previous state of affairs.
Analysis of Research Methods Used
There were several research methods used in this study. Several of the articles
were “expert opinion” articles and focused on interviews and reports from several
21
healthcare finance professionals (Becker & Ellison, 2019; LaPointe, 2020). The large
majority of the other articles were secondary research articles, and the data for these
studies were accumulated mainly through reviews of the current literature, although not
systematic (Blass & Porr, 2019; Cheatham et al., 2019; Christodoulakis et al., 2020).
The proper “research studies” in this literature review used many approaches to
generate their findings. For example, in the article on deep learning for medical
predictions, Rajkomar et al. (2018) used predictive modeling. They reported the accuracy
of such predictions using an area under the receiver operator curve [AUROC] across
sites. In the study by Kim et al. (2020), the authors studied the accuracy of a new Deep
Claim system to identify potential payment rejections and found that using the new
system resulted in a 22.21% relative recall gain (95% precision). Wojtusiak et al. (2011)
was the only study that measured the performance of their model to use rule-based
prediction of medical claims payment in a before and after a fashion, providing actual
numbers on the increase in effectiveness in using the new approach over previous
performance.
Concept of Artificial Intelligence, Machine Learning, and Robotic Process
Automation
Kuhl et al.’s (2019) analysis provided an in-depth discussion of both AI and ML.
Their work specifically noted that while AI can be defined as an overarching conceptual
category that references a diverse set of computer intelligence-driven technologies,
machine learning represents a particular application. The authors noted that machine
learning could be understood as a program’s ability to perform routine tasks, become
increasingly proficient in completing these same tasks, and utilize and apply known
22
information towards advanced problem-solving forms. Kuhl et al. (2013) also contended
that optimal approaches to machine learning involve base-level operations in which
programs perform repetitive tasks that gradually increase in their complexity (Kuhl et al.,
2013).
An anonymous report from the publication Healthcare Financial Management
noted that healthcare operations’ revenue cycle management process often provides
unique machine learning applications opportunities. These tasks include these same traits
(Baxter et al., 2019). In the same publication, a follow-up report also noted that
healthcare systems increasingly rely on automated and analytics-driven revenue cycle
management approaches even as they outsource these processes to third-party specialist
firms (Navigant Consulting, 2019). Dash et al. (2019) demonstrated how increasing
complexity is helpful in the context of supply chain management. Much of their analysis
can be applied to the context of an automatic revenue cycle, specifically, to help provide
a framework for how artificial intelligence can adapt to increasingly complex tasks.
Robotic process automation (RPA) is an industrial response to the vast amount of manual
work that individuals perform daily, weekly, or monthly to support a broad array of high-
volume business processing (Lacity & Willcocks, 2016).
RPA is mainly associated with the task level. The application areas include
finance and accounting, IT infrastructure maintenance, and front-office processing. The
so-called robots are software programs that interact with enterprise resource planning and
customer relationship management systems. The robots can gather data from systems and
update them by imitating manual screen-based manipulations. RPA solutions are
23
appealing from a business perspective because they automate repetitive tasks while
minimally invasive into the overall processing they support.
Process Definitions of Revenue Cycle Management
Literature analyses describe the healthcare sector’s current strategies and other
industries to implement automated revenue cycle management approaches. Millauer and
Vellekoop’s (2019) healthcare industry discussion noted that firms frequently utilize
these approaches for three main reasons. These models streamline the repetitive nature of
fiscal cycle operations by applying machine learning models and algorithms to these
tasks. This same approach additionally serves to mitigate the risks stemming from human
error within these processes. McGrow (2019) highlighted the importance of removing
human error from processes when it is possible to do so. However, currently, these
processes are still being completed by humans because there is not currently a
sufficiently sophisticated machine learning system to replace the human element with an
automated system completely. Analysis conducted by Becker and Ellison noted that the
same models’ current healthcare industry applications include a multilayered set of
strategies. Among these entail using machine learning-based models to structure routine
billing operations efficiently, complete complex coding tasks, and generate predictive
data that can be used for risk assessment and management purposes (Becker & Ellison,
2019). Blass and Porr similarly noted that automated approaches to revenue management
typically include the ability of programs and their applied algorithms to gradually identify
patterns associated with payers and contracting groups (Blass & Porr). Over time, these
applications can detect risk variables that might indicate the client’s inability to deliver
payment on time (Blass & Porr, 2019).
24
Evans cited automated forms of revenue cycle management as a valuable
instrument in helping firms in diverse industries achieve higher efficiency and
optimization levels in their internal areas of financial operations (J. R. Evans, 2015).
Similarly, Davenport and Kalakota identified revenue management as one specific
benefit derived from AI and ML learning applications across health systems (2019).
Based on all of these studies, if machine learning were implemented correctly in
the future, it would be possible to replace most, if not all, of the billing processes with
machines rather than humans.
Potential of Machine Learning for these Processes
Discussions of the benefits generated through ML-driven revenue cycle
management processes in healthcare include assessments of current and likely or
predictive benefits. Simultaneously, these collective assessments emphasize technology’s
role as drivers in achieving current and future term benefits. Hut’s discussions noted that
current generation AI and ML platforms could process and structure complex and
recurrent tasks within health systems (Hut, 2019). Revenue management represents one
specific example in this same context. Hegwer similarly noted that current ML
applications help firms achieve excellence in fiscal cycle management processes. The
author provided several discussions of cases of large systems that applied these
technologies and yielded notable improvements in their ability to process patient data,
predict reimbursement patterns, and predictively assess the likelihood of nonpayment
among specific groups or clients (Hegwer, 2018).
Nilsson (2019) also noted that the same applications can predict payer behaviors
and indicate the times in which they will likely remit payment and if they are at risk for
25
nonpayment or default. Schouten (2013) contributed to these same discussions by
assessing machine learning platforms’ capabilities to examine recurrent payment loss
patterns by investigating multiple channels of revenue and reimbursement. While the
author’s discussions referenced the current technologies currently utilized by health
systems, his analysis also identifies these technologies’ ability to complete increasingly
complex and predictive-based assessments. In implementing these more complex
processes, errors would likely continue to be reduced, and organizations would have a
more efficient billing process. Schouten’s commentary parallels Rosenfield’s discussions
of future term machine learning applications and roles (Schouten, 2013). The latter author
noted that advanced machine learning benefits could include these platforms’ ability to
conduct complex operations that subdivide payment systems according to billed
procedures and specialized coding (Schouten, 2013).
In both cases, the analyses cited ML-based algorithms’ ability to conduct
increasingly complex operations as they engage in many of the same procedures over the
longer term. On a more global scale, Lee et al. (2019) found that artificial intelligence has
significant potential for automation in many industries and automation of non-robotic
tasks. This is important in understanding how organizations can implement artificial
intelligence to automate revenue cycle management. While Nilsson’s analysis identified
the potential and prospective benefits generated through machine learning applications,
his report implicitly located one of these systems’ critical risks (Nilsson, 2019). The
author specifically identified the necessity of cultivating a strategic plan to implement
and incorporate ML-based technologies in a healthcare firm’s operations.
26
This approach represents a vital aspect of technology management as it will better
ensure that an implemented strategy will achieve positive returns from a cost/benefit
perspective. The issue of cost represents another critical factor frequently identified by
related literature. Fundamentally, these applications represent a methodology for
achieving savings-based returns (Hillman, 2020). Accordingly, healthcare organizations
typically integrate and apply these innovations to avoid waste, identify redundant
expenses, and locate ways of optimizing fiscal cycle operations. However, achieving
these outcomes often requires an organization’s ability to mitigate short-term risks that
accompany implementation strategies. Accordingly, the costs associated with purchasing
technology and integrating it into existing networks can present firms with a combined
set of fiscal and technical challenges that they will have to address as they develop
change management plans. For example, Clancy (2020) made a point of describing the
importance of organizations using artificial intelligence to automate certain aspects of the
revenue cycle, particularly those aspects that are repetitive and are an inappropriate use of
human resources.
The risks encountered during these initial stages can additionally affect
organizations in the longer term in cases where firms do not explicitly identify the
specific functions that integrated ML platforms will achieve in the context of a firm’s
fiscal cycle management processes. For example, issues related to a platform’s immediate
use, its prospective future term value, and the role that human agents will have in
monitoring the applications represent core issues that decision-makers need to address
during planning sessions (LaPointe, 2020). Similarly, analysts identify the need for
carefully value mapping a proposed model before its implementation: a methodology that
27
can evaluate the specific departments and stakeholders that will benefit from the
applications. In cases where departments or individual employees exhibit a reluctance to
accept the proposal, the same strategies can be used to identify the role these stakeholders
will play in managing the applications (Christodokulakis et al., 2017).
A final set of recommendations includes the need for cultivating a set of
measurable objectives that clearly define the role, purpose, and strategy of the integrated
systems across the technology’s prospective lifecycle. Cheatham et al. conducted an in-
depth analysis explaining some of the risks associated with artificial intelligence
(Cheatham et al., 2019). Organizations, including hospitals that implement artificial
intelligence, must also use specific protocols to mitigate its risks. Specifically, they must
have a clear structure that pinpoints the specific risks associated with AI. The structures
must also have institution-wide controls rather than limited controls.
Lastly, there must be a nuance in analyzing the risk in light of the risk’s nature.
This is important because organizations must understand the risks and how to mitigate
them before implementing new protocols when they plan to implement artificial
intelligence.
Application of Machine Learning/Artificial Intelligence to Healthcare Issues
The fact that vast amounts of money are lost due to inaccurate claim processing is
well established in the literature (Kim et al., 2020; Wojtusiak et al., 2011). One of the
problems that frequently occur that causes claims to be rejected is the inclusion of an
incorrect ICD code for diagnosis or CPT code for diagnostic tests (Abdullah et al., 2009;
Chimmad et al., 2017). This is the reason that multiple researchers have looked for ways
to use AI and ML to automatically analyze massive bodies of medical claims to detect
28
and, in some cases, repair information that was incorrectly entered (Abdullah et al., 2009;
Chimmad et al., 2017; Kim et al., 2020; Zhong et al., 2019). Improving the information
on medical claims can reduce lag time between claim submission and payment, which is
a critical financial measurement indicating the financial health of a healthcare
organization (Cleverly & Cleverley, 2018).
Kim et al. (2020) proposed a novel implementation of AI/ML that they call Deep
Claim. The Deep Claim approach uses a three-step process to improve the accuracy of
predicting the exact amount that third-party payers will present. The first step is the
development of clinical contextual interrelations at the high level of claims, which uses
ML against raw claims data, avoiding the need for expert knowledge or extensive
preparation of the data before processing. The next stage is deploying Deep Claim in real
deployment scenarios. The third step is where Deep Claim flags questionable fields in the
claim based on what it learned in the ML process. This gives it high prediction
interpretability, along with data presented that explains why the fields were flagged so
they can be double-checked and rectified. This novel approach asserts that it can identify
22.21% more denials than the best system that is currently in place.
In research by Wojtusiak et al. (2011), the researchers developed an ML
application that would permit the AI system to combine rules that were already known
for claim rejection and combine these with new rules that were detected by the AI
algorithms. The ability of the AI to generate new rules was particularly important because
healthcare is continually changing, and this architecture permits the system to adapt as
new changes in the healthcare system occur. The system effectively identified new errors
that had slipped through the system, with 60% of Medicaid, 50% of DRG 371, 55% of
29
DRG 372, and 44% of DRG 373 abnormalities detected. The false-positive rates were
relatively low, ranging from 5% to 30% for the same groups.
In the study by Kumar et al. (2010), the researchers used data mining, which
could then be used for an ML process to improve the prediction of claims that need
reworking. They noted that 30% of the administrative staff in health insurers are
dedicated to reworking incorrect claims, which be rectified using AI. The researchers
developed a method of detection based on ML, then deployed that model at one of the
nation’s largest health insurers. Because the new system was much more precise, it
generated a substantial increase in hit rates, identifying faulty claims. The improved
accuracy provided by this novel application of AI and ML could potentially generate cost
savings of between $15-25 million for each standard insurer using the system.
Other researchers explored ways to label data or develop concept representations
from existing data sets using combinations of AI and ML (Bai et al., 2019; Che & Janusz,
2013; Lu et al., 2020; Zhong et al., 2019). Being able to generate rules and label data or
categorize it in a way that ML systems can easily interpret is a crucial stepping stone to
practically using such data to apply to numerous healthcare applications, such as financial
management (Che & Janusz, 2013; Wojtusiak, 2014).
Summary and Thematic Analysis
This literature review has considered numerous aspects of how the healthcare
finance industry has considered and implemented ML, AI, and RPA into their business
frameworks. The financial process, especially that of submitting claims, is complex and
regularly involves touching tens of thousands of documents. The potential for errors is
high, and as the research has indicated, some organizations have up to 30% of their staff
30
dedicated to “reworking” claims that were incorrectly submitted. With the constant
pressure on healthcare organizations to decrease costs, finding ways to use AI, ML, and
RPA to streamline finance department processes and make them more efficient is highly
attractive. Even more so, the potential cost savings, which are projected into the tens of
millions, are sufficient to get the attention of healthcare finance professionals. This
review has explained how AI, ML, and RPA can be applied to the healthcare finance
field. It has also demonstrated that systems currently in use generate considerable savings
for many healthcare organizations.
These findings are especially significant to healthcare finance professionals who
are under considerable pressure to find ways to reduce costs in their departments and
improve cash flow through efficient claims processing and payment turnaround. The
findings are also relevant to the healthcare finance field because the application of certain
aspects of the research has been demonstrated to generate considerable cost savings.
Healthcare finance departments are also responsible for maintaining the organization’s
financial health. The potential of AI and ML technologies to improve payment
turnarounds is highly relevant, as this is a key financial performance metric for all
hospitals.
The literature review presents valuable information. Specifically, these analyses
identify analytics-based approaches to revenue cycle management as an increasingly
utilized strategy among diverse healthcare systems. The findings indicate that sector
decision-makers identify vital benefits that can be derived from these applications as
fundamental savings and risk-management tools. While discussions of the model’s
current and prospective capabilities differ in terms of their understanding of the benefits
31
derived from their application, they are interconnected by their mutual contention that
these outcomes are directly correlated with the platform’s existing and emerging
technological features and capabilities. In brief, these views suggest that as AI and ML
systems advance, healthcare organizations will be able to apply them in increasingly
sophisticated ways. Assessments of risk identify the challenges related to initial and start-
up processes as being the most significant. Recommended risk management approaches
include applying detailed and precise technology strategies that identify an implemented
model’s specific role within the organization and outline the specific objectives that the
platform will help the company achieve.
The implications derived from the preliminary literature review relate to the
following themes: 1) the current state of the use of AI/ML/RPA and 2) the continuing
gaps in the healthcare revenue cycle areas that could benefit from this technology. As the
review indicated, the current field provides prospective decision-makers with an in-depth
set of data that explains the concept of analytics-driven approaches to revenue
management. It generally outlines the types of benefits derived from its application.
Analyses that reference individual health systems as case studies provide contextual
information that identifies how single organizations apply these innovations. While
informative and descriptive, this information lacks a level of specificity that could
otherwise help planners make targeted decisions. Accordingly, these preset gaps require
follow-up studies that assess common thematic issues from an organizational perspective.
Evaluating the variables of benefits, drawbacks, risks, and appropriate risk management
strategies from a healthcare organization’s strategic perspective can aid in balancing the
32
current tendency for associative literature to focus on macro-level themes related to these
same issues.
33
Chapter 3
Methodology
Approach
The purpose of this study was to identify the potential risks and benefits of using
AI-based applications in the revenue cycles of large healthcare organizations.
Specifically, the study closed the research gaps by identifying and analyzing the
perspectives of key stakeholders responsible for managing revenue cycles. As mentioned
in Chapter 1, the study followed a qualitative methodological approach when the
researcher conducts semi-structured interviews with recruited participants. Semi-
structured interviews provide researchers with opportunities to identify significant themes
across participants and establish an appropriate context for developing theoretical
explanations of emerging themes found in coded data. In many ways, the selected
approach provided a solid basis for identifying broader issues noted in earlier published
studies. Conducting semi-structured interviews with participants also provided the
researcher’s opportunities to discuss how systems thinking concepts and risk management
strategies apply in different healthcare settings (Alam, 2016; Anderson, 2016). By
including interview data in this study, its overarching goal was to determine how
researchers may perform similar investigations using qualitative, quantitative, or mixed
methods approaches.
34
Justification for the Methodology
Following Hissong et al. (2015), a qualitative methodological framework applies
to this study proposed when it guides how researchers understand the meaning derived
from lived experience. Accordingly, qualitative research involves studying individuals as
they behave in different social, organizational, or institutional contexts. Given that few
currently published studies address how key stakeholders manage revenue cycles in
healthcare, the decision to apply a qualitative framework involved accounting for
differences between individual experiences, meanings placed on individual experiences,
how individuals respond to different environments, and developing models whereby
researchers performing future investigations may design empirically verifiable
instruments.
For example, qualitative researchers may apply a phenomenological design when
investigating the relationship between professional development and barriers to accessing
consistent healthcare (Creswell & Creswell, 2018; Hissong et al., 2015).
Phenomenological study designs typically involve researchers performing semi-
structured interviews with a sample size of no more than ten (n = 10) participants. While
quantitative researchers may perceive that such a small number of participants cannot
produce generalizable results, they may provide an in-depth analysis of how significant
themes coded in the interview data can inform future investigations. The lack of
generalizability will constitute a significant limitation that influences how researchers
performing future investigations may attempt to replicate the study design across other
settings (Creswell & Creswell, 2018). Instead, the selected methodology refers to an
inductive process from which data inform theory development.
35
Theoretical Framework and Development
The theoretical framework developed for this study included three key areas of
analysis. First, the proposed framework will draw from key concepts impacting
healthcare administration. Examples of concepts included in this framework are systems
thinking and risk management strategies. Whereas systems thinking spans multiple
disciplines and apply to various organizational contexts, its applications to the healthcare
industry are such that researchers often explain why relationships between different
components are more complex than others (Anderson, 2016). Given that the healthcare
industry is complex, its relationship to systems thinking indicates further where
researchers can detect patterns and failure probability. In relation, risk management
strategies involve healthcare professionals emphasizing financial and business viability
from an organizational perspective.
Healthcare professionals must follow specific steps when addressing risks, which
entail identifying the context, explaining known risks, analyzing these risks, evaluating
the risks, and managing the risks properly (Alam, 2016). By combining elements of
systems thinking and risk management strategies, the researcher applied specific concepts
to technology management practice in artificial intelligence (AI) applications. The
emerging theoretical framework then aligned with significant themes identified in the
current planning and risk management literature. More specifically, the theoretical
framework developed from an analysis of themes coded from the interview data aligned
with how researchers previously utilized AI models for revenue management practices in
healthcare. As the interview portion of the study will receive attention, the researcher will
36
apply systems thinking concepts and risk management strategies to indicate the presence
of significant themes.
Next, concepts found in the research on risk management strategies were applied
to identify themes like AI risk and risk mitigation. An informed view of principles
guiding risk management strategies will likely improve how the researcher interprets the
interview data and accordingly builds a theoretical framework. From there, the ADDIE
model used to evaluate practice among e-learning designers and developers will guide
theory development. Researchers note further how the ADDIE model supports a process
of analyzing, designing, developing, implementing, and evaluating AI designs in complex
healthcare environments when their implications for revenue generation along cyclical
lines are vast (Anderson, 2016; Gawlik-Kobylinska, 2018). While the ADDIE model also
supports improvements to healthcare decision-making, it corresponds more closely to risk
identification. Subsequently, the theoretical framework developed from the findings will
inform problem-solving approaches significant stakeholders in healthcare may use when
measuring the risks and benefits associated with integrating AI technologies into revenue
cycle management.
Interview Structure and Design
The researcher used a set of 12 interview questions (see Appendix A) that each
participant received. All participants received both a standard email invitation and
NSU’s standard informed consent (see Appendix C and D). All responses provided by the
participants will provide keywords that the researcher may use to ask further questions.
Following this design provided a rich context for analyzing the interview data as they
coincide with this study’s three central research questions. As detailed below, a set of four
37
interview questions addressed themes related to how AI-based technologies will benefit
the healthcare revenue cycle management processes. Three interview questions involved
asking the participants to address risk factors that negatively impact the healthcare
revenue cycle management processes. A final set of five interview questions invited the
participants to discuss risk management and problem-solving strategies that guide
decision-making processes in the organizational context.
First, the interview questions addressing themes related to how AI-based
technologies will benefit healthcare revenue cycle management processes are as follows:
1. What types of patterns or processes have you seen as a healthcare administrator,
accounting/financial management officer, or information technology (IT) staff
member who influenced your healthcare revenue cycle management perceptions?
2. Which of these patterns or processes left the most significant impact on
organizational performance? Why do you believe these patterns or processes play
such an essential role?
3. How do you believe that AI-based technologies can inform the triple aim of
healthcare when financial and related schemes appear challenging to manage?
4. How do you believe AI-based technologies may contribute to improvements in
administrative, financial, or other forms of professional decision-making?
Responses to these questions will guide how the researcher steers each interview
and links major themes to those found in the extant research literature. The exact process
will apply to the two remaining sets of interview questions on risk factors that negatively
impact healthcare revenue cycle management processes and risk management strategies
that also guide decision-making processes in the organizational context.
38
The second set of questions will involve the researcher asking participants the
following:
1. Which risks specific to your organization left the most significant impacts on
decision-making after integrating AI-based or other types of technologies into
healthcare revenue cycle management?
2. How have these risks driven past performance and shaped decision-making?
Which risks still demand critical attention in the present?
3. What do you believe will mitigate future risks when some issues impacting
healthcare revenue cycle management remain?
While this set of questions will initially lead the participants to provide closed-ended
responses, its relation to the specific findings discussed in Chapter 2 will provide an
appropriate context for developing theory from the interview data and answering the
three central research questions.
Lastly, the third set of questions will involve the researcher asking participants the
following:
1. Which risk management strategies, if any, did you apply in the past to ensure that
your organization could integrate AI-based technologies effectively?
2. How did you apply these strategies to mitigate current and future risks?
3. Which strategies do you believe are still effective and why?
4. Which strategies do you wish the organization would eliminate? What examples
can you provide to support any changes in administrative or other types of
decision-making in the organizational environment?
39
5. How have the past and current strategies impacted your ability to develop yourself
professionally while gaining knowledge of how revenue cycle management
functions?
As with the first two sets of semi-structured interview questions, this third set will
ensure that each participant will provide rich insights into how AI-based technologies
inform a lived experience among staff members at different levels within the same
organization.
Data Collection and Research Questions
The study followed a qualitative design to generate three distinct datasets. Next,
the datasets produced in this study were aligned with the responses to interview
questions. From there, the artifacts produced contributed to the discussion and answered
all three research questions (Hissong et al., 2015; Regnault et al., 2018). After the
researcher combined all three datasets, a triangulation process followed to ensure that all
responses provided by each participant establish an appropriate context allowing for
comparisons against previous research findings (Creswell & Creswell, 2018). To reiterate
from Chapter 1, three central research questions guide this study as follows:
R1: What prospective benefits are possible from using AI revenue cycle
applications in the healthcare industry?
R2: What are the risk factors associated with implementing AI-based technologies
in the healthcare industry?
R3: What outcomes are derived by using a Lean Six Sigma (LSS) designed
framework for healthcare executives deciding to implement AI/RPA in the
healthcare revenue cycle?
40
All three research questions contributed to theoretical development in the following
ways. First, the research questions encouraged the researcher to focus on specific sub-
topics related to the broader issue. Second, the research questions provided a set of
discussion points applicable to all three datasets emerging from the study design. Third,
the research questions provided the basis for exploring AI-based technology in health
systems.
Sampling
The study included semi-structured interview data provided by at least five (n = 5)
participants with experience managing AI-based technologies while managing healthcare
revenues. For this study, a quota sampling strategy works best to ensure that all recruited
participants meet specific characteristics (Creswell & Creswell, 2018; Hissong et al.,
2015). The selected strategy allows qualitative researchers to focus their attention on
which recruited individuals will demonstrate the highest possible degree of knowledge or
expertise in their field. From there, the researcher will use recruitment strategies
appropriate to location, culture, and population until achieving a specific quota.
Quota sampling is a nonprobability strategy that resembles purposive sampling in
selecting participants according to criteria relevant to answering one or more research
questions. While the sample size depends mainly on the time, resources, and study
objectives, they may differ when an investigator approximates how many participants
will provide interview data (Creswell & Creswell, 2018). However, the quota sampling
strategy applies when researchers evaluate populations with characteristics that
correspond to set properties.
41
The researcher selected participants from the quota sampling strategy, including
healthcare administrators, accounting/financial management officers, and information
technology (IT) staff members employed within the same organization. The selected
group of participants represents a collective of internal stakeholders capable of providing
in-depth feedback to semi-structured interview questions regarding their experience using
AI-based technologies in revenue cycle management. Specific inclusion criteria that
apply here include 1) employment in one area of healthcare administration that reflects
professional responsibilities and experience in revenue cycle management; 2)
employment within the same organization; and 3) two to three years of experience with
the same organization. Each of the recruited participants will receive an informed consent
letter explaining this study. All responses to semi-structured interview questions will
remain confidential to maintain the anonymity of each participant.
Data Analysis Procedures
As previously mentioned, the semi-structured interview data obtained from each
participant underwent a triangulation process to determine the validity of responses in
alignment with the extant literature. Following Creswell and Creswell (2018),
triangulation entails a process of comparing outcomes and evaluating whether lived
experiences described by participants match those observed in previous studies. Since the
study involved performing semi-structured interviews with recruited participants, data
triangulation is an optimal strategy for comparing how individuals responded to each
question. Data triangulation ensured further that significant themes found in the interview
data answered the central research questions. While investigator triangulation that
involves the use of different evaluators ensured the interview data was feasible in
42
answering the central research questions, time and resource constraints limit
opportunities to make closer comparisons between closer observations. Despite how
participants may view their experiences of AI-based technologies and healthcare revenue
cycle management differently, their responses to each interview question produced
outcomes that researchers should consider when performing future investigations.
Further, the data analysis procedure invited the researcher to address how AI-
based technologies impacted healthcare revenue cycle management by addressing
potential outcomes like associated benefits, associated weaknesses, risk mitigation, and
effective strategies. Each outcome reflected how the recruited participants described their
lived experience of using technology to improve revenue cycle management. The use of
an NVivo-based application was applied here when interview data initially appeared
unstructured in a raw format. By using NVivo, the researcher coded and segmented data
according to patterns recorded in the interview data. The coded data then informed the
theoretical framework development to indicate where participants offered similar and
different perspectives on their experience using AI-based technologies.
Trustworthiness and Reliability
Ensuring trustworthiness in this qualitative study required attention to factors like
credibility, transferability, dependability, and confirmability. As Nowell et al. (2017)
explained, credibility occurs whenever qualitative researchers account for participants’
lived experiences and align them with the extant literature. Researchers may use
procedures like data triangulation as an external check to increase credibility (Creswell &
Creswell, 2018; Nowell et al., 2017). However, most cases involve researchers checking
preliminary findings and comparing them to raw interview data obtained from
43
participants. Member checking and the use of external auditors may inform this process.
However, time and resource constraints limit how many individuals will participate in the
data analysis process.
Second, transferability establishes that findings analyzed from the interview data
should generalize across populations in some way. While smaller samples rarely make
the findings of qualitative studies generalizable, researchers must still provide thick
descriptions to account for where gaps in theory development remain (Creswell &
Creswell, 2018; Hissong et al., 2015; Nowell et al., 2017). Along these lines,
dependability allows qualitative researchers to trace and document the sources of
interview data logically. Researchers may better judge the dependability of investigations
by examining data collection and analysis procedures as informed how accurately they
interpret the findings (Nowell et al., 2017). Here, researchers may achieve confirmability
by explaining how the findings answer specific questions and inform theoretical
development. Confirmability entails further that researchers performing future
investigations may replicate study designs and understand how some decisions were
made.
Considering how this study aimed to include three datasets, ensuring the validity
of each will require an application of reliability-oriented approaches reflecting potential
outcomes in future investigations. Accordingly, the researcher double-checked that each
dataset corresponded to significant themes found in the interview data by performing an
audit trail (Nowell et al., 2017). Documenting an audit trail will also inform the
confirmability of study findings when the lived experiences described by each participant
match a defined research context. However, increasing familiarity with the data will
44
remain necessary to explain similarities and differences in perceptions regarding how AI-
based technologies produce benefits or risks within a specific organizational context.
Especially as qualitative studies increase in popularity, researchers will need to
familiarize themselves with various tools for ensuring the data collected from participants
have more extensive applications. Aligned with the purpose of this study, a qualitative
methodology will support theory development when each data source provides evidence
of which strategies work and where decision-making can improve.
Summary
The goal of this chapter was to outline the research method used to answer the
research questions. A discussion of the procedure, study participants, data collection, and
interview questions outlined the specifics of how the study was conducted and who
participated in the study. The methodology overview detailed the steps for creating the
interview structure and how that data would flow into a triangulation to ensure the
validity of the responses. This chapter outlined a listing of required resources that were
needed to support data collections, analysis, and suggested sampling. The instrument
development and validation process provided insight into combining interview data and
performing the triangulation to form the questions for the theoretical framework. The
goal of Chapter 4 is to provide the study results and demonstrate that the methodology
described in Chapter 3 was followed and supported.
45
Chapter 4
Results
Data Analysis
There was a total of twelve interview questions conducted from a sample size of
ten participants that included healthcare administrators, accounting/financial management
officers, and information technology (IT) staff members employed within the same
organization.
Thematic Analysis Approach
The researcher recorded the interviews via a Microsoft Forms engine and
transcribed into text, then arranged and sorted them in NVivo 12 computer-assisted
qualitative data analysis software (CAQDAS) (see Appendix B).
The sixth stage process of thematic analysis by Braun and Clarke (2006) (i.e.,
familiarizing yourself with your data, generating initial codes, searching for themes,
reviewing themes, defining and naming themes, and producing the report) was followed
to analyze the transcribed information. Appendix E details the coding process that was
used.
46
Table 1
Research Questions with Relevant Themes Hierarchy
Research Question Themes
R1: What prospective benefits are possible
from using AI revenue cycle applications in
the healthcare industry?
1. Benefits of AI in HRCMP
1.1 Cost reduction and revenue growth
1.2 Improved data quality
1.3 Organizational benefits
1.3.1 Decrease or reduce workforce
1.3.2 Enhances teamwork
1.3.3 Make better and quick decisions
1.4 Patients benefits
1.4.1 Help in early diagnosis
1.4.2 Improved patients’ experience
1.4.3 Reduces patients’ denial rate
R2: What are the risk factors associated
with implementing AI-based technologies
in the healthcare industry?
2. Negative impact of risk factors on
HRCMP
2.1 Impact of the human component
2.2 Increase in cost
2.3 Need to retrain employees
2.4 Security and privacy concerns
2.5. Technological complexity
R3: What outcomes are derived by using a
Lean Six Sigma (LSS) designed framework
for healthcare executives deciding to
implement AI/RPA in the healthcare
revenue cycle?
3. Risk management and problem-solving
strategies
3.1 Data security
3.2 Identification of risks
3.3 Implementing NLP
3.4 Properly trained staff
3.5 Review and audit of Processes
3.6 Transparency of processes
47
Details of Interviews
A total of ten interviews were conducted with recruited participants. Table 2
exhibits an overview of each participant (interviewee) with the interview duration.
Table 2
Interview Overview with the Duration of Interviews
Interviewee Number Duration (In minutes)
Participant -1 2 hrs 23 min
Participant -2 14 min
Participant -3 3 hrs 22min
Participant -4 43 min
Participant -5 45 min
Participant -6 1 hr 48 min
Participant -7 1 hr 12 min
Participant -8 10 min
Participant -9 59 min
Participant -10 1 hr
Word Frequency
Word frequency query of fifty most repeating words having a length of 4 (four
alphabets) was run to get the initial familiarity of the data/document. The following
figure or word cloud exhibits a list of the most frequently occurring words or concepts in
responses of the ten participants.
48
Figure 1
Word Frequency
Analysis Process in NVivo
A total of fifteen axial codes were grouped into three broad categories labelled
based on the research objective and research questions: the categories were defined based
on already developed questions in the semi-structured interviews (see Appendix E). The
data in each category was further mined, and various concepts (themes) and sub-concepts
(sub-themes) were identified and interpreted.
Findings
Benefits of AI in HRCMP (Healthcare Revenue Cycle Management Process)
To define HRCMP stands for Healthcare Revenue Cycle Management Process.
The first research question, “What prospective benefits are possible from using AI
49
revenue cycle applications in the healthcare industry?” was answered by formulating a
level-3 theme of “Benefit of AI in HRCMP.” This theme was made up of four sub-
themes 1). Patients benefits, 2) Organizational benefits, 3) Improved data quality, and 4)
Cost reduction and revenue growth. The themes of patient benefits were further
categorized as 1) Reduces patients’ denial rates, 2) Improved patients experience, and 4)
Help in early diagnosis. The theme organizational benefits were further categorized as
enhancing teamwork, decreasing or reducing the workforce, and making better decisions.
50
Figure 2
Theme Hierarchy of Benefits of AI in HRCMP
Figure 3 below represents the top beneficial themes from the collected interview
data. The data is presented based on the percentage of text coded at each theme (node),
and it is calculated by NVIVO based on how many times and the amount of text
referenced at each node for the source document. The top benefit identified by the
51
participants is the organizational benefits of using automation within the healthcare
revenue cycle, as cited below:
“AI would allow for more efficient processes in every department throughout the
revenue cycle, which would produce better data creation, which would create
better data reporting, which will allow the management team to better pinpoint
and address issues affecting both patient health and organizational health.”
“As AI is based on data, visa Data warehouse, Data lakes, and data marts,
essentially data stored from all facets of technology systems and integrations, the
possibilities of data aggregations combined with logic, yields to timely and
accurate reports or dashboards.”
Figure 1
Percentage Coverage of Benefits of AI in HRCMP
13.18%
8.87%
5.42%
2.86%
Organizational
benefits
Patients benefits Improved data quality Cost reduction and
revenue growth
Percentage coverage of Benefits of AI in
HRCMP
52
Negative Impact of Risk Factors on HRCMP
The second research question, “What are the risk factors associated with
implementing AI-based technologies in the healthcare industry?” was answered by
formulating a level-3 theme of “Negative impact of risk factors on HRCMP.” This theme
was made up of five sub-themes of Need to retrain employees, Security, and privacy
concerns, increase in cost, technological complexity, and Impact of the human
component
Figure 2
Theme Hierarchy of Negative Impact of Risk Factors on HRCMP
The chart below displays how much of the respondents in percentages are talking
about the theme related to the negative impact of risk factors on HRCMP. The highest
53
one is the “impact of the human component,” which means the risk of human error will
still exist even after software implementation.
“If the data that we are generating through staff creation is poor, even the best
reporting will still be inaccurate, leading to inaccurate, and possibly incorrect
decisions by management.”
Figure 3
Percentage Coverage of the Negative Impact of Risk Factors on HRCMP
Risk Management and Problem-Solving Strategies
The third research question, “What outcomes are derived by using a Lean Six
Sigma (LSS) designed framework for healthcare executives deciding to implement
AI/RPA in the healthcare revenue cycle?” was answered by formulating a level-3 theme
of “Risk management and problem-solving strategies.” This theme comprised six sub-
themes of implementing NLP, Transparency of processes, Data security, Identification of
risks, adequately trained staff, and review and audit of processes.
3.15%
2.12% 2.06%
1.78%
1.14%
Impact of
human
component
Need to retrain
employees
Technological
complexity
Security and
privacy
concerns
Increase in cost
Percentage coverage of Negative
impact of risk factors on HRCMP
54
Figure 6
Theme Hierarchy of Risk Management and Problem-Solving Strategies
The review and audit of the processes is the highest coded theme among the codes
in risk management and problem-solving strategies.
“Having proper dedicated reviewers of all information and processes is
incredibly important.”
“Six Sigma strategies, standardization, and process improvement will pave the
way for AI.”
55
Figure 7
Percentage Coverage of Risk Management and Problem-Solving Strategies
Triangulation of Data
The themes obtained from semi-structured interviews went through a triangulation
process to determine the validity of responses aligned with the extant literature. A total of
four research articles were selected on the keywords of AI optimizing hospital revenue
cycle management. This triangulation tested the validity by converging information from
the interview questions and the research articles to ensure that the data gathered was
consistent. The results of the triangulation process are represented via table format in
Appendix F.
6.74%
4.15%
2.62% 2.38%
1.96%
0.83%
Review and
audit of
Processes
Transparency
of processes
Data security Identification
of risks
Properly
trained staff
Implementing
NLP
Percentage coverage of Risk management
and problem solving strategies
56
Framework Design
Development
The development of the framework presented below was in conjunction with one
of the goals of this research:
To create a framework that may be applied to a healthcare organization in an
effort to migrate from their current revenue management technique to one that
includes the use of AI/ML/RPA as a means of future cost control and revenue
boost.
The researcher was able to construct a framework to provide guidance to
healthcare executives on selecting appropriate tasks for artificial intelligence (AI)/robotic
process automation (RPA) by using the data gathered from the literature review as well as
the responses from the interview data. The framework represents the main areas of
opportunities and concerns that were expressed in the various interviews. The themes and
various subthemes of the benefits of AI in HRCMP and the negative impact of risk
factors on HRCMP were used to create the questions and rankings within the framework.
These themes were aligned with the previous research, such as Deloitte’s study of
the Smart use of artificial intelligence in health care, Seizing opportunities in patient care,
and business activities (Chebrolu et al., 2020). In this article, the main areas of benefit
were increasing efficiencies and minimizing risks, and the largest area of concern was
ensuring that the technology complied with regulations. By using articles such as these,
the researcher was able to complete a data triangulation to validate the interview
responses. The validation was done by utilizing NVivo to mine the interview data and
57
align the responses to the sub-themes and existing literature. The researcher was able to
construct the below framework using the results.
Framework Example
Step 1:
Answer the question below for each task the company is considering automating. If the
answer is “no,” then the task is not appropriate for automation, and you do not need to
continue with steps 2 and 3.
1. Can automation be used for the task under consideration?
a. Bots may not be allowed because of government regulation or company
policies.
Step 2:
Fill out the below tables for each revenue cycle task the company is considering
automating. Once finished answering the questions, create an overall final evaluation
score by considering the responses to the individual questions. Additional information for
each category or question is included after each table. If unable to answer a specific
question, leave it blank.
58
Table 3
Risk Viability Framework
Risk Viability Lower Viability Higher Viability
Activity Type 1
Judgment
Based
2 3 4 5
Process Structure
and Risk
1
Low
2 3 4 5
High
Data Risk 1
Unstructured
2 3 4 5
Structured
Custom
Development
Required
1
High
2 3 4 5
Low
Automation as
Preferred
Solution
1
No
2 3 4 5
Definitely
Final Risk
Viability
Evaluation
1
Low
Viability
2 3 4 5
High
Viability
Additional information about “Risk Viability” categories:
• Activity Type refers to the extent to which the audit activity requires human
judgment or learning.
• Process Structure and Risk refer to the frequency with which the underlying
process changes. Some processes remain the same over time, whereas other
processes are constantly fluctuating. Frequent changes to the underlying process
will require constant updates to the bot or advanced programming.
• Data Risk refers to the extent to which bot technology will be processing data
that has a high-risk category.
59
• Custom Development Required refers to the amount of time, money, and
expertise needed to create the bot. Development requirements tend to increase
with the complexity of the process.
• Automation as Preferred Solution refers to the that not every process should be
automated.
Table 4
Benefits Framework
Benefits of Automation Less Beneficial More Beneficial
Effort Required for
Manual function
1
Low
2 3 4 5
High
Frequency of the
Function
1
Low
2 3 4 5
High
Staffing Concerns
(Turnover/Overtime)
1
High
2 3 4 5
Low
Data Accuracy Concerns 1
High
2 3 4 5
Low
Compliance Concerns 1
High
2 3 4 5
Low
Final
Evaluation
of Benefits
1
Low Benefit
2 3 4 5
High Benefit
Additional information about Benefits of Bot categories:
• Effort Required for Manual function refers to the amount of time and mental
energy needed to perform the Revenue Cycle activity (and not the creation of the
automation).
• Frequency of the Function refers to the number of times the activity occurs
within a given time period. Activities that occur more often are more beneficial to
automate.
60
• Staffing Concerns (Turnover/Overtime) refers to the amount of staffing issues
your department faces. By creating automation, companies should be able to
redeploy staff to perform more complex tasks, increasing employee engagement
thus creating a 24×7 workforce.
• Data Accuracy Concerns refers to the amount of human error and variations
from the standard.
• Compliance Concerns refer to the amount of concern an organization has with
possible data errors resulting in a reportable offense.
Step 3:
Plot the scores for each potential bot activity on the matrix below. Automation activities
that are in Quadrant 2 should be prioritized for immediate development. Once all
Quadrant 2 activities are developed, Quadrant 1 activities should be developed, and
activities in Quadrant 3 and 4 should not be developed.
61
Figure 8
Framework Scatterplot
Summary
This chapter revealed the study findings. First, the subject matter expert’s results
were outlined. Second, the custom coding via NVIVO was described along with its
results and ended with ten qualified employees. Third, data triangulation was done to
validate the interview responses against current literature. Utilizing this data, the
researcher was able to answer the three research questions and construct a theoretical
framework for healthcare executives to use when deciding to implement AI/RPA within
their organizations.
62
Chapter 5
Conclusion
The conclusion begins by exploring the results of the three research questions.
Limitations of the study are then described, noting how they may have had an impact on
the results. Next, implications and recommendations are theorized to offer a context for
further evolution of the concept of the use of AI/RPA in the healthcare revenue
management cycle. A summary of the research study concludes the chapter.
Research Questions
Research Question 1: What prospective benefits can be generated by using AI revenue
cycle applications for healthcare organizations?
As noted in the literature review, many studies have been done in relation to the
use of AI/RPA in other industries. This study attempted to incorporate these studies as
well as the interview data to answer the first research question. Similar to the other
studies, the researchers noted a core group of proposed benefits: 1) Patient benefits, 2)
Organizational benefits, 3) Improved data quality, and 4) Cost reduction and revenue
growth.
Research Question 2: What are the risk factors associated with AI implementation in
healthcare?
Most healthcare organizations are forced to assess risk levels prior to
implementing any innovative technologies. Similarly noted in other research studies that
were targeted to healthcare implementations, this study noted the following strategies that
63
need to addressed potential high risks in the use of AI/RPA in the revenue cycle,
including external, physical, and digital, as well as maintaining a governance framework
to assure patient privacy and other HIPAA requirements. Utilizing the proposed
framework developed in research question 3, healthcare companies should assess whether
the potential benefits sufficiently outweigh the associated risks.
Research Question 3: What outcomes are derived by using a Lean Six Sigma (LSS)
designed framework for healthcare executives deciding to implement AI/RPA in the
healthcare revenue cycle?
This research study provides a theoretical framework for using AI/RPA in the
healthcare revenue cycle, which will allow for waste reduction and elimination of non-
value-added activities along with variability reduction. Lean tools reduce waste and non-
value-adding activity and enhance the effectiveness of equipment, tools, and machines.
For this research question, a theoretical framework was constructed, the Lean Six Sigma
framework should be implemented to reduce the defects which occurred during the
revenue cycle. The theoretical framework combines the interview data and literature as
well as Lean Six Sigma methods to mitigate the errors and defects and increase patient
satisfaction while reducing overhead costs.
Limitations
Upon retrospective review, multiple study limitations were identified. Firstly, a
relatively small sample size was utilized to accomplish this study. Due to this limitation,
the findings may or may not be generalizable to the population of healthcare executives
in the United States or healthcare revenue cycle strategies used in other countries.
64
Secondly, a limitation was due to conducting this research during the COVID-19
pandemic. This limitation caused the researcher to adapt from a typical face-to-face
interview. Face-to-face interviews have been considered the gold standard in qualitative
interviewing, especially for their potential to elicit honest views on sensitive topics by
building trust with research participants. At present, this is not feasible, and remote
methods were required to conduct this study. This caused the researcher to possibly miss
valuable data and insights due to a lack of an unstructured conversation. These
conversations may have led to other topic areas that might have influenced the outcomes
of the study.
Implications
The theoretical framework constructed in this research is subject to several
limitations that suggest several opportunities for additional research. First, the framework
focuses on the prioritization of the development of new automation tasks. Healthcare
organizations would benefit from research regarding the maintenance of these
technologies, including governing the tasks with respect to ever-changing government
and insurance regulations.
Secondly, the theoretical framework developed in this research study was not
tested or validated. We believe, but have not empirically validated, that this framework
will help healthcare executives in their decision-making process with regards to AI/RPA
automation. Thus, future research or case studies should be done to validate the
framework for its effectiveness and efficiency.
65
Recommendations
AI/RPA technologies are at the forefront of disruptive technologies and have
tremendous potential to transform the healthcare revenue cycle. However, there is much
to be explored about the implications of this emerging technology on the healthcare
revenue cycle before it can be fully implemented. Additional testing of RPA and actual
implementation on sections of the revenue cycle is necessary to obtain a better
understanding of its benefits and challenges. This study focused on developing a
theoretical framework to assist healthcare executives in determining if AI/RPA
implementations aligned with their organizational needs.
In the meantime, it seems that AI/RPA can be used to automate segments of the
RCM process. However, caution and due diligence are needed in its development,
implementation, and monitoring due to the unknowns with the payor and federal
regulatory issues. The higher level of monitoring may cancel out the organizational
benefits until more knowledge is gained around the risks of these technologies.
Although initial assessments of the value-add of AI/RPA indicate that it can lead
to improved patient satisfaction and better financial and organizational benefits, it would
be interesting to measure its usefulness in real-time with a large RCM department.
However, this type of implementation may not be easy to do until prototypes are ready
for deployment. As more about the cost and benefits of AI/RPA is revealed over time, it
will be necessary for healthcare executives to become familiar with its potential
application in their organization.
This study recommends that future studies continue examining other factors that
may influence the cost/benefit analysis of AI/RPA implementations. For example, this
66
study suggests adding other risk constructs to the model, such as new government
regulations. These areas are changing so fast that these items would be required to be
built into the model. Additionally, utilizing major payor rules and regulations is highly
recommended to examine other factors that influence the expected outcomes of AI/RPA
within the healthcare RCM processes.
Summary
The problem addressed by this study is a lack of understanding regarding the
specific risks and benefits associated with AI implementation in healthcare settings.
Many administrative tasks are currently completed manually in healthcare, which takes
high labor costs and increases human computation error potential. However, it is
unknown to what extent AI may improve these administrative tasks and address
challenges (CAQH, 2018).
There is a lack of understanding regarding the risks and benefits associated with
AI/RPA implementations in healthcare revenue cycle settings. Healthcare companies are
confronted with stricter regulations and billing requirements, underpayments, and more
significant delays in receiving payments. Despite the continued interest of practitioners,
revenue cycle management has not received much attention in research.
In order to expand the knowledge of the use of AI/RPA in the healthcare revenue
cycle, the researcher conducted a thorough analysis of the existing literature and
combined that with conducting interviews of key individuals. Using this data, the
researcher conducted a triangulation of the responses and current literature to help
develop a theoretical framework that may be applied to a healthcare organization in an
67
effort to migrate from their current revenue management technique to one that includes
the use of AI/ML/RPA as a means of future cost control and revenue boost.
The goals of this research study were:
1. To expand on the current literature surrounding the use of AI in the health care
revenue cycle and provide a framework to allow health care executives to quickly
visualize the benefits or drawbacks of such a technology in their specific
healthcare revenue cycle departments.
2. To create a framework that may be applied to a healthcare organization in an
effort to migrate from their current revenue management technique to one that
includes the use of AI/ML/RPA as a means of future cost control and revenue
boost.
To achieve the stated goals of the research, the main research questions were:
R1. What prospective benefits can be generated by using AI revenue cycle
applications for healthcare organizations?
R2. What are the risk factors associated with AI implementation in healthcare?
R3. What outcomes are derived by using a Lean Six Sigma (LSS) designed
framework for healthcare executives deciding to implement AI/RPA in the
healthcare revenue cycle?
In order to answer these research questions, a qualitative semi-structured
interview was conducted with ten key stakeholders responsible for managing or
developing revenue cycles, including healthcare administrators, accounting/financial
management officers, and information technology (IT) staff members.
The semi-structured interview consisted of 12 questions in three thematic areas:
68
1. How AI-based technologies will benefit the healthcare revenue cycle management
processes
2. How to address risk factors that negatively impact the healthcare revenue cycle
management processes
3. Inviting the participants to discuss risk management and problem-solving
strategies that guide decision-making processes in the organizational context
Finally, the interview responses underwent a triangulation process against
multiple existing studies to determine the validity of responses aligned with the extant
literature. Following Creswell and Creswell (2018), the triangulation process ensured
that the outcomes of the research participant’s responses were aligned with those in
previous studies done in the areas of AI/RPA studies. An audit trail was developed by
transcribing the semi-structured interview responses that were recorded via a Microsoft
Forms engine. These responses were stored without any personally identifiable
information to ensure the confidentiality of the interview participants.
The research findings suggest that AI/RPA implementations can improve the
healthcare revenue cycle’s effectiveness and efficiency. Healthcare organizations should
be cautious about which workflows that they implement AI/RPA into due to
governmental regulations and payor complexities. These findings are consistent with
recent literature and the interview data collected, which suggests that some tasks do not
benefit from risk payoff.
The limitations of this research study included factors, such as sample size and
sample technique. The sample size was small, which affected the accuracy of the results,
and the sampling technique was convenient, which is not generalizable. Additionally, this
69
study collected information about AI/RPA thoughts from key individuals. This research
was based on semi-structured interviews, which could affect participants’ truth in
answering the questions and, consequently, the study results’ accuracy.
This research study contributed to prior healthcare literature in three main ways.
First, it expanded on the current literature surrounding the use of AI in the health care
revenue cycle. Second, the research quantified the past research and was able to draw
similarities and likeness by interviewing prominent information technology professionals
as well as healthcare executives. Finally, this research was able to construct a theoretical
framework, thereby allowing health care executives to quickly visualize the benefits or
drawbacks of such a technology in their specific healthcare revenue cycle departments.
This study recommended opportunities for future research to examine other
AI/RPA implementations in different organizations while modifying the developed
theoretical model to fit the organization’s terminology. Future research is needed to test
the theoretical model to ensure that it has the intended outcomes and displays the benefits
as expected. Moreover, major payor and government regulations could be added to the
theoretical model for further investigation. Another recommendation is to recruit a large
and diverse sample using experimental research design to ensure the generalizability of
results.
70
Appendices
71
Appendix A: Interview Questions
72
73
74
75
Appendix B: IRB Exempt Initial Approval Memo
76
77
Appendix C: Email Invitation
Dear (Participant),
I am conducting interviews as part of a research study at Nova Southeastern University.
This study is in fulfillment of my dissertation requirements. The study aims to increase
the understanding of the specific risks and benefits associated with Artificial Intelligence
and/or Robotic Process Automation (AI/RPA) implementations in healthcare revenue
cycle settings.
As an experienced healthcare administrator, accounting/financial management officer,
and/or information technology (IT) staff member, you are in an ideal position to give us
valuable first-hand information from your perspective.
The interview takes around 30 minutes and is very informal. We are simply trying to
capture your thoughts and perspectives on the use of AI/RPA within the revenue cycle.
Your responses to the questions will be kept confidential. Each interview will be assigned
a number code to help ensure that personal identifiers are not revealed during the analysis
and write-up of findings.
There is no compensation for participating in this study. However, your participation will
be a valuable addition to my research, and findings could lead to a greater understanding
of the use of AI/RPA in the healthcare setting.
If you are willing to participate, please suggest a day and time that suits you, and I will do
my best to be available. If you have any questions, please do not hesitate to ask.
Thank you for your time and consideration.
Leonard Pounds
[email protected]
954-661-2794
mailto:[email protected]
78
Appendix D: Informed Consent Form
79
80
81
82
Appendix E: NVIVIO Codes
Name of codes Description
Benefits of AI in HRCMP How AI-based technologies will benefit the healthcare
revenue cycle management processes
Cost reduction and revenue
growth
AI is a massive enabler in improving funds flow and
reducing billing mistakes resulting in reduced capital cost.
Improved data quality AI will eliminate numerous manual mistakes, timing issues
with manual inputting of data by providers and front desk
staff, and delays in submission of claims
Organizational benefits
Decrease or reduce
workforce
Decrease of the workforce means the quantity of work
needed by staff become less and reduce means to bring down
the size (less no of people are required to perform a task)
Enhances teamwork A thorough examination of all systems and processes of AI-
based projects has brought organizations together.
Make better quick decisions Using AI technologies, data will be better and easily
accessible, resulting in more time for analysis that leads to
better quicker, better decisions.
Patients benefits
Help in early diagnosis The patient population’s needs are anticipated by AI systems
and by its early intervention, results in preventing more
severe conditions AI systems can alert patience of visits,
medication refills, and can also monitor the progress of
improved health outcomes
Improved patients’
experience
AI would be able to enhance the patient experience by
streamlining their admission and care process and giving
doctors and medical staff more time to focus on the patients
and not the process.
Reduces patients’ denial rate Claim denials are one of the most common barriers to
effective revenue cycle management. Using AI systems can
anticipate denials, edits can be put in place, and new claims
will be paid on initial submissions.
The negative impact of risk
factors on HRCMP
risk factors that negatively impact the healthcare revenue
cycle management processes
Impact of the human component The risks of human errors will still exist and can adversely
affect even the most finely implemented software.
Increase in cost The cost of implementing and maintaining the entire web of
processes results in extra costs to healthcare
Need to retrain employees The staff does not perform the processes manually soon.
They will lose the knowledge which is vital for the system’s
upkeep and adjustment
Security and privacy concerns Data integrity and security are significant concerns with
implementing any new technology as data going out to a
third party.
83
Technological complexity Understanding how new technologies work and how can they
be practically implemented in cycle management is quite
complex and is an ever-changing paradigm,
Risk management and problem-
solving strategies
Data security Data security should be paramount. The evaluation must be
carried out to keep the patient’s data and the company’s
financial information safe.
Identification of risks Any potential risk should be identified and monitored to
minimize its impact
Implementing NLP Implementing NLP (Natural language processing) to
translate the clinical notes automatically
Properly trained staff Training of the staff to use the AI processes
Review and audit of Processes The data should be analyzed and assessed at regular
intervals.
Transparency of processes It means regular reports should be generated to identify
issues in the system and the documentation of all the
processes and workflow designs to understand easily
84
Appendix F: Triangulation of Data
85
86
87
References
Abdullah, U., Sawar, M. J., & Ahmed, A. (2009). Comparative study of medical claim
scrubber and a rule-based system. Proceedings – 2009 International Conference on
Information Engineering and Computer Science, ICIECS 2009.
https://doi.org/10.1109/ICIECS.2009.5363668
Adams, W. C. (2015). Conducting semi-structured interviews. In Handbook of Practical
Program Evaluation (4th ed., pp. 492–505). John Wiley & Sons.
Alam, A. Y. (2016). Steps in the process of risk management in healthcare. Journal of
Epidemiology and Preventative Medicine, 51, 1-8.
Anderson, B. R. (2016). Improving health care by embracing systems theory. The
Journal of Thoracic and Cardiovascular Surgery, 152(2), 593-594.
ASQ. (2020). What is Lean Six Sigma. https://asq.org/quality-resources/six-sigma
Bai, T., Egleston, B. L., Bleicher, R., & Vucetic, S. (2019). Medical concept
representation learning from multi-source data. 28th International Joint Conference
on Artificial Intelligence (IJCAI-19), 4897–4903.
https://doi.org/10.24963/ijcai.2019/680
Bautista, R. M., Dumlao, M., & Ballera, M. A. (2016). Recommendation system for
Engineering students’ specialization selection using predictive modeling.
Proceedings of the Third International Conference on Computer Science, Computer
Engineering, and Social Media (CSCESM2016), 34–40.
Baxter, K., Pechanich, C., Laur, T., Sevenikar, G., & Malloy, D. (2019). Pursuing
innovation in the revenue cycle to transform operations. Healthcare Financial
Management, 73(11), S1–S4.
https://www.hfma.org/topics/hfm/2019/december/pursuing-innovation-in-the-
revenue-cycle-to-transform-operations.html
Becker, S., & Ellison, A. (2019, August 12). How AI can transform hospital revenue
cycle management — 5 thoughts. Becker’s Hospital CFO Report.
https://www.beckershospitalreview.com/finance/how-ai-can-transform-hospital-
revenue-cycle-management-5-thoughts.html
Blass, G., & Porr, R. F. (2019). More than clinical: How AI supports compliance and risk
management. Journal of Healthcare Compliance, 21, 35–39.
Bohr, A., & Memarzadeh, K. (2020). The rise of artificial intelligence in healthcare
applications. In Artificial Intelligence in Healthcare (pp. 25–60).
https://doi.org/10.1016/b978-0-12-818438-7.00002-2
88
Bryman & Bell. (2011). Business research methods. Oxford University Press.
CAQH. (2018). 2017 CAQH INDEX®: A report of healthcare industry adoption of
electronic business transactions and cost savings. In CAQH Explorations.
www.caqh.org
Che, N., & Janusz, W. (2013). Unsupervised labeling of data for supervised learning and
its application to medical claims prediction. Computer Science, 14(3), 191.
https://doi.org/10.7494/csci.2013.14.2.191
Cheatham, B., Javanmardian, K., & Samandari, H. (2019). Confronting the risks of
artificial intelligence. Mckinsey Quarterly, 1–9.
Chebrolu, K., Ressler, D., Varia, H. (2020, October 16). Deloitte introduces trustworthy
AI framework – press release. Deloitte United States.
https://www2.deloitte.com/us/en/pages/about-deloitte/articles/press-
releases/deloitte-introduces-trustworthy-ai-framework.html?nc=1.
Chenail, R. J. (2011). Interviewing the investigator: Strategies for addressing
instrumentation and researcher bias concerns in qualitative research. Qualitative
Report, 16(1), 255–262.
Chimmad, A., Saripalli, P., & Tirumala, V. (2017). Assessment of healthcare claims
rejection risk using machine learning. 2017 IEEE 19th International Conference on
E-Health Networking, Applications, and Services, Healthcom 2017, 1–6.
https://doi.org/10.1109/HealthCom.2017.8210758
Christodoulakis, C., Asgarian, A., & Easterbrook, S. (2020). Barriers to adoption of
information technology in healthcare. Proceedings of the 27th Annual International
Conference on Computer Science and Software Engineering, CASCON 2017, 66–75.
Clancy, T. R. (2020). Artificial intelligence and nursing: The future is now. JONA: The
Journal of Nursing Administration, 50(3).
https://doi.org/10.1097/NNA.0000000000000855
Cleverly, W. O., & Cleverley, J. O. (2018). Essentials of health care finance (8th ed.).
Jones & Bartlett Learning.
Creswell, J., & Plano Clark, V. L. (2007). Designing and conducting mixed methods
research. Sage.
Creswell, J. W., & Creswell, J. D. (2018). Research design: Qualitative, quantitative, and
mixed methods approaches (5th ed.). Sage Publications.
Culatta, R. (2021). ADDIE model. Instructional Design.
89
del%20is%20the,training%20and%20performance%20support%20tools.
Dash, R., Rebman, C., & Kar, U. K. (2019). Application of artificial intelligence in
automation of supply chain management. Journal of Strategic Innovation and
Sustainability, 14(3), 43–53. https://doi.org/10.33423/jsis.v14i3.2105
Davenport, T., & Kalakota, R. (2019). The potential for artificial intelligence in
healthcare. Future Healthcare Journal, 6(2), 94–98.
https://doi.org/10.7861/futurehosp.6-2-94
DECO. (2019, September 26). Five hospital revenue cycle management challenges
healthcare providers face. DECO. https://www.decorm.com/five-hospital-revenue-
cycle-management-challenges-healthcare-providers-face/
Evans, J. (2017). Rethinking health care’s triple aim. HFM (Healthcare Finance
Management), October, 1–8. search.ebscohost.com/
Evans, J. R. (2015). Modern analytics and the future of quality and performance
excellence. The Quality Management Journal, 22(4), 6–17.
https://doi.org/10.1080/10686967.2015.11918447
Fineout-Overholt, E., & Melnyk, B. M. (2015). Evidence-based practice in nursing &
healthcare. A guide to best practice (3rd ed.). Wolters Kluwer.
Forcier, M. B., Gallois, H., Mullan, S., & Joly, Y. (2019). Integrating artificial
intelligence into health care through data access: Can the GDPR act as a beacon for
policymakers? Journal of Law and the Biosciences, 6(1), 317–335.
https://doi.org/10.1093/jlb/lsz013
Frączek, B. (2016). Characteristics of evidence in evidence-informed practice in financial
education. Acta Oeconomica Cassoviensia, 9(1), 52–67.
Gandomi, A., & Haider, M. (2015). Beyond the hype: Big data concepts, methods, and
analytics. International Journal of Information Management, 35(2), 137–144.
https://doi.org/10.1016/j.ijinfomgt.2014.10.007
Gawlik-Kobylinska, M. (2018). Reconciling ADDIE and Agile instructional design
models—
Case study. New Trends and Issues: Proceedings on Humanities and Social
Sciences, 5(3) 14-21. https://doi.org/100.18844/prosoc.v5i3.3906
Gerke, S., Minssen, T., & Cohen, G. (2020). Ethical and legal challenges of artificial
intelligence-driven healthcare. In Artificial Intelligence in Healthcare (pp. 295–
336). https://doi.org/10.1016/b978-0-12-818438-7.00012-5
90
Hamet, P., & Tremblay, J. (2017). Artificial intelligence in medicine. Metabolism,
69(Supplement), S36–S40. https://doi.org/10.1016/j.metabol.2017.01.011
Hegwer, L. R. (2018). Technology and a strong patient focus help providers excel in
revenue cycle performance. Healthcare Financial Management, 72(9), 58–65.
Hillman, D. (2020). The role of intelligent automation in reducing waste and improving
efficiency in the revenue cycle. Healthcare Financial Management, 75, 36–39.
Hissong, A. N., Lape, J. E., & Bailey, D. M. (2015). Bailey’s research for the health
professional (3rd ed.). F.A. Davis Company.
Hut, N. (2019). Effective use of analytics helps healthcare organizations solve critical
challenges. Healthcare Financial Management, 44–47.
Kim, B. H., Sridharan, S., Atwal, A., & Ganapathi, V. (2020). Deep claim: Payer
response prediction from claims data with deep learning. ArXiv, 2007.06229.
Kühl, N., Goutier, M., Hirt, R., & Satzger, G. (2019). Machine learning in artificial
intelligence: Towards a common understanding. Hawaii International Conference
on Systems Science Proceedings 2019, 2–12.
https://doi.org/10.24251/hicss.2019.630
Kumar, M., Ghani, R., & Mei, Z. S. (2010). Data mining to predict and prevent errors in
health insurance claims processing. Proceedings of the ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, 65–73.
https://doi.org/10.1145/1835804.1835816
Lacity, M. C., & Willcocks, L. P. (2016). Robotic process automation at telefónica O2.
MIS Quarterly Executive, 15(1), 21–35.
LaPointe, J. (2020). How artificial intelligence is optimizing revenue cycle management.
Recycle Intelligence. https://revcycleintelligence.com/features/how-artificial-
intelligence-is-optimizing-revenue-cycle-management
Lee, J., Suh, T., Roy, D., & Baucus, M. (2019). Emerging technology and business model
innovation: The case of artificial intelligence. Journal of Open Innovation:
Technology, Market, and Complexity, 5(3). https://doi.org/10.3390/joitmc5030044
Lin, Y., Chen, H., Brown, R. A., Li, S., & Yang, H. (2017). Healthcare predictive
analytics for risk profiling in chronic care: A bayesian multitask learning approach.
MIS Quarterly, 41(2), 473–495.
Lu, J., Fung, B. C. M., & Cheung, W. K. (2020). Embedding for anomaly detection on
health insurance claims. Proceedings – 2020 IEEE 7th International Conference on
Data Science and Advanced Analytics, DSAA 2020, 459–468.
91
https://doi.org/10.1109/DSAA49011.2020.00060
McGrow, K. (2019). Artificial intelligence: Essentials for nursing. Nursing, 49(9), 46–49.
https://doi.org/10.1097/01.NURSE.0000577716.57052.8d
Millauer, T., & Vellekoop, M. (2019). Artificial intelligence in today’s hotel revenue
management: opportunities and risks. Research in Hospitality Management, 9(2),
121–124. https://doi.org/10.1080/22243534.2019.1689702
Morse, J. M., Barrett, M., Mayan, M., Olson, K., & Spiers, J. (2002). Verification
strategies for establishing reliability and validity in qualitative research.
International Journal of Qualitative Methods, 1(2), 13–22.
https://doi.org/10.1177/160940690200100202
Mortier, P., Cuijpers, P., Kiekens, G., Auerbach, R. P., Demyttenaere, K., Green, J. G.,
Kessler, R. C., Nock, M. K., & Bruffaerts, R. (2018). The prevalence of suicidal
thoughts and behaviours among college students: A meta-analysis. Psychological
Medicine, 48(4), 554–565. https://doi.org/10.1017/S0033291717002215
Navigant Consulting. (2019). Top revenue cycle challenges and opportunities.
Healthcare Financial Management.
https://www.hfma.org/topics/hfm/2019/november/top-revenue-cycle-challenges-
and-opportunities.html
Nilsson, E. (2019). 4 strategies for an AI-driven approach to improve revenue cycle
performance. Healthcare Financial Management, 73(9), 42–44.
Nowell, L. S., Norris, J. M., White, D. E., & Moules, N. J. (2019). Thematic analysis:
Striving to meet the trustworthiness criteria. International Journal of Qualitative
Methods, 16, 1- 13. https://doi.org/10.1177/1609406917733847
Onwuegbuzie, A., & Teddlie, C. (2003). A framework for analyzing data in mixed
methods research. In A. Tashakkori & C. Teddlie (Eds.), Handbook of mixed
methods in social and behavioral research (pp. 351–383). Sage.
Patient Protection and Affordable Care Act, Pub. L. No. 42 U.S.C. § 18001 et seq.
(2010).
Rajkomar, A., Oren, E., Chen, K., Dai, A. M., Hajaj, N., Hardt, M., Liu, P. J., Liu, X.,
Marcus, J., Sun, M., Sundberg, P., Yee, H., Zhang, K., Zhang, Y., Flores, G.,
Duggan, G. E., Irvine, J., Le, Q., Litsch, K., … Dean, J. (2018). Scalable and
accurate deep learning with electronic health records. NPJ Digital Medicine, March,
1–10. https://doi.org/10.1038/s41746-018-0029-1
Regnault, A., Willgoss, T., & Barbic, S. (2018). Towards the use of mixed methods
inquiry as best practice in health outcomes research. Journal of Patient-Reported
92
Outcomes, 2, 2–5. https://doi.org/10.1186/s41687-018-0043-8
Schouten, P. (2013). Big data in health care solving provider revenue leakage with
advanced analytics. Healthcare Financial Management, 67(2), 40–42.
Shaw, J., Rudzicz, F., Jamieson, T., & Goldfarb, A. (2019). Artificial intelligence and the
implementation challenge. Journal of Medical Internet Research, 21(7).
https://doi.org/10.2196/13659
Stanfill, M. H., & Marc, D. T. (2019). Health information management: Implications of
artificial intelligence on healthcare data and information management. Yearbook of
Medical Informatics, 28(1), 56–64. https://doi.org/10.1055/s-0039-1677913
Wager, K. A., Lee, F. W., & Glaser, J. P. (2017). Health care information systems: A
practical approach for health care management (4th ed.). Jossey-Bass.
Wojtusiak, J. (2014). Rule learning in healthcare and health services research. Intelligent
Systems Reference Library, 56, 131–145. https://doi.org/10.1007/978-3-642-40017-
9_7
Wojtusiak, J., Ngufor, C., Shiver, J., & Ewald, R. (2011). Rule-based prediction of
medical claims’ payments: A method and initial application to Medicaid data.
Proceedings – 10th International Conference on Machine Learning and
Applications, ICMLA 2011, 2, 162–167. https://doi.org/10.1109/ICMLA.2011.126
XIFIN. (2020, March 9). Gain new insights with analytics, AI to accelerate RCM
workflow. Revcycle Intelligence. https://revcycleintelligence.com/news/gain-new-
insights-with-analytics-ai-to-accelerate-rcm-workflow
Zhong, Q. Y., McCammon, J. M., Fairless, A. H., & Rahmanian, F. (2019). Medical
concept representation learning from claims data and application to health plan
payment risk adjustment. ArXiv, 1907.06600, 2–5.
ProQuest Number:
INFORMATION TO ALL USERS
The quality and completeness of this reproduction is dependent on the quality
and completeness of the copy made available to ProQuest.
Distributed by ProQuest LLC ( ).
Copyright of the Dissertation is held by the Author unless otherwise noted.
This work may be used in accordance with the terms of the Creative Commons license
or other rights statement, as indicated in the copyright statement or in the metadata
associated with this work. Unless otherwise specified in the copyright statement
or the metadata, all rights are reserved by the copyright holder.
This work is protected against unauthorized copying under Title 17,
United States Code and other applicable copyright laws.
Microform Edition where available © ProQuest LLC. No reproduction or digitization
of the Microform Edition is authorized without permission of ProQuest LLC.
ProQuest LLC
789 East Eisenhower Parkway
P.O. Box 1346
Ann Arbor, MI 48106 – 1346 USA
28773927
2021
BIG DATA, DATA SCIENCE, AND THE U.S. DEPARTMENT OF DEFENSE (DOD)
by
Roy Lancaster
GAYLE GRANT, DM, Faculty Mentor and Chair
MICHELLE PREIKSAITIS, JD, PhD, Committee Member
BRUCE WINSTON, PhD, Committee Member
Tonia Teasley, JD, Interim Dean
School of Business and Technology
A Dissertation Presented in Partial Fulfillment
Of the Requirements for the Degree
Doctor of Business Administration
Capella University
January 2019
ProQuest Number:
All rights reserved
INFORMATION TO ALL USERS
The quality of this reproduction is dependent upon the quality of the copy submitted.
In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted. Also, if material had to be removed,
a note will indicate the deletion.
ProQuest
Published by ProQuest LLC ( ). Copyright of the Dissertation is held by the Author.
All rights reserved.
This work is protected against unauthorized copying under Title 17, United States Code
Microform Edition © ProQuest LLC.
ProQuest LLC.
789 East Eisenhower Parkway
P.O. Box 1346
Ann Arbor, MI 48106 – 1346
13805367
13805367
2019
© Roy Lancaster, 2019
Abstract
This qualitative case study of a de-identified DOD organization, Bravo Zulu Center (BZC)
(pseudonym) explored how U.S. Department of Defense (DOD) personnel glean actionable
information from big data sets. This research sought to help analyze and define the skills used by
DOD analysts, in order to better understand the application of data science to the DOD. While
the technology for producing data has grown tremendously, DOD personnel lack the required
data analysis skills and tools. Eleven DOD analysts answered individual interview questions,
eight managers participated in a focus group, and the DOD provided documents to assist with
investigating two research questions: How does the Bravo Zulu Center glean actionable
information from big data sets? How mature are the data science analytical skills, processes, and
software tools used by Bravo Zulu Center analysts? Qualitative analysis using the NVivo-11®
Pro software on the results of the interviews, focus group, and documents, showed that
overarching themes of access to quality data, training, data science skills, domain understanding,
management, infrastructure and legacy systems, organization structure and culture, and
competition for analytical talent appear as concerns for improving big data analysis in the DOD.
The Bravo Zulu Center is experiencing the same large data growth as other organizations
described in scholarly research and is struggling with creating actionable information from large
data sets to meet mission requirements and this is compounded by immature data science skills.
iii
Dedication
The study is dedicated to my wife of thirty years Laurie Lancaster. Your love, continued
encouragement, and desire for life-long learning has always provided me strength to continue, I
love you and thank you! I also dedicate this work to our children Sarah, TJ, and Wesley and to
our grandbabies Nora and Jameson! A special thank you to my mom Kathryn for “grounding”
me in the early years and teaching me the value of education and for your foundational love and
support! Special thank you to my sisters Shari and Amy and to all my extended family and
friends, I love you all!
iv
Acknowledgments
I wholeheartedly thank my mentor and chair, Dr. Gayle Grant for her expert guidance
throughout this project and getting me to finish line, thank you! I extend gratitude to my
committee, Dr. Michelle Preiksaitis and Dr. Bruce Winston for their expert reviews and
guidance. A special thank you to Dr. Linda Haynes for her outstanding reviews and most
importantly her love and inspiration, thanks Aunt Linda! Thank you to the Bravo Zulu Center
(pseudonym) for opening their doors for me, this study would not have been possible without
your generosity. Thank you to the men and women who wear the uniform of the United States
military!
v
Table of Contents
Dedication ……………………………………………………………………………………………….. iii
Acknowledgments…………………………………………………………………………………….. iv
List of Tables …………………………………………………………………………………………. viii
List of Figures …………………………………………………………………………………………….x
CHAPTER 1. INTRODUCTION …………………………………………………………………………….1
Introduction ………………………………………………………………………………………………..1
Background ………………………………………………………………………………………………..2
Business Problem ………………………………………………………………………………………..4
Research Purpose ………………………………………………………………………………………..5
Research Questions ……………………………………………………………………………………..6
Rationale ……………………………………………………………………………………………………7
Conceptual Framework ………………………………………………………………………………..8
Significance………………………………………………………………………………………………..9
Definition of Terms……………………………………………………………………………………10
Assumptions and Limitations ……………………………………………………………………..10
Organization for Remainder of Study …………………………………………………………..11
CHAPTER 2. LITERATURE REVIEW …………………………………………………………………13
Conceptual Framework and Research Design ……………………………………………….14
Big Data Defined ………………………………………………………………………………………19
DOD and Big Data …………………………………………………………………………………….25
Data Sciences ……………………………………………………………………………………………31
vi
Data Sciences Skills …………………………………………………………………………………..34
Federal Job Series and DOD Data Scientists …………………………………………………45
Management Implications …………………………………………………………………………..48
Summary ………………………………………………………………………………………………….52
CHAPTER 3. METHODOLOGY ………………………………………………………………………….53
Introduction ………………………………………………………………………………………………53
Research Questions ……………………………………………………………………………………53
Design and Methodology ……………………………………………………………………………54
Participants ……………………………………………………………………………………………….56
Setting. …………………………………………………………………………………………………….60
Analysis of Research Questions…………………………………………………………………..61
Credibility and Dependability ……………………………………………………………………..65
Data Collection …………………………………………………………………………………………67
Data Analysis ……………………………………………………………………………………………69
Ethical Considerations ……………………………………………………………………………….75
CHAPTER 4. RESULTS ………………………………………………………………………………………76
Introduction ………………………………………………………………………………………………76
Data Collection Results………………………………………………………………………………78
Data Analysis and Results ………………………………………………………………………….84
Summary ………………………………………………………………………………………………..141
CHAPTER 5. DISCUSSION, IMPLICATIONS, RECOMMENDATIONS ………………143
Introduction …………………………………………………………………………………………….143
vii
Evaluation of Research Questions ……………………………………………………………..147
Fulfillment of Research Purpose ………………………………………………………………..149
Contribution to Business Problem ……………………………………………………………..152
Recommendations for Further Research ……………………………………………………..153
Conclusions …………………………………………………………………………………………….155
REFERENCES ………………………………………………………………………………………………….157
Statement of Original Work and Signature …………………………………………………………….169
APPENDIX A. INTERVIEW GUIDE ………………………………………………………………….170
viii
List of Tables
Table 1. Seven traits/small to big data comparison ………………………………………………….23
Table 2. Harris and Mehrotra’s analysts and data scientists comparison …………………….37
Table 3. Harris, Murphy and Vaisman list of data science skills ……………………………….38
Table 4. Federal 1500 job series occupations…………………………………………… 48
Table 5. BZC participant criteria…………………………………………………………60
Table 6. Instruments and data collection methods ………………………………………62
Table 7. Initial codes …………………………………………………………………….71
Table 8. Interviewee experience levels…..……………………………………………….80
Table 9. Management focus group experience….…..……………………………………81
Table 10. BZC collected documents…………………….………………………………..83
Table 11. Initial codes (restated)…………………………………………………………..85
Table 12. Analysts’ responses to questions about big data…………………………………………89
Table 13. Analysts’ responses to data usage questions……………………………………90
Table 14. Analysts’ responses to questions regarding data analysis challenges………….92
Table 15. Analysts’ responses further exploring access to quality data…………………..93
Table 16. Analysts’ responses to data usage and data analysis questions…………………95
Table 17. Additional responses to analysis challenges questions…………………………97
Table 18. Additional analysts’ responses to challenges questions………………………..99
Table 19. Analysts’ responses to data science skills questions…………………………..101
Table 20. Analysts’ responses to data science skills and analysis software questions……104
Table 21. Analysts’ responses to training related questions………………………………106
ix
Table 22. Analysts’ responses to data scientists scarcity questions……………………..108
Table 23. Analysts’ responses to data scientist skills and roles questions………………110
Table 24. Managers’ responses to questions about big data…………………………….114
Table 25. Managers’ responses to data usage questions…………………………………116
Table 26. Managers’ responses to questions regarding data analysis challenges……….117
Table 27. Managers’ additional responses to data analysis challenges…………….……119
Table 28. Managers’ responses to data usage and data analysis questions………………120
Table 29. Managers’ responses to analysis challenges………………………………….122
Table 30. Managers’ responses to data science skills questions…………………………124
Table 31. Managers’ responses to data science skills and analysis software questions….125
Table 32. Managers’ responses to training related questions……………………….……127
Table 33. Managers’ responses to data scientists scarcity questions………………………128
Table 34. Managers’ responses to data scientists’ skills and roles questions…………….130
Table 35. Data scientist and BZC Supply Analyst skills comparison……………….…..134
Table 36. Data scientist and BZC Program Management Analyst skills comparison……136
Table 37. Data scientist and BZC Operations Research Analyst skills comparison……..138
Table 38. Data scientist and BZC Computer Scientist skills comparison…………………140
x
List of Figures
Figure 1. Analysis of big data scholarship ……………………………………………………………….16
Figure 2. Cleveland’s data science taxonomy…………………………………………………………..32
Figure 3. Adaptation of Cleveland’s data science taxonomy ……………………………………..63
Figure 4. BZC case study triangulation …………………………………………………………………..67
Figure 5. BZC case study data analysis process ……………………………………………………….72
Figure 6. BZC potential analyst participants………………………………..……………79
Figure 7. Final hierarchical coding structure……..………………………………………86
Figure 8. Initial analysts interviews word frequency diagram……………………………87
Figure 9. Refined analyst interviews word frequency diagram…………………………..88
Figure 10. Initial management focus group interview word frequency diagram………..112
Figure 11. Refined management focus group interview word frequency diagram.……..113
Figure 12. BZC strategic document word frequency diagram……………………..…….131
Figure 13. BZC job announcements word frequency diagram……………………….….133
Figure 14. Cleveland’s data science taxonomy (restated)……………….……….………..144
Figure 15. Final hierarchical coding structure (restated)……………………….…..……146
Figure 16. Domain and data science assessment model………………………….……….151
1
CHAPTER 1. INTRODUCTION
Introduction
A seemingly infinite amount of data (big data) has emerged, and its effects are profound
on modern-day corporations and the United States military as they continue to progress through
the information technology age (Ransbotham, Kiron, & Prentice, 2015). The ability to connect
and analyze continuously growing digital data is now essential to competitiveness in most
sectors of the United States economy (Lansiti & Lakhani, 2014). George, Haas, and Pentland
(2014) suggested that although there is evidence demonstrating the significant growth in data and
its importance for sustainability there is a gap in published management scholarship providing
theory and practices for management. Additionally, growing evidence supports the notion that
the skills required to manage and analyze the exponentially growing size of data are inadequate
and in short supply with bleak predictions for the future (Harris & Mehrotra, 2014). If there is
truly a new occupation emerging (data scientist) in the commercial sector because of the
exceptional data growth, then determining how United States Department of Defense (DOD)
organizations currently analyze large data sets will help determine if data scientists are warranted
in their organizations. Chapter 1 of this study demonstrates a business problem for both
commercial organizations and the DOD. The general business problem is the lack of effective
analysis in organizations operating in the modern-day big data environment (Harris & Mehrotra,
2014). The specific business problem is that DOD organizations may be struggling with gleaning
actionable information from large data sets compounded by immature data science skills of DOD
analysts (Harris, Murphy, & Vaisman, 2013). This chapter describes the conceptual framework
that supports this study and the rationale, purpose, and significance of the study. The overall
significance of this study is to assist with the gap in DOD related scholarly research related to
2
big data and data science and seeks to contribute value to scholars and practitioners working on
this important business problem.
Gang-Hoon, Trimi, and Ji-Hyong (2014) proposed a level of skepticism in the United
States military’s ability to adapt new technologies and philosophies required to leverage
meaningful information from large data sets. The research explored big data and data science
associated with the challenges brought on by the enormous data growth being observed in nearly
all organizations. The DOD is an extremely large organization and well beyond the ability of one
dissertation to affect massive change. This research was supported by a comprehensive literature
review of big data and data science application in corporate America as well as the DOD and
seeks to provide actionable insights into the requirements of the analysts in modern-day
organizations and serve as a catalyst for additional research.
Background
Managing data represents both problems and opportunity with distinct advantages to
organizations that can manage and analyze data (McAfee & Brynjolfsson, 2012). This research
investigated how organizational leaders and analysts manage and probe data to make better-
informed decisions, offer new insights, and automate business processes thereby adding value
throughout the value chain and creating sustainable competitive advantages (Berner, Graupner,
& Maedche, 2014). Watson and Marjanovic (2013) advocated that although executives are aware
of big data and know of some specific uses, they are often unsure how big data can be used in
their organizations and what is required to be successful. Additionally, Edwards (2014) found the
DOD is experiencing a similar data growth and presents similar problems and opportunities for
DOD leaders.
Watson and Marjanovic (2013) suggested big data and data science may not represent
3
something new but are simply the next stage of business analysis as organizations continue to
progress through the information technology age. The fields of business intelligence (BI) and
business analytics (BA) are not new with decades of existence in business and were the subject
of examination in this research. Scholarly researchers agree it is important to understand the
desired connection between raw data and actionable information through the evolution of
business intelligence (BI) and business analytics (BA) (Chen, Chiang, & Storey, 2012). The term
intelligence has been a term used in scientific research since the early 1950s. In the 1970s,
computing technology began providing actionable information to the business world and
companies began utilizing systems to generate information from raw data for management
(Ortiz, 2010). In her seminal book, In the Age of the Smart Machine: The Future of Work and
Power, Zuboff (1988) predicted information systems are not only going to automate business
processes they will also produce valuable information in a unique manner. The field of business
intelligence became popular in the business and information technology (IT) communities and
the idea of business analytics became popular in the 2000s as the key analytics component of
business intelligence (Chen et al. 2012). The unquestioned benefit of business intelligence and
business analytics is the ability to capture trends, gain insights, and draw conclusions from the
data generated in support of the business or to gain advantages over the competition and create
sustainable growth (Rouhani, Ashrafi, Zare Ravasan, & Afshari, 2016). Berner et al. (2014)
suggested that with data generation on a sharp incline there are significant gaps in the abilities of
modern-day organizations to leverage big data, and without mitigation, this gap will continue to
grow. The concept of business intelligence means organizations understand their business and
the environment it operates in, thus creating the ability for smarter decisions. Big data stands to
be a key enabler for business intelligence success (Swain, 2016).
4
Business Problem
Organizations face rapid data growth, requiring deliberate and strategic action by
leadership to remain competitive and ensure sustainability (Gabel & Tokarski, 2014). For
example, the data-rich, highly-competitive airline industry gives a clear advantage to airline
corporations that use big data to drive their strategies and decisions, while punishing those that
do not (Akerkar, 2014). Additionally, corporations such as Amazon are leading the way utilizing
high-powered big data analytics to alter the retail industry (Watson & Marjanovic, 2013). The
airline and retail industries are just two examples of industries that are being reshaped due to
their ability or inability to analyze large data sets and may provide actionable insights for the
DOD.
Ransbotham, Kiron, and Prentice (2015) is a significant research study published in the
MIT Sloan Management Review that in 2014 surveyed 2,719 participants. The participants of the
study advocated combining high level analytical skills with existing business knowledge are
creating competitive advantages. Phillips-Wren and Hoskisson (2015) suggested big data is
stimulating innovation and altering foundational aspects of many business models. Additionally,
both of these sources indicate the analysis of big data is proving difficult as companies struggle
with the ability to create actionable analytical products and integrating new analysis into existing
decisions venues. Ransbotham et al. (2015) proposed a key constraint preventing analysts from
producing actionable information from large data sets are the lack of analytical skills.
The general business problem is the lack of effective analysis in organizations operating
in the modern-day big data environment (Harris & Mehrotra, 2014). The specific business
problem is that DOD organizations may be struggling with gleaning actionable information from
large data sets compounded by immature data science skills of DOD analysts (Harris, Murphy, &
5
Vaisman, 2013). Symon and Tarapore (2015) proposed the fast-paced evolution of analysis
capabilities in commercial organizations represents great opportunity to address this business
problem for the DOD. Hamilton and Kreuzer (2018) suggested the amount of data collected by
DOD organizations continues to outpace the ability to process and interpret the data and the
ability to glean actionable information from large data sets is crucial for DOD mission success.
Research Purpose
The purpose of this qualitative case study was to explore how DOD employees conduct
data analysis with the influx of big data. An unidentified U.S. Air Force command was selected
by the researcher as the case study organization to support this study. The Bravo Zulu Center
(BZC) pseudonym was applied throughout this research to conceal the identity of the case study
organization. This research explored the emerging commercial data scientist occupation and the
skills required of data scientists to help determine if data science is applicable to the DOD. This
research sought to further define the skills required of data scientists to help enable their
effectiveness in modern organizations with specific emphasis aimed at the DOD. The targeted
population consisted of analysts, managers, or executives working within the Bravo Zulu Center
(BZC). The implication for positive social change includes the potential to identify needed
adaptations in the skills and abilities of analysts and managers working within DOD
organizations that are required to glean actionable information from big data sets. This research
explored data science and the implications associated with the big data phenomenon by
conducting qualitative research with a representative case study organization. This dissertation
explored important skill sets, attitudes, and perceptions of the analysts working big data issues
for the BZC, along with the skills sets, attitudes, and perceptions of management within the same
organization. Big data innovations are happening throughout commercial industries and it is
6
transforming foundational aspects of many business models and placing greater demands for
fast-paced innovation (Parmar, Cohn, & Marshall, 2014). This fast-paced evolution of analysis
capabilities in commercial organizations represents great opportunity for the DOD. This research
builds upon several big data and data science constructs documented in contemporary scholarly
literature (Symon & Tarapore, 2015). First, big data represents both potential and liability with
the ability to manage and analyze big data sets likely required for business sustainability
(Gobble, 2013). Second, for organizations to harvest actionable information from big data sets
requires deliberate change in many aspects of organization design and management of human
resources (Gabel & Tokarski, 2014).
A qualitative research methodology is appropriate for understanding human behavior and
is common in social and behavioral sciences and by scholar practitioners who seek to understand
a phenomenon (Cooper & Schindler, 2013). This type of research involves collecting data
typically in the participants’ settings and inductively analyzing the collected information looking
for themes to provide insight and understanding (Cooper & Schindler, 2013). This research is an
exploration of how big data analysis is accomplished within the DOD and why the rise of large
data sets may generate the need to increase the analytical skills of DOD employees making a
qualitative research methodology most appropriate.
Research Questions
The objective of this research was to develop an understanding of how DOD analysts
respond to, probe and assimilate data in big data environments to help determine if a data science
occupation is justified and warranted in the DOD. The following research questions guided the
study:
7
Primary Research Question 1: How does the Bravo Zulu Center glean actionable
information from big data sets?
Primary Research Question 2: How mature are the data science analytical skills,
processes, and software tools used by Bravo Zulu Center analysts?
Rationale
The principle rationale for furthering the knowledge on the big data phenomenon and
data science through a qualitative case study is a result of the need to view big data analysis
through the humanist lens instead of an information system technological lens (McAfee &
Brynjolfsson, 2012). Managing big data requires senior decision makers to embrace data driven
decisions and this will require a cultural change in many organizations (Gabel & Tokarski,
2014). Even though there are researchers that stress the importance of big data capability, there is
no consensus on how best to re-align and organize modern-day organizational models to support
big data efforts (Grossman & Siegel, 2014). Additionally, Brynjolfsson and McAfee (2012)
suggested there is a lack of understanding by all levels of management regarding the value of big
data and the changes required to harness the power of big data. Management may need to invest
in data scientists who can manage and manipulate large data sets and turn this raw data into
meaningful information. Unfortunately, organizations and academia may be struggling with
defining the skills sets of these so-called data scientists (Harris et al. 2013). Gabel and Tokarski
(2014) advocated data capture usage is on a sharp increase and businesses and organizations
would like to realize competitive advantages contained in the use of the tremendous amount of
data. Digital data is driving foundational changes in personal lives, business, academia, and
functions of government. The analysis of big data promises to reshape everything from
government, international development, and even how we conduct basic science (Gobble, 2013).
8
DOD organizations are generating massive amounts of information from activities along their
value chains. There has been a dramatic increase of embedded sensors into modern-day weapon
systems that is compounding the data growth (Hamilton & Kreuzer, 2018).
Moorthy et al. (2015) suggested there is potential in nearly all industries regarding the
impact of turning vast amounts of raw data into meaningful information. Additionally, turning
large raw data sets into meaningful information will require deliberate and strategic action
(Galbraith, 2014). Warehousing data is problematic, expensive, and time consuming and creates
alignment difficulties in modern organizations (Gabel & Tokarski, 2014). Davenport and Patil
(2012) submitted that the skills required to large amounts of raw data into meaningful
information are in high demand and are in short supply. The technology for producing data has
evolved greatly but the skills and software tools required to analyze large data sets have been
lagging (Gobble, 2013). Additionally, the DOD has declared they have a scarcity of data
scientists. According to the Deputy Assistant Secretary for Defense Research, data scientists are
in short supply and are becoming the most in demand job for the U.S. Military (Hoffman, 2013).
There are experts suggesting there is a data analysis skills shortfall especially for analysts that
have the talent to create predictive analytical products utilizing statistics, artificial intelligence,
and machine learning (Davenport & Patil, 2012).
Conceptual Framework
The conceptual framework serves as the foundational knowledge to support the research
study. This framework serves to guide the research by relying on formal theory, which supports
the researcher’s thinking on how to understand and plan to research the topic (Grant & Osanloo,
2014). William S. Cleveland (2001) coined the term data science in the context of enlarging the
major areas of technical work in the field of statistics. Cleveland’s seminal work described the
9
requirement of an “action plan to enlarge the technical areas of statistics focuses of the data
analyst” (Cleveland, 2001, p. 1). Cleveland described a major altering of the analysis occupation
to the point a new field shall emerge and will be called “data science” (Cleveland, 2001, p. 1).
The plan of six technical areas that encompass the field of data science described by Cleveland
include multidisciplinary investigations, models and methods for data, computing with data,
pedagogy, tool evaluation, and theory. The primary catalyst for Cleveland’s declaration of the six
technical areas was to act as a guideline for the percentage of the overall effort a university or
governing organization should apply to each technical area to begin to define curriculum for the
development of future data scientists and was adapted to support this research (Cleveland, 2001).
Significance
DISA (2015) suggested the capability to leverage meaningful information from big data
is important to the DOD. However, there are researchers that also suggests there are significant
shortfalls in the abilities of complex organizations to fully employ business intelligence
techniques on extremely large data sets (Harris & Mehrotra, 2014). In June 2014, the Office of
Naval Research published a request to commercial and DOD industries for white papers and full
proposals on how to use big data for real insight (McCaney, 2014). The overall objective was to
achieve unprecedented access to data with deeper insights by examining the data in new and
innovative ways (McCaney, 2014). Additionally, in March of 2015 the Defense Information
Systems Agency (DISA) published a request for information regarding infrastructure
development to support potential big data and governance solutions. This request is specifically
seeking examples of commercially developed solutions that are more efficient than current DOD
solutions (DISA, 2015). The desired significance of this research was to develop an
understanding of the skills required by modern-day analysts and help determine if a data scientist
10
is justified and warranted in the DOD.
Definition of Terms
Big Data is characterized as “datasets that are too large for traditional data processing
systems and that therefore require new technologies” (Provost & Fawcett, 2013, p. 54).
Big Data is characterized by “extremely high volume, velocity, and variety (commonly
referred to as the “3 Vs”). It also exceeds the capabilities of most relational database
management systems and has spawned a host of new technologies, platforms, and approaches”
(Watson & Marjanovic, 2013, p. 5).
Big Data Analytics: “Analytical techniques in applications that are so large (from
terabytes to exabytes) and complex (from sensor to social media data) that they require advanced
and unique data storage, management, analysis, and visualization technologies” (Chen et al.
2012, p. 1165).
Data Scientist Definition #1 is a seasoned professional with the training, skills, and
curiosity to discover new insights in the era of big data (Davenport & Patil, 2012).
Data Scientist Definition #2 is someone that is better at programming than statistics and
better at statistics than a computer scientist (Baskarada & Koronios, 2017).
Assumptions and Limitations
The goal of this qualitative case study was to explore how DOD employees conduct data
analysis with the influx of big data. This research explored the emerging commercial data
scientist occupation and the skills required of data scientists to help determine if data science is
applicable to the DOD. The ability to generalize conclusions to a larger population is a potential
limitation of qualitative research (Cooper & Schindler, 2013). A potential limitation of this study
is the ability to draw conclusions on an organization as large and complex as the DOD. The
11
following were the assumptions and limitations within this study.
Assumptions
The sample in this study was limited to a small number of DOD analysts and managers
within one organization. The research findings are not meant to be representative of the entire
population of DOD analysts and managers but are meant to be a catalyst for additional
quantitative research and analysis. Responses from the analysts and the managers were based
upon their own experiences and perceptions are not meant to be representative of the entire DOD
population.
Limitations
There were some limitations to qualitative data collection, primarily because of the
subjectivity and biases inherent to each participant and the researcher (Cooper & Schindler,
2013). The researcher purposively selected an organization within the DOD responsible for large
data sets and is experiencing the big data phenomenon for supporting documents, research
literature, and case study. A potential limitation was the researcher’s bias due to his long DOD
career. The researcher is a career U.S. Navy employee and purposively avoided U.S. Navy
organizations to prevent bias. All the data collected in support of this research will be retained
for seven years and then destroyed personally by the researcher via a crosscut shredder for
documents and via an approved data destruction program for digital recordings.
Organization for Remainder of Study
This study is organized into five chapters and the basis of Chapter 1 was to identify the
purpose, reasoning, and intent of this doctoral research. The research in support of Chapter 1
demonstrated a clear business problem regarding the challenges associated with the big data
phenomenon and lack of defining skills for DOD analysts and proposed the DOD is suffering
12
from this business problem (Gobble, 2013). Chapter 2 contains a literature review with
explanations on how this study differs from previous research. Chapter 3 describes the
methodology and research design employed in this study. Additionally, the data collection
method(s) are described to include the data analysis, credibility, dependability, and ethical
considerations (Moustakas, 1994). Chapter 4 presents the data analysis and findings and Chapter
5 presents a discussion of the results, conclusions, and recommendations for further research.
13
CHAPTER 2. LITERATURE REVIEW
The evidence is clear; forward acting leaders manage and harness insights from data to
gain sustainable competitive advantages (Lansiti & Lakhani, 2014). Additionally, there is clear
evidence that there are big data problems emerging due to the disproportionate growth between
collected data and the abilities of most organizations to analyze the data (Géczy, 2015). The
general business problem is the lack of effective analysis in organizations operating in the
modern-day big data environment (Harris & Mehrotra, 2014). The specific business problem is
that DOD organizations may be struggling with gleaning actionable information from large data
sets compounded by immature data science skills of DOD analysts (Harris et al. 2013).
Additionally, the amount of data being collected and requiring analysis is on a sharp increase for
the DOD. Porche III, Wilson, Johnson, Erin-Elizabeth, and Tierney (2014) commented that at
little as 5% of all data collected in the U.S. Navy and Air Force’s intelligence, surveillance, and
reconnaissance mission received analytical interpretation: the U.S. military data analysts are
overwhelmed. Additionally, substantial research is underway to determine how big data volumes
can create value for individuals, community organizations and governments (Gobble, 2013). In
response to concern regarding extreme data growth and its impact on modern day businesses and
society, several scholarly journals have been created just in the past few years which are bringing
scholars and practitioners together to research and report on the growing big data business
problem and data sciences (Frizzo-Barker, Chow-White, Mozafari & Dung, 2016). For example,
the Big Data Analytics, Big Data & Society, and the EPJ Data Science Journals have all been
founded since 2012.
The objective of this research was to develop an understanding of how DOD analysts
14
respond to, probe and assimilate data in big data environments to help determine if a data science
occupation is justified and warranted in the DOD. The following research questions guided the
study:
Primary Research Question 1: How does the Bravo Zulu Center glean actionable
information from big data sets?
Primary Research Question 2: How mature are the data science analytical skills,
processes, and software tools used by Bravo Zulu Center analysts?
This chapter describes the processes used to explore big data and data sciences and
identifies and describes research studies that have been completed regarding this important
business problem in commercial business as well as the DOD. This chapter is the result of a
comprehensive review of the pertinent scholarly and practitioner literature surrounding big data
and data sciences and is foundational for a qualitative methodology and case study research
design.
Conceptual Framework and Research Design
The conceptual framework that serves as the foundational knowledge to support this
research study is the work of William S. Cleveland (2001). This seminal research introduced the
term data science in the context of “expanding the technical areas of the field of statistics.” This
seminal work described the requirement of an “action plan to enlarge the technical areas of
statistics focuses of the data analyst” (Cleveland, 2001, p. 1). Cleveland described a major
altering of the analyst occupation to the point that a new field shall emerge called “data science”
(Cleveland, 2001, p. 1). Cleveland’s data science taxonomy directed universities to develop six
technical areas, allocate resources appropriately to research, and develop curriculum within these
technical areas. Additionally, Cleveland recommended a data science action plan that could be
15
adapted for research by government and corporate organizations. Since Cleveland (2001) there
have been many researchers advancing the field of data science through theories and methods.
However, there has yet to be provided a largely accepted academic definition of data science to
include the skills required of data scientists and how best to employ data scientists in modern big
data environments (Viaene, 2013). Conversely, there are scholars conducting scientific research
further defining the data science occupation and there are universities that have developed
curriculum to educate data scientists (Cotter, 2014). The lack of a definition regarding data
science and the potential shortage of these professionals coupled with the rapid data growth in
DOD data systems presents a key issue for the DOD.
As described by Moustakas (1994), qualitative research is an approach to explore how
groups or individuals perceive a specific phenomenon or problem. This type of research involves
collecting data typically in the participants’ settings and inductively conducting analysis of the
collected information looking for themes to provide insight and understanding (Moustakas,
1994). A qualitative research design utilizing a single embedded case study organization is
appropriate for this research and the Bravo Zulu Center agreed to participate as the case study
organization.
Gap in Literature
Although there is a tremendous amount of literature with researchers investigating the
implications with big data sets and data science, there is a gap in published scholarly literature
regarding big data and data sciences related specifically to the DOD. Frizzo-Barker et al. (2016)
conducted a systematic review of the big data business scholarship published between the years
2009-2014. These researchers analyzed 219 papers from 152 relevant academic journals and
concluded big data research and theory is fragmented and in “early state of domain of research in
16
terms of theoretical grounding, methodological diversity, and empirical evidence” (Frizzo-
Barker et al. 2016, p. 1). Frizzo-Barker et al. (2016) examined key elements as to the types and
sheer volume of published big data research as well as to the aspects of big data problems and
opportunities examined in contemporary big data research. Frizzo-Barker et al. (2016) examined
the types of industries and organizations being analyzed through big data research and concluded
most research can be categorized as either business in general or financial and management.
These researchers categorized any research regarding big data and the DOD into the law and
governance category making up 17% of the total big data research submitted suggesting a
significant gap exists in big data research associated with DOD as seen in Figure 1.
Figure 1. Analysis of Big Data Scholarship. Adapted from “An Empirical Study of the Rise of
Big Data in Business Scholarship,” by J. Frizzo-Barker, P. Chow-White, M. Mozafari, and H.
Dung 2016, International Journal of Information Management, 36(3), p. 410. Copyright 2016 by
Elsevier. Reprinted with permission.
Additionally, there is an abundance of contemporary big data research regarding the
technological advances enabling the big data phenomenon and much less surrounding the human
and data science implications associated with big data. In fact, there appears to be a gap in
published scholarly literature that tackles the human implications associated with big data and
17
data sciences and this gap is the focus of this research. There appears to be many opportunities to
explore new theories and practices that may evolve regarding the management of big data and
the evolution and application of data science (George et al. 2014).
The Big Data and Data Science Buzz
Without question the term big data and associated literature experienced a sharp increase
over the past decade. In Young’s (2014) dissertation regarding big data and healthcare Young
cited a 2013 Google search on the term big data which yielded 9.1 million hits, I executed the
same Google search in December 2017, and the search provided 343 million hits regarding big
data and I executed the same search in August 2018, and the search provided 824 million hits.
Additionally, there is a plethora of both scholarly and secondary literature surrounding big data
and data science and this literature review was the product of the examination of hundreds of
writings regarding these topics. This literature review focused on the perceived benefits and
liabilities of big data and the implications for analysts in modern organizations responsible for
capturing meaningful information from the data. Specifically, are there actions and emerging
requirements of the people responsible for analyzing data because of the arrival of large amounts
of data, and secondly is the notion of a data scientist warranted? Additionally, this literature
review focused on supported evidence of successful big data application by commercial
organizations to aid the DOD regarding their initiatives to harness big data.
A continually growing interest from mainstream media and research firms are
contributing to the message regarding data sciences. The research firm Glassdoor is an
organization that ranks occupations based upon current job openings, salaries, career
opportunities, and job satisfaction. This organization ranked data scientist as the top job in the
United States for 2016, 2017, and 2018 and indicated a data scientist could expect to earn an
18
annual salary of $110,000 (Columbus, 2018). In this example, a major research firm on job
occupations in the United States declared data scientist as the top profession and yet as this
literature review highlights the DOD has not determined how and if data scientists are needed.
Additionally, a very often cited report Manyika et al. (2011) suggested a short fall of analytical
and managerial talent in the United States in the range of 140,000 to 190,000 people by 2018.
The well-published big data researchers Thomas Davenport and D.J. Patil not only agreed to the
shortfall but also labeled data scientist as the “sexiest” job in the 21st century (Davenport & Patil,
2012, p. 1). Conversely, Fox and Do (2013) advocated there may be too much hype regarding
big data and its potential impacts. These researchers indicated the term big data is too vague and
this vagueness is causing prioritization problems for organizations. These researchers suggest
that increasing data both in size and complexity has been on-going since the mid-1990s and it
does not represent a new problem (Fox & Do, 2013). Comparing literature between researchers
such as Davenport and Patil (2012) who claimed big data and data science is having profound
effects on most industries and researchers such as Fox and Do (2013) who proposed that big data
is not new demonstrates this is an on-going debate that requires further research.
The term data scientist gained significant notoriety and momentum in 2008, when D. J.
Patil and Jeff Hammerbacker were leading the analytical efforts at Facebook and LinkedIn
(Davenport & Patil, 2012). Data scientists are professionals at gleaning actionable information
from large amounts of data. Data scientist use traditional math, science, and statistical techniques
along with modern analysis software to glean actionable information from large data sets
(Davenport & Patil, 2012). Furthermore, the term data scientist received a great amount of
popular press when D. J. Patil went on to be appointed by President Obama as the first Chief
Data Scientist at the White House (Smith, 2015). D.J. Patil served in this capacity under
19
President Obama from 2015-2017. The following comprehensive review of the existing scholarly
and practitioner literature explores the potential and effects of big data and seeks to document the
implications and requirements of today’s business leaders and understand the growing
importance of data science.
Big Data Defined
There is clear evidence demonstrating there is a big data phenomenon underway, but it is
less clear on the full ramifications of big data and how prepared is the human element and the
full significance of the big data phenomenon. There are scholarly researchers suggesting the
arrival of big data includes cultural, technological, and scholarly impacts (George, Haas, &
Pentland, 2014). Conversely, there are some influential researchers, such as Watson and
Marjanovic (2013), that indicate big data may not represent something new but is simply the next
phase of digitization as societies continue to progress through the information age. Beer’s (2016)
theoretical framework suggested there is very little understanding of the concept of big data,
such as where the term came from, how is it used and how does it lend authority thereby further
conceptualizing the big data phenomenon and allowing for actionable research and theory.
Schneider, Lyle, and Murphy (2015) indicated the growing conversation of big data is a very
relevant conversation to the DOD due to the extreme data growth and data capture by DOD
activities coupled with indications the data growth trends will continue for the near future.
Big data has become a ubiquitous term with no single unified definition. A commonly
cited explanation describes big data “as the collection of data sets so large and complex that it
becomes difficult to process using traditional relational database tools and traditional data
processing applications” (Moorthy et al. 2015, p. 76). The origin of the term big data is
debatable; however, this term has been around since at least the 1990s. Several authors give
20
some credit to John Mashey, who in the 1990s was a chief scientist working at Silicon Graphics
Inc., responsible for developing methods for the management of large amounts of computer
graphics. Mashey gave hundreds of presentations to small groups in the 1990s to explain the
concept of an extremely large amount of data capture coming quickly with profound impacts
(Lohr, 2013).
Several researchers, such as Watson and Marjanovic (2013), placed big data on an
evolutionary scale and depict the big data phenomenon as the fourth generation in the
information age. With decision support systems (DSS) as the first generation which was born in
the early 1970s. Secondly, the 1990s brought in the era of the enterprise data warehousing in
which businesses aggregated their data from many disparate data sources and field locations into
a single warehouse or warehouses. The third generation arrived in the early 2000s in which
senior leaders and managers were gaining near and real-time access into these data warehouses
and invested heavily into the business intelligence layers built on top of these data sets to gain
powerful and competitively attractive decisions into their value chains. Finally, the big data era is
creating a fourth generation that promises to be a catalyst for major change and innovation in
nearly all industries (Watson & Marjanovic, 2013).
The Size of Big Data
The amount of data collection globally is growing rapidly and modern organizations are
capturing massive amounts of data on activities up and down their value chains. Additionally,
millions of networked sensors are being embedded into machines creating a hugely data rich
environment. This exponential growth in data is underway in nearly all sectors of the U.S.
economy and businesses are simply collecting more data than they can manage (McAfee &
Brynjolfsson, 2012). There are several researchers and organizations studying the amount of data
21
generated and providing predictions of massive growth in the decade ahead. One common
resource cited in modern literature surrounding big data is the Digital Universe research project
sponsored by the EMC Corporation (Turner, Reinsel, Gantz & Minton, 2014). This project seeks
to define how big the big data expansion is today and provides predictions of data growth into
the next decade. According to the Digital Universe, data generation and collection will double
every two years and by 2020, the size of stored digital data will reach 44 trillion gigabytes. To
help put this into context if this amount of data was stored in a stack of tablet computers, such as
an iPad™, there would be 6.6 stacks of tablets equal to the distance from the Earth to the Moon
(Turner et al. 2014).
The Three V’s Revised
There are many assumptions and perplexities regarding big data definitions. If all
organizations generate data, what constitutes big data? Additionally, because big data is a term
with different meanings it creates difficulties when determining solution paths regarding big data
efforts (Watson & Marjanovic, 2013). Attempting to define a taxonomy on which to conduct big
data research is a common theme in contemporary big data literature (Beer, 2016). In 2001,
Douglas Laney of META group authored what is now considered a foundational white paper
regarding data management and provided a context upon which the big data phenomenon could
be described. Even though there is no consensus on the amount of data that constitutes big data,
the impact of big data could be described through the constructs of volume, velocity, and variety
(Phillips-Wren & Hoskisson, 2015). Although an exact and wide-spread definition of big data
has not been commonly agreed to, examining the data growth through Laney’s definition is very
commonly cited in the literature. Laney described the three V’s in the context of the amount and
size of the data (volume), the rate at which data is produced(velocity), and range of different
22
formats data is being generated and delivered (variety) (Phillips-Wren & Hoskisson, 2015).
Kitchin and McArdle (2016) suggested Laney’s traditional view of big data using the
three V’s lacks ontological clarity. Ontological clarity would define the concepts, categories and
properties of big data and the relationships between them (Kitchin & McArdle, 2016). The use of
the three V’s to describe big data is a useful entry point but only describes a broad set of issues
associated with big data, vice providing further definition and practicality of big data (Kitchin &
McArdle, 2016). Additionally, Kitchin and McArdle (2016) aggregated and submitted several
important and new qualities and attributes of big data, suggested by several contemporary big
data researchers, to include the following:
“Exhaustivity. The entire system is captured, n=all, rather than being sampled.
Fine-grained. Resolution and uniquely indexical (in identification).
Relationality. Data contains common fields that enable the conjoining of different
datasets.
Extensionality. Data is added and changed easily.
Scaleability. The ability for data to expand in size rapidly.
Veracity. Data can be messy, noisy and contain uncertainty and error.
Value. Data provides many insights can be extracted and the data repurposed.
Variability. Data can be constantly shifting in relation to the context in which they are
generated” (Kitchin & McArdle, 2016, p. 1).
Kitchin and McArdle (2016) explored ontological characteristics of 26 datasets to
provide a more actionable definition of big data. These researchers developed a taxonomy of
seven big data traits and then applied these traits against 26 data sets that were considered to
23
meet current definitions of big data. Kitchin and McArdle (2016) significantly added to Laney’s
foundational definition of big data and demonstrated big data is qualitatively different to
traditionally small data sets along seven axes as seen in Table 1.
Table 1
Kitchin & McArdles’ Seven Traits and Small to Big Data Comparison
Small Data Big Data
Volume Small or limited to large Very large
Velocity Slow, freeze-framed or bundled Fast, continuous
Variety Limited in scope to wide ranging Wide
Exhaustivity Samples Entire populations
Resolution and indexicality Course and weak to strong and tight Tight and strong
Relationality Weak to strong Strong
Extensionality and
scalability
Low to middling High
Note. Adapted from “What makes big data, big data? Exploring the ontological characteristics of 26
datasets,” by R. Kitchin and G. McArdle, 2016. Big Data & Society, 3 (1). CC 2016 by Sage Publishing.
Big Data Benefits
The traditional analytics environment that exists in most organizations today includes
transactional systems that generate data and data warehouses that store the data. Data warehouses
are thus collections of federated data marts. A set of business intelligence and analytics tools that
aid decision-making through queries, data mining, and dashboards. Typical dashboards drill from
top-level key performance indicators down through a wide range of supporting metrics and
detailed data (Davenport, Barth, & Bean, 2012).
Almeida (2017) suggested the primary purpose of big data analysis is to improve
24
business processes through greater insights and better decision making. Understanding how to
leverage increasingly amounts of data is crucial for business success in the modern environment.
This researcher conducted an in-depth literature review of published works between the years
2012-2017 and determined that big data analysis is a growing theme of importance in big data
research (Almeida, 2017). Additionally, research published in the Harvard Business Review by
McAfee and Brynjolfsson (2012) was a study encompassing 330 large North American
companies and consisted of structured interviews with executives spread across these
organizations. The researchers gathered information in interviews about the companies’
organizational management and technology strategies and collected information from annual
reports and independent sources. The primary purpose of McAfee and Brynjolfssons’ study was
to investigate if exploiting vast new flows of information in the era of big data could radically
improve performance. The researchers suggested the era of big data is a revolution because
companies can measure and therefore manage more precisely activities up and down their values
streams unlike any time in the past. McAfee and Brynjolfssons concluded that top performing
companies that are using data-driven decision-making supported by analytical software were on
average “5% more productive and 6% more profitable” suggesting companies can and do build
competitive advantages through big data analysis (McAfee & Brynjolfsson, 2012, p. 64).
Additionally, according to Davenport and Dyché (2013) the analysis of data to provide insight
into the organizations’ value chain is not a new concept. However, most businesses are just now
starting to strategize the potential benefits of big data analysis and how best to implement big
data analysis into their traditional business intelligence architectures. Corporations such as
Yahoo, Google, Wal-Mart, and Amazon are clearly leading the way regarding big data
management and analysis. However, for most companies the ability to manage large data sets to
25
the extent of these leading corporations requires strategic planning and action (Watson &
Marjanovic, 2013). Prominent researchers such as Davenport and McAfee clearly demonstrate
there is value to companies that can analyze big data sets and may provide actionable theory for
the DOD. Hoffman (2013) suggested that although the DOD has been warehousing and
analyzing data for several decades, they, too, require strategic change to leverage information in
the era of big data. Leveraging big data through analysis is a high priority for the U.S. military,
however there are researchers suggesting the DOD’s ability to analyze its data is not keeping
pace with the amount of data being collected (Hoffman, 2013). Much of the expectation involved
in big data analysis is the continued desire by companies and the DOD to move from reactionary
metrics based on historical data to predictive and prescriptive metrics that may be possible with
big data analysis. Research on big data and data science suggests the ability to locate hidden
facts, indicators, and relationships immersed in big data sets not yet explored (Chen et al. 2012).
DOD and Big Data
The amount of data collection across the DOD has been increasing at a fast pace and the
demands from the warfighters to make well-informed decisions from massive amounts of data
are critical (Hamilton & Kreuzer, 2018). Edwards (2014) suggested big data insights are now an
essential requirement for modern warfare and military organizations need to use advanced
analytics to take advantage of their massive amounts of data and avoid over saturation from the
data. The notion the DOD is aware of its growing data challenge is well documented. However,
it is less clear on just how large is the data growth in DOD information systems and how
prepared is the DOD to handle big data. The purpose of this exploratory qualitative case study
was to explore how DOD employees conduct data analysis with the influx of big data. This
research will explore the emerging commercial data scientist occupation and the skills required
26
of data scientists to help determine if data science is applicable to the DOD. By conducting a
comprehensive literature review as to the perceptions of big data and data science there are
potential benefits to the DOD.
DOD Big Data Initiatives
Although Frizzo-Barker et al. (2016) suggested there is a gap in big data literature for
U.S. government organizations the U.S. defense industry appears energized by the potential of
big data and big data analysis. The DOD is reaching out to commercial industries for assistance
and advice (Konkel, 2015). Cyber defense and situation awareness initiatives appear to be in the
forefront of the department’s initiatives. Many of the big data projects underway within the DOD
are aimed at advancing military, surveillance, and reconnaissance (ISR) systems (Costlow,
2014). Porche et al. (2014) accumulated several formal research projects requested by the U.S.
Navy to investigate the huge data growth and provide any potential ways forward. The amount of
ISR data collected by the U.S. Navy has become overwhelming with no end in sight. These
researchers explained the U.S. Navy is only able to analyze approximately five percent of the
data it collects from its ISR platforms (Porche et al. 2014). Additionally, several researchers
from the U.S. Navy’s postgraduate school collaborated on Big Data and Deep Learning for
Understanding DOD Data (2015) further expounding on the big data problem for the DOD with
specific research to help determine if big data and data science are really something new or just
the next progression in information technology analysis. These researchers explained that
applications including traditional numerical analysis, statistics, machine learning, data mining,
business intelligence, and artificial intelligence are migrating into a common term called big data
analytics (Zhao, MacKinnon, & Gallup, 2015).
The U.S. Air Force (USAF) is also struggling with the demands for ISR data collection
27
and analysis as the requirement for these types of missions continues to increase. In Data Science
and the USAF ISR Enterprise (2016), the USAF Deputy Chief of Staff for Intelligence,
Surveillance and Reconnaissance released a publicly available white paper that described
extreme emphasis on the U.S. Air Force’s big data growth and the opportunities for data
sciences. The U.S. Air Force is experiencing exponential data growth and increasing demands on
analysts. Data science is a key element in order to unlock big data for the U.S. Air Force ISR
community (USAF, 2016). This white paper described three specific conditions that exist today
that are indications of lacking big data analysis. First, even though there is exponential growth in
data, only a limited set of data is analyzed due to the lack of integration and connectedness.
Secondly, a problem is the incapability to dynamically correlate and cross-reference data
vertically through organizations and horizontally across mission areas. Lastly, the shortage of
streamlined processes to coordinate, combine, and disseminate data to other participating
organizations (USAF, 2016). In this writing, the U.S. Air Force clearly acknowledged a big data
and data science problem and is requesting additional research to understand the impacts of
leveraging data scientists. This research suggested big data specialists should take the lead of
researching and comprehending data science methods and approaches that would be instrumental
in advancing the field of data sciences across the U.S. Air Force (USAF, 2016).
Another recent big data and data science initiative suggests the DOD is strategically
making efforts to analyze big data streams aimed at improving personnel readiness.
Strengthening Data Science Methods for Department of Defense Personnel and Readiness
Missions (2017) is a publically available and comprehensive report sponsored by the DOD. The
report requests the National Academies of Science, Engineering, and Medicine to collaborate on
and provide recommendations on how the Office of the Under Secretary of Defense (Personnel
28
& Readiness) could use the field of data science to improve the effectiveness and efficiency of
their critical mission. Specifically, the request was to develop an implementation plan for the
integration of data analytics into the DOD decision-making processes. A major theme is this
report is to further the development of advanced analytics and the strengthening of data science
education. A skilled workforce that can apply contemporary advances in data science
methodologies is critical. Furthermore, this research study concluded that based upon similar
research conducted in other mature organizations this portion of the DOD’s depth, skills, and
overall resources in data analytics is insufficient. Having small pockets of data science expertise
is not sufficient and the DOD should seek to raise the overall general level of awareness and
skills to become more effective. Simply stated, new data science skills are critically needed in
the DOD workforce (National Academies Press, 2017). The U.S. Army also has several big data
initiatives underway with exclamations that big data analysis has arrived and is here to stay. The
Commander’s Risk Reduction Dashboard (CRRD) is an initiative that integrates a variety of
personnel data from several data sources. The CRRD relies on big data analysis to inform local
commanders and higher echelon commands of personnel who might be at higher risk of suicide
(Schneider et al. 2015). By examining current and publically available literature from the U.S.
Navy, U.S. Air Force, and the U.S. Army there are distinct big data and data science projects on-
going. Many of the projects are championed by senior officers who have expressed concern
regarding the abilities of DOD organizations to analyze big data sets. Additionally, it is also clear
the DOD is interested in examining the big data and data science practices of commercial
organizations and to leverage these advances across DOD organizations to support national
defense strategies.
29
Big Data Challenges
According to Watson and Marjanovic (2013) the challenge with harnessing the power of
big data includes identifying which sectors of data to exploit, getting data into an appropriate
platform and integrating across several platforms, providing governance, and getting the people
with the correct skill sets to make sense of the data. There is evidence this fundamental problem
resides within the DOD as well. The essence of analyzing big data within the DOD requires
many data sources to be fed from hundreds of organizations requiring the defining data sharing
legal, policy, oversight, and compliance standards to make it happen (Edwards, 2014). To make
effective use of big data within the DOD requires an investment of time and money as well as
finding the correct talent to do the analysis. Locating the people within DOD as well as bringing
in analysts from outside the DOD to successfully conduct big data analysis is a major challenge
(Edwards, 2014). Schneider, Lyle, and Murphy (2015) categorized the primary challenges
associated with big data specifically for the DOD and listed the ability to analyze and interpret
the data as a primary concern. Furthermore, these researchers recommended incentivizing
analysts to remain loyal to the DOD may be one of the biggest challenges the DOD will face
with big data analysis.
White House Big Data Strategy
Another example that the U.S. Government is acting on big data and data science is the
White House’s big data strategy. In March 2012, the Obama administration published the Big
Data Research and Development Initiative with specific implications for six federal departments
or agencies including the DOD. The intent of the initiative is to build an innovation ecosystem to
enhance the ability to analyze, extract and make decisions from large and diverse data sets. The
intent is for Federal agencies to better support the entire nation based upon data (White House,
30
2012). One of the specific initiatives was to expand the workforce needed across federal agencies
to develop and use big data technologies. The DOD portion of the big data initiative focuses on
three areas: data for decisions, autonomy, and human systems. The data to decision aspect of this
initiative is to develop computation techniques and software tools for analyzing large amounts of
data (White House, 2012). Stemming from the White House big data initiative the Federal Big
Data Research and Development Strategic Plan (2016) was promulgated. The Big Data Steering
Group reports to the Subcommittee on Networking and Information Technology Research and
Development (NITRD) and published their report through the direction of the Executive Office
of the President, National Science, and Technology Council. There are seven detailed strategies
promulgated in this plan with strategy number six directly related to the business problem and
research questions that chartered this research with the BZC.
Strategy 1: “Create next generation capabilities by leveraging emerging Big Data foundations,
techniques, and technologies” (White House, 2016, p. 6).
Strategy 2: Support R & D to explore and understand…
Strategy 3: Build and enhance research cyber infrastructure…
Strategy 4: Increase the value of data through policies that promote sharing…
Strategy 5: Understand big data collection, sharing, regarding …
Strategy 6: “Improve the national landscape for big data education and training to fulfill
increasing demand for both deep analytical talent and analytical capacity for the broader
workforce” (White House, 2016, p. 29).
Continue growing the cadre of data scientists
Expand the community of data-empowered domain experts
Broaden the data-capable workface
31
Improve the public’s data literacy
Strategy 7: “Create and enhance connections in the national big data innovation ecosystem”
(White House, 2016, p. 34).
The NITRD’s supplement to the fiscal year 2018 President’s budget indicates the Federal Big
Data Research and Development Strategic Plan (2016) is still an active plan under President
Trump (White House, 2018).
Data Sciences
Similar to using a search engine to search term big data, a review of both scholarly and
gray literature regarding data sciences and data scientists returns a plethora of literature. There is
evidence suggesting the term data science has been around for decades. However, many scholars
credit William S. Cleveland (2001) with introducing the term data science in the context of
enlarging the major areas of technical work in the field of statistics. This seminal work described
the requirement of an “action plan to enlarge the technical areas of statistics focuses of the data
analyst” (Cleveland, 2001, p. 1). Cleveland described, due to the increasing collections of data a
major altering of the analysis occupation to the point a new field shall emerge and will be called
“data science” (Cleveland, 2001 p. 1). The plan of six technical areas that encompass the field of
data science includes multidisciplinary investigations, models, and methods for data, computing
with data, pedagogy, tool evaluation, and theory Figure 2. The primary catalyst for Cleveland’s
declaration of the six technical areas was to act as a guideline for the percentage of the overall
effort a university or governing organization should apply to each technical area to begin to
define curriculum for the development of future data scientists (Cleveland, 2001). The focal
point of this research is to understand and document the current environment surrounding the
required skills for big data analysis. Additionally, to explore the call for data science as described
32
by Cleveland and further the body of knowledge regarding the progression of the data science
occupation with specific emphasis on the DOD.
Figure 2. Cleveland’s Data Science Taxonomy. Adapted from “Data Science: An action plan for
expanding the technical areas of the field of statistics.” by W. Cleveland (2001) International
statistical review, 69(1), 21-26.
Scholarly Views of the Data Scientist Role
Zhu and Xiong (2015) explained there is a new discipline emerging called data science
and there are distinct differences between the established sciences, data technologies, and big
data. The formation and the further development of data science extends much further than
computer science. Although data scientists use similar methods and techniques there are
profound differences and data science requires fundamental theories and new techniques (Zhu &
Xiong, 2015). In an attempt to further define data science and data scientist Harris, Murphy and
Vaisman (2013) provided the results of the survey they conducted in mid-2012 of working
analysts across multiple industries. These researchers surveyed analysts to understand their
Data
Sciences
Multidisciplinary
Investigation
Models &
Methods
Computing
with Data
Pedagogy
Tool
Evaluation
Theory
33
experiences and perceptions of their skills. This research provided a quantitative methodology
that researchers and DOD organizations could leverage to understand how to evolve their
existing analysts into data scientists. Harris, Murphy and Vaisman (2013) furthered the notion of
the T-shaped data analysts. These are analysts that have broad expertise (top of the T) coupled
with in-depth knowledge of a particular skill or business domain (stem of the T). The vertical
stem of the T represents deep and foundational business domain understanding and the
horizontal bar represents a wide range of skills necessary across the organization (Harris et al.
2013). Additionally, scholars such as Vincent Granville, Ph.D. have now published detailed
descriptions of data scientists with specific skill requirements. In his foundational book
Developing Analytic Talent: Becoming a Data Scientist (2014) Granville explained vividly data
science is a new role emerging across industries and government organizations. The data
scientist role is different from traditional roles of statistician, business analysts and data
engineers. Data science is a combination of business engineering and business domain expertise,
data mining, statistics, and computer science, along with advanced predictive capabilities such as
machine learning. Data science is bringing a number of processes, techniques, and
methodologies together with a business vision to drive actionable insights (Granville, 2014).
Business Intelligence and Business Analytics
Although there are scholars such as Zhu and Xiong (2015) and Harris, Murphy and
Vaisman (2013) that proposed data science is an emerging occupation with distinct skill
requirements beyond traditional data analysts. There are scholarly researchers suggesting data
science is the next logical progression of business intelligence (BI) and business analytics (BA)
generating on-going debate. Provost and Fawcett (2013) suggested companies have realized the
benefits of hiring data scientists and academic institutions are creating data science curriculums
34
and contemporary literature is documenting advocacy for a new data science occupation.
However, there is disagreement about what constitutes data science is and without further
definition; the concept may diffuse into a meaningless term. These researchers argue data science
has been difficult to define because it is intermingled with other data driven decision making
concepts such as business analytics, business intelligence, and big data. The relationships
between these concepts and data science required further exploration and the underlying
principles of data science need to emerge to fully understand the potential of data science
(Provost & Fawcett, 2013).
The research conducted by Chen, Chiang, and Storey (2012) described a clear evolution
of business intelligence and business analytics starting in the 1990s and determined big data
analytics is a similar field offering new opportunities. They described big data and big data
analytics as terms used to describe the “data sets and analytical techniques that have become
large and complex and typically require unique and advanced storage” (p. 1165). Additionally,
big data sets may require specialized management, analysis and visualization technologies, and
techniques. The big data era has quietly moved into many public, private, and corporate
organizations and these researchers explained significant improvements in market intelligence,
government, politics, science and technology, healthcare, security, and public safety through big
data analysis. These researchers expressed that the analysis of big data is a related but separate
field to business intelligence and business analytics (Chen et al. 2012).
Data Sciences Skills
The literature suggests before modern-day organizations, including the DOD, can benefit
from the rapid data growth and access to real time information, data scientists are going to be
required and will need to be embedded into the decision processes (Galbraith, 2014). Research
35
published in the Harvard Business Review Shah, Horne, and Capellá (2012) suggested even
though companies are investing heavily in deriving insights from data streaming from their
customers and suppliers there are still significant gaps in skills and abilities of individuals and
organizations to conduct the analysis. In 2012, these researchers surveyed 5,000 employees from
22 global companies and determined less than 40% of employees have sufficiently matured skills
to succeed in a big data environment (Shah, Horne & Capellá, 2012). Fundamentally, the ability
most organizations possess is to analyze only a small subset of their collected data that is
constrained by analytics and algorithms of desktop software solutions with modest capability
(Shah et al. 2012).
Fundamental to the investigation on whether a data scientist is different from traditional
quantitative analysts requires an investigation of the current abilities of data scientists in relation
to their requirements to generate information and the ability of the data scientists to use the
modern tool sets (Harris & Mehrotra, 2014). Many questions still exist such as: what is the level
of education needed? Do data scientists need to have a terminal degree or is data science an
applied role? Do all data scientists need to be experts in machine learning and unstructured data
analysis? Additionally, there is evidence suggesting a rise in the mistaken assumptions regarding
the meaningfulness of correlations in the era of big data. For example, big data sets often
produce statistically significant findings even though the results are false and potentially based
on inappropriate analytical methods suggesting a required modification of analytical skills (Shah
et al. 2012). The arrival of big data suggests the typical statistical approach of relying on p values
to establish significance and correlation will unlikely be sufficient in a world of immense data in
that almost everything is significant. Simply, when utilizing traditional and typical statistical
tools to analyze big data it is common to arrive at false correlations (George et al. 2014).
36
Harris and Mehrotra (2014) expressed that in their research the organizations that create
the most value from data science are the ones that allow their data scientists to discover insights
from “open-ended questions that matter the most to the business” (p. 16). These researchers also
suggested there are distinguishable differences between data scientists when compared to
traditional quantitative analysts and there are many implications on how to define the roles of
data scientists as well as how to attract and train these experts and how to get the most value
from this emerging discipline. In 2014, these researchers surveyed more than 300 analytical
professionals from many different companies and from several industries to learn how these
analysts perceived their work and role in the organization. In their research they concluded about
one-third of the analysts describe themselves as data scientists with the remaining identifying
themselves as analysts with distinguishable characteristics. For example, more data scientists
than analysts consider their work more critical to favorable business outcomes. Additionally,
94% of the data scientists’ surveyed indicated analytical abilities are a key element of their
companies’ strategies and business model as compared to 65% of the traditional analysts who
believe their work is tied directly to business models and strategies (Harris & Mehrotra, 2014).
According to Harris and Mehrotra (2014), data scientist skills differ from traditional analyst and
the most typical distinctions are provided in Table 2.
37
Table 2
Harris and Mehrotra’s Analysts and Data Scientists Comparisons
Traditional Analysts Data Scientists
Types of Data Structured or semi-
structured, relational and
typically numeric data
All types, including unstructured,
numeric, and non-numeric data (such
as images, sound, text)
Preferred Tools Statistical and modeling
tools, usually contained in a
data repository
Mathematical languages (such as R
and Python®), machine learning,
natural language processing and
open-source tools.
Nature of work Report, predict, prescribe
and optimize
Explore, discover, investigate and
visualize
Typical educational
background
Operations research,
statistics, applied
mathematics, predictive
analytics
Computer science, data science,
symbolic systems, cognitive science.
Mind-set Percentage who say they:
Are entrepreneurial 69%
Explore new ideas 58%
Gain insights outside of
formal projects 54%
Percentage who say they:
Are entrepreneurial 96%
Explore new ideas 85%
Gain insights outside of formal
projects 89%
Note. Adapted from “Getting value from your data scientists,” by J. Harris and V. Mehrotra, (2014). MIT
Sloan Management Review, 56(1), 15-18. Copyright 2014 by Massachusetts Institute of Technology.
Adapted with permission.
The research concluded data scientists are highly skilled specialists who tackle the most
significant and complex business challenges (Harris & Mehrotra, 2014). Common themes
regarding the skills required of data scientist include advanced and in many cases, open source
statistical software such as R and Python. These applications lend themselves to another common
characteristic of the perceived data scientist and that is they will serve the organization best if
they can explore open-ended questions (Davenport & Dyché, 2013).
Harris, Murphy and Vaisman (2013) conducted quantitative research in 2012 that
surveyed analysts across several industries to further the knowledge of data science skills and the
38
role of data scientists. The researchers developed a list of 22 generic data science skills and then
ask the respondents of their survey to categorize the skills and to self-identify their perceived
roles against the list of data science skills. The list of perceived data science skills as described
by these researchers was adapted to analyze the perceived skills and roles of the analysts at the
Bravo Zulu Center as seen in Table 3.
Table 3
Harris, Murphy and Vaisman Data Science Skills
Perceived Category Data Science Skills
Business Product development
Business
Machine Learning/Big Data Unstructured data
Structured data
Machine learning
Big and distributed data
Math & Operations research Optimization
Math
Graphical models
Bayesian/Monte Carlo statistics
Algorithms
Simulation
Programming System administration
Back end programming
Front end programming
Statistics Visualization
Temporal statistics
Surveys and marketing
Spatial statistics
Science
Data manipulation
Classical statistics
Note. Adapted from “Analyzing the Analyzers: An introspective survey of data scientists and their work,”
by H. Harris, D. Murphy, and M. Vaisman, (2013). Sebastopol, CA: O’Reilly Media. Copyright 2013
by the authors. Adapted with permission.
39
Defining the occupation of the data scientist is an evolutionary process currently
underway. Viaene (2013) explains that data science is not yet a defined academic discipline or
established profession. There appears to be a group of occupations such as scientists, analysts,
technologists, engineers, statisticians working together to carve out the role for the data scientist.
This researcher also agrees with other data science research underway that big data analysis
requires a multi-skilled team in which the data scientist is a member. Big data sets combined
with advanced analytical capability are creating a breed of analysts that are going to be able to
uncover hidden patterns and unknown correlations (Santaferraro, 2013).
Data Science and Business Domain Connection
A common theme in data science research suggests that for data scientists to generate
business value they will need to work closely with domain experts in the organization (Viaene,
2013). To create the business value and prevent runaway data projects this researcher proposes a
benefits realization process through a circular series of steps. This process can create
collaboration between the business domain experts and the data scientists and should be a
foundational requirement before starting a data science project. Viaene’s benefits realization
process steps are briefly described below:
Modeling the business- modeling represent using data to create improvements in the
business.
Discovering data- discovery takes place in the model domain.
Operationalizing insights- operational insights are transferred to the model domain to
the business domain or operationalized.
Cultivating knowledge- promotes the best practices for the use of data and data science
to maximize the investment.
40
Three Types of Analysts
Viaene (2013) describes the roles of traditional analysts fall into three categories: data
analysts, business intelligence analysts, and business analysts. First data analysts are
professionals that understand where data comes from and how to make data available for
business decisions. These analysts typically focus on the extraction, cleansing, and
transformation of raw data in actionable information and most data analysts have computer
science training and solid backgrounds in math and statistics. Second, business intelligence
analysts are effective once the data have been moved into data marts and data warehouses. Third
business intelligence (BI) analysts perform the next level of data preparation. Business Analysts
are the business analysts are the group within the organization that can transform the information
collected into actionable insights on where to influence the business. The abilities of moving,
handling and analyzing data make these traditional analysts ideal data scientist candidates.
To evolve these traditional analysts into data scientists will require proficiencies in
parallel computing and petabyte sized non-structured analysis capability of NoSQL databases,
machine learning, and advanced statistics (Santaferraro, 2013). To gain these data scientists,
Santaferraro suggested the creation of internal programs that provides the opportunity for
existing data, BI analysts, and business analysts to acquire the skills they need to become big
data scientists and recommends the creation of this program around five primary tasks.
Santaferraro (2013) breaks the skills required of the emerging data scientists into a few distinct
descriptions and provides a five-point plan for filling the demand for data scientists.
Santaferraro’s five-point plan is summarized below:
Task 1 – Canvas existing analysts and identify those with the background, talent and
desire to increase their skills and create education opportunities for these individuals.
41
Task 2 – Provide incentives for participants and reward them for reaching milestones.
Incentivizing data scientists’ loyalty will be important due to the shortage of data scientists.
Task 3 – Organize analysis structure to support big data success. Avoid tying data
scientists only to business units or only creating an enterprise pool of data scientists. A hybrid of
these two approaches is warranted.
Task 4 – Deploy the infrastructure to support big data analytics. Create an infrastructure
to support unconstrained analytics. These systems should contain embedded analytics, agile
extensions, rapid iterations, real-time access, and extreme flexibility.
Task 5 – Foster a culture of analytics that supports data driven decisions. Big data
analysis can eliminate emotions, gut feelings, and egos from decision-making.
Training and Certification of Data Scientists
Henry and Venkatraman (2015) claim the average American universities and their degree
programs are unprepared to provide the analytical skills required of corporations in the modern
big data environment. Conversely, the literature suggests there are many colleges, universities,
trade-schools, research organizations, software providers and government organizations that are
modifying their curriculums to include advanced analytics and data science (Miller, 2014). The
literature regarding data science suggests there are no widely agreed upon standards and
certification requirements for data science and data scientists. Essentially anyone can label
themselves a data scientist. Considerations such as the educational level and the core skill
requirements are still in large debate making it difficult to define data science skills and
curriculums. However, there are many educational institutions now providing their interpretation
(Cotter, 2014).
In Cotter’s (2014) dissertation: Analytics by Degree: The Dilemmas of Big Data Analytics
42
in Lasting University/Corporate Partnerships this researcher conducted in-depth investigation
about how corporations and universities should partner to ensure the readiness of graduates to fill
key analysis roles in the era of big data. Cotter conducted a phenomenological study and
interviewed four business analytical groups: business leaders, faculty, recent graduates, and
supervisors of recent graduates to determine the readiness of the recent graduates and the
perceived overall effectiveness of the university education. This research concluded that most
business analytics graduates are initially lacking in real-world preparation. Additionally, Cotter
concluded the ever-changing business world is creating a need for analytical capability that may
have been previously satisfied with the T-shaped analysts (Cotter, 2014). Cotter’s research
amplifies the research questions posed in this dissertation regarding how prepared are the
analysts within the DOD to glean actionable information from big data sets? Fundamentally,
determining how the curriculums offered today at universities and DOD learning institutions
may need to alter to provide data scientists to the workforce is high interest to DOD leaders
(Edwards, 2014).
Defining data scientists’ skills, training and certification requirements is problematic
because of the broad implications and overlapping language with business intelligence, data
analysis, and business analytics. Cotter (2014), also conducted a comprehensive review of the
current degrees and certifications offered at the undergraduate and graduate levels in the United
States and abroad and concluded there are several learning institutions with many undergraduate
degrees and certifications available. Fundamental to the investigation on whether data scientists
are different from traditional quantitative analysts requires an investigation of the current
abilities of data scientists in relation to their requirements to generate information and the ability
of the data scientists to use the modern tool sets. There is evidence suggesting not only a skills
43
gap, but the analysis tools are outpacing the ability of the analysts suggesting a gap in human
talent to harness big data (Halper, 2016). Watson and Marjanovic (2013) suggested already
embedded business analysts can upgrade their skills through university courses and should
include Java, R, SAS Enterprise Miner, IBM SPSS Modeler, Hadoop, and MapReduce.
Commercial Certification
Another available option for the DOD to examine their data science abilities is through
the use of certification from agencies outside of the DOD and academia. Modest research for
options available for certification of data scientists today suggests there are several companies
and trade organizations providing training and certification. The Institute for Operations
Research and Management Science (INFORMS) is an international organization comprised of
over 12,500 members supporting the fields of operations research and analytics. INFORMS
describes in their charter a desire to promote practices that create advances in operations research
and analytics for the betterment of decision-making and optimize business processes
(INFORMS, 2017). This organization claims to be a leading organization in the formalization of
a certification process for analytics focused on moving organizations from descriptive to
predictive and prescriptive analytics (Sharda, Asamoah, & Ponna, 2013). INFORMS sets an
eligibility requirement for experience and skills and then through a set of high standards and
rigorous examinations certifies analytical professionals with CAP certification (INFORMS,
2017).
Halper (2016) provided the results of a snapshot survey from an audience at The Data
Warehouse Institute Chicago 2016. This researched aimed at furthering the understanding as to
the confidence of software providers to automate analysis of big data sets and address the skills
gap. There is a push by software and hardware technology providers to ease the skills required of
44
data scientists by advancing analytical software to continually move through large data sets
while also providing high level and effective statistical analysis and training. Halper’s modest
research supports the notion that organizations are still trying to determine what skills are
required for their analysts, where the analysts are going to come from and are uncertain as the
overall effectiveness of software solutions (Halper, 2016).
Vendor Training and Certification
There are several major corporations such as Microsoft, IBM, TeraData, and SAS that
are quickly developing professional analytical and data science programs. Microsoft is
recognizing the growing need for professional expertise in data science through their
professional development program focusing on data science theory, hands-on training, on-line
course curriculum coupled with a final project prior to certification (Davis, 2016). The SAS
institute is another organization offering a data science certification. This company was founded
in 1976 and has been consistently growing ever since. SAS suggests that companies successfully
harnessing information from big data are augmenting their existing analytical staffs with data
scientists. Data scientist possess higher levels of IT capability and specialize training and skills
with emphasis on big data technologies (SAS, 2017). SAS has developed an Academy for Data
Sciences that offers a blend of classroom and on-line courses that also uses a case study approach
to get hands on experience. Additionally, the SAS training curriculum offers training in several
of the sought after big data and data science applications such as R, Python, Pig, Hive and
Hadoop (SAS, 2017). This research study explored the commercial availability of data science
training and explored how analysts are trained at the BZC to help determine if further
exploration of commercial data science training is appropriate for DOD organizations.
45
Shortfall Preparation
The literature suggests there is a significant shortfall of analytical professionals within the
commercial sector and the DOD and this shortfall is expected to grow (Géczy, 2015). As this
literature review demonstrates, researchers are calling for action. Miller (2014) suggested that
big data and data science are such a significant problem that a national consortium is warranted.
Academia, industry, and the U.S. Government should work together to continue the growth of a
big data and data science national consortium to address the big data analytical skills gap (Miller,
2014). This consortium would do the following:
Create formal definitions for occupations to include data scientist
Establish curriculums and standards for accreditation for data and analytics
Engaged industries, government, and academia through shared communities of
interest
Partner with industry consortiums and organizations to establish strong internship
programs and increase the collaboration between academia and business
Stimulate the creation of courseware skills and literacy at all levels of education
Establish working groups to govern data policy issues
Federal Job Series and DOD Data Scientists
George, Haas, and Pentland (2014) suggested equally important to the methods for
collecting the data are the methodologies to analyze the data. Finding and maintaining analysts
who are capable of gleaning actionable information and significance of big data intelligence is a
challenge confronting our military and these experts are in short supply (Edwards, 2014). The
development and continuous maintenance of data analysis skills in the era of big data typically
requires large investments in time and dollars. Additionally, each class of DOD worker (enlisted,
46
officer, civilian, contractor) may benefit uniquely from big data analysis but also may bring
unique challenges (Schneider et al. 2015). Attempting to analyze the current state of skills and
potential shortfalls of the entire class of workers in the DOD is beyond the scope of this
dissertation. However, this research focused on the primary analysts responsible for conducting
big data analysis at Bravo Zulu Center, the DOD civilians. Additionally, because the definitions,
skill requirements, and occupational roles of data scientists are still emerging in commercial
industries and academia, this fundamentally supports the importance of exploring this problem
for the DOD. Several researchers suggests the most likely avenue for organizations to develop
analytical talents will come from innovating new talent from existing analytical groups
(Davenport & Dyché, 2013). To gain insights as to the DOD’s current talent to conduct big data
analysis this research investigated the current occupational roles of the persons assigned within
the federal civilian workforce and the analysts assigned to the case study organization
responsible for conducing data analysis.
Office of Personnel Management
The United States Office of Personnel Management (OPM) is an independent agency of
the U.S. Federal Government that manages the civil service labor force. According to OPM,
“their mission is to recruit and hire the best talent; to train and motivate employees to achieve
their greatest potential; and to constantly promote an inclusive workforce defined by diverse
perspectives” (OPM, 2014, p.1.). OPM maintains a detailed classification and qualifications
section of their website and publicly available manual that promulgates the federal position
classifications, job grading, and qualifications information that is used to determine the
classifications and qualifications requirements for most work within the Federal Government
(OPM, 2014).
47
Classification and Qualification Standards
OPM classification standards are assigned for all federal positions and provide uniformity
and equity in the classification of positions by providing a common reference across federal
organizations, locations, and agencies. OPM classification usually includes a description of the
duties, criteria, official titles and grades. Simply, by classifying federal jobs OPM determines the
appropriate occupational series title, pay grade and pay system. Qualifications are the specific
knowledge, skills and abilities required of each position (OPM, 2009). OPM categorizes all
federal positions by white-collar jobs or trades and labor occupations. Examining the federal
positions classified for data analysis and the qualifications required of these positions provided
insights into the DOD’s current labor force associated with conducting big data analysis.
Researching the current federal job classifications suggest there are no current job classification
series for data scientists and the terms business intelligence and business analytics are not
requirements listed in the OPM’s classification and qualifications guidance. However, within the
1500 OPM job series there are several job classifications that encompass analysis, mathematics,
statistics, operations research, and computer science. The 1500 job series appears to be the
federal job classification most closely related to the emerging field of data science (OPM, 2005).
A description of the 1500 job series is paraphrased below:
Federal 1500 Job Series – This group includes all classes of positions and the duties of which
are to advise on, administer, supervise, or perform research or other professional and scientific work.
This group also performs related clerical work in basic mathematical principles, methods,
procedures, or relationships, including the development and application of mathematical methods for
the investigation and solution of problems. Additionally, the development and application of
statistical theory in the selection, collection, classification, adjustment, analysis, and interpretation of
48
data; the development and application of mathematical, statistical, and financial principles to
programs or problems involving life and property risks (OPM, 2005, pp. 14-16).
By further examining the 1500 federal job classification guidance there are several
occupational series that encompass, at least in part, many qualifications requirements of
traditional analysis as seen in Table 4. This research explored the 1500 series federal occupations
and other federal analysts occupations within the DOD workforce to determine if they provide
the necessary skills for useful big data analysis and how aligned these federal occupations are to
those of the perceived data scientist.
Table 4
Federal 1500 Job Series Occupations
1501- General Mathematics & Statistics 1520- Mathematics
1510- Actuarial Science 1529- Mathematical Statistics
1515- Operations Research 1530- Statistics
Note. Adapted from “Professional Work in the Mathematical Sciences Group 1500,” by U.S. Office of
Personnel Management.
According the research published by the U.S. Air Force, a distinctive data science career field
does not currently exist and the operations research analysts (1515) is the federal occupation that
most closely relates to the perceived data scientist occupation (USAF, 2016). The employment of
the 1500 job series analysts and other active analysts occupations were explored with the BZC
case study.
Management Implications
The arrival of a vast amount of data along with the continuing evolution of information
systems presents a paradigm that requires a change in the management of the organization.
49
Combining big data with advanced analytics will allow managers to gain deep insights about
their business and translate data analysis into improved performance (Brynjolfsson & McAfee,
2012). The Manyika et al. (2011) research that indicated a large shortfall of data scientists by
2018 also forecasted a significant shortfall of managers with the expertise to leverage big data
analysis to make effective decisions. In a big data era where one comment from a trusted social
media source can result in losses or profits of billions of dollars and chain reactions in the news
media, there is no argument remaining regarding a management impact to modern-day business
(George et al. 2014). Additionally, there is little doubt businesses are prioritizing to include big
data in their strategic plans and a recent survey of six hundred global business leaders identified
their organizations as data driven and ninety percent of those organizations recognized
information as key resources for success (Gobble, 2013). However, there is evidence that
suggests many organizations do not fully trust the technologies, the data and ultimately the data
scientists and “neither the data scientists nor managers are effective at speaking each other’s
language” (Harris & Mehrotra, 2014, p. 16). In the research conducted by Harris and Mehrotra
(2014) they proposed there are five key management challenges to address in the era of big data:
Talent Management
Leadership
Decision Making
Technology
Company Culture
Although a comprehensive investigation as to the management implications associated with all
five key management challenges is beyond the scope of this dissertation, researching key
implications for managers and their perceptions of big data and data sciences is warranted.
50
Additionally, the approach of investigating the perceptions of the analysts as well as conducting
a focus group interview with executives or managers within the Bravo Zulu Center will help
ensure deep investigation. The investigation with the management team at the Bravo Zulu Center
will explore their perceptions regarding the differences between data scientists and traditional
analysts and several other important questions. Again, Harris and Mehrotra’s (2014) research
included a survey of more than 300 analysts and suggested that because there was a much higher
direct management involvement of data scientists over traditional data analysts into the most
critical projects, management understands how effective creative data scientists can be when it
comes to solving complex problems. Additionally, as part of their research Harris and Mehrotra
conducted a focus group interview session with a group of managers and executives to gain their
perspectives on big data and the data science. This approach was repeated in my case study
research with the Bravo Zulu Center.
According to Brynjolfsson and McAfee (2012) the managerial challenges associated with
building data driven organizations from big data are even greater than the technological
challenges. In general, the technologies are outpacing adoption, and there is work to be done to
construct the policies that ensures the leveraging of big data. In previous decades, data and
metrics were limited and essentially rolled into aggregated key performance indicators and
presented to executives. Much of the decisions and direction of the firm were placed in the hands
of the executives who relied heavily on their experiences and intuition. The ability to analyze big
data stands to completely change this business model but requires a significant investment in the
culture of the organization (Brynjolfsson & McAfee, 2012). Additionally, even though the term
big data has now been accepted as a common business term, there is very little published
management scholarly literature that tackles the management challenges associated with big data
51
and provides great promise and opportunity for new theories and practices (George et al. 2014).
Companies may need to train incumbent managers to be more numerate and data literate as well
as hire new managers who already possess the skills to lead in the era of big data (Harris &
Mehrotra, 2014).
Kiron (2013) from MIT Sloan Management Review provides analysis of a 2012 survey of
50 senior executives from the financial and insurance industries that investigated their
perceptions of big data. Several key themes emerged from this analysis.
These leaders believed in the promise of better informed decisions with the analysis
of big data sets. Eighty five percent of the surveyed leaders indicated they have big
data initiatives either planned or in-work.
These leaders were more concerned about the variety of data and less concerned
about the volume. Most of the firms had initiatives for managing the volume of data
but were not satisfied with the integration of the dispersed data sources.
Very few leaders, only 3% were concerned about the analysis of social media
information.
Organizational alignment is a critical factor to ensuring success. The alignment of big
data initiatives across the business and information technology units is crucial.
The leaders recognized the lack of available analytical talent.
Harris and Mehrotra (2014) suggested senior management will need to learn how to best
employ and manage data scientists. Many large organizations are now creating a core hub of data
scientists to foster an environment of sharing information and technology. Additionally, because
data scientists are a scarce commodity, many organizations are embedding data scientists with
existing data analysis groups within the organization. Creating teams that combine business
52
analysts, visualization experts, modeling experts, and data scientists from different disciplines
and functional areas may provide the most effective strategy for employment (Harris &
Mehrotra, 2014).
Summary
This literature review provided evidence that U.S. companies are experiencing massive
data growth, and companies that can harness information from big data create competitive
advantages. Similarly, the DOD is experiencing big data growth and the ability of the U.S.
military to analyze large data sets are becoming a crucial element of mission accomplishment
(Hamilton & Kreuzer, 2018). The terms big data and data science have rapidly grown in their
relative importance in business and DOD scholarship, however there remains opportunity to
further advance theory for practical application. The desired ability to conduct meaningful
analysis from big data sets is a strong theme in contemporary scholarly literature and the further
emergence of the data science occupation is growing merit quickly. Based upon the evidence
suggesting there will continue to be a shortage of data scientists for the near future and the DOD
is faced with a significant challenge.
53
CHAPTER 3. METHODOLOGY
Introduction
The purpose of this qualitative case study was to explore how DOD employees conduct
data analysis with the influx of big data. This research explored the emerging data scientist
occupation and the skills required of data scientists to help determine if data science is applicable
to the DOD. This research aimed to discover if there are fundamental differences between DOD
analysts and data scientists by exploring the professional experiences of analysts and managers
from a key organization within the DOD. Géczy (2015) proposed a common big data problem in
organizations because of the inabilities of most organizations to manage and analyze big data
sets. Berner et al. (2014) suggested organizations are capturing more data than at any time in
history, with clear advantages to organizations that glean insight from the data. Although there is
a tremendous amount of literature investigating the implications with big data sets and data
science, there appears to be a gap in published scholarly literature regarding big data and data
sciences related specifically to the DOD (Frizzo-Barker et al. 2016). The general business
problem is the lack of effective analysis in organizations operating in the modern-day big data
environment (Harris & Mehrotra, 2014). The specific business problem is that DOD
organizations may be struggling with gleaning actionable information from large data sets
compounded by immature data science skills of DOD analysts (Harris et al. 2013). This chapter
is organized into sections to explain the methodology, design, setting, and proposed participants.
Additionally, this chapter explains how the data was collected and analyzed in support of the two
research questions and how ethical considerations were handled.
Research Questions
The objective of this research was to develop an understanding of how DOD analysts
54
respond to, probe and assimilate data in big data environments to help determine if a data science
occupation is justified and warranted in the DOD. The following research questions guided the
study:
Primary Research Question 1: How does the Bravo Zulu Center glean actionable
information from big data sets?
Primary Research Question 2: How mature are the data science analytical skills,
processes, and software tools used by Bravo Zulu Center analysts?
These research questions framed the research and were used to generate data through semi-
structured personal interviews and a single focus group interview from professionals living the
big data phenomenon within the DOD. Additionally, analysis of documents served as a third data
source from the sponsoring case study organization.
The remainder of Chapter 3 provides details on the research design and methodology, the
sponsoring organization and participants and the questions of inquiry to include how the data
was collected and analyzed. Additionally, this chapter discusses the credibility and dependability
of the research and ethical considerations.
Design and Methodology
A research design provides the logic that connects the collected data to the overall
questions posed in the study (Yin, 2009). Creswell (2009) described three components of
research: the researcher’s philosophical assumptions, the methodology, and the strategy of
inquiry. The researcher used an exploratory research design to gather the perceptions of the
participants through personal interviews and employed a qualitative strategy to explore and
analyze the collected data from a single embedded case study organization.
55
Methodological Approach
Qualitative research stems from a variety of disciplines such as “anthropology, sociology,
psychology, linguistics, communication, economics, and semiotics” (Cooper & Schindler, 2013,
p. 145). Qualitative research is an approach for exploring and understanding the meaning
individuals or groups may ascribe to a specific problem or phenomenon. This type of research
involves collecting data typically in the participants’ settings and inductively conducting analysis
of the collected information looking for themes to provide insight and understanding (Cooper &
Schindler, 2013). Additionally, Creswell (2009) explained, although there may still be
deliberation on the fine elements of qualitative research, generally there is common agreement
on several core and defining characteristics as seen below:
Qualitative researchers collect data where the participants are experiencing the
phenomenon on problem under investigation.
The researcher serves as the key instrument and is the means in which the data are
collected. Qualitative researchers may collect the data through interviewing
participants, observing behavior or examining documents.
Qualitative researchers gather multiple forms of data vice relying on a single source.
Qualitative researchers build patterns, categories, and themes from the data from the
bottom up utilizing inductive and deductive data analysis techniques.
Qualitative researchers maintain a focus on learning the meaning that the participants
of the study uphold regarding the problem or issue under investigation.
Qualitative researchers are open to emergent designs, and understand questions may
change, and data collection methods may shift as the researchers learns about the
problem or issue to be studied.
56
Qualitative researchers understand their role in the study and how their personal
backgrounds have potential for shaping interpretations.
Qualitative researchers strive to develop a complete account of the research problem.
A qualitative research methodology is appropriate for understanding human behavior and
is common in social and behavioral sciences and by scholar practitioners who seek to understand
a phenomenon (Cooper & Schindler, 2013). In this case, the research was furthering the body of
knowledge as it relates to big data and data science and how or if DOD analysts should be
behaving differently due to the growth of information into big data.
Research Design
A case study is a qualitative research design to obtain multiple perspectives from a single
organization and is appropriate when questions are being posed to understand a contemporary
phenomenon (Yin, 2009). Case study research is an inquiry about a contemporary phenomenon
that is set within the real-world context when there is a desire to provide an up-close and in-
depth understanding from a single or small number of cases (Yin, 2012). This effective approach
is the rationale for selecting one organization within the DOD with the intent to help determine if
data scientists are warranted in DOD organizations. Triangulation is a method used to improve
the overall accuracy of research by combing data collection methods and differing types of data
(Gronhaug & Ghauri, 2010). Triangulation for this research was executed by collecting data
through semi-structured personal interviews, a single focus group interview and document
analysis. Triangulation was accomplished by analyzing the data from the three data sources with
the assistance of the NVivo-11® software.
Participants
Yin (2009) suggested that a single case study is appropriate under several circumstances.
57
First, a case study is appropriate when a single case meets all the conditions for testing the theory
and can confirm, challenge, or extend the theory. Secondly, when a single case represents an
extreme or unique case and lastly when a single case is representative of a typical case. The
Bravo Zulu Center represents a typical case as described by Yin (2009). By examining this
representative case-study organization within the DOD directly responsible for large data sets,
this research can provide actionable knowledge and serve as a road map for the DOD and similar
large complex organizations to execute further research. There are several means of data
collection available to the qualitative researcher (Creswell, 2009). The researcher collected data
through semi-structured interviews, document analysis, and a single focus group interview and
are discussed further in the data collection section of this chapter.
The researcher contacted senior officials from the DOD working in the Pentagon to help
identify organizations that are responsible for analyzing large data sets thus making them
candidate organizations to participate in this research. Additionally, the researcher’s extensive
experience in the DOD helped to guide the selection of the Bravo Zulu Center (BZC) as the case
study organization to support this research. The BZC is a large complex organization with big
data and data science challenges and is representative of many DOD organizations facing very
similar challenges. Because the DOD is an extremely large organization with understandably
tight controls on releasing information, creating actionable research is difficult, but not
impossible. A letter for sponsorship was provided by the Office of the Secretary of Defense
Prepublication and Security Review that granted approval of this research within any DOD
organization with two conditions. First, DOD specific literature supporting the literature review
portion of this study would need to be already released literature regarding the DOD. In other
words, the researcher was not permitted to use his DOD computer and network access to extract
58
DOD related information that had not yet been released for public dissemination. Secondly, the
organizations and the individuals who participated in the research would do so on a volunteer
basis and the participants could end their involvement with the researcher at any time without
repercussion. Additionally, the sponsoring organization and the participants will not be
compensated.
Selecting the participants in qualitative research requires deliberate planning and an
effective sampling strategy. Participants of the research study are generally not chosen because
their opinions represent the dominant opinion but because their experiences and attitudes will
reflect the entire scope of the research problem (Gronhaug & Ghauri, 2010). The basic premise
for sampling in scientific research is “by selecting some of the elements in the population,
conclusions can be drawn regarding the entire population” (Cooper & Schindler, 2013, p. 338).
The population for this study represents thousands of managers and analysts from the DOD.
Additionally, the initial review of available literature regarding the BZC and its mission
supported its selection as the representative organization to support this study.
Harris and Mehrotra (2014) conducted a scientific research project that in 2012 surveyed
more than three hundred analysts and conducted a focus group interview with managers and
executives that investigated how organizations can get value from data scientists. Their research
findings suggested hiring data scientists alone is not enough and managers in modern
organizations must learn how to employ data scientists effectively. Their data collection strategy
was to solicit participants from two distinct groups, analysts and managers, and served as a
foundational strategy for this research and was repeated in this case study research with the BZC.
To gain understanding within specific functional groups within the DOD a purposive
sampling method was used. Purposive sampling is a type of nonprobability sampling where the
59
researcher arbitrarily selects participants for their “unique characteristics or their experiences,
attitudes, or perceptions” and is most effective when one needs to study a certain cultural domain
with knowledgeable experts within the organization (Cooper & Schindler, 2013, p. 663). The
ideal target population was determined to be senior managers or executives from the BZC
directly responsible or influenced by large data sets as well as the analysts, or perceived data
scientists supporting management within the BZC. Each of the participants of this study met the
initial inclusion criteria because they are employed by the BZC working as either an analyst or
manager/executive within the organization. Additionally, the purposive sampling strategy
allowed the researcher to exercise his expert judgment on additional inclusion and exclusion of
participants that ultimately increased the precision and accuracy of the research. The researcher
applied a minimum seniority and experience level to both participants groups and excluded DOD
contractors.
Although there is no specific requirement on the number of participants to include in a
qualitative research study, qualitative case study research typically ranges from 3 to 10
participants (Creswell, 2009). Additionally, saturation in qualitative research suggests the
researcher should keep sampling if the breadth and depth of knowledge is expanding and stop
collecting data when redundancy appears or no new insights occur from the collected data
(Walker, 2012). To ensure saturation is met the researcher pre-determined a minimum of 10
analysts would participate in the personal interviews and a minimum of 6 managers or executives
would participate in the focus group interview. Additionally, all the analysts and managers that
participated in this research was voluntary, no compensation was provided, and the participants
were informed they could leave at any time without repercussion. The details of the BZC
participant criteria are summarized in Table 5.
60
Table 5
BZC Participant Criteria
Managers or Executives Analysts
Pay Grade or Rank Civilian GS-14 or above
Military O-5 or above
Civilian GS-07 or
above military E-5
or above
Overall DOD
Experience
10 years 5 years
BZC Experience 2 years 2 years
Data Collection Focus Group Interviews
Participants 6-8 10-minima
Setting
Several factors were used to determine the DOD organization to participate in this
research and a potential conflict of interest was addressed. A conflict of interest is any condition
in which the researcher has an existing relationship with a participant or the sponsoring
organization that could compromise the validity and the findings of the research (Seidman,
2013). Naval aviation related DOD organizations were omitted as possibilities to avoid any
potential conflict of interest due to the researcher’s active employment with Naval Air Systems
Command (NAVAIR) and the potential of his 32-year naval career creating bias in the research.
Secondly, using secondary information, such as DOD organizational charts as well as
consultations with current senior civil service members at the Office of the Secretary of Defense,
several organizations were targeted for possible inclusion. Lastly, any DOD organization that
was selected would need to be experiencing a large growth in data and required to provide
actionable information about their big data sets.
61
The Bravo Zulu Center (BZC) was selected by the researcher as the single case study
organization. The BZC is a large complex organization with big data and analysis requirements
to support its mission and is representative of many DOD organizations facing very similar
challenges. The BZC’s big data and analysis requirement supports the selection of the BZC as
the representative organization to support this case study research. Due to the geographical
distance between the researcher and the BZC and due to scheduling complexities that existed
with the number of participants the data was not collected in person. The data was collected via
the telephone and is addressed further in the data collection section of this dissertation.
The BZC published and made publicly available a strategic document that provided
insights into data and analysis challenges within their organization. According to this report, the
U.S. Air Force has only started to realize the full potential of an integrated logistics and
sustainment enterprise and the ability to access and analyze data will play a key role. This
strategic plan for the BZC categorizes the actions to achieve the vision into nine distinct
attributes. Attribute #1 sets a vision for the BZC to build and analyze their data more effectively.
This strategic vision along with other BZC documents were explored as part of this research and
further detail is provided in Chapter 4. This research provided value to the DOD practitioners
working within the BZC and similar DOD organizations required to analyze big data sets. To
ensure confidentiality of the case study organization, the title and citation of the BZC strategic
document is not provided in this research.
Analysis of Research Questions
In qualitative research findings result from a process of data collection, interpretative or
analytical processing, and reporting (Cooper & Schindler, 2013). Organizations are made up of
human beings with different skills, attitudes, beliefs, values, motivations, prejudices, hopes,
62
worries, political beliefs, and other characteristics that effect the performance of the organization
(Swanson & Holton, 2005). In support of the two research questions chartering this study, the
role of the researcher was to explore how the BZC gleans actionable information from large data
sets to help determine if the data scientist occupation is warranted in DOD organizations. By
posing questions to professionals working within the BZC, their responses yielded patterns
regarding big data and data sciences and generated themes for actionable conclusions and the
support of further research. Three instruments and three data collection methods were used in
this study as seen in Table 6.
Table 6
Instruments and Data Collection Methods
Instrument Data Collection Method (s)
The researcher Interviews, focus group, document analysis
Audio recorder/Telephone Interviews
Audio recorder/Telephone Focus group
William S. Cleveland (2001) introduced the term data science in the context of enlarging
the major areas of technical work in the field of statistics and provides the conceptual framework
that supports this study. Cleveland’s seminal work described the requirement of an “action plan
to enlarge the technical areas of statistics focuses of the data analyst” (Cleveland, 2001, p. 1).
Cleveland described, due to the increasing collections of data a major altering of the analysis
occupation to the point a new field shall emerge and will be called “data science” (Cleveland,
2001, p. 1). Cleveland’s proposal of six technical areas that encompass the field of data science
includes multidisciplinary investigations, models and methods for data, computing with data,
pedagogy, tool evaluation, and theory. This taxonomy was adapted with permission from a
63
senior executive within the BZC to collect and analyze the data as seen in Figure 3.
Figure 3. Cleveland’s Data Science Taxonomy. Adapted from “Data Science: An action plan for
expanding the technical areas of the field of statistics.” by W. Cleveland (2001) International
statistical review, 69(1), 21-26.
1. Multidisciplinary Investigation – Investigate BZC data analysis collaborations.
2. Models and Methods – Investigate the analysis capabilities and the statistical models
and methods used by the BZC analysts.
3. Computing Data – Investigate BZC hardware and software capability available to
conduct big data analysis.
4. Pedagogy – Investigate the skills of the BZC analysts and the educational and training
requirements and opportunities available to BZC analysts.
5. Tool evaluation – Investigate the BZC software tools used in big data analysis.
Semi-Structured Interviews and Focus Group Interview Questions
The interview questions should seek to describe the essence of the experience and be
Data
Sciences
Multidisciplinary
Investigation
Models &
Methods
Computing
with Data
Pedagogy
Tool
Evaluation
Theory
64
unquestionably linked to the research problem under investigation (Creswell, 2009). In support
of the two primary research questions chartering this research, the researcher prepared several
interview questions to gain specific insights regarding big data and data sciences experiences at
the BZC. The interview questions were limited to 5 to 8 and were prepared carefully as to
provide insights into the research problem while also being prepared as not to limit the views of
the participants. A template was developed to ensure a clear understanding of the questions and
to ensure identical initial questions were posed to the managers or executives and the analysts
within the BZC. Additionally, the participants were given the questions at least one week prior to
the scheduled interviews to ensure adequate time to develop in-depth responses.
Interview Questions
1. How is data used in your organization to meet mission requirements? What are some
areas in your organization that are dependent on data?
2. How do you define big data? What increases of digital data (big data) have you
witnessed and how has it impacted the business of the BZC?
3. What are some knowledge, skills, and abilities needed to be an effective data
scientist?
4. What are some of the significant challenges associated with conducting data analysis
in your organization?
5. What are the data science skills that are used by the BZC analysts?
6. What additional skills are needed by analysts to be effective in the modern big data
environment?
7. What else can you tell me regarding big data and data science?
Semi-Structured Interview Protocol
A semi-structured interview protocol was selected as the best means to collect data from
the analysts who participated in the research. Semi-structured interviews are individual depth
65
interviews that generally start with a few broader questions, to put the respondents at ease and to
gain general insight into the business problem, and then migrate into increasingly more specific
questions to draw out detail (Cooper & Schindler, 2013). Interviews used in qualitative research
can vary depending on the “number of people involved, the level of structure, the proximity of
the interviewer to the participants, and the number of interviews conducted” (Cooper &
Schindler, 2013, p. 152). Effective use of semi-structured interviews relies on developing a
dialog between the interviewers and the respondents and requires more interviewer creativity.
Additionally, the interviewer’s experience and skills should be used to achieve a greater clarity
and elaboration of the answers (Cooper & Schindler, 2013). As a 32-year veteran of DOD
experiences both as an active duty sailor and a federal civilian, the researcher relied heavily on
many experiences regarding the management of information technology projects and data
analysis initiatives for the DOD. The telephone was used as the data collection instrument to
conduct the interviews with the analysts.
Focus Group
A focus group is a panel that typically consists of 6 to 8 participants that is led by a
trained moderator. Focus group interviews typically last between ninety minutes to two hours
(Cooper & Schindler, 2013). The researcher moderated a focus group interview that consisted of
8 managers or executives from the BZC to gain insights, ideas, feelings, and experiences about
big data and data sciences in their organization. A recorded telephone conference was used as the
data collection instrument to conduct the focus group interview after which the recorded audio
was transcribed and analyzed by the researcher to determine patterns and themes.
Credibility and Dependability
Internal validity or credibility addresses how the research findings match reality.
66
Qualitative researchers need to address the extent the findings will make sense and be considered
credible (Swanson & Holton, 2005). To ensure the consistency of the findings and dependability
in the research the researcher used a field testing technique. The interview questions that were
developed by the researcher were field tested with five doctoral level business professors that
possessed the experience and skills to participate in this study and helped determine if the
questions posed by the researcher were interpreted as intended. These field tests were conducted
by telephone to simulate the conditions of the actual interviews and modifications were made to
the interview template based upon the feedback received. The field test confirmed the credibility
and dependability of the semi-structured interview guide and the focus group interview guide
used for this study. Creswell (2009) suggested member checking is a process used by researchers
to ensure the accuracy of qualitative findings. Through the ongoing dialogue between the
researcher and the participants the researcher will continually describe his interpretation of the
dialogue to ensure it aligns to the participants perceptions. Additionally, the researcher submitted
a copy of the transcripts to each participant for their review to ensure the researcher accurately
transcribed the dialogue.
Triangulation is a method to improve the accuracy of qualitative research by combining
data collection methods and different types of data to support the research. Triangulation in
research assists in the production of a more complete, holistic, and contextual portrait of the
research problem and is particularly important in case study research (Gronhaug & Ghauri,
2010). Triangulation for this research was achieved by utilizing three data collection methods
appropriate for qualitative research as seen in Figure 4.
67
Figure 4. BZC case study triangulation.
In conjunction with the analysts interviews and the management focus group interview the
researcher collected documents to support the research questions posed in this study. Documents
included job descriptions of analysts working at the BZC and a strategy document regarding data
analysis at the BZC.
Data Collection
Qualitative research combines explorative and intuitive analysis and relies on the
experience and the skills of the researcher to conduct analysis of the collected data (Gronhaug &
Ghauri, 2010). As with many scientific studies, business research studies generally required the
collection of primary data to answer their research questions (Gronhaug & Ghauri, 2010). The
data collection decisions in this research set the boundaries for the study on how the data would
be collected and documented for later analysis (Creswell, 2009). Creswell (2009) suggested
when conducting qualitative inquiry, the researcher has several forms of data collection means
available:
A qualitative observation seeks to obtain information through the use of field notes on
the behaviors and activities of the individuals at the research sites.
Managers or Executives
Focus Group
Analysts
Interviews
Documents Analysis
68
Qualitative interviews are direct interaction events in which the researcher meets with
the participants and through the use of semi-structured interviews elicits views and
opinions.
Qualitative documents are public documents (e.g., newspapers, meeting minutes,
official reports).
Qualitative audio and visual materials such as audio recordings, photographs, video,
website main pages, and e-mail.
Upon approval from the Capella University Institutional Review Board (IRB), the
researcher began to collect data. The recruitment strategy was to email a description and purpose
of the study along with the interview questions, that illustrated the nature of the study to the list
of proposed analysts and managers that met the researcher’s selection criteria. After receiving
responses from several potential participants, the researcher began to formalize a relationship
with each participant. The participants confirmed they read the informed consent form provided
by the researcher and acknowledged they were willing to disclose information during the
interview process and agreed to allow the interviews to be recorded by the researcher. The
researcher allotted himself six weeks to conduct the individual interviews and the single focus
group interview. BZC documents were collected throughout the entire data collection period. To
minimize fatigue the semi-structured interviews of the analysts were limited to sixty minutes and
the focus group interview was limited to ninety minutes. Because of the geographical distance
between the researcher and the analysts participating in the study, the interviews of the analysts
were conducted via telephone. Additionally, the single focus group interview was conducted via
a telephone conference that allowed participants to dial in from different locations.
Before conducting any of the interviews each of the participants provided the researcher
69
with a verbal consent that met the standards of the Capella University IRB and the researcher
confirmed each participant understood their rights. Anonymity was provided by assigning a
numerical value for each participant in the study and no participant names were disclosed at any
point in the research. The data in support of this research was collected solely by the researcher
and the digital recordings and transcripts have been locked in a cabinet in the researcher’s home
and will be destroyed by the researcher after seven years via the use of a cross cut shredder for
documents and via an approved data destruction program for the digital recordings.
Document analysis is a process for systematically reviewing and evaluating documents in
support of qualitative research. Similar to other analytical methods, document analysis requires
the researcher to deeply explore the collected data to elicit meaning and develop a deeper
understanding in support of the research problem. Documents may include “both printed and
electronic material” and include items such as advertisements, agendas, meeting minutes,
manuals, white papers, books, letters, diaries and journals (Bowen, 2009 p. 27). In support of this
research, the BZC provided the researcher releasable documents regarding the job descriptions of
analysts working at the BZC and strategic documents regarding data and analysis at the BZC.
Additionally, to ensure the relevancy of the documents provided by the BZC the researcher only
collected documents published by the BZC between January 1, 2012 and July 31, 2018. These
documents were fully reviewed and the synthesized information was categorized into major
themes for analysis in support of the two research questions posed in this study.
Data Analysis
The process of qualitative data analysis is making sense out of the data and ultimately
discovering themes from seemingly random information (Swanson & Holton, 2005). The
premise promoted for the two distinct groups is aimed at learning about the lived experiences of
70
people responsible for setting goals and policies (managers) as well as learning about the lived
experiences of people responsible for gleaning information from large data sets (analysts).
Specifically, the researcher sought to locate themes from managers and analysts currently
working within the big data phenomenon to create an accurate understanding of the two research
questions proposed in this study.
Coding Structure
The process of coding “involves the assignment of numbers or symbols to responses
generated from the interviews so the information can be grouped into a limited number of
categories” (Cooper & Schindler, 2013, p. 652). Creating a coding structure gives the researcher
the ability to take large amounts of raw information acquired from the interviews and categorize
the collected responses into a more manageable scheme for processing and analysis (Cooper &
Schindler, 2013). In qualitative research coding happens as a function in both the preparation of
the data collection process and after the data are collected as a means to efficiently analyze the
data (Cooper & Schindler, 2013). Additionally, it is common in qualitative research for the initial
categorizations and codes to change and evolve during the research process (Gronhaug &
Ghauri, 2010). A coding structure was developed and served as guidance to the researcher to
ensure linkages between the conceptual framework, the research questions, and the data
collection process. In preparation for the semi-structured interviews with the analysts and the
focus group interview with the managers or executives the following initial coding structure was
used as seen in Table 7. This coding structure was modified as the researcher progressed through
the data collection and data analysis phases.
71
Table 7
Initial Codes
Code Theme Description
MI Multidisciplinary investigation BZC data analysis collaborations
MM Models and methods BZC analysis capabilities and the statistical
models or methods
CD Computing with data BZC hardware and software capability
P Pedagogy BZC analysts skills, training, education
TE Tool evaluation BZC software tools used
Cooper and Schindler (2013) suggested qualitative researchers use an array of
interpretive techniques to describe the phenomena, decode and translate the information drawn
from personal experiences to achieve an in-depth understanding that tells the researcher how and
why things happen. Swanson and Holton (2005) described four levels required for qualitative
data analysis as the following:
Data organization and preparation – getting the collected data into a form that is easy
to work with and will require the transcription of the collected data.
Familiarization – the researcher will become deeply immersed in the collected data.
Data reduction (coding) – the researcher will be begin to organize the information
into meaningful categories.
Generating meaning – the researcher will begin to offer own their own interpretation.
The following process was applied to conduct the analysis of the qualitative data collected from
the BZC as seen in Figure 5
72
Figure 5. BZC case study data analysis process.
Analysis and Interpretation
Data Organization & Preparation
Document
Analysis
Questions of Inquiry
BZC Management
Focus Group
BZC Analysts
Interviews
Transcribe Data
from Audio
Recordings/
Documents
Verification of
Transcripts
Organization of
the Data
Read all of the
Data
Data Coding
NVivo ®
Themes Descriptions
Interrelating Themes and
Descriptions
Interpretation of the Meaning
of Themes and Descriptions
Recode the data using the identified
themes and sub-themes
Familiarization
Data Reduction
73
Data Organization and Preparation
All audio files generated from the interviews with the analysts and the focus group with
the managers were transcribed by the researcher. The interviews were transcribed into a
Microsoft Word ® document and this document was imported for use in NVivo-11®.
Additionally, the researcher typed up his field notes and observations recorded during the
interviews and these were also imported into NVivo-11® for qualitative inductive analysis and
thematic identification. All recordings, transcriptions, scans, and outputs from NVivo-11® will
be kept in an unidentified, password-protected location for seven years and subsequently
destroyed.
Familiarization
During the familiarization process the researcher is actively engaged in the data by
asking questions of the data and making comments (Swanson & Holton, 2005). The researcher
immersed himself in the data by listening to the audio several times and reading and rereading
the data while taking notes and synthesizing meaning from the data. The familiarization process
allowed the researcher to gain a general sense of the collected information and then to note and
understand important aspects that later aided in the analysis portion of the research.
Data Reduction
A large share of the work involved in qualitative analysis is driven by the act of
categorizing and coding. The goal is to begin to identify themes of the collected data and use
codes to represent those emergent concepts (Swanson & Holton, 2005). Several steps are
required in the data reduction process. The researcher is looking for tones, impressions, and
credibility of the collected data while always keeping in the forefront how the collected data
might relate to the research questions proposed in the study (Swanson & Holton, 2005).
74
Secondly, the process of coding gives the researcher the ability to reduce or simplify the data by
creating categories and gives the researcher the ability to start conceptualizing the collected data.
A code is a tag or label for assigning units of meaning to the collected data and data driven codes
are the most fundamental and most widely used method of coding in qualitative research
(Swanson & Holton, 2005). With continual reading and synthesizing of the collected data,
recurring topics and patterns began to emerge from the data that were then categorized and
properly coded. This process was completed separately for every semi-structured interview
transcript and the transcript of the focus group interview. Additionally, these two sets of outputs
were combined and analyzed together. The last step in the data reduction phase is to start the
generation of themes from the analyzed data. By examining and reflecting the categories and
themes of each interview and focus group overall themes began to emerge (Swanson & Holton,
2005).
Analysis and Interpretation
The final phase of the process is the analysis and interpretation of the data. In this phase,
the researcher brings all the generated themes together for formal conclusions and presentation
(Cooper & Schindler, 2013). Through the process of coding and analysis of the collected data,
interpretation and understanding began to emerge for the researcher. In this stage the qualitative
researcher attempts to offer their own interpretation of the phenomenon (Swanson & Holton,
2005). This is done by exploring the codes and categories and asking, how do the themes fit
together? What happens with some combining or splitting of the categories? What patterns
emerge across the themes? What contrasts, paradoxes, irregularities may surface? The resulting
themes that resulted from the data collection and analysis are described in Chapter 4 of this
dissertation.
75
Ethical Considerations
The researcher obtained approval from the Capella University Institutional Review Board
(IRB) prior to collecting research data from any of the participants. Additionally, the researcher
successfully completed the Collaborative Institutional Training Initiative (CITI) that provided the
general acceptable ethical standards for academic human research. After completing this
training, the researcher determined the core ethical principles to address in this research included
informed consent, privacy, confidentiality, and researcher bias. Additionally, the researcher
obtained approvals from the U.S. Air Force Survey Office, U.S. Air Force Human Rights
Protection Office, and the union that represents a portion of the workforce at the BZC.
DOD information security considerations were mitigated by working closely with the
Secretary of Defense Prepublication and Security Review office that is responsible for providing
security reviews of publications regarding DOD information. Additionally, an ethical
consideration of conflict of interest was examined. A conflict of interest is any condition in
which the researcher has an existing relationship with a participant or the sponsoring
organization that could compromise the validity and the findings of the research (Seidman,
2013). The researcher in this study has a long history with the DOD due to his employment with
the Naval Air Systems Command. This conflict was mitigated by not including any naval
aviation organizations in the research. Solely the researcher collected the data in support of this
research. The digital recordings and transcripts will be locked in a cabinet in the researcher’s
home and will be destroyed by the researcher after seven years via the use of a crosscut shredder
for documents and via an approved data destruction program for the digital recordings.
76
CHAPTER 4. RESULTS
Introduction
The purpose of this qualitative case study was to explore how DOD employees conduct
data analysis with the influx of big data. The general business problem is the lack of effective
analysis in organizations operating in the modern-day big data environment (Harris & Mehrotra,
2014). The specific business problem is that DOD organizations may be struggling with gleaning
actionable information from large data sets compounded by immature data science skills of DOD
analysts (Harris, Murphy, & Vaisman, 2013). This research explored the emerging data scientist
occupation and the skills required of data scientists to help determine if data science is applicable
to the DOD. This research aimed to discover if there are fundamental differences between DOD
analysts and data scientists by exploring the professional experiences of analysts and managers
from a critical organization within the DOD. Géczy (2015) suggested a typical big data problem
in organizations because of the inabilities of most organizations to manage and analyze big data
sets. This chapter is organized into sections to explain the data collection results, data analysis
results, summary, and how the collected and analyzed data supported the two research questions
in this study.
The following research questions guided the study:
Primary Research Question 1: How does the Bravo Zulu Center glean actionable
information from big data sets?
Primary Research Question 2: How mature are the data science analytical skills,
processes, and software tools used by Bravo Zulu Center analysts?
The remainder of Chapter 4 is organized to provide details of the participants in the research,
documents that were collected and analyzed, and the themes and patterns that resulted from the
77
qualitative data analysis of the collected data.
Evaluation of Design and Methodology
Qualitative research stems from a variety of disciplines such as “anthropology, sociology,
psychology, linguistics, communication, economics, and semiotics” (Cooper & Schindler, 2013,
p. 145). As described by Moustakas (1994), qualitative research is an approach to explore how
groups or individuals perceive a specific phenomenon or problem. This type of research involves
collecting data typically in the participants’ settings and inductively conducting analysis of the
collected information looking for themes to provide insight and understanding (Moustakas,
1994). A case study is a qualitative research design to obtain multiple perspectives from a single
organization and is appropriate when questions are being posed to understand a contemporary
phenomenon (Yin, 2009). Case study research is an inquiry about a contemporary phenomenon
that is set within the real-world context when there is a desire to provide an up-close and in-
depth understanding from a single or small number of cases (Yin, 2012). This effective approach
was the rationale for selecting the BZC to help determine if data scientists are warranted in DOD
organizations. The data collected and analyzed from the management focus group, the analysts’
interviews, and the BZC documents supported an exploratory case study approach for this
research. Additionally, the BZC is a complex organization that collects large amounts of data and
is struggling with the analysis of this data to support their mission requirements making them an
ideal representative case study organization for this research. The data was collected by three
means to support this research. First, semi-structured interviews were conducted with analysts
working within the BZC. Second, a single focus group interview was conducted with managers
within the BZC. Third, job announcements used to hire BZC analysts were collected and
analyzed and a recent BZC strategic planning document was collected and analyzed. The
78
research design and methodology, participant criteria, setting, data collection and analysis
methods were executed as proposed in Chapter 3. One additional analyst was interviewed than
the proposed minimum to ensure saturation.
Data Collection Results
Participants of the research study are generally not chosen because their opinions
represent the dominant opinion but because their experiences and attitudes will reflect the entire
scope of the research problem (Gronhaug & Ghauri, 2010). The researcher used the purposive
sampling method and defined participant criteria based upon minimum seniority and experience
level to include senior managers or executives from the BZC directly responsible or influenced
by large data sets as well as analysts supporting management within the BZC. The research
complied with the policies of the Institutional Review Board (IRB) at Capella University, the
U.S. Air Force Survey Office and the U.S. Air Force Human Rights Protection Office and the all
the participants met the inclusion criteria. Triangulation is a method used to improve the overall
accuracy of research by combing data collection methods and different types of data (Gronhaug
& Ghauri, 2010). Triangulation for this research was executed by collecting data through semi-
structured personal interviews, a single focus group interview, and document analysis.
Triangulation was accomplished by analyzing the data from the three data sources using the
NVivo-11® software that aided in the identification of patterns and themes.
Interviews
A list of the email addresses of potential participants that met the participant criteria was
provided to the researcher by the BZC personnel office. The researcher then solicited participants
via email that included a description of the research, the adult informed consent form, and the
interview questions. Potential participants consisted of personnel working at any of the BZC
79
locations with a job title of analyst, and they met the minimum seniority and experience criteria.
Demographic analysis was conducted on the initial composition of potential participants as seen
in Figure 6.
Figure 6. BZC potential analyst participants.
Unexpectedly the demographic analysis of the potential participant data revealed there are far
more program management analysts assigned in analyst positions over the other OPM
occupations at the BZC. Eleven semi-structured interviews with analysts were conducted and
one additional analyst was interviewed than originally planned to ensure saturation. The analysts
that agreed to participate spanned three different OPM job occupations and ranged significantly
in overall DOD and BZC experience. The most senior analyst that participated had forty-five
years DOD experience and the most junior analyst had nine years DOD experience. The
participant with the most BZC center experience had fourteen years of experience and two
participants had just completed two years working at the BZC. The analyst participants were
assigned a numeric value to ensure their anonymity as seen in Table 8.
80
Table 8
Interviewee Experience Levels
Pseudonym OPM Code/Occupation DOD Experience BZC Experience
Analyst 1 2003/Supply Analyst 17 Years 8 Years
Analyst 2 1515/Ops Research Analyst 18 Years 2+ Years
Analyst 3 2003/Supply Analyst 35 Years 8 Years
Analyst 4 1515/Ops Research Analyst 17 Years 2+ Years
Analyst 5 1515/Ops Research Analyst 33 Years 6 Years
Analyst 6 0343/Program Analyst 16 Years 6 Years
Analyst 7 0343/Program Analyst 45 Years 13 Years
Analyst 8 1515/Ops Research Analyst 13 Years 5 Years
Analyst 9 0343/Program Analyst 19 Years 14 Years
Analyst 10 1515/Ops Research Analyst 9 Years 9 Years
Analyst 11 0343/Program Analyst 41 Years 6 Years
The researcher shared the purpose of the exploratory research with each participant, and
the researcher read the adult informed consent form out loud and received verbal consent from
each participant before conducting the interviews. The open-ended interview questions ensured
alignment with the conceptual framework and were grouped within the initial coding structure
and supported the two research questions. The analysts’ interviews were recorded using a
smartphone application. The interviews were then downloaded onto the researcher’s personal
computer and the audio recording files were imported into the NVivo-11® software. Each audio
interview was transcribed by the researcher and the document files were imported into NVivo-
11® that aided in the thematic analysis.
81
Focus Group
A list of the email addresses of potential focus group participants that met the participant
criteria was provided to the researcher by the BZC personnel office. The researcher then solicited
participants via email that included a description of the research, the adult informed consent
form, and the interview questions. Potential participants consisted of managers or executives
working at any of the BZC locations that met the minimum seniority and experience criteria.
Seven managers and one executive participated in the focus group and each participant was
assigned a generic manager title and a numeric value to ensure their anonymity refer to Table 9.
Table 9
Management Focus Group Experience
Pseudonym DOD Experience BZC Experience
Manager 1 35 2
Manager 2 32 24
Manager 3 30 10
Manager 4 19 3
Manager 5 16 14
Manager 6 20 15
Manager 7 34 24
Manager 8 17 12
The researcher shared the purpose of the exploratory research with each participant, and the
researcher read the adult informed consent form out loud and received verbal consent from each
participant prior to conducting the focus group interview. The researcher confirmed with each
82
participant that they met the seniority and minimum experience participant criteria. The
researcher asked the same initial open-ended questions to the management focus group that were
asked to the analysts and the interview questions ensured alignment to the conceptual framework
and were grouped together within the initial coding structure and supported the two research
questions. The focus group interview was eighty-six minutes in duration and was recorded using
a smartphone application. The interview was then downloaded onto the researcher’s personal
computer and the audio recording of was then transcribed by the researcher and imported into
NVivo-11® that aided in the thematic analysis.
Document Analysis
Two different types of documents were collected and analyzed in support of this research.
Job announcements were collected to explore the skills required of newly hired analysts to help
determine if the BZC is hiring data science skills into their organization. Additionally, a strategic
planning document that encompasses a vision of data and analysis for the BZC to achieve was
collected and analyzed. The documents that were collected to support this study are seen in Table
10. To ensure the confidentiality of the case study organization, the title and citation of the
BZC’s job announcements and strategic document are not disclosed in this research.
83
Table 10
BZC Collected Documents
Document Type Document
Job Announcement Program Management Analyst
Job Announcement Operations Research Analyst
Job Announcement Computer Scientist
Job Announcement Supply Systems Analysts
Strategic BZC Strategic Planning Document
This research explored if the federal occupations within the BZC workforce provide the
necessary skills for big data analysis and how aligned these federal occupations are to those of
the perceived data scientist. The data collection and analysis supported the two research
questions in this study to explore how the BZC gleans actionable information from big data sets
and how mature are the data science skills of analysts, processes and software tools used within
the BZC. Analyzing BZC job announcements for analysts and computer scientists and coding the
job and skills requirements from these job announcements into NVivo-11® aligned with the
initial coding structure and conceptual framework provided insights on the BZC’s requirements
of analytical talent. The BZC personnel office provided job announcements for analysts and
computer science occupations. These announcements were imported into NVivo-11® and the
duties and skills requirements were coded using the initial coding structure aligned with the
conceptual framework, and the results are provided later in this chapter.
To explore how the BZC uses data and to explore how the BZC gleans actionable
information from big data sets, a BZC strategic planning document was collected and analyzed.
84
This publicly available BZC strategic document suggest the U.S. Air Force has only started to
realize the full potential of an integrated logistics and sustainment enterprise and the ability to
access and analyze data will play a key role. This strategic plan for the BZC categorizes the
actions to achieve the vision into nine distinct attributes. Attribute #1 sets a vision for the BZC to
build and analyze their data more effectively. This document was imported into NVivo-11®, and
the content of attribute #1 was coded aligned with the initial coding structure and conceptual
framework and the results are provided later in this chapter.
Data Analysis and Results
In qualitative research findings result from a process of data collection, interpretative or
analytical processing, and reporting (Cooper & Schindler, 2013). In support of the two research
questions chartering this study, the role of the researcher was to explore how the BZC gleans
actionable information from large data sets and how mature the data science skills of analysts,
processes, and software tools are at the BZC to help determine if the data scientist occupation is
warranted in DOD organizations. By posing questions to professionals working within the BZC,
their responses yielded patterns regarding big data and data sciences and themes have been
generated for actionable conclusions and the support of further research.
The process of coding “involves the assignment of numbers or symbols to responses
generated from the interviews so the information can be grouped into a limited number of
categories” (Cooper & Schindler, 2013, p. 652). Creating a coding structure gives the researcher
the ability to take large amounts of raw information acquired from the interviews and categorize
the collected responses into a more manageable scheme for processing and analysis (Cooper &
Schindler, 2013). The researcher developed the research questions and ensured alignment with
the initial coding structure and conceptual framework. The interview questions were open-ended
85
which enabled semi-structured conversations about how the BZC gleans actionable information
from big data sets and how evolved the data science skills, processes, and software tools are at
the BZC. The coding and analysis of the interviews with the analysts served as the baseline for
the enhanced coding structure and were then used in the coding and analysis of the focus group
interview and the BZC documents. The initial coding structure is restated in Table 11 for
convenience.
Table 11
Initial Codes
Code Theme Description
MI Multidisciplinary investigation BZC data analysis collaborations
MM Models and methods BZC analysis capabilities and the statistical
models or methods
CD Computing with data BZC hardware and software capability
P Pedagogy BZC analysts’ skills, training, education
TE Tool evaluation BZC software tools used
Several iterations of reading and coding were required in the data reduction process and the
researcher was looking for tones, impressions, and credibility of the collected data while keeping
in the forefront how the collected data related to the research questions in this study. With
continual reading and synthesizing of the collected data recurring topics and patterns emerged.
The coding structure was refined as the transcripts of the analysts and focus group interviews
were coded and analyzed and resulted in the final coding structure (see Figure 7).
86
Figure 7. Final hierarchical coding structure. Shaded codes represent the initial coding structure.
87
Semi-Structured Interviews Analysis and Results
The transcriptions of the 11 analysts’ interviews were loaded into NVivo-11® and each
interview was coded to the initial parent codes aligned with the conceptual framework. After the
initial coding and analysis of the transcribed interviews of the 11 analysts a word frequency
query was used in NVivo-11® to generate Figure 8. The word data was removed from all word
frequency queries because it was overwhelmingly used.
Figure 8. Initial analyst interviews word frequency diagram.
The initial analysis of the semi-structured interviews with the analysts suggests early themes of
analysts’ skills, analysis, training, organizations, and information systems as seen in Figure 8.
The word frequency query was then modified to display only the fifteen most used words by the
analysts to further identify the early themes. This additional query still demonstrated early
themes of analysts’ skills, analysis, training, organizations, and information systems but
additional themes of programs, scientist, engineers, research, problem, pull, and management
emerged as seen in Figure 9.
88
Figure 9. Refined analyst interviews word frequency diagram.
Several open-ended interview questions were posed to the eleven analysts that
participated in the research to further explore the research questions on how the BZC gleans
actionable information from big data sets and how mature are the data science skills, processes,
and software used by BZC analysts. The interview questions were designed to gain a deeper
understanding on how BZC analysts conduct analysis, their perceptions of big data, challenges
associated with conducting data analysis, the software tools used to conduct data analysis,
training options for analysts, and their perceptions of data science. Several themes emerged from
the analysis of the collected data which helped to answer the research questions posed in this
study.
Research Question #1: How does the Bravo Zulu Center glean actionable information
from big data sets?
The analysts were asked initial open-ended questions investigating if the BZC is
experiencing the big data phenomena, the perceived benefits, and liabilities of big data, and their
conceptions about the term big data. The responses provided insights about the concept of big
data, data growth and the ability of the BZC to analyze large data sets. The participants’
89
responses are provided in Table 12.
Interview questions posed regarding big data:
How do you define big data? What increases of digital data (big data) have you witnessed and
how has it impacted the business of the BZC?
The complete list of initial interview questions are provided in Appendix A.
Table 12
Analysts’ Responses to Questions about Big Data
Participant Comment
Analyst 2 I think at least the fundamental concept of big data is integrating multiple data
sources so that you’ve got a better picture of your overall output or just trends.
This is something where we should be working toward. There is very little that
we are doing with big data.
Analyst 3 It’s so big you haven’t figured out either the way to do it or the time to do it, to tie
things together in a meaningful way that is what I think our situation is.
Analyst 4 Yes, it has grown exponentially from the 80s. However, many of our systems for
data collection rely on the compliance of human beings.
Analyst 5 There are vast amounts of sensor data on new weapon systems that are available.
I believe big data is anything bigger than a standard desktop application can
handle, it is going to involve data formats above and beyond structured tables and
lists. It is going to include things like scanned images, and we’ve got information
systems that involve scanned images, it could be audio, it could be video, it could
be free form text, we’ve got lots of forms with check boxes and then free form
boxes for somebody to write something in there. Big data is going to be a huge
volume and it may be coming at you at a very rapid rate.
Analyst 6 I haven’t noticed in increase in the data, I have noticed a trend to try and
modernize how the data is being gathered, maintained and shared.
Analyst 8 I think to someone who comes from a statistics background, who has been in the
field of statistics for a long time, their version of what constitutes big data is
totally different than someone who is a computer scientist or programmer. I
would say big data in today’s day and age. Big data is millions of records if not
billions and trillions I don’t know that we are capturing more data, per say in the
BZC, although I think there is a push to want to capture more than what we
already are. I think big data and big data analytics is a trend, but it is a trend that
is here to stay and I think the Air Force needs to jump on the bandwagon.
Analyst 11 We have so much data and you’re right it is growing exponentially and it’s really
kind of overwhelming for the average employee.
90
Big data theme. By coding and analyzing the transcripts from the analysts’ interviews
through the (MI) initial code regarding big data, thematic elements common in the literature
review were revealed. The BZC is a complex organization with many disparate data systems
generating large data sets and is struggling with gleaning actionable information from the data
sets. The BZC supports Moorthy’s (2015) definition of big data “as the collection of data sets so
large and complex that it becomes difficult to process using traditional relational database tools
and traditional data processing applications” (Moorthy et al. 2015, p. 76).
The analysts were posed questions that further explored how the BZC gleans actionable
information from big data sets. The participants were asked to explain how data is used within
the BZC to meet mission requirements. The participants were also posed an open-ended question
that explored any dependencies on data. The participants’ responses are provided in Table 13
Interview question posed regarding big data analysis challenges:
How is data used in your organization to meet mission requirements? What are some areas in
your organization that are dependent on data?
The complete list of initial interview questions are provided in Appendix A.
Table 13
Analysts’ Responses to Data Usage Questions
Participant Comment
Analyst 2 But they haven’t been able to tell people how they perform historically and we’ve
had to go back and develop all that for them as far as metrics and other things like
that and a Pareto chart. Then we did some follow up DSCM work after that and
developed metrics and goals.
Analyst 3 So, we are big on metrics, number 1 so we pull down a lot of data just to satisfy
populating metrics but there’s not, the majority of the metrics there’s not a lot of
analytical things that go along with it, its just we pull down the data and you
populate a metric and then you’re done. There’s other that we do to where we pull
the data populate a metric and maybe based on being in or out of tolerance that
91
Table 13 (continued)
Participant Comment
Analyst 3 warrants doing some analysis and so any time you do analysis, when then you
have to start pulling down the raw data that facilitates doing that
Analyst 5 So, we do the planning and we generate metrics from all sorts of data to access
how well the supply chain is performing. One of the big things that we look at is
metrics, how well are we doing and there are different definitions of the metrics
depending on which organization you are talking to. But even within the BZC
there is going to be different definitions of the metrics.
Analyst 7 One of the things they have a measurement for output per man day. So a man day
would be let’s say people on an 80 hour pay period on a two week period.
Analyst 9 I go and evaluate an organization they’ve really have never tracked it before, like
on a spreadsheet or database or anything because it was never really evaluated as
something that was important, there is other metrics that they are looking at.
Analyst 11 People do a lot of the gathering of the data and metrics, reporting and that sort of
thing.
Metrics theme. In previous decades, data and metrics were limited and essentially rolled
into aggregated key performance indicators and presented to executives. Much of the decisions
and direction of the firm were placed in the hands of the executives who relied heavily on their
experiences and intuition. The ability to analyze big data stands to completely change this
business model but requires a significant investment in the culture of the organization
(Brynjolfsson & McAfee, 2012). The responses to the interview questions posed to the analysts
regarding how data is used within the BZC were coded using the (MI) initial code aligned with
the conceptual framework. The analysis of the collected data suggests a theme of metrics and the
BZC places emphasis on managing their business through the analysis of metrics. The analysts
proclaimed they spend a significant amount of time pulling data together and creating metrics for
their leadership.
The analysts were asked initial open-ended questions that continued to explore how the
BZC gleans actionable information from big data sets and associated challenges. The participants
92
were asked to explain the challenges in gleaning actionable information from big data sets. The
participants’ responses are provided in Table 14
Interview question posed regarding big data analysis challenges:
What are some of the significant challenges associated with conducted data analysis in your
organization?
The complete list of initial interview questions are provided in Appendix A.
Table 14
Analysts’ Responses to Questions Regarding Data Analysis Challenges
Participant Comment
Analyst 1 We definitely have problems with data quality and I think as the data increases
the challenges increase.
Analyst 2 They have so little actionable big data; we lack the infrastructure and the
knowledge to really bring it all together.
Analyst 3 We do have plenty of data. The data warehouse that I use mostly, there hasn’t
been an increase in the data that’s been collected, however there’s been a change
of what’s been exposed to us.
Analyst 4 The reliability of our data is poor.
Analyst 5 You made an allusion to a data pool or data warehouse. It’s not out there, there is
an immense amount of time and effort that has to be applied to knowing where
the data is at and then going out to fetch it.
Analyst 6 The biggest challenge is getting appropriate access to those systems to extract the
Information. It seems that we are still very protective of letting other air force
employeesget into systems and pull what needs to be pulled. That’s a challenge
that I experience on a daily basis. Who owns the data, people allowing you to see
their data, you could have better decision support if you have access to certain
data, but getting that access is often difficult from the person who controls it so
that is a challenge.
Analyst 7 I don’t believe that there is a problem with collecting data and really even some
cases the way they report. I think it is probably just not as accurate as it should
be.
Analyst 10 So IT alone aside from software is another issue, but sometimes the lack of data
or missing information.
93
Access to quality data theme. By coding and analyzing the transcripts from the analysts’
interviews through the (MI) initial code, access to quality data emerged as a theme. The analysts
indicated infrastructure and policies are constraining access to data. Additionally, the data that is
accessible lacks accuracy and completeness. Watson and Marjanovic (2013) suggested a
challenge with harnessing the power of big data includes accessing data through appropriate
platforms and providing data governance. A BZC data governance strategy that includes how
analysts get access to quality data to support mission requirements is warranted.
As the dialog continued between the researcher and the analysts regarding the challenges
associated with conducting data analysis at the BZC. Additional sub-questions were posed to
each participant to further explore the factors constraining access to quality data within the BZC.
The participants’ responses are provided in Table 15
Interview questions posed regarding big data analysis challenges:
What are some of the significant challenges associated with conducted data analysis in your
organization? What are some factors limiting access to quality data?
The complete list of initial interview questions are provided in Appendix A.
Table 15
Analysts’ Responses Further Exploring Access to Quality Data
Participant Comment
Analyst 2 We have so little actionable big data; we lack the infrastructure and the
knowledge to really bring it all together. As far as advanced analytics, the
infrastructure hasn’t been established, a couple of people have tinkered with it.
We desperately need the infrastructure and the hardware and the software to get
started, management needs to understand that when they set up big data, it’s a lot
like owning a boat, you are going to pour in a lot of money and we may not see a
real viable return on investment for 3-5 years.
Analyst 4 So you can So you can imagine if the Air Force or DOD decided to go to a cloud
based system the millions upon millions of records that we have that would have
94
Table 15 (continued)
Participant Comment
Analyst 4 to be scrubbed. Most of them could be done automatically.
Analyst 5 We’re still running on many dozens of legacy data systems that have their roots
decades ago and we are still using those legacy systems to do our planning.
Analyst 6 How do we transition this information from legacy systems that are pieced
mealed into a larger common database that we can actually do things with and
make informed decisions and connect the dots where we know we haven’t been
able to in the past? How do we merge everything together to where we can really
start tackling some of these big problems instead of just wringing our hands over
it?
Analyst 10 But there are now doing a lot of data mining, getting all of this information from
these program offices and putting them into databases where, they are web-based
databases where anybody can go in and get this information and I think it’s very
important and now they are talking about going to the cloud and having a lot of
the information available in the cloud, although the air force is behind in that.
Analyst 11 There is a lot of things we don’t know, we’ve got the data out there but it is in so
many disparate forms and so many disparate systems that it is virtually
impossible for us to know what we truly have and what we can do. So I have been
trying to get us pushed into that direction.
Infrastructure: Legacy and disparate systems theme. Edward (2014) suggested the
essence of analyzing big data within the DOD requires the aggregation of many data sources
from hundreds of organizations requiring the defining data sharing legal, policy, oversight, and
compliance standards to make it happen. According to Watson and Marjanovic (2013), the
challenge with harnessing the power of big data includes identifying which sectors of data to
exploit, getting data into an appropriate platform and integrating across several platforms,
providing governance, and getting the people with the correct skill sets to make sense of the data.
Interview questions were posed to the participants regarding what challenges and opportunities
they faced to conduct big data analysis and the responses that were related to information
systems were coded using the (CD) initial code aligned with the conceptual framework. The
95
analysis of the collected data suggests the BZC has sections of their business with modern
computer infrastructure and analysis capabilities but their business is also constrained in the
ability to conduct enterprise big data analysis partially due to outdated or legacy information
systems, infrastructure, and many disparate systems.
To further explore the research question of how the BZC gleans actionable information
from big data sets. The analysts were posed questions further exploring how data is used within
the BZC to mission requirements and how do BZC center employees conduct data analysis?
Additionally, sub-questions were posed to the participants to determine how evolved the BZC is
in their ability to build predictive and prescriptive metrics and models. The participants’
responses are provided in Table 16
Interview questions posed:
How is data used in your organization to meet mission requirements? How do BZC analysts
glean actionable information from big data sets?
The complete list of initial interview questions are provided in Appendix A.
Table 16
Analysts’ Responses to Data Usage and Data Analysis Questions
Participant Comment
Analyst 1 We spend a lot of time now, just pulling from different sources and then putting it
all together then trying to analyze it.
Analyst 2 Really most of the things they are doing are elementary data pulls where they
compile the data and it’s just count data. Very simple elementary computations,
they for the most part make sure that the data is valid and they compile it and it’s
like, here are your top 10. We populate it and run a couple of queries and stuff
numbers into PowerPoints. Most of the data is not aggregated, some of the guys
have taken a class dealing with neural networks, but we haven’t really played
with that very much. As far predictive modeling, that’s not really how the BZC
people view it, they only look at count data. Most of the data is not aggregated,
some of the guys have taken a class dealing with neural networks, but we haven’t
96
Table 16 (continued)
Participant Comment
Analyst 2 with that very much. As far predictive modeling, that’s not really how the BZC
people view it, they only look at count data. Most of the data is not aggregated,
some of the guys have taken a class dealing with neural networks, but we haven’t
really played with that very much. As far predictive modeling, that’s not really
how the BZC people view it, they only look at count data. I mean you’re talking
predictive capability is stuff that we have, it would in less than 1%.
Analyst 3 The majority is pulling raw data, there are a few pre-defined. But most of it is
pulling down raw data that, you kinda of, either manipulate inside of the system,
right calculation or things like that inside of the system to produce an answer you
are looking for or your other option is to export into excel.
Analyst 4 We actually built a simulation model using Arena, we are still is the so very
beginning of text analysis.
Analyst 5 You made an allusion to a data pool or data warehouse. It’s not out there, there is
an immense amount of time and effort that has to be applied to knowing where
the data is at and then going to fetch it.
Analyst 10 We try to get enough data where we can find trends to try to mitigate any issues.
We have access to a system and we pull it up and we see what has failed and what
hasn’t and it’s a very old system and we export it into excel, unfortunately it
duplicates some things and so we have to literally go through, take out duplicates
and then make charts, pivot tables and what not to analyze the data. I mean, we
have useful things that we have predicted for certain aircraft parts or even for an
aircraft itself and most of them are well past their useful life.
Analyst 11 So, right I don’t have access to most systems, I have very few systems that I
actually access, I typically will contact other people if I need a data pull from a
system. For example, I have gone to DP, to personnel and told them I want a list
of every mechanic by skill and by shop, where they work so I can try and do
some analysis on how many sheet metal mechanics it take for different weapon
systems, so that when I have a new weapon system come on board maybe I can
be better informed on how many mechanics I will need for that.
Data analysis processes theme. Data science is bringing many processes, techniques,
and methodologies together with a business vision to drive actionable insights (Granville, 2014).
Much of the expectation involved in big data analysis is the continued desire by company and
DOD leaders to move from reactionary metrics based on historical data to predictive and
prescriptive metrics that may be possible with big data analysis. Research on big data and data
97
science suggests the ability to locate hidden facts, indicators, and relationships immersed in big
data sets not yet explored (Chen et al. 2012). The interviews were coded and analyzed through
the (MM) initial code aligned with the conceptual framework. The analysis of the collected data
suggests the BZC is mostly building and analyzing reactive metrics on historical data with small
pockets of predictive analytical capability. Additionally, many of the data analysis processes are
manual processes reliant upon pulling data from many disparate data warehouses and analyzing
the data in basic analysis software.
Further exploring how BZC gleans actionable information from big data sets and the
challenges associated with conducting big data analysis the participants provided input regarding
organizational structure and the culture within the BZC. The participants’ responses are provided
in Table 17
Interview questions posed:
What are some of the significant challenges associated with conducting data analysis in your
organizations? How are analysts employed and aligned in your organization?
The complete list of initial interview questions are provided in Appendix A.
Table 17
Additional Responses to Analysis Challenges Questions
Participant Comment
Analyst 2 It’s kind of a mixed model, if you will. They’ve got it centralized in some of it,
where we’ve got an entire flight, I think of about a dozen analysts, including
interns. Real world problems are not going to be exactly like the book. They lack
creativity, we have people that are so use to the military model, where everything
is provided is some kind Reg or SOP, or TOP or something like that.
Analyst 3 Right and because of that, I think that’s why at least in the BZC, that’s why we
have the volume of analytics being done by contractors. So, they are not actually
government employees, it’s just a contractor that’s doing it.
98
Table 17 (continued)
Participant Comment
Analyst 4 We don’t cross talk well. There is still a lot of protectionism about data and about
systems. We don’t have enough data scientist, people to go collect the data. We
need more data scientist folks to go out and collect the data and feed it to us. I
think as an organization we’re going to have to have a deliberate plan to mature
the analysis capabilities and the ability of the organization to consume those
products. Recognizing it is going to take several years but they are trying to bring
in 1515s. We have very few 1515s in the center, very few.
Analyst 5 I think as an organization we’re going to have to have a deliberate plan to mature
the analysis capabilities and the ability of the organization to consume those
products.
Analyst 6 I do think there is a problem within our command air force materiel command
that I’m aligned to we are trying to address it even within the center through CSF,
they’re called center senior functionals. Recognizing it is going to take several
years but they are trying to bring in 1515s. We have very few 1515s in the center,
very few.
Analyst 10 There’s never been a data analysts ever that have worked for the quality
department before so this is brand new. Sometime I also find that the willingness
of people to work with you and communicate. There is a lot of people that don’t
like to communicate. I don’t know about the other branches but the air force is so
far behind and I fear that it is making it difficult and I think it is deterring a lot of
analysts away. We work overtime every week and have a huge back log of things.
I think sometimes, people don’t understand what we are doing and why we are
doing it.
Analyst 11 People are shorthanded so they don’t have the time to do the analysis. So we
don’t have very many people with that skill set I think that if we grew that skill
set so that, for example I don’t think we have, we don’t have a 1515 in LGX or I
believe in LGA. I’m not sure how far down in the organization maybe each
division would have to have at least one data scientist and maybe make it at the
13 level or even the 12 target 13, something like that.
So I think the answer to that is you bring in some data scientists to train the
functional specialists on how to do the thing.
Organizational structure and culture theme. Gabel and Tokarski (2014) suggested for
organizations to harvest actionable information from big data sets requires the deliberate altering
in many facets of organization design and management of human resources. Harris and Mehrotra
(2014) proposed senior management will need to learn how to employ best and manage data
99
scientists. Many large organizations are now creating a core hub of data scientists to foster an
environment of sharing information and technology. Additionally, because data scientists are a
scarce commodity, many organizations are embedding data scientists with existing data analysis
groups within the organization. Creating teams that combine business analysts, visualization
experts, modeling experts, and data scientists from different disciplines and functional areas may
provide the most effective strategy for employment (Harris & Mehrotra, 2014) When discussing
challenges associated with conducting big data analysis within the BZC a theme of
organizational structure and culture was apparent, and determining how to best employ data
scientists and how to create a culture that shares data and information is warranted at the BZC.
Further investigating how the BZC gleans actionable information from big data sets and
the challenges associated with conducting data analysis the participants provide additional
insights. The participants’ responses are provided in Table 18.
Interview question posed:
What are some of the significant challenges associated with conducting data analysis in your
organizations?
The complete list of initial interview questions are provided in Appendix A.
Table 18
Additional Analysts’ Responses to Challenges Questions
Participant Comment
Analyst 2 We are just getting there with leadership, they continue to do the same thing, yet
expect different results.
Analyst 3 We are big on metrics, number 1 so we pull down a lot of data just to satisfy
populating metrics but there’s not, the majority of the metrics there’s not a lot of
analytical things that go along with it, its just we pull down the data and you
populate a metric and then you’re done.
100
Table 18 (continued)
Participant Comment
Analyst 4 There is a disconnect sometimes with leadership on how long it takes to actually
build both the models whether it’s simulation models on some other type of data
model.
Analyst 5 Within BZC there is going to be different definitions of the metrics. So you use a
different method or a different data set to calculate something you are going to
get a different result and they are never going to agree. We’ve got to bring our
managers along and as we rotate senior managers we’ve got to make sure they’ve
got that capability to consume those products.
Analyst 7 My biggest issue on rotating the leaders is that from an organizational
development perspective if you look at team development principles, you keep
your team in a constant storming stage versus getting to the norming and
performing stages.
Analyst 8 Management is wrapped up in taskers and the bureaucracy of how things are and
what their leadership wants them to do that we never get to do anything advanced
here.
Analyst 9 I see the newer leadership coming up are moving up into the leadership positions
do not know what to do with data. We are not educating our senior leaders to
think methodically and to really to use data and when I say use it is, ok
understand it, that’s a piece, can you interpret it, because that’s the other piece,
you have to understand it and to able to interpret it so that way you can speak to
it.
Analyst 10 Another one that you kind of had touched on that I made a note to is relevance.
When I came into this office there were people putting information down and
trying to put stuff together that really didn’t make any sense.
Management theme. Harris and Mehrotra (2014) proposed leadership is a top
management challenge in the era of big data. Companies may need to train incumbent managers
to be more numerate and data literate as well as hire new managers who already possess the
skills to lead in the era of big data. Participants provided statements regarding how leadership
consumes analysis information and difficulties with determining what metrics to use to measure
the success of the BZC. The BZC is a military organization that rotates its military leaders often,
and the participants suggested this creates challenges for BZC analysis.
101
Research Question 2: How mature are the data science analytical skills, processes, and
software tools used by Bravo Zulu Center analysts?
The analysts that participated were posed open-ended questions investigating the maturity
level of analytical skills, processes, and software that are used within the BZC. The initial open-
ended questions were designed by the researcher to explore the skills required to be an effective
analyst within the BZC as perceived by the participants. The initial open-ended questions
investigated if there are perceived data science skills being used by BZC analysts and the
maturity of those skills. The participants’ responses are provided in Table 19.
Interview question posed:
What are some knowledge, skills, and abilities needed to be an effective data scientist? What are
the data science skills that are used by BZC analysts? How evolved are the data science skills
within the BZC?
The complete list of initial interview questions are provided in Appendix A.
Table 19
Analysts’ Responses to Data Science Skills Questions
Participant Comment
Analyst 1 You definitely need to know how to manipulate data in excel and even to
manipulate data in; we have a system called LIMS-EV BOB J, business objects.
Being able to write scripts to pull data, different types of data that you need to do
your analysis. So you definitely need some computer skills. Yes, some basic
programming skills, because that is exactly what you are doing when we are
using LIMS. You don’t have to be a math scientist to do it, but you do have to be
able to count. I think you definitely have to be able to do critical thinking,
thinking out of the box, to be a good analyst and a lot of it comes with time, the
more experience you get the more things you know you need to look for, you
know the right questions to ask. You have to be inquisitive. It’s hard to find
people that have all of those skills, and it takes a long time to get skills on both of
those domains. You have to have people that are self-motivated
102
Table 19 (continued)
Participant Comment
Analyst 2 We are just barely scratching the surface. Very, little, most of the data is not
aggregated, some of the guys have taken a class dealing with neural networks, but
we haven’t really played with that very much. I’ve built some elementary
predictive models looking at the relationship between different variables and how
it effects asset availability and I mapped most of those so people can understand
the interactions between those. As far predictive modeling, that’s not really how
the BZC people view it, they only look at count data. We are in supply and this is
a virgin canvas, nobody has touched it, they haven’t sprinkled science on any of
this stuff. I mean we look like freaking rock stars helping this people and we are
not even getting into the really cool or interesting tools yet. I’m thinking this
is a fantastic field and warranted.
Analyst 3 We use the word analyst quite often but there really are no true analysts in our
organization. There’s probably I think six others that fill, quote, unquote, analyst
role and none of us are true analysts, we just are people that kind of know the
supply chain, know how to pull down data, know how to make heads or tails of it,
know how to spin it, know how write a few internal, in the system, internal
calculations or variables, things like that, and so we pull down the data and we
kind of of come up with some, ya know basic results that’s why we have the
volume of analytics being done by contractors. In my eyes there could be so
much more achieved if, if the knowledge base or the skill set were to grow.
Analyst 4 Data science, I believe there is a specific need at least on an interim basis as we
transition from all they siloed data systems to data lakes, cloud based. Getting the
skills to be able to that and to actually do it is time consuming. We are still in the
so very beginning of text analysis.
Analyst 5 Now we do have one analyst that was able to add a simulation package.
So we will be able to build some simulation models there and use those. So bit in
pieces we lurch forward. Visualization of findings, right now we are slapping
together slide decks, sometime with 200 slides in them.
Analyst 10 I don’t really know the difference between what they would consider a data
scientist or an operational research analyst. To me they are doing the same thing,
you are diving for data, you are looking for data, you are trying to analyze it or
use it to analyze in order to make impact decisions for problems or for systems. I
don’t that the title really makes much of a difference other than the fact that
operations research analyst are predominately in financial.
Analyst 11 We have the 1515 job series, operations research analysts, so those people are
very valuable they are really, when you talk about a data scientist that kind of
what I think of that person would be so we don’t have very many people with that
skill set I think that if we grew that skill set so that, for example I don’t think we
have, we don’t have a 1515 in LGX or I believe in LGA. Yes, and I also believe
that we have a lot of analysts that could easily be trained with those additional
skill sets, and I would argue that they would make the better one because they’ve
103
Table 19 (continued)
Participant Comment
Analyst 11 got the experience in that area, whatever that area is. I would say we do need
some more analysts and probably a data scientist at a low enough level that they
can train others to increase their skills sets would be really good.
Data science skills theme. Davenport and Patil (2012) proposed data scientists are
experts at gleaning actionable information from massive amounts of data. Data scientist use
traditional science, math, and statistics coupled with modern software and analysis techniques to
turn raw data into actionable information. Data science is a combination of business engineering
and business domain expertise, data mining, statistics, and computer science along with
advanced predictive capabilities such as machine learning (Granville, 2014). The participants
agreed to scholarly views of the perceived data science skills and unanimously agreed that the
perceived data skills are immature within the BZC. Six analysts agreed that data science is a
unique role beyond that of a traditional analyst and two analysts suggested the role of the data
scientist does not have to be unique and three analysts were unsure. Additionally, the participants
acknowledged that there is no data science occupation within the Federal OPM job structure and
they expressed that there are very few analysts within BZC with the complete range of the
perceived data science skills. Several analysts indicated that the operations research analysts is
the occupation that is most closely related to a data scientist and several participants submitted
growing data scientists from the existing analytical workforce would be the most effective
approach. Additionally, four sub-themes emerged from the data collection and analysis: access to
software, access to training, competition for talent, and domains.
Open-ended interview questions were posed to the BZC analysts that continued to
104
explore the maturity of data science skills and the utilization of software tools to support data
analysis within the organization. Additional questions were posed that explored the use of
common data science software tools to gain insights into the accessibility and utilization of these
tools by BZC analysts. The participants’ responses are provided in Table 20.
Interview question posed:
What are the data science skills that are used by BZC analysts? How evolved are the data science
skills with the BZC? Are BZC analysts able to access and use mathematical languages and open
source tools such as R and Python®?
The complete list of initial interview questions are provided in Appendix A
Table 20
Analysts’ Responses to Data Science Skills and Analysis Software Questions
Participant Comment
Analyst 1 We have some tools that are out there, for example a thing called LIMS-EV.
Analyst 2 We use Access and Excel. I’m also using Minitab and it’s only because that what
we have licenses for, for something that’s a real stats program and has of lot of
these built in functions. At this point we have R installed, we don’t have R studio,
I’m not much of a programmer and everything that I’m looking there seems to be
nothing that’s GUI based
Analyst 4 We have access to R, but not the most current version. Either the licenses aren’t
renewed in the case of Arena or there is something else better that comes along.
So we end up losing our skills.
Analyst 8 We have base R but we are not allowed to install any of the packages that people
create for it. Access to software is one of the biggest things.
Analyst 10 Now we’ve been trying to also get Tableau, because right now all we have is
excel and we don’t even have the analysis took pack, so everything is hand done.
They took it out and I called and ask them to put it back in because we were
trying to run regression on something and they said no. It was no longer allowed,
it caused a security issue and we couldn’t have it, and that’s all we were told. So
that has probably been one of our biggest issues for the air force all together is IT
constraints and we did a huge study on IT constraints and how much that impacts
our day to day. IT is definitely our biggest issue and it’s not just the software but
IT alone. We can purchase a software license but by the time things go through
105
Table 20 (continued)
Participant Comment
Analyst 10 contracting the one that we are trying to purchase will be outdated and then we
have to go through and it’s so challenging to get it through and we’ve tried to go
through different avenues to get a quicker process but it’s been an ongoing issue.
I’m having to do a cost comparison for my own position, to contract it out to
MERC for them to do analysis because the air force will not provide me with the
software to do it myself. It becomes concerning, because then where am I going
to go, what am I going todo, I know thought for a fact the marine corps and the
army are in dire need of analysts.
Analyst 11 I will use excel and do my analysis based on that and using my 41 years of
experience with maintenance and most of it has been in maintenance although I
have worked supply chain and program offices as well. The fact that there is a ton
more data available and other tools that they could use to do better analysis, they
are either not trained in it, they don’t know how to do it, their bosses don’t
request that or require it so we lose out on a lot of opportunity.
Access to software theme. Common themes regarding the skills required of data
scientists include advanced and in many cases, open source statistical software such as R and
Python®. These applications lend themselves to another common characteristic of the perceived
data scientist, and that is they will serve the organization best if they can explore open-ended
questions (Davenport & Dyché, 2013). Fundamentally, the ability of personnel in most
organizations is to analyze only a small subset of their collected data that is constrained by
analytics and algorithms of desktop software solutions with the modest capability (Shah et al.
2012). The analysts’ responses to the interview questions were coded using the (TE) initial code
aligned with the conceptual framework. The analysis of the collected data suggests there are
some sections of the BZC leveraging advanced analytical software. However, the collected data
suggest the BZC has limited advanced analytical software available to most analysts.
Information technology policies appeared as a significant constraint preventing access to modern
analytical software.
106
Several interview questions were posed to the participants to explore the role of data
science at the BZC, the data science skills that are used by the BZC, and the data science training
available to BZC analysts to answer the research question on how evolved the data science skills,
processes, and software tools at the BZC. Questions were posed to explore how participants
receive training and the maturity of this training as compared to the perceived data science skill
requirements. The participants’ responses are provided in Table 21.
Interview question posed:
How evolved are the data science skills with the BZC? Do analysts received data science
training? How do analysts get trained with the BZC?
The complete list of initial interview questions are provided in Appendix A.
Table 21
Analysts’ Responses to Training Related Questions
Participant Comment
Analyst 2 There’s no formalized training, they’ve been having people go through the Army
ORSAMAC School, but that’s just an introduction. They have occasional classes
that, most are AFIT classes, which is what the Air Force calls it. Most of those
require you to be a resident to do that, they have occasional training classes that
we’ve seen with the local colleges or something else. A lot of things that we do
are self-study.
Analyst 3 There is no training to do any kind of analytics. A lot of it is just assume, because
we do a lot of promotion within and so we just assume they are capable of doing
what the job is asking for. No, No. Now don’t get me wrong I think if we wanted
that, if somebody, if I wanted to pursue that, I think my organization would be in
support of it and they would concur with that and approve it, but it’s just not
something we sought to do.
Analyst 4 We sort of feed on each other, it’s not a formalized training program.
Analyst 8 We have base R but we are not allowed to install any of the packages that people
create for it. Access to software is one of the biggest things.
Analyst 10 So there aren’t just a lot of training opportunities that are given to us, I’m not on
an APDP coded position anymore.
107
Table 21 (continued)
Participant Comment
Analyst 11 The fact that there is a ton more data available and other tools that they could use
to do better analysis, they are either not trained in it, they don’t know how to do
it, their bosses don’t request that or require it so we lose out on a lot of
opportunity. The truthful answer is, we don’t get any
Access to training theme. The responses were coded using the (P) initial code aligned
with the conceptual framework. The analysis of the collected data suggest the data science skills
of civilian analysts are immature at the BZC. The participants expressed there are very few
analysts training opportunities and even less training opportunities related to the perceived data
science skills. Some of the participants explained that they are fully qualified and meeting their
OPM job series requirement but acknowledged their OPM occupational requirements do not
require data science skills training. Additionally, several analysts indicated they have been able
to complete modest levels of data science training through web-based instruction. One analyst
stationed at Wright-Patterson Air Force Base indicated analysts that are stationed at this location
have access to the Air Force Institute of Technology (AFIT) and could acquire data science-
related training without tuition cost to the individual. The participants submitted the BZC has
successfully sent analysts to other services to receive data science-related training and there is a
significant amount of self-study taking place using common websites such as YouTube and
Google.
A thematic element in the scholarly literature that supported this research suggests the
DOD will have to compete for scarce data science talent (Géczy, 2015). BZC participants were
posed questions to further investigate the maturity of data science and the perceived shortfall and
competition for analytical talent. The participants’ responses are provided in Table 22.
108
Interview question posed:
How evolved are the data science skills with the BZC? Do you have to compete for data science
talent? Do you have enough data scientists?
The complete list of initial interview questions are provided in Appendix A.
Table 22
Analysts’ Responses to Data Scientists Scarcity Questions
Participant Comment
Analyst 1 It’s hard to find people that have all of those skills.
Analyst 5 Our interns are getting emails from headhunters looking for analysts and the
starting salaries are twice or better than what we are paying them, those double
salary packages are going to be very attractive as soon as their obligation periods
are over.
Analyst 6 We can’t hire people fast enough.
Analyst 7 The whole issues of getting people hired into the government is typically slow
and all those other things that compounds this whole problem.
Analyst 10 I don’t know about the other branches but the air force is so far behind and I fear
that it is making it difficult to and I think it is deterring a lot of analysts away. It is
impossible for us to do the work, so they are like giving us busy work and we’re
not able to actually do what were trained to do, what went to school to do, and
what we want to do. I mean honestly I’ve really considered going out into
industry and see what’s out there, only because we are so constrained it makes it
almost impossible to do our jobs and to support how much we should be
supporting and its unfortunate we can’t get the air force to see that. It’s a huge
growing industry and we need a lot more people with the experience, I think that
is one of the problems that we’ve had here is finding people that meet the criteria
and have the right education and experience to fill the positions to help us with
these problems that we are having but I think training and trying to get out the
message that analysts and ops research analysts are a way to go forward to help
with our DOD.
Analyst 11 So I think if you try to bring them in from the outside with those skills, yes it’s
hard to keep them, I think that if we, I think we try to develop these particular
skills in the people that we currently have, maybe, I can think of people in my
different organizations that were really good at analyzing with the simple tools
that they had and if they were given some additional training and classes how
awesome they could be. I think we need more analysts.
109
Competing for talent theme. Géczy (2015) suggested there is a significant shortfall of
analytical professionals within the commercial sector and the DOD and this shortfall is expected
to grow. Finding and maintaining analysts who are capable of gleaning actionable information
from big data intelligence is a challenge confronting our military, and these experts are in short
supply (Edwards, 2014). Schneider, Lyle, and Murphy (2015) advocate incentivizing analysts to
remain loyal to the DOD may be one of the most significant challenges the DOD will face with
big data analysis. Davenport and Dyché (2013) suggested the most likely avenue for
organizations to develop analytical talent will come from innovating new talent from existing
analytical groups. The analysts’ responses to the interview questions were coded using the (P)
initial code aligned with the conceptual framework. The results of the exploration suggest the
BZC has experienced some success in attracting analysts in some locations but is also
experiencing difficulties in attracting this talent. The participants expressed concern about their
people being sought after by competing industries and the process to bring new hires into their
organization is too slow.
BZC participants were posed questions to further investigate the maturity of data science,
the perceived skills required, and the roles of a data scientist. The researcher explained scholarly
definitions of data scientists and solicited responses from the analysts. The participants’
responses are provided in Table 23.
Interview question posed:
How evolved are the data science skills with the BZC? What skills are required of BZC analysts?
Are data scientists’ people with distinct skill requirements beyond traditional analysts?
The complete list of initial interview questions are provided in Appendix A.
110
Table 23
Analysts’ Responses to Data Scientists Skills and Roles
Participant Comment
Analyst 1 You have to be able to check the data that you are pulling and that comes from
experience as well if something doesn’t look right it’s probably not right so you
have to be able to do the math, is the program actually giving you the correct
numbers, sometimes you have to do that. The ideal candidate has that experience
in the supply chain and also has critical thinking and analysis skills.
Analyst 2 A lot of the guys they are right out of school, they don’t know how to apply a
theoretical model, they don’t realize that real world the data is not as clear cut. I
think mentoring would be something. We need a lot of people who are trained as
just analysts, I’m mean you can learn the rest of the stuff, you can find someone
to program or something, but you need someone who can go an solve problems
and track it to ground and get some actual viable movement so they can see that
there is a change.
Analyst 3 We’ve got the one person in our organization, he’s kinda like the most dangerous
guy, because not only does he understand the data, he understands how it all
works and he knows how to program and he has a degree in statistics. There’s
probably I think six others that fill, quote, unquote, analyst role and none of us
are true analysts, we just are people that kind of know the supply chain, know
how to pull down data, know how to make heads or tails of it, know how to spin
it, know how write a few internal, in the system, internal calculations or variables,
things like that, and so we pull down the data and we kind of come up with basic
results.
Analyst 5 The long term vision is they’ll extract the data and they will hand it over to an
operations research analyst that is specially trained in analysis techniques as
opposed to data science techniques. We need more data scientist folks to go out
and collect that data and feed it to us.
Analyst 11 I would think as LG we should definitely have like one per division and we are
supposed to be integrating everything for the entire BZC and yet we don’t have
some 1515s to help us with our analysis because what will happen I’ll spend, I
might spend 5 days analyzing data to come up with some results or whatever that
because I don’ t have the skills that a 1515 has they might be able to do the same
thing in four or five hours that’s taking me four or five days and so we lose a lot
in that and could just even be that maybe we just have some small training
sessions, here is how you do pivot tables. Yes, and I also believe that we have a
lot of analysts that could easily be trained with those additional skill sets, and I
would argue that they would make the better one because they’ve got the
experience in that area, whatever that area. So I would say we do need some more
analysts and probably a data scientist at a low enough level that they can train
others to increase their skills sets would be really good.
111
Domains theme. A common theme in data science research suggests that for data
scientists to generate business value, they will need to work closely with domain experts in the
organization. Creating collaboration between the business domain experts and the data scientists
and should be a foundational requirement before starting a data science project (Viaene, 2013).
Granville (2014) suggested data science is a combination of business engineering and business
domain expertise, data mining, statistics, and computer science, and advanced predictive
capabilities such as machine learning. Data science is bringing many processes, techniques, and
methodologies together with a business vision to drive actionable insights (Granville, 2014). The
responses to the interview questions were coded using the (P) initial code aligned with the
conceptual framework. The participants offered their perceptions regarding the data science role
within DOD organizations and the importance of data science and business domain connections.
Some participants proposed that data scientists should be proficient in the business domain while
other participants suggested data scientists could serve the business best by conducting the
advanced analysis and then provide the results to a business domain analyst.
Focus Group Interview Analysis and Results
The transcribed focus group interview was loaded into NVivo-11® and was coded to the
initial parent nodes aligned with the conceptual framework. After the initial coding and analysis
of the transcribed focus group interview a word frequency query was used in NVivo-11® to
generate Figure 10. The word data was removed from all word frequency queries because it was
overwhelmingly used.
112
Figure 10. Initial management focus group interview word frequency diagram.
The initial analysis of the focus group interview suggests early themes of metrics,
analysts’ skills, tools, information systems, and performing as seen in Figure 10. The word
frequency query was then modified to display only the fifteen most used words by the managers
to identify early themes. This additional query still demonstrated early themes of metrics,
analysts’ skills, tools, information systems, and performing but additional early themes of
predictive, analysts, processes, computers, and business emerged as seen in Figure 11.
113
Figure 11. Refined management focus group interview word frequency diagram.
The same initial open-ended interview questions that were posed to the analysts were
asked to the focus group participants to further explore the research questions on how the BZC
gleans actionable information from big data sets and how mature are the data science skills,
processes, and software tools used by BZC analysts. The interview questions were designed to
gain a deeper understanding on how BZC analysts conduct analysis, the participants’ perceptions
of big data, challenges associated with conducting data analysis, the software tools used to
conduct data analysis, training options for analysts, and their perceptions of data science. All of
the themes that were generated from interviews with the BZC’ analysts were also supported by
the focus group participants with the exception of the management theme. The collected data
from the management focus group did not present a theme of management as a constraining
factor to big data analysis.
Research Question #1: How does the Bravo Zulu Center glean actionable information
from big data sets?
114
The management focus group participants were asked initial open-ended questions
investigating if the BZC is experiencing the big data phenomena, the perceived benefits, and
liabilities of big data, and their conceptions about the term big data. The responses provided
insights about the concept of big data, data growth and the ability of the BZC to analyze large
data sets. The participants’ responses are provided in Table 24.
Interview questions posed regarding big data:
How do you define big data? What increases of digital data (big data) have you witnessed and
how has it impacted the business of the BZC?
The complete list of initial interview questions are provided in Appendix A.
Table 24
Managers’ Responses to Questions about Big Data
Focus Group Comment
Participant The term big data by itself I think has a lot of different meanings depending on
who you talk to, if you connect it with something it takes on a new meaning like
big data analytics, but big data my understanding of it, it’s these large data sets of
structured data or unstructured data but again back to the volume of it, it’s so big
maybe traditional tools that you have don’t allow you to take advantage of all that
information that is there, available to you.
Participant So while we recognize that we’ve had big data it has always been from a different
aperture or different perspective and which we have applied the analytics. I think
that we are maturing our conceptualization of big data and with at least the
logistic space we are recognizing that is an enterprise asset and we are moving the
kind of corporation in that direction at least from a logistics perspective.
Participant There is a realm of methods used for the predictive we are sitting on a significant
volume of data that I would call big data in the sense it is from different sources,
different types, structured, un-structured ect., that we could use to do relational
analysis and form the basis for predictive and potentially prescriptive.
Participant So we actually collect that data, I would love to say it is in big data warehouses
but that implies a much more elegant solution that I think we currently have in the
BZC. We are looking upgrading many of those systems but to date many of them
are old systems written in COBOL, that sort of language, but they collect the
data, they are standard ways to analyze it, standard ways it is presented to
material managers and shop planners.
115
Table 24 (continued)
Focus Group Comment
Participant So I think big data, to blunt and honest is kind of a buzzword right now, that we
have been doing some of that for years, we just haven’t given it this fancy title,
but we have been predicting what we are going to need years in advance for as
long as I have been in the air force.
Big data theme. As expected the interviews with the BZC managers provided insights
about data growth and the ability of the BZC to collect and analyze large data sets. The open-
ended interview questions were designed to explore if the BZC is experiencing a big data
phenomena, the perceived benefits and liabilities of big data, and their conceptions about big
data. By coding and analyzing the transcripts from the focus group interview through the (MI)
initial code regarding big data, thematic elements common in the literature review were revealed.
The BZC is a complex organization with many disparate data systems generating large data sets.
The managers recognized benefits and challenges with analyzing their big data sets and one
participant described big data as a buzzword.
The managers that participated in the focus group were posed questions that further
explored how the BZC gleans actionable information from big data sets. The participants were
asked to explain how data is used within the BZC to meet mission requirements. The participants
were also posed an open-ended questions that explored any dependencies on data. The
participants’ responses are provided in Table 25.
Interview question posed regarding big data analysis challenges:
How is data used in your organization to meet mission requirements? What are some areas in
your organization that are dependent on data?
The complete list of initial interview questions are provided in Appendix A.
116
Table 25
Managers’ Responses to Data Usage Questions
Focus Group Comment
Participant His division is really the keeper in the BZC for performance metrics and how we
apply standard metrics. We use those metrics to access performance and then we
use them in planning as well.
Participant I’ll say corporate business processes we have these metrics as well, so they are
throughout the complex.
Participant Let me just add, so we also use the warfighter metrics too, we use operational
performance of how our systems are performing. We have a whole series of
readiness metrics just like you guys use in the navy, which are outcome metrics
but those drive our planning processes too, so it’s operational metrics, it’s our
supply chain performance metrics, it’s our operations and production
management metrics, there is a whole series, training metrics, you name it, we
use that data to measure our performance and understand where problems are,
that’s what metrics do, they tell you story and help you reveal where you have
gaps and shortfalls that you need to address.
Participant We are talking requirement type metrics, our systems actually track it through the
base supply system, we track how often it is ordered, we compare that to the
flying hour program and then we determine how often that item is used per flying
hour and then how many flying hours we are projected to fly.
Participant One of things , we have a whole host of data solutions to kind of piggy back on
what Mr… is saying, we have one that is kind of business intelligence and an
enterprise data warehouse the pulls raw data and then applies business rules the
cleanse that data and do a presentation layer so that people can have standard
performance metrics in near real time or as the data projects but in the case of
Mr…. operation you get large data sets that are pulled from legacy systems and
then analyzed to present the metrics on performance.
Participant We are about to get started with looking at some commercial platforms that are
available, for example looking at some of our outcome metrics and even some of
the, all of the outcome metrics are lagging, some are less lagging than others and
looking for patterns within that to enable us to have some of the leading health
indicator constructs, that’s going to be a couple of six month projects that are
going to kick off in the next month.
Participant That can be something fundamental in understanding data, there is a tendency to
reach out for a single metric and when in fact it’s typically a sequence of events.
Metrics theme. The responses to the interview questions posed to the managers
regarding how data is used within the BZC were coded using the (MI) initial code aligned with
117
the conceptual framework. The managers that participated in the focus group interview
expressed the importance of gleaning actionable information from large data sets. The managers
provided several examples of how BZC managers use data and metrics throughout the
organization to make crucial business decisions. The managers expressed metrics are a key
output from the data analysts within the BZC and an important aspect of managing the business
of the BZC.
The management participants were asked initial open-ended questions that continued to
explore how the BZC gleans actionable information from big data sets and associated challenges.
The participants were asked to explain the challenges in gleaning actionable information from
big data sets. The participants’ responses are provided in Table 26
Interview question posed regarding big data analysis challenges:
What are some of the significant challenges associated with conducted data analysis in your
organization?
The complete list of initial interview questions are provided in Appendix A.
Table 26
Managers’ Responses to Questions Regarding Data Analysis Challenges
Focus Group Comment
Participant One of the major challenges Roy will be perhaps as we move into big data, right
now we have had a lot of segmented data that we mentioned before and so how
do we integrate that and how to we keep the integrity of that data so that we when
we start to do the big data analytics we’re doing it from a clear and concise
enterprise perspective that has data integrity from inception all the way through
the analysis phase. I think that is one of the big challenges that we are going to
have, because we have such segmented data, because we have so many legacy
systems that produce that data.
Participant We also have a challenge just in data creation, a lot of our systems are relying on
that airman typically a mechanic out in the field who has to put in what he did to
fix the part so we can create our models. The integrity piece is a continuous
118
Table 26 (continued)
Focus Group Comment
Participant challenge and will be no matter what analytical tool you apply.
Participant Access is another one ran into has a problem, getting access to the data, you know
it goes back to what Mr. xxx said about who owns the data, people allowing you
to see their data, you could have better decision support if you have access to
certain data, but getting that access is often difficult from the person who controls
it so that is a challenge.
Access to quality data theme. Watson and Marjanovic (2013) suggested a challenge
with capitalizing big data includes accessing data through appropriate platforms and providing
data governance. By coding and analyzing the transcripts from the focus group interview through
the (MI) initial code and asking open-ended questions regarding how the BZC gleans actionable
information from big data sets, access to quality data emerged as a theme. The management
participants shared common concerns expressed by the analyst participants regarding access to
quality as a theme that is currently constraining big data analytics at the BZC.
To further explore the challenges associated with conducting big data analysis within
BZC the researcher asked the focus group participants to further expound on constraints to big
data analysis. The management focus group participants provided additional responses as seen in
Table 27.
Interview questions posed:
What are some of the significant challenges associated with conducting data analysis in your
organization? The complete list of initial interview questions are provided in Appendix A.
119
Table 27
Managers’ Additional Responses to Data Analysis Challenges
Focus Group Comment
Participant I would love to say it is in big data warehouses but that implies a much more
elegant solution that I think we currently have in the BZC. We are looking
upgrading many of those systems but to date many of them are old systems
written in COBOL, that sort of language, but they collect the data, we don’t have
big enterprise data warehouse for logistics, I think that we are moving into that
space as some of the previous comments stated for the most part it is de-
centralized and it’s kind of adhoc based on the mission needs of the organization
that is applying those systems.
Participant We have had a lot of segmented data that we mentioned before and so how do we
integrate that and how to we keep the integrity of that data so that we when we
start to do the big data analytics we’re doing it from a clear and concise enterprise
perspective that has data integrity from inception all the way through the analysis
phase.
Participant I think that is one of the big challenges that we are going to have, because we
have such segmented data, because we have so many legacy systems that produce
that data.
Participant If we can truly take advantage of the capacity and processing that potentially exist
in a cloud environment I think that would be huge and it might allow us to
actually use some of the tools that maybe are better fit in that environment then
the single site license for an individual computer, we have an air force license that
allows us to truly do analysis in the cloud.
Participant We don’t have a big enterprise data warehouse for logistics. For the most part it is
de-centralized and it’s kind of adhoc based on the mission needs of the
organization that is applying those systems.
Participant Warehousing data and we keep hearing like migration to a cloud environment and
so in my little world here from our perspective if we ever get to a true cloud
environment where all the data is available to everyone.
Infrastructure: Legacy and disparate systems theme. Edward (2014) suggested the
essence of analyzing big data within the DOD requires the aggregation of many data sources
from hundreds of organizations requiring the defining data sharing legal, policy, oversight, and
compliance standards to make it happen. The focus group responses were coded using the (CD)
initial code aligned with the conceptual framework. The participants of the management focus
120
group expressed very similar opinions of the analysts. The BZC has sections of their business
with modern computer infrastructure and analysis capabilities but their business is also
constrained in the ability to conduct enterprise big data analysis due to their availability of
information systems, infrastructure, and many disparate systems.
To further explore the research question of how the BZC gleans actionable information
from big data sets. The management participants were posed questions further exploring how
data is used within the BZC to mission requirements and how do BZC center employees conduct
data analysis? Additionally, sub-questions were posed to the participants to determine how
evolved the BZC is in their ability to build predictive and prescriptive metrics and models. The
participants’ responses are provided in Table 28.
Interview questions posed:
How is data used in your organization to meet mission requirements? How do BZC analysts
glean actionable information from big data sets?
The complete list of initial interview questions are provided in Appendix A.
Table 28
Managers’ Responses to Data Usage and Data Analysis Questions
Focus Group Comment
Participant I‘ll start with the stubby pencil because we still have some of the manual
calculations where we are pulling data from requirements from a simple data call
all the way into systems that we are trying to implement tools that are available
now that can do some of what you are getting at, the big data analytics to actually
automatically set some business intelligence rules up so that we take the human
out of the loop. We really need AI to help us probe that in a faster manner to find
those patterns so that we can do more exception based management, train the
software to really speed up our decision process. I’ve seen that continuum as part
of the data science maturity getting from like to said from reactive to predictive to
prescriptive effectivity, I think we are probably pretty good at the reactive piece.
121
Table 28 (continued)
Focus Group Comment
Participant There is a realm of methods used for the predictive we are sitting on a significant
volume of data that I would call big data in the sense it is from different sources,
different types, structured, un-structured ect., that we could use to do relational
analysis and form the basis for predictive and potentially prescriptive
Participant Vendors are out there who are putting together some views for us that will allow
us to be, write algorithms that will help us to be more predictive but we are really
just tipping our toe in that space right now, as you know if you have been
researching there is a variety of companies who have different levels of maturity
and abilities to do these and make these relationships to tell you and actually
allow you to be predictive and prescriptive.
Participant The Air Force in the past year has embraced the strategy of predictive
maintenance even though we have had policy for a number of years where we are
taking our data from our authoritative maintenance sources, we are using the data,
performance data that we are pulling off aircraft or other weapon systems and we
are using both sets to help us understand performance and manage the health of
the systems so that we can get more predictive and understanding failure and be
able to have parts available ahead of time.
Participant I wouldn’t say they are particularly predictive in really takes humans
understanding and interpreting the data and trying to make decisions, we haven’t
gotten into the machine learning stages yet, where those patterns build and then
we can program certain views and certain I’ll call them vignettes that allow us to
try and get ahead of trends that we believe are going to happen.
Participant I think to some extent we sale ourselves short as an air force, big data they always
tell me they can predict something, I would tell you or D200 system has looked at
the past history of our usage and we predict two years out what they air force is
going to need and prepare our depot shops to repair that, whether it’s a great
prediction or not it’s probably about as good as any you will find in industry
Data analysis processes theme. Much of the expectation involved in big data analysis is
the continued desire by company and DOD leaders to move from reactionary metrics based on
historical data to predictive and prescriptive metrics that may be possible with big data analysis.
Research on big data and data science suggests the ability to locate hidden facts, indicators, and
relationships immersed in big data sets not yet explored (Chen et al. 2012). Interview questions
were posed to the management participants regarding what processes and methods are used by
122
BZC analysts to glean actionable information from big data sets. The questions explored how
mature and effective the analytical processes are in their organization and the maturity of their
predictive analytical capabilities. The responses were coded and analyzed through the (MM)
initial code aligned with the conceptual framework. The analysis of the collected data suggests
the BZC is mostly building and analyzing reactive metrics on historical data with small pockets
of predictive analytical capability. Additionally, many of the data analysis processes are manual
processes reliant upon pulling data from many disparate data warehouses and analyzing the data
in basic analysis software.
Further exploring how BZC gleans actionable information from big data sets and the
challenges associated with conducting big data analysis the management participants provided
input regarding organizational structure and the culture within the BZC. The participants’
responses are provided in Table 29
Interview questions posed:
What are some of the significant challenges associated with conducting data analysis in your
organizations? How are analysts employed and aligned in your organization?
The complete list of initial interview questions are provided in Appendix A.
Table 29
Managers’ Responses to Analysis Challenges
Focus Group Comment
Participant If we had data scientists and they could do these big Uber computations on big
data and we had kind of the infrastructure I guess the fundamental question is
where would they reside to give the most value to the enterprise whatever that
enterprise is defined as, and what is the hierarchal structure, the relationships with
all the corresponding analysis that goes down all the way to, kind of the squadron
level, so I think fundamentally we have to organize ourselves to effectively utilize
data not just have the capacity to analyze and collect data.
123
Organizational structure and culture theme. Similar to the responses provided by the
analyst that participated in the research a theme of BZC organization and culture was apparent
within the focus group responses. Gabel and Tokarski (2014) suggested for organizations to
harvest actionable information from big data sets requires the deliberate altering in many facets
of organization design and management of human resources. Harris and Mehrotra (2014)
advocated senior management will need to learn to employ best and manage data scientists.
Research Question 2: How mature are the data science analytical skills, processes, and
software tools used by Bravo Zulu Center analysts?
The managers that participated were posed open-ended questions investigating the
maturity level of analytical skills, processes, and software that are used within the BZC. The
initial open-ended questions were designed by the researcher to explore the skills required to be
an effective analyst within the BZC as perceived by the participants. The initial open-ended
questions investigated if there are perceived data science skills being used by BZC analysts and
the maturity of those skills. The participants’ responses are provided in Table 30.
Interview question posed:
What are some knowledge, skills, and abilities needed to be an effective data scientist? What are
the data science skills that are used by BZC analysts? How evolved are the data science skills
within the BZC?
The complete list of initial interview questions are provided in Appendix A.
124
Table 30
Managers’ Responses to Data Science Skills Questions
Focus Group Comment
Participant This is Mr… I guess if you use the definition that you used where the person is
skill in all those areas as well as knowledgeable in the data they are handling
that’s a hard thing to groom or to grow if you are talking the technical aspect of
it, I think you almost back to the computer scientist, the 1550 type folks, so I
don’t know if you use the definition that you put to us earlier, that would be a
hard one, even if you had it I don’t know if you would even find qualified
candidates to fill it. That broad of a skill set that they need.
Participant Our data scientist if you will, we found him from the software group here, but I
agree with your definition the data scientist also has to understand the data and
we are probably in the same boat as every other organization where we rely on
SMEs but we have found some online tools like pluralsite and data camp and
where it is almost like a youtube type training so we can get real time training or
honestly people google things, I want to write script to do this and we google it
and we find an example of code like that and then we incorporate that code so a
lot of ours is truly learning on the fly or as a need presents itself figuring out who
else has done it and just kind of borrow from them, out.
Participant I have a group of analysts, operations research analysts that work for me, they are
very skilled in the model and very skilled in the math and to be honest they are
very well booked learned but they have no idea what the data is presenting to
them unless we have a senior logistician or someone who has been out on a flight
line or in a depot shop tell them what it means, they are good people and they
will learn it over time, but my particular shop is quite young they have all of
those skills but they don’t have any background on how to interpret the results.
Data science skills theme. Data scientist use traditional science, math, and statistics
coupled with modern software and analysis techniques to turn raw data into actionable
information. Data science is a combination of business engineering and business domain
expertise, data mining, statistics, and computer science along with advanced predictive
capabilities such as machine learning (Granville, 2014). The focus group participants
acknowledged the growing data science occupation in the commercial sector and the importance
of maturing the data science skills within the BZC. The participants agreed to the scholarly
125
definitions of a data scientist and that data science is a unique role beyond that of a BZC
traditional analyst. One focus group participant stressed that data science includes business
domain understanding.
Open-ended interview questions were posed to the BZC managers that continued to
explore the maturity of data science skills and the utilization of software tools to support data
analysis within the organization. Additional questions were posed that explored the use of
common data science software tools to gain insights into the accessibility and utilization of these
tools by BZC analysts. The participants’ responses are provided in Table 31.
Interview question posed:
What are the data science skills that are used by BZC analysts? How evolved are the data science
skills with the BZC? Are BZC analysts able to access and use mathematical languages and open
source tools such as R and Python®?
The complete list of initial interview
Table 31
Managers’ Responses to Data Science Skills and Analysis Software Questions
Focus Group Comment
Participant There is a spectrum here, there is dashboards, there’s tools that we have that have
an automated presentation layer that I can go and pull up certain metrics and it
will tell me status, particular readiness status, parts status, we are trying to get
into the space. We are finding those is a lot of those tools that they are being
taught on are not usage within the DOD environment because we can’t get them
inside the fence.
Participant So we are using older versions of the tools or we are not even able to access those
tools so we are still doing things, these students basically have to go learn how to
use Access, because Access is not being taught anymore in school, we are past
that point and Access has such a limited space constraint to it that we have to do
iterative type analysis to actually compile the data and make it usable.
Participant Those are the things that I was alluding to where we have R but it is five versions
removed or Python we are still trying to crack the code on how to get it and the
126
Table 31 (continued)
Focus Group Comment
Participant libraries that are needed to actually make it usable. How do I get some of this
software loaded and behind the firewall without taking 24 months?
Access to software theme. Common themes regarding the skills required of data
scientists include advanced and in many cases, open source statistical software such as R and
Python®. These applications lend themselves to another common characteristic of the perceived
data scientist, and that is they will serve the organization best if they can explore open-ended
questions (Davenport & Dyché, 2013). The responses provided by the management focus group
regarding data science skills and analysis software were coded using the (TE) initial code aligned
with the conceptual framework. The analysis of the collected data submit there are some sections
of the BZC leveraging advanced analytical software. However, the collected data suggest the
BZC has limited access to advanced analytical software available to most analysts. Information
technology policies appeared as a significant constraint preventing access to modern analytical
software.
Several interview questions were posed to the managers to explore the role of data
science at the BZC, the data science skills that are used by the BZC, and the data science training
available to BZC analysts to answer the research question on how evolved the data science skills,
processes, and software tools at the BZC. Questions were posed to explore how participants
receive training and the maturity of this training as compared to the perceived data science skill
requirements. The participants’ responses are provided in Table 32.
Interview question posed:
How evolved are the data science skills with the BZC? Do analysts received data science
127
training? How do analysts get trained with the BZC?
The complete list of initial interview questions are provided in Appendix A.
Table 32
Managers’ Responses to Training Related Questions
Focus Group Comment
Participant Our data scientist if you will we found him from the software group here, but I
agree with your definition the data scientist also has to understand the data and
we are probably in the same boat as every other organization where we rely on
SMEs but we have found some online tools like pluralsite and data camp and
where it is almost like a youtube type training so we can get real time training or
honestly people google things, I want to write script to do this and we google it
and we find an example of code like that and then we incorporate that code so a
lot of ours is truly learning on the fly or as a need presents itself figuring out who
else has done it and just kind of borrow from them.
Participant So this is Mr….again and Mr… you can correct me 100% but so some of the
workforce series employees, I mean a 1515 I believe is the series for an analyst
but again if I was to want a 346 who is a logistician and I need them to
understand because they are doing supply chain work what the data is telling
them I don’t as part of their development we don’t deliberately train them that
way, again there are courses out there that we, if you are dealing with that in your
day to day job that you can take, we are also looking at DAU, but this is the
challenge for career field development that we need to start moving towards
changing the competencies that we expect our SMEs to have so that it would
include these skills.
Access to training theme. The management participants supported the theme expressed
from the analysts, the BZC has limited access to data science-related training. There are very few
formal analysts training opportunities and even less training opportunities related to the
perceived data science skills. However, the BZC has pursued making some online training
venues available to analysts.
A thematic element in the scholarly literature that supported this research suggests the
DOD will have to compete for scarce data science talent (Géczy, 2015). BZC managers were
128
posed questions to further investigate the maturity of data science and the perceived shortfall and
competition for analytical talent. The participants’ responses are provided in Table 33.
Interview question posed:
How evolved are the data science skills with the BZC? Do you have to compete for data science
talent? Do you have enough data scientists?
The complete list of initial interview questions are provided in Appendix A.
Table 33
Managers’ Responses to Data Scientists Scarcity Questions
Focus Group Comment
Participant It is location specific in industry at right, I know the challenges we had when we
were trying to stand up that office it was the oil industry. I am never validated this
with any research but we could generally look at the price of a barrel of oil, if it
steadily stayed below $55 a barrel then the length of the cert got better but that is
purely my observation I didn’t write everything down, when was oil was high the
certs and the qualified applicants that I would receive to evaluate I would say was
slim pickings, over.
Participant I would agree with that in fact it’s probably harder I even have folks that have
already figured out that they can make more money even within the Department
of Defense if they go to either coast, so getting analysts to move here to BZC is a
challenge in itself, my fear is that we are going to groom these folks here and then
they are going to see they can go and become a GS14 analysts and make $20,000
dollars more, now granted there is a cost of living side to that as well but just
from a true numbers perspective the higher salaries are on the coasts they are not
out here in the middle of the country, or they are competing with the oil industry
who is paying a higher salary for those types of people.
Participant We just hired two ops research analysts and we had to go outside to do it and use
to DHA because it is a hard to fill occupation but we were able to find them here
maybe because we don’t have the oil industry and people don’t want to live on
the east coast but we were able to do it so I don’t think the pinch is quite so hard
here if you can find skill sets you can hire them but it is finding the skill sets that
is more of the problem. I would say one reason that we try to grab the interns and
bring them on and our EN office has done a really good job of that, let the folks
come in and get a flavor of it, we have several, I will say at least one that I know
that I brought in that helps with retention, they get experience out of it they get a
taste and it helps. The challenge is using them so that they have meaningful work,
there is a tendency at times for folks to say well that’s an intern let me give them
129
Table 33 (continued)
Focus Group Comment
Participant the grunt work, but if I really want that skill set it is giving them value added
work and the hard stuff so one they can know they are contributing and two it
gives them a taste of what is to come, over.
Competing for talent theme. Géczy (2015) suggested there is a significant shortfall of
analytical professionals within the commercial sector and the DOD and this shortfall is expected
to grow. Finding and maintaining analysts who are capable of gleaning actionable information
from big data intelligence is a challenge confronting our military, and these experts are in short
supply (Edwards, 2014). Several interview questions were posed to the focus group participants
to gain their perspectives on the anticipated shortfall of analytical talent, and the responses were
coded using the (P) initial code aligned with the conceptual framework. The results of the
exploration suggest the BZC has experienced some success in attracting analysts in some
locations but is experiencing difficulties in attracting this talent. The participants expressed
concern about their people being sought after by competing industries and the process to bring
new hires into their organization is too slow.
BZC managers were posed questions to further investigate the maturity of data science,
the perceived skills required, and the roles of a data scientist. The researcher explained scholarly
definitions of data scientists and solicited responses from the analysts. The participants’
responses are provided in Table 34.
Interview question posed:
How evolved are the data science skills with the BZC? What skills are required of BZC analysts?
Are data scientists’ people with distinct skill requirements beyond traditional analysts?
130
The complete list of initial interview questions are provided in Appendix A.
Table 34
Managers’ Responses to Data Scientists Skills and Roles Questions
Focus Group Comment
Participant Gone are those days where we had air force level institutions that kind of fostered
domain centric analysis capabilities and it seems now to be pushed down to the
organizational level that needs and consumes that data and makes the business
decisions for their particular business process. It is interesting to looking at the air
force in terms of, ya if we had data scientists and they could do these big Uber
computations on big data and we had kind of the infrastructure I guess the
fundamental question is where would they reside to give the most value to the
enterprise whatever that enterprise is defined as, and what is the hierarchal
structure, the relationships with all the corresponding analysis that goes down all
the way to, kind of the squadron level, so I think fundamentally we have to
organize ourselves to effectively utilize data not just have the capacity to analyze
and collect data.
Participant Have a SME who is able to do what I think eventually we want to get is where the
SME has those competencies that will make them good analysts but that is really
the future state so how do we bridge that, perhaps with data scientist and
computer scientist who are working with our SMEs using the tools that are
available.
Domains theme. A common theme in data science research suggest that for data
scientists to generate business value, they will need to work closely with domain experts in the
organization (Granville, 2014). Creating collaboration between the business domain experts and
the data scientists and should be a foundational requirement before starting a data science project
(Viaene, 2013). The management participants offered their perceptions regarding the data
science role within DOD organizations and the importance of data science and business domain
connections. The management focus group submitted similar opinions as the analysts regarding
the distinctions between data scientist and business domain knowledge support the domains
131
theme. All of the responses were coded using the (P) initial code aligned with the conceptual
framework.
Bravo Zulu Center Document Analysis and Results
The BZC strategic planning document that was collected by the researcher was imported
into NVivo-11® for analysis. The content of the BZC’s strategic plan attribute #1 regarding data
accessibility was coded aligned with the initial coding structure and conceptual framework. A
word frequency query was generated to gain a general sense of the information provided in the
BZC’s strategic plan as seen in Figure 12.
Figure 12. BZC strategic document word frequency diagram.
The analysis of the BZC’s strategic plan advocates the BZC has placed emphasis on digital, time,
agility, integration, tools, and analysis. The coding and further analysis of the BZC’s strategic
document revealed there is a BZC strategic objective to enable complete data integration and
data availability across the BZC. Within the data availability attribute of this strategic plan, there
are specific goals to make data 100% accessible and accurate by providing all required data at
the point of entry via a single entry point and by dynamically linking and integrating systems.
132
The strategic plan also describes the employment of the necessary tools, models, and predictive
analysis capabilities to turn raw data into useful information. Triangulation analysis of the data
collected from analyst interviews, the focus group interview, and this BZC strategic document
suggest the organization is suffering from significant data accessibility and data quality issues.
However, the review and analysis of the BZC’s strategic document suggest the organization is
aware of these shortfalls and is actively engaged in mitigating these issues.
BZC Job Announcements Document Analysis and Results
Harris and Mehrotra (2014) proclaimed there are distinguishable differences between
data scientists when compared to traditional quantitative analysts and there are many
implications on how to define the roles of data scientists as well as how to attract and train these
experts and how to get the most value from this emerging discipline. To explore the maturity of
data science skills at the BZC several recent job announcements were collected and analyzed.
The BZC personnel center provided recent supply analyst, program management analyst,
operations research analyst, and computer science job announcements. These job announcements
were imported into NVivo-11® and the skills and duties required of these positions were code to
the (P) initial code and aligned with the conceptual framework. A word frequency query was
executed combining the data from all four job announcements. Words that are generic to all job
description were omitted from the query. The result indicates the presence of data sciences skills
such as mathematics, statistics, and computer science, as seen in Figure 13.
133
Figure 13. BZC Job announcements word frequency diagram.
To further explore the maturity of data science skills of newly hired BZC personnel, the
skills and duties sections of the recent job announcements were compared to scholarly views of
data science skills. Comparisons of the data science skills proposed by Harris, Murphy, and
Vasinman (2013) along with the specific data science software suggested by Harris and Mehrotra
(2014) to the skills required of BZC analysts and computer scientists described in the recent BZC
job announcements are provided in Tables 35 through 38.
The comparison of the skills required of the supply analyst job announcement to
scholarly views of data science skills are provided in Table 35. According to the recent supply
analyst job announcement, BZC supply analysts require the basic abilities to analyze statistical
data and apply arithmetical computations with graphical representation. There are no specific
analysis software tools and computer science, or programming requirements. A supply analyst
that is hired into the BZC requires specific supply chain domain knowledge but little specific
data science-related skills.
134
Table 35
Data Scientist and BZC Supply Analyst Required Skills Comparison
Data Scientist Supply Analyst
Types of
Data (CD)
Big data, all types, including unstructured, numeric, and non-numeric data
(Harris, Murphy & Vasinman, 2013)
Current statistical data
Preferred
Tools (TE)
Mathematical languages (such as R and Python®), machine learning,
natural language processing and open-source tools (Harris & Mehrotra,
2014)
No specific software or
tools
Nature of
work (MI)
Explore, discover, investigate and visualize (Harris, Murphy & Vasinman,
2013)
Analyze, develop, evaluate
using statistical data
Methods
(MM)
Optimization/Visualization
Graphical models
Classical, Bayesian, Temporal, Spatial statistics
Monte Carlo Simulation
Data manipulation (Harris, Murphy & Vasinman, 2013)
Arithmetical computations
meaningful statistical data
for graphic representation
Computer
Science
Skills (P)
Programming
System administration
Back-end programming
Front-end programming (Harris & Mehrotra, 2014)
No specific computer
science requirements
Typical
degree
Computer science, data science, symbolic systems, cognitive science. Degree not required
135
The program management analyst job announcement collected and analyzed in support of
this research require candidate employees to have specific understanding of command
operations, products, services, and knowledge of the goals of the command. This occupation at
the BZC requires knowledge and skills in applying analytical and evaluation techniques to
identify and apply analytical process to resolve problems. The program management occupation
at the BZC serves as a broad announcement with little specific analytical requirements. Two
analysts that participated in this research explained that BZC job descriptions are not sufficiently
detailed to support the hiring of candidates with data science skills and this is apparent in the
program management analyst job announcement collected and analyze in support of this
research.
The comparison of the skills required of the program management analyst job
announcement to scholarly views of data science skills are provided in Table 36. According to
the recent program management analyst job announcement, BZC program management analysts
require basic skills in program management, planning, and coordinating. There are no specific
data analysis, mathematics, statistics, computer science, or programming requirements. There are
no specific analysis software requirements and a college degree is not required. According to the
list of analysts currently assigned to the BZC working as analysts to support this research, the
program management analysts make up 54% of the total analysts’ workforce concluding that the
majority of the BZC analysts have no specific data science skills requirements.
.
136
Table 36
Data Scientist and BZC Program Management Analyst Required Skills Comparison
Data Scientist Program Management Analyst
Types of
Data (CD)
Big data, all types, including unstructured, numeric, and non-numeric data
(Harris, Murphy & Vasinman, 2013)
No specific data analysis
requirements
Preferred
Tools (TE)
Mathematical languages (such as R and Python®), machine learning,
natural language processing and open-source tools (Harris & Mehrotra,
2014)
Familiar with total quality
management tools
Nature of
work (MI)
Explore, discover, investigate and visualize (Harris, Murphy &
Vasinman, 2013)
Develops plans and coordinates
Methods
(MM)
Optimization/Visualization
Graphical models
Classical, Bayesian, Temporal, Spatial statistics
Monte Carlo Simulation
Data manipulation (Harris, Murphy & Vasinman, 2013)
No specific math or statistics
requirements
Computer
Science
Skills (P)
Programming
System administration
Back-end programming
Front-end programming (Harris & Mehrotra, 2014)
No specific computer science
requirements
Typical
degree
Computer science, data science, symbolic systems, cognitive science. Degree not required
137
The operations research analyst job announcement collected and analyzed in support of
this research require candidate employees to possess the ability to conduct scientific work. BZC
analysts are required to possess the ability to design, develop and adapt mathematical, statistical,
econometric, and other methods to recommend courses of actions for complex problems. This
occupation at the BZC requires knowledge and skills in applying analytical and evaluation
techniques to identify and apply analytical process to resolve problems. According to the job
announcements, operations research analysts working at the BZC are required to work
independently on small projects and the ability to work with other analysts on large complex
projects. The operations research analyst occupation requires a 4-year degree from an accredited
college or university in operations research or a similar course of study with at least three to
twenty-four semester hours in calculus. The operations research analyst position announcement
analyzed in support of this research described that operations research analysts will be coupled
up with subject matter experts in the organization. This distinction supports the notion that the
BZC is stressing the important of creating teams comprised of domain experts and advanced
analysts.
The comparison of the skills required of the operations research analyst job
announcement to scholarly views of data science skills are provided in Table 37. According to
the recent operations research analyst job announcement, BZC operations research analysts are
required to have skills in data collection and a wide range of methods to conduct data analysis
and skills in applied mathematics. There are no specific analysis software tools and computer
science, or programming requirements. Several participants in this research expressed that the
operations research analyst occupation possess the skills most closely related to a data scientist.
138
Table 37
Data Scientist and BZC Operations Research Analyst Required Skills Comparison
Data Scientist Ops Research Analyst
Types of
Data (CD)
Big data, all types, including unstructured, numeric, and non-numeric data
(Harris, Murphy & Vasinman, 2013)
Data collection
Preferred
Tools (TE)
Mathematical languages (such as R and Python®), machine learning,
natural language processing and open-source tools (Harris & Mehrotra,
2014)
No specific software or tools
Nature of
work (MI)
Explore, discover, investigate and visualize (Harris, Murphy & Vasinman,
2013)
Wide range of methods and
techniques to perform analysis
Methods
(MM)
Optimization/Visualization
Graphical models
Classical, Bayesian, Temporal, Spatial statistics
Monte Carlo Simulation
Data manipulation (Harris, Murphy & Vasinman, 2013)
Applied mathematics and
statistics, no specific statistical
methods
Computer
Science
Skills (P)
Programming
System administration
Back-end programming
Front-end programming (Harris & Mehrotra, 2014)
No specific computer science
requirements
Typical
degree
Computer science, data science, symbolic systems, cognitive science. Ops Research or similar with
specific math requirements
139
The computer scientist job announcement collected and analyzed in support of this
research require candidate employees to possess expert knowledge of theories, concepts,
principles, practices, standards, methods, techniques, and materials of professional computer
science. Candidates are required to have knowledge of other technical disciplines to apply
advanced computer software, software systems, hardware architectural theories, principles of
concepts for new application development and experimental theories.
The comparison of the skills required of the computer scientist job announcement to
scholarly views of data science skills are provided in Table 38. According to the recent computer
scientist job announcement, BZC computer scientists are required to have skills in theories and
concepts of computer science to include the mathematics requirements encompassed in a
computer science bachelor’s degree. This occupation requires thirty semester hours of combined
mathematics, statistics, and computer science and a minimum of fifteen hours combining
statistics and calculus. There are no specific analysis software tools and computer science, or
programming requirements. The job announcement and the collected interview data from the
BZC indicate that computer scientists are employed in many different capacities throughout the
organization.
140
Table 38
Data Scientist and BZC Computer Scientist Required Skills Comparison
Data Scientist Computer Scientist
Types of
Data (CD)
Big data, all types, including unstructured, numeric, and non-numeric
data (Harris, Murphy & Vasinman, 2013)
No specific data analysis
requirements
Preferred
Tools (TE)
Mathematical languages (such as R and Python®), machine learning,
natural language processing and open-source tools (Harris & Mehrotra,
2014)
No specific software or tools
Nature of
work (MI)
Explore, discover, investigate and visualize (Harris, Murphy &
Vasinman, 2013)
Apply theories and concepts of
computer science
Methods
(MM)
Optimization/Visualization
Graphical models
Classical, Bayesian, Temporal, Spatial statistics
Monte Carlo Simulation
Data manipulation (Harris, Murphy & Vasinman, 2013)
No specific math or statistics
requirements
Computer
Science
Skills (P)
Programming
System administration
Back-end programming
Front-end programming (Harris & Mehrotra, 2014)
Apply theories and concepts of
computer science
Typical
degree
Computer science, data science, symbolic systems, cognitive science. Computer science or similar with
specific math requirements
141
The comparative analysis of the BZC job announcements to the scholarly views of data
science suggest the BZC can hire analysts with significant math, statistics, operations research,
and computer science skills through a combination of OPM occupations suggesting there is no
single OPM occupation that encompasses data science and a teaming approach for data science
enablement is appropriate. None of the analysts’ occupations and the computer science
occupation required any specific software knowledge.
Summary
The BZC is an organization that is generating big data sets and has varying levels of
analysis capability throughout their business units. The results of the research were triangulated
from semi-structured interviews with analysts, a focus group interview with management, and
document analysis of a BZC strategic document and recent BZC job announcements. Several
themes emerged as limitations in the BZC’s ability to analyze large data sets and were shown
throughout this research. Access to quality data, metrics, management, organization structure,
culture, infrastructure, data analysis processes, data science skills, and training emerged from the
research as themes important to big data analysis within the BZC.
All of the participants in the research recognized the benefits of developing data science
skills within BZC. Six of the eleven analysts agreed that data science is a role beyond that of a
traditional analyst, two analysts suggested existing analysts could evolve their skills to the level
of a data scientist, and three analysts were unsure. The focus group participants agreed to the
scholarly definitions of a data scientist and that data science is a unique role beyond that of a
BZC traditional analyst. The focus group stressed that if data science includes business domain
understanding it is going to be difficult for their organization to attract, train, and retain this level
of talent. There were common themes on the limitations of the skills of current analysts due to
142
occupational standards, access to training, access to software, and competition for talent. There
was a significant theme of how to train, certify, and employ data scientists within the BZC.
143
CHAPTER 5. DISCUSSION, IMPLICATIONS, RECOMMENDATIONS
Introduction
Rapid data growth is having profound effects on modern-day corporations and the United
States military as they continue to progress through the information technology age
(Ransbotham, Kiron, & Prentice, 2015). Harris and Mehrotra (2014) suggested the skills required
to manage and analyze the exponentially growing size of data are inadequate and in short supply
with bleak predictions for the future. This research explored the emerging commercial data
scientist occupation and the skills required of data scientists to help determine if data science
applies to the DOD. This research sought to define further the skills required of data scientists to
help enable their effectiveness in modern organizations with specific emphasis aimed at the
DOD. The targeted population consisted of analysts, managers, or executives working within the
Bravo Zulu Center (BZC). This research explored data science and the implications associated
with the big data phenomenon by conducting qualitative research with a representative case
study organization. This research explored essential skill sets, attitudes, and perceptions of the
analysts working big data issues for the BZC, along with the skills sets, attitudes, and perceptions
of management within the same organization. A BZC’s strategic planning document and recent
BZC’s job announcements were collected and analyzed that ensured triangulation from three
collection methods to improve the overall accuracy of the research (Gronhaug & Ghauri, 2010).
This chapter discusses the findings of the research compared to the research questions
and the supporting literature review to ensure fulfillment of the research purpose. The chapter
evaluates how the research contributed knowledge toward understanding and resolving the
business problem posed in this study and provides multiple recommendations for further
research.
144
Conceptual Framework Final Implications
The conceptual framework served as the foundational knowledge to support this research
study. This framework guided the research by relying on formal theory, which supported the
researcher’s thinking on how to understand and plan to research the topic (Grant & Osanloo,
2014). William S. Cleveland (2001) coined the term data science in the context of enlarging the
major areas of technical work in the field of statistics. Cleveland’s seminal work described the
requirement of an “action plan to enlarge the technical areas of statistics focuses of the data
analyst” (Cleveland, 2001, p. 1). Cleveland described a major altering of the analysis occupation
to the point a new field shall emerge and will be called “data science” (Cleveland, 2001, p. 1).
Cleveland’s proposal of six technical areas that encompass the field of data science
includes multidisciplinary investigations, models and methods for data, computing with data,
pedagogy, tool evaluation, and theory as seen in Figure 14. This taxonomy was adapted and used
by the researcher to conceptualize the business problem, formulate a plan to collect and analyze
data and provide actionable conclusions.
Figure 14. Cleveland’s Data Science Taxonomy. Adapted from “Data Science: An action plan
for expanding the technical areas of the field of statistics.” by W. Cleveland (2001) International
statistical review, 69(1), 21-26.
Data
Sciences
Multidisciplinary
Investigation
Models &
Methods
Computing
with Data
Pedagogy
Tool
Evaluation
Theory
145
The coding and analysis of the data that was collected from interviews with BZC analysts
served as the baseline for the enhanced coding structure and were then used in the coding and
analysis of the focus group interview and the BZC collected documents. After continual reading
and synthesizing of the triangulated collected data recurring topics and patterns emerged and
resulted in the final coding structure (see Figure 15). The resulting themes that emerged from the
analysis of the collected information from the BZC formed the themes and conclusions of this
research. The adaptation of Cleveland’s data science taxonomy was effective in this research
study and could be used to support future data science research.
146
Figure 15. Final hierarchical coding structure.
147
Evaluation of Research Questions
Two primary research questions guided this study. How does the Bravo Zulu Center
glean actionable information from big data sets? How mature are the data science analytical
skills, processes, and software tools used by Bravo Zulu Center analysts?
Research Question 1
The purpose of exploring how the BZC gleans actionable information from large data
sets was to understand if their organization is experiencing exponential data growth and how
effective their organization is at analyzing large data sets to help determine if the data science
occupation is warranted in DOD organizations. Findings from the case study revealed the BZC is
large organization collecting an overwhelming amount of data from a large number of disparate
systems and the organization is not taking full advantage of the data that is available. The BZC
has different methods of gleaning actionable information from data sets from manual processes
of collecting and analyzing data to a mature level of analysis through effective business
intelligence systems. The most common method for analyzing data within the BZC is to pull raw
data from many different data warehouses, compile the data on local computers and then analyze
the data in Microsoft Excel or Access and provide the results in PowerPoint. When asked about
their knowledge of the term big data the participants indicated the BZC is operating in a big data
environment and most often equated their definition of big data to when an organization reaches
a data saturation point and is not able to effectively analyze their collected data. The analysis of
the collected data from the BZC was initially analyzed through the use of word frequency
queries and early themes of analysis skills, training, and information systems were identified.
Continual coding and analysis of the collected data revealed access to quality data, organization
structure, culture, infrastructure, and disparate systems as areas that are constraining the BZC’s
148
ability to glean actionable information from large data sets.
Research Question 2
The purpose of exploring how mature the data science skills of analysts, processes, and
software tools are at the BZC was to understand if the current BZC and DOD occupational job
series and the skills required of those job series encompass the scholarly views of data science
skills to ultimately help determine if the data science occupation is warranted in DOD
organizations. Six of the analysts and the focus group participants agreed that data science skills
are skills beyond that of traditional analysts, two analysts suggested the role of the data scientist
does not have to be unique, and three analysts were unsure. All the participants agreed that data
science skills are lacking at the BZC. All the participants indicated there are very few data
scientists within the organization and a large portion of their advanced analytical work is
contracted to outside companies. When asked about how evolved their analytical processes and
products were in relation to the perceived data scientists’ abilities the participants indicated they
are in the beginning stages of building advanced analytical products with limited predictive
analytical capability. Additionally, by comparing the skills and duties required of analysts and
computer scientists as described in recent BZC job announcements to that of scholarly views of
data science skills revealed there are components of data science skills spread across several
analysts’ occupations and the computer science occupation. Harris and Mehrotra (2014)
proposed creating teams that combine business analysts, visualization experts, modeling experts,
and data scientists from different disciplines and functional areas may provide the most effective
strategy for employment. Currently, the BZC cannot hire a government data scientist in a single
occupation and creating teams that encompass the data science skills is warranted.
Harris and Mehrotra (2014) suggested common desktop applications limit the analysis
149
capabilities in many organizations. Data scientists are well versed in common advanced
statistical software with access to open source libraries to conduct the advanced analysis. The
results of the research revealed that BZC analysts are constrained in their ability to conduct data
science because of their access to modern data science tools such as R and Python® as well as
modern visualization software such as Tableau and others. The research revealed there is a mix
of statistical and business intelligence software that is available but there is not a standardized
plan for analysis software across the BZC. The participants expressed frustration with
information technology policy constraints that are preventing access to modern analytical
software and inhibiting the BZC data science evolution.
There is evidence the BZC is actively engaged in advancing data science skills in their
organization. The BZC has recently created a small data science team that is focused on utilizing
data science to bring actionable insights into one specific business unit within their command.
Additionally, the BZC’s strategic document that was collected and analyzed revealed there is a
strategic objective to enable complete data integration and data availability across the BZC with
a goal to increase their analytical capability. The BZC analysts’ data science skills and processes,
and analysis software are immature. The BZC analysts and managers understood their limitations
to data science and are actively engaged to bring these skills into their business.
Fulfillment of Research Purpose
The Chapter 2 literature review provided a foundation of scholarly research that
expressed the critical importance of big data analysis in both commercial and DOD sectors. The
literature review served as a foundation of research that described the emergence of the data
science occupation and this occupation is critical for big data analysis in modern environments
and that these skills are in short supply (Edwards, 2014). The research sought to define further
150
data science skills and how and if these skills could be employed in DOD organizations by
examining the skills and abilities of federal civilians working as analysts within the BZC. The
research revealed that the scholarly views of data science skills are inherent to several federal
OPM occupations of personnel working within the BZC. Chapter 4 revealed the BZC is
experiencing extreme data growth, has immature data science skills and processes, and provided
several implications on how best to employ data scientist within their organization. These
findings directly related to the specific business problem that the DOD may be struggling with
gleaning actionable information from large data sets compounded by immature data science
skills. The research provided evidence that there are skills differences between data scientists and
the traditional analyst that are available to DOD organizations through the current Federal OPM
occupations. This research suggests access to quality data, organization structure and culture,
infrastructure and legacy systems, access to training, competition for talent, and access to
software as themes that are preventing the BZC from fully leveraging data science capabilities
and these limitations may be affecting other DOD organizations. Additional themes of big data,
metrics, management, data analysis processes, data science skills, and domains resulted from this
research and supported the conclusion that data science skills, processes, and software are
immature at the BZC.
The results of this research suggest DOD organizations will accelerate their ability to
glean actionable information from large data sets by maturing data science skills within their
workforce. The results of this research propose there are several limitations that are inhibiting the
development of a DOD data science workforce. Harris and Mehrotra (2014) suggested creating
teams that combine business analysts, visualization experts, modeling experts, and data scientists
as an effective strategy. Because there is no formalized data science occupation within the DOD
151
workforce and because the DOD is competing for scarce data science talent creating data
analysis teams that comprise the breadth of data science and domain understanding is a
reasonable approach. DOD organizations should evaluate the abilities of their existing analysts in
domain understanding and data science skills to support an action plan to further mature data
science within their organizations. Additionally, by creating a visualization that plots the
assessments of their analysts on domain knowledge and data science skills DOD organizations
can explore the maturity of their overall analysis capability as seen in Figure 16. Additionally,
DOD organizations should influence the skill requirements sections of job announcements of
incoming analysts to bring in more data science skills and evaluate all policies and infrastructure
limitations that are prohibiting the use of modern data science analytical software.
Figure 16. Domain and data science assessment model.
152
Contribution to Business Problem
Gabel and Tokarski (2014) suggested organizations face rapid data growth and require
deliberate action by leadership to ensure sustainability. The DOD is generating massive amounts
of data and is facing similar challenges (Hamilton & Kreuzer, 2018). The general business
problem is the lack of effective analysis in organizations operating in the modern-day big data
environment (Harris & Mehrotra, 2014). The specific business problem is that DOD
organizations may be struggling with gleaning actionable information from large data sets
compounded by immature data science skills of DOD analysts (Harris et al. 2013).
This qualitative case study analyzed the perceptions and experiences of analysts working
big data analysis issues in a representative organization along with the perceptions and
experiences of management within the same organization. The research provided actionable
information on how DOD organizations are currently analyzing large data sets. This research
provided insights regarding the current skills of analysts within the case study organization and
how evolved these skills are when compared to the scholarly views of data science skills. This
research uncovered vital limitations regarding the data science skills of existing DOD analysts
and new analysts coming into the federal OPM occupations when compared to scholarly views
of data science skills. The findings are that the personnel assigned as analysts within the case
study organization have detailed business domain understanding but do not have data science
specific skill requirements and training. The relatively small number of analysts that do have
partial requirements for data science-related skills are spread across several OPM occupations
and the job announcements used to hire analysts only partially include the breadth of data
science-related skills. Additionally, DOD analysts are constrained in their ability to leverage
modern analytical software. The BZC analysts that participated in this research are providing
153
valuable products to management throughout the organization. However, before BZC analysts
can build advanced analytical products on their large data sets the organization will need to
further assess the skills of existing analysts and policies that are constraining data science
maturity and subsequent analytical innovation.
Recommended Actions for DOD Organizations
This research investigated how the BZC gleans actionable information from big data sets
and identified access to quality data, organization structure and culture, infrastructure and legacy
systems, access to training, competition for talent, and access to software as constraints to data
science adoption. The research concluded the data science skills and processes of analysts
working at the BZC are immature and all of the participants in this research agreed that
advancing data science is critical to BZC’s mission effectiveness. The research suggests DOD
organizations should develop an action plan to mature data science to include:
Evaluate existing analysts on business and data science knowledge.
Create data science teams by combing data science related federal occupations.
Influence job announcements to include data science skills.
Remove policies constraining access to modern analytical software.
Remove policies constraining access to data science training.
Develop strategies to integrate and share quality data.
Recommendations for Further Research
Further research recommendations were derived from the limitations posed in Chapter 1
as well as the findings and themes from the analysis of the collected data in Chapter 4. Cooper
and Schindler (2013) suggested a limitation of qualitative research is the ability to generalize
154
conclusions to a larger population. The findings in this research suggests DOD organizations are
experiencing big data growth, and are struggling with gleaning actionable information from large
data sets compounded by immature data science skills. The following recommendations for
further research may help quantify the shortage of DOD data scientists, provide further details on
data science software and training barriers, and organizational and cultural implications to data
science adoption:
A quantitative study to include a large population of DOD analysts statistically
comparing the skills used by DOD analysts to that of data science skills that could
quantify the shortage of analytical talent. This research would help to further define
gaps of current DOD analysts and help support any restructuring of Federal OPM
occupational standards and how DOD organizations acquire and employ data
scientists.
A quantitative study that examines the constraints that are limiting DOD analysts to
software tools required for data science analysis. A researcher could survey DOD
operational units and information technology policy organizations regarding the
accessibility of software and the potential barriers that need addressing. Access to
modern software was a significant theme in this research, and access to analytical
software may be a common DOD problem.
An exploratory qualitative study or a quantitative study that examined access to data
science or advanced analytical training within the DOD workforce. The participants
in this study presented a theme regarding the lack of data science training and
certification. Several options for data science training and certification are available
from commercial vendors, academia, and within the DOD. Additional research that
155
explores or quantifies the significance of access to data science training and
certification may help DOD organizations internally grow data scientists and is
warranted.
An exploratory case study that examines the organizational and cultural changes
required in commercial or DOD organizations that are needed because of the massive
data growth and the requirement of better analytics. Gabel and Tokarksi (2014)
suggested large data sets are complicated, time-consuming, and expensive and create
strategic alignment problems in modern organizations. How to align the organization
and how and where to insert data scientists was a theme from this research with the
BZC and further research is warranted.
A qualitative case study that explores the management implications associated with
the arrival of big data and data science in modern organizations.
Conclusions
This study was intended to further define big data and data sciences and explore their
applicability to DOD organizations and expand the body of knowledge regarding big data and
data science. The primary findings of this study suggest the BZC is experiencing large data
growth and concurs with scholarly definitions of big data, data science, and the importance of the
further development of a data science workforce to meet mission requirements. The study
revealed that the BZC is a large complex organization generating large amounts of data and has
varying levels of ability to glean actionable information from large data with several limitations.
The study revealed that data science skills and processes are immature within the BZC. The
personnel assigned as analysts within the case study organization have detailed business domain
understanding but do not have data science specific skill requirements and training. The
156
relatively small number of analysts that do have partial requirements for data science-related
skills are encompassed in several OPM occupations and the job announcements used to hire
analysts only partially include the breadth of data science-related skills. Several themes emerged
as constraints to data science expansion within the BZC. This research suggests access to quality
data, organization structure and culture, infrastructure and legacy systems, access to training,
competition for talent, and access to software as themes that are preventing the BZC from fully
leveraging data science capabilities and these limitations may be affecting other DOD
organizations. Additional themes of big data, metrics, management, data analysis processes, data
science skills, and domains resulted from this research and supported the conclusion that data
science skills and processes are immature at the BZC. The study revealed the BZC has strategic
actions underway to manage and integrate data for better accessibility and the importance of
modern analytical software for their analysts and to continue the development of the skills of
their analysts in order to glean actionable information from big data sets that will directly
contribute to mission effectiveness.
157
REFERENCES
Akerkar, R. (2014). Analytics on big aviation data: Turning data into insights. International
Journal of Computer Science and Applications, 11(3), 116-127. Retrieved from
https://pdfs.semanticscholar.org/820f/a4268e73d6de5beed8486dfa8b8d8ecc42de.pdf
Almeida, F. (2017). Benefits, challenges and tools of big data management. Journal of Systems
Integration, 8(4), 12-20. doi:10.20470/jsi.v8i4.311
Baskarada, S., & Koronios, A. (2017). Unicorn data scientist: The rarest of breeds. Program,
51(1), 65-74. doi:10.1108/PROG-07-2016-0053
Beer, D. (2016). How should we do the history of Big Data? Big Data & Society, 3(1), 1-10.
doi:10.1177/2053951716646135
Berner, M., Graupner, E., & Maedche, A. (2014). The information panopticon in the big data era.
Journal of Organization Design, 3(1), 14-19. doi:10.7146/jod.3.1.9736
Bowen, G. A. (2009). Document analysis as a qualitative research method. Qualitative Research,
9(2), 27-40. doi:10.3316/QRJ0902027
Brynjolfsson, E., & McAfee, A., (2012). Big data: The management revolution. Harvard
Business Review, 90(10), 60-68. Retrieved from http://tarjomefa.com/wp-
content/uploads/2017/04/6539-English-TarjomeFa-1.pdf
Chen, H., Chiang, R., & Storey, V. (2012). Business intelligence and analytics: From big data to
big impact. MIS Quarterly (0276-7783), 36(4), 1165. Retrieved from
https://www.jstor.org/stable/41703503
Cleveland, W. S. (2001). Data science: An action plan for expanding the technical areas of the
field of statistics. International Statistical Review, 69(1), 21-26. doi:10.1111.j.1751-
5823.2001.tb00477.x
158
Columbus, L. (2018, January, 29). Data scientist is the best job in America according to
Glassdoor’s 2018 rankings. Forbes. Retrieved from
https://www.forbes.com/sites/louiscolumbus/2018/01/29/data-scientist-is-the-best-job-in-
america-according-glassdoors-2018-rankings/#33eaef7c5535
Cooper, D., & Schindler, P. (2013). Business research methods, 12th Edition. McGraw-Hill
Learning Solutions, 2013-03-05. VitalBook file.
Costlow, T. (2014). How big data is paying off for DOD. Defense Systems. October 24 2014.
Retrieved from https://defensesystems.com/articles/2014/10/24/feature-big-data-for-
defense.aspx
Cotter, P. (2014). Analytics by degree: The dilemmas of big data analytics in lasting
university/corporate partnerships (Doctoral dissertation). Retrieved from ProQuest UMI
Dissertation, UMI Number 3635733
Creswell, J. W. (2009). Research design: Qualitative, quantitative, and mixed methods
approaches (3 ed.). Thousand Oaks, CA: Sage.
Davenport, T. H., Barth, P., & Bean, R. (2012). How big data is different. MIT Sloan
Management Review, 54(1), 43-46. Retrieved from
https://pdfs.semanticscholar.org/eb3d/ece257cca2e8ce6eaf73fd98c1fdcbdc5522.pdf
Davenport, T., & Dyché, J. (2013). Big data in big companies. SAS Institute. Retrieved from
https://www.sas.com/en_us/whitepapers/bigdata-bigcompanies-106461.html
Davenport, T., & Patil D. (2012). Data scientist: The sexiest job of the 21st Century. Harvard
Business Review 90(10), 70-76. Retrieved from https://hbr.org/
159
Davis, J. (2016, July, 15). Microsoft launches online data science program. Informationweek.
Retrieved from http://www.informationweek.com/big-data/big-data-analytics/microsoft-
launches-online-data-science-degree-program/d/d-id/1326276
DISA, (2015). Defense Information Systems Agency request for information, Big Data Solution
and Governance Capabilities. March, 2015. Retrieved from
https://govtribe.com/project/defense-information-systems-agency-disa
Edwards, J. (2014). Big data takes strategic turn a DOD. Defense News, November 21 2014.
Retrieved from https://www.c4isrnet.com/it-networks/2014/11/20/big-data-takes-a-
strategic-turn-at-dod/
Fox, S., & Do, T. (2013). Getting real about big data: Applying critical realism to analyse big
data hype. International Journal of Managing Projects in Business, 6(4), 739-760.
doi:10.1108/IJMPB-08-2012-0049
Frizzo-Barker, J., Chow-White, P., Mozafari, M., & Dung,H. (2016). An empirical study of the
rise of big data in business scholarship. International Journal of Information Management,
36(3), 403-413. doi:10.1016/j.ijinfomgt.2016.01.006
Gabel, T. J., & Tokarski, C. (2014). Big Data and organization design. Journal of Organization
Design, 3(1), 37-45. doi:10.7146/jod.3.1.9753
Galbraith, J. (2014). Organization design challenges resulting from big data. Journal of
Organization Design, 3(1), 2-13. doi:10.7146/jod.3.1.8856
Gang-Hoon, K., Trimi, S., & Ji-Hyong, C. (2014). Big-data applications in the government
sector. Communications of the ACM, 57(3), 78-85. doi:10.1145/2500873
http://www.informationweek.com/big-data/big-
160
Géczy, P. (2015). Big data management: Relational framework. Review of Business & Finance
Studies, 6 (3), 21-30. Retrieved from
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2656427
George, G., Haas, M. R., & Pentland, A. (2014). Big data and management. Academy of
Management Journal. 57(2), 321-326. doi:10.5465/amj.2014.4002
Gobble, M. M. (2013). Big data: The next big thing in innovation. Research Technology
Management, 56(1), 64-66. doi:10.5437/08956308X5601005
Grant, C., & Osanloo, A. (2014). Understanding, selecting, and integrating a theoretical
framework in dissertation research: Creating the blueprint for your house. Administrative
Issues Journal: Education, Practice, and Research, 4(2), 12-26. doi:10.5929/2014.4.2.9
Granville, V. (2014). Developing analytical talent: Becoming a data scientist. Indianapolis, In.
John Wiley & Sons, Incorporated.
Gronhaug, P., & Ghauri, K. (2010). Research methods in business studies XML Vitalsource
ebook for Capella, 4th Edition. Pearson Learning Solutions. VitalBook file.
Grossman, R., & Siegel, K. (2014). Organizational models for big data and analytics. Journal of
Organization Design, 3(1), 20-25. doi:10.7146/jod.3.1.979
Halper, F. (2016). The citizen data scientist-coming to your organization? Business Intelligence
Journal, 21, 55-56. Retrieved from https://tdwi.org
Hamilton, S. P., & Kreuzer, Michael P. (2018). The big data imperative. Air & Space Power
Journal, 32(1), 4-20. Retrieved from
https://www.airuniversity.af.edu/Portals/10/ASPJ_Spanish/Journals/Volume-30_Issue-
2/2018_2_11_hamilton_s_eng.pdf
161
Harris, H. D., Murphy, S. P., & Vaisman, M. (2013). Analyzing the analyzers: An introspective
survey of data scientists and their work. Sebastopol, CA: O’Reilly Media.
Harris, J. G., & Mehrotra, V. (2014). Getting value from your data scientists. MIT Sloan
Management Review, 56(1), 15-18. Retrieved from https://sloanreview.mit.edu
Henry, R., & Venkatraman, S. (2015). Big data analytics the next big learning opportunity.
Journal of Management Information and Decision Sciences, 18(2), 17-29. Retrieved from
https://www.abacademies.org/journals/journal-of-management-information-and-decision-
sciences-home.html
Hoffman, M. (2013). Big data poses big problem for pentagon. Defense Tech, 03(20). Retrieved
from https://www.military.com/defensetech/2013/02/20/big-data-poses-big-problem-for-
pentagon
INFORMS (2017). Certified Analytics Professional Handbook. Retrieved from
https://www.certifiedanalytics.org/
Kitchin, R., & McArdle, G. (2016). What makes big data, big data? Exploring the ontological
characteristics of 26 datasets. Big Data & Society, 3(1). doi:10.1177/2053951716631130
Kiron, D. (2013). Organizational alignment is key to big data success. MIT Sloan Management
Review, 54(3), 1-n/a. Retrieved from https://sloanreview.mit.edu/
Konkel, F. (2015) Pentagon to Silicon Valley: Tech us big data. NextGov, June 18 2015.
Retrieved from http://www.nextgov.com/analytics-data/2015/06/pentagon-silicon-valley-
teach-us-big-data/115717/
Lansiti, M., & Lakhani, K. R. (2014). Digital ubiquity: How connections, sensors, and data are
revolutionizing business. Harvard Business Review, 92(11), 90-99. Retrieved from
https://hbr.org/
https://www.certifiedanalytics.org/
162
Lohr, S. (2013, February 04). Searching for origins of the term ‘big data’. The New York Times.
Retrieved from http://www.nytimes.com
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011).
Big Data: The Next Frontier for Innovation, Competition & Productivity, 1-143.
Retrieved from https://www.mckinsey.com/business-functions/digital-mckinsey/our-
insights/big-data-the-next-frontier-for-innovation
McAfee, A., & Brynjolfsson, E. (2012). Big data: The management revolution. (cover story).
Harvard Business Review, 90(10), 60-68. Retrieved from https://hbr.org/
McCaney, K. (2014). Navy wants to take big data into battle. Defense News, June 24, 2014.
Retrieved from https://defensesystems.com/articles/2014/06/24/navy-onr-big-data-
ecosystem.aspx
Miller, S. (2014). Collaborative approaches needed to close the big data skills gap. Journal of
Organization Design, 3(1), 26-30. doi:10.7146/jod.3.1.9823
Moorthy, J., Lahiri, R., Biswas, N., Sanyal, D., Ranjan, J., Nanath, K., & Ghosh, P. (2015). Big
data: Prospects and challenges. The Journal for Decision Makers, 1(40), 74-96.
doi:10.1177/0256090915575450
Moustakas, C. (1994). Phenomenological research methods. Thousand Oaks, CA: Sage
Publications.
National Academies Press (2017). Strengthening data science methods for department of defense
personnel and readiness missions. Washington D.C. Retrieved from
https://www.nap.edu/catalog/23670/strengthening-data-science-methods-for-department-
of-defense-personnel-and-readiness-missions
163
OPM, (2005). U.S. Office of Personnel Management. Professional Work in the Mathematical
Sciences Group, 1500. Retrieved from https://www.opm.gov/policy-data-
oversight/classification-qualifications/classifying-general-schedule-
positions/standards/1500/gs1500p.pdf
OPM, (2009). U.S. Office of Personnel Management. Handbook of Occupational Groups and
Families. Retrieved from https://www.opm.gov/policy-data-oversight/classification-
qualifications/classifying-general-schedule-positions/occupationalhandbook.pdf
OPM, (2014). U. S. Office of Personnel Management strategic plan FY2014-2018. Retrieved
from https://www.opm.gov/about-us/budget-performance/strategic-plans/2014-2018-
strategic-plan.pdf
Ortiz Jr., S. (2010). Taking Business Intelligence to the Masses. Computer IEEE, 0018-
9162/July-10), 12-15. Retrieved from
https://www.computer.org/cms/Computer.org/ComputingNow/homepage/news/CO_0710
_FeatStory_BusinessIntelligenceToMasses.pdf
Parmar, R., Cohn, D., & Marshall, A. (2014). Driving Innovation through data. IBM Institute for
Business Value. Accessed December 27, 2015:
www.935.ibm.com/services.us/gbs/thoughleadership/innovation-through-data
Phillips-Wren, G., & Hoskisson, A. (2015). An analytical journey towards big data. Journal of
Decision Systems, 24(1), 87-102. doi:10.1080/12460125.2015.994333
Piatetsky, G. (2017, January, 10). Data scientist-best job in America, again. KDnuggets.
Retrieved from http://www.kdnuggets.com/2017/01/glassdoor-data-scientist-best-job-
america.html.
https://www.opm.gov/about-us/budget-performance/strategic-plans/2014-2018-strategic-plan.pdf
https://www.opm.gov/about-us/budget-performance/strategic-plans/2014-2018-strategic-plan.pdf
http://www.935.ibm.com/services.us/gbs/thoughleadership/innovation-through-data
http://www.kdnuggets.com/2017/01/glassdoor-data-scientist-best-job-america.html
http://www.kdnuggets.com/2017/01/glassdoor-data-scientist-best-job-america.html
164
Porche, III, I., Wilson, B., Johnson, E., & Tierney, S. (2014). Data Flood: Helping the Navy
Address the Rising Tide of Sensor Information. RAND Corporation, 2014.
Provost, F., & Fawcett, F. (2013). Data science and its relationship to big data and data-driven
decision making. Big Data, 1(1), 51-59. doi:10.1089/big.2013.1508
Ransbotham, S., Kiron, D., & Prentice, P. K. (2015). The talent dividend: Analytics is driving
competitive advantage at data-oriented companies. MIT Sloan Management Review, 56(4),
1-12. Retrieved from https://sloanreview.mit.edu/
Rouhani, S., Ashrafi, A., Zare Ravasan, A., & Afshari, S. (2016). The impact model of business
intelligence on decision support and organizational benefits. Journal of Enterprise
Information Management, 29(1), 19-50. doi:10.1108/JEIM-12-2014-0126
Santaferraro, J. (2013). Filling the demand for data scientists: A five-point plan. Business
Intelligence Journal, 18, 13-18. Retrieved from https://tdwi.org/research/list/tdwi-
business-intelligence-journal.aspx
SAS, (2017). SAS academy for data science. Retrieved from
https://www.sas.com/en_us/learn/academy-data-science.html
Schneider, K. F., Lyle, D. S., & Murphy, F. X. (2015). Framing the big data ethics debate for the
military. Joint Force Quarterly : JFQ, (77), 16-23. Retrieved from
https://pdfs.semanticscholar.org/8cbc/6b28d0e1bca2bcf09cb6c5d389ec086c7748.pdf
Seidman, I. (2013). Interviewing as qualitative research: a guide for researchers in education
and the social sciences. Teachers college press.
Shah, S., Horne, A., & Capellá J. (2012). Good data won’t guarantee good decisions. Harvard
Business Review, 90(4), 23-25. Retrieved from https://hbr.org/
165
Sharda, R., Adomako Asamoah, D., & Ponna, N. (2013). Research and pedagogy in business
analytics: Opportunities and illustrative examples. Journal of Computing & Information
Technology, 21(3), 171-183. doi:10.2498/cit.1002194
Smith, M. (2015, February, 18). The White House names Dr. DJ Patil as the first chief data
scientist. Retrieved from https://obamawhitehouse.archives.gov/blog/2015/02/18/white-
house-names-dr-dj-patil-first-us-chief-data-scientist
Swain, A. (2016). Big data analytics: An expert interview with Bipin Chadha, data scientist for
united services automobile association (USAA). Journal of Information Technology Case
and Application Research, 18(3), 181-185. doi:10.1080/15228053.2016.1223497
Swanson, R., & Holton, E. (2005). Research in Organizations, Foundations and Methods of
Inquiry. San Francisco, CA: Berrett-Koehler Publishers, Inc.
Symon, P. B., & Tarapore, A. (2015). Defense intelligence analysis in the age of big data. Joint
Force Quarterly : JFQ, (79), 4-11. Retrieved from
http://ndupress.ndu.edu/Media/News/Article/621113/defense-intelligence-analysis-in-the-
age-of-big-data/
Turner, V., Reinsel, D., Gantz, J., & Minton, S. (2014). The digital universe of opportunities:
Rich data and the increasing value of the internet of things. EMC Corporation. Retrieved
from https://www.emc.com/leadership/digital-universe/2014iview/index.htm
U.S. Air Force (2016). Data science and the USAF ISR enterprise. Retrieved from
http://www.defenseinnovationmarketplace.mil/resources/Data_Science_and_the_USAF_IS
R_Enterprise%20_White_Paper.PDF
Viaene, S. (2013). Data scientists aren’t domain experts. IT Professional, 15(6), 12-17. Retrieved
from https://ieeexplore.ieee.org/document/6674007
https://obamawhitehouse.archives.gov/blog/2015/02/18/white-house-names-dr-dj-patil-first-us-chief-data-scientist
https://obamawhitehouse.archives.gov/blog/2015/02/18/white-house-names-dr-dj-patil-first-us-chief-data-scientist
166
Walker, J. (2012). The use of saturation in qualitative research. Canadian Journal of
Cardiovascular Nursing, 22(2), 37-41. Retrieved from
https://www.ncbi.nlm.nih.gov/pubmed/22803288
Watson, H. J., & Marjanovic, O. (2013). Big data: The fourth data management generation.
Business Intelligence Journal, 18, 4-8. Retrieved from https://tdwi.org/research/list/tdwi-
business-intelligence-journal.aspx
White House (2012). The big data research and development initiative. Washington D.C.
Retrieved from https://obamawhitehouse.archives.gov/blog/2012/03/29/big-data-big-deal
White House (2016). The federal big data research and development strategic plan. Washington
D.C. Retrieved from https://www.nitrd.gov/PUBS/bigdatardstrategicplan.pdf
White House (2018). The networking and information technology research and development
program supplement to the President’s FY18 budget. Washington D.C. Retrieved from
https://www.nitrd.gov/pubs/2018supplement/FY2018NITRDSupplement.pdf
Yin, R. (2009). Case study research: Design and methods (Applied Social Research Methods
Series, 5, 4th ed.). Thousand Oaks, CA: Sage Publications.
Yin, R. (2012). Applications of Case Study Research. Thousand Oaks, CA: Sage Publications.
Young, J. (2014). An epidemiology of big data (Doctoral dissertation). Retrieved from ProQuest
UMI Dissertation, UMI Number 3620515.
Zhao, Y., MacKinnon, D., & Gallup, G. (2015). Big data and deep learning for understanding
DOD data. CrossTalk. July/August 2015. Retrieved from
http://www.crosstalkonline.org/issues/julyaugust-2015.html
Zhu, Y., & Xiong, Y. (2015). Towards Data Science. Data Science Journal, 14, 8.
doi:10.5334/dsj-2015-008
https://www.nitrd.gov/pubs/2018supplement/FY2018NITRDSupplement.pdf
167
Zuboff, S. (1988). In the Age of the Smart Machine: The Future of Work and Power.
Basic Books, New York, NY.
168
STATEMENT OF ORIGINAL WORK
Academic Honesty Policy
Capella University’s Academic Honesty Policy (3.01.01) holds learners accountable for the
integrity of work they submit, which includes but is not limited to discussion postings,
assignments, comprehensive exams, and the dissertation or capstone project.
Established in the Policy are the expectations for original work, rationale for the policy,
definition of terms that pertain to academic honesty and original work, and disciplinary
consequences of academic dishonesty. Also stated in the Policy is the expectation that learners
will follow APA rules for citing another person’s ideas or works.
The following standards for original work and definition of plagiarism are discussed in the
Policy:
Learners are expected to be the sole authors of their work and to acknowledge the
authorship of others’ work through proper citation and reference. Use of another person’s
ideas, including another learner’s, without proper reference or citation constitutes
plagiarism and academic dishonesty and is prohibited conduct. (p. 1)
Plagiarism is one example of academic dishonesty. Plagiarism is presenting someone
else’s ideas or work as your own. Plagiarism also includes copying verbatim or
rephrasing ideas without properly acknowledging the source by author, date, and
publication medium. (p. 2)
Capella University’s Research Misconduct Policy (3.03.06) holds learners accountable for research
integrity. What constitutes research misconduct is discussed in the Policy:
Research misconduct includes but is not limited to falsification, fabrication, plagiarism,
misappropriation, or other practices that seriously deviate from those that are commonly
accepted within the academic community for proposing, conducting, or reviewing
research, or in reporting research results. (p. 1)
Learners failing to abide by these policies are subject to consequences, including but not limited to
dismissal or revocation of the degree.
http://www.capella.edu/assets/pdf/policies/academic_honesty.pdf
http://www.capella.edu/assets/pdf/policies/research_misconduct.pdf
169
Statement of Original Work and Signature
I have read, understood, and abided by Capella University’s Academic Honesty Policy (3.01.01)
and Research Misconduct Policy (3.03.06), including Policy Statements, Rationale, and
Definitions.
I attest that this dissertation or capstone project is my own work. Where I have used the ideas or
words of others, I have paraphrased, summarized, or used direct quotes following the guidelines
set forth in the APA Publication Manual.
Learner name
and date Roy Lancaster 11/11/2018
http://www.capella.edu/assets/pdf/policies/academic_honesty.pdf
http://www.capella.edu/assets/pdf/policies/research_misconduct.pdf
170
APPENDIX A. INTERVIEW GUIDE
Interview Guide designed and created by Lancaster, 2018.
Purpose: The interviews with analysts and the focus group with managers are being
conducted to help senior leadership in the Bravo Zulu Center (BZC) understand how the analysis
of big data impacts the organization’s mission effectiveness. We would like your opinion and
perception of what you consider important knowledge, skills, and abilities necessary for both the
analysts and management team working big data issues for the BZC. Your feedback on big data,
data science, and how our organization relies on this data to conduct daily business in the BZC is
valuable to helping us understand how and in where we can focus our efforts to improve BZC
organization. Thank you for taking time to participate.
Rationale: The principle rationale for furthering the knowledge on the big data
phenomenon and a potentially emerging data science occupation suggests creating the ability to
manage and analyze large amounts of data is more of a human problem and less of an
information system technological problem (McAfee & Brynjolfsson, 2012).
171
Interview Guide Questions
Code Question
MI- Multidisciplinary investigation How is data used in your organization to meet mission
requirements? What are some areas in your organization that
are dependent on data?
TH- Theory How do you define big data? What increases of digital data
(big data) have you witnessed and how has it impacted the
business of the BZC?
P-Pedagogy What are some knowledge, skills, and abilities needed to be
an effective data scientist?
TE- Tool evaluation
MM- Models and methods
What are some of the significant challenges associated with
conducting data analysis in your organization?
TH- Theory What are the data science skills that are used by BZC the
BZC analysts?
MI- Multidisciplinary investigation What additional skills are needed by analysts to be effective
in the modern big data environment?
MI- Multidisciplinary investigation What else can you tell me regarding big data and data
science?
We cannot provide confidentiality to a participant regarding comments involving criminal activity/behavior, or statements
that pose a threat to yourself or others. Do NOT discuss or comment on classified or operationally sensitive information.
Why Choose Us
- 100% non-plagiarized Papers
- 24/7 /365 Service Available
- Affordable Prices
- Any Paper, Urgency, and Subject
- Will complete your papers in 6 hours
- On-time Delivery
- Money-back and Privacy guarantees
- Unlimited Amendments upon request
- Satisfaction guarantee
How it Works
- Click on the “Place Order” tab at the top menu or “Order Now” icon at the bottom and a new page will appear with an order form to be filled.
- Fill in your paper’s requirements in the "PAPER DETAILS" section.
- Fill in your paper’s academic level, deadline, and the required number of pages from the drop-down menus.
- Click “CREATE ACCOUNT & SIGN IN” to enter your registration details and get an account with us for record-keeping and then, click on “PROCEED TO CHECKOUT” at the bottom of the page.
- From there, the payment sections will show, follow the guided payment process and your order will be available for our writing team to work on it.