3 What
3.1 What data science is
We have no shortage of definitions of “data science”, because different definitions serve different purposes. I will start with a tautological description that almost amounts to an operational definition.
3.1.1 Data science is what data scientists do
Writing in 2012 for the Harvard Business Review, Tom Davenport and DJ Patil (the US Chief Data Scientist during the Obama Administration) wrote the following:
[W]hat data scientists do is make discoveries while swimming in data. … At ease in the digital realm, they are able to bring structure to large quantities of formless data and make analysis possible. They identify rich data sources, join them with other, potentially incomplete data sources, and clean the resulting set. [D]ata scientists help decision makers shift from ad hoc analysis to an ongoing conversation with data. Data scientists realize that they face technical limitations, but they don’t allow that to bog down their search for novel solutions. [T]he dominant trait among data scientists is an intense curiosity—a desire to go beneath the surface of a problem, find the questions at its heart, and distill them into a very clear set of hypotheses that can be tested. (Davenport and Patil 2012)
3.1.2 Learning from data
For pith, we can turn to Donoho (2017): “Data science [is] the science of learning from data, with all that this entails,” to which he adds, “it studies the methods involved in the analysis and processing of data and proposes technology to improve methods in an evidence-based manner.”
For plainness, we can turn to my working definition: “a set of core activities to ask good scientific questions and to line up the tools to answer them rigorously using data.” I developed that formulation in 2016, drawing on The Art of Data Science (Peng and Matsui 2015). My operating definition of data science links rather than separates the tasks of stating and solving problems, mediated through data. Peng and Matsui enumerated 5 core activities, to which I add a sixth. I explicitly link each of these 6 core activities to analysis, since data analysis is a central commitment of data science, as I explain further below.
Pose good questions. The set of potential questions is enriched through awareness of the kinds of learning that various analytic methods support.
Prepare data to address those questions. With the purposes of analysis in mind, the practitioner can obtain, manage, and explore data to ensure the data’s fitness to address the analytic purposes.
Probe the data. Conduct rigorous analysis to address questions, which includes developing and critically assessing one or more analytic models. The value of analysis itself comes from the ability to answer the question and to convey what is learned from data.
Place analytic results in context. Interpretation binds the question to the method, binds the method to the result, and puts them all into the context of assumptions about the data, technical assumptions in the analytic models, existing domain knowledge, and alternative analytic approaches that could have been considered. Understanding what specific data can’t tell you, or what phenomena those data rule out, is as valuable as interpreting what the data show.
Present methods and results. Communication shares what has been learned from data and how it was learned, and it also subjects the life cycle to scrutiny and transparency.
Preserve the entire life cycle. Ensure that the life cycle is traceable, accessible, reproducible, and enduring to the extent possible. In addition to communicating methods and results, transparency ensures that the data and analytic code are as available as possible, subject to privileges of access where necessary. This transparency in turn supports fundamental scientific norms.
Peng and Matsui took care to explain that their core activities do not need to, and often do not, occur in sequence. Rather, with the execution of each core activity, careful reflection could lead the practitioner to repeat, jump back, or jump forward. I expand on this idea later in this section.
Blei and Smyth (2017) wrote, “Although each of [statistical, computational, and human perspectives] is a critical component of data science, we argue that the effective combination of all three components is the essence of what data science is about. ... The practice of data science is not just a single step of analyzing a dataset. Rather, it cycles between data preprocessing, exploration, selection, transformation, analysis, interpretation, and communication. One of the main priorities for data science is to develop the tools and methods that facilitate this cycle.”
The Data Science Association focuses on meaning: Data science is “the scientific study of the creation, validation and transformation of data to create meaning”, and a data scientist is “a professional who uses scientific methods to liberate and create meaning from raw data.” (Data Science Association)
The National Institutes of Health Strategic Plan for Data Science defined it as “the interdisciplinary field of inquiry in which quantitative and analytical approaches, processes, and systems are developed and used to extract knowledge and insights from increasingly large and/or complex sets of data.” (National Institutes of Health 2018)
The National Academies of Sciences, Engineering, and Medicine described data science along with its relationship to other fields, its primary tasks, and its primary purposes:
[Data science centers on] multidisciplinary and interdisciplinary approaches to extracting knowledge or insights from data for use in a broad range of applications. It is the field of science that relies on processes and systems (mathematical, computational, and social) to derive information or insights from data. It is about synthesizing the most relevant parts of the foundational disciplines to solve particular classes of problems or applications while also creating novel techniques to address the ‘cracks’ between those disciplines where no approaches may yet exist … because the volume and variety of data available are expanding swiftly, data are more available immediately, and decisions based on data are increasingly automated and in real time. (National Academies of Sciences, Engineering, and Medicine 2018)
Finally, the National Institute of Standards and Technology (NIST) in 2015 defined data science as the “extraction of actionable knowledge directly from data through a process of discovery, or hypothesis formulation and hypothesis testing.” Per NIST, a data scientist is “a practitioner with sufficient knowledge in the overlapping regimes of business needs, domain knowledge, analytical skills, and software and systems engineering to manage the end-to-end data processes in the data life cycle … The end-to-end role ensures that everything is performed correctly to explore the data, create and validate hypotheses.” (NIST Big Data Public Working Group 2015) The data life cycle is “a set of processes in an application that transform raw data into actionable knowledge”:
Collection: Gather and store raw data.
Preparation: Convert raw data into cleansed, organized information.
Analysis: Synthesize knowledge from organized information.
Action: Use synthesized knowledge to generate value for the enterprise.
NIST’s definitions appear in the context of discussing “big data” that cannot be accommodated by traditional architectures. Data volume (size), variety (sources, domains, and types), velocity (rate of flow), and variability (changing characteristics) “drive the shift to … architectures for data-intensive applications.” (NIST Big Data Public Working Group 2015) I prefer plain-language versions of these concepts:
Attribute | V | S |
---|---|---|
Number of data elements | Volume | Size |
Multiple repositories, domains, or types | Variety | Scope, Sources |
Rate of flow (records per unit time) | Velocity | Speed |
Richness, complexity | Variability | Shape, Structure |
These varied definitions of data science explicitly invoke learning, answering questions, creating meaning, and extracting knowledge—epistemic tasks that take us into the nature, sources, structure, and limits of empirically derived knowledge. Learning defines data science, which in turn centers data science on the role of analysis. These definitions vary in whether and how they appeal to processes (such as a data life cycle and adaptive problem-solving), means (complex data), and methods (quantitative and analytical approaches and technology). These definitions also suggest but do not enumerate the skills that are needed to do data science.
I agree with the NIH and NASEM definitions in that current and evolving complexity drive a need for adapted methods. As data and technologies become more complex, the drive for adaptively learning from data also intensifies. The risk I see is that we might overinvest in narrow (but potentially useful) skills while giving short shrift to broadly applicable but underserved skills: We should respect the fundamentals and avoid an unfortunate tendency to overemphasize the exotic or the complex at the expense of those fundamentals. Indeed, Donoho (2017) makes a similar case. We cannot lose sight of the need for learning from conventional or familiar data, using conventional or familiar methods. In addition to basic skills for managing and analyzing data, we need always to esteem skills for critical reflection and reasoning in creative but disciplined ways. In this sense, complexity and sophistication motivate data science, but they do not define data science, and it is a mistake to overidentify data science with those drivers. It would also be a mistake to focus on the technology rather than the science: Learning from data is a scientific act, enabled by evolving methods and tools. The value of learning from data needs to be judged by scientific rather than technological norms. Have we posed questions of social value and scientific validity? (See also Freedman 1987.) Have we prepared the data to answer those questions? Have we adequately probed the data and placed findings in context? Are proposed conclusions traceable and defensible? Is the reasoning coherent? We will return to this point when we talk about how data science is done.
3.1.3 Core activities and critical reflection
As mentioned above, Peng and Matsui’s schema for the core activities of the art of data science goes deeper than merely listing those activities: Each core activity calls for critical reflection, in which the practitioner reviews each core activity by framing, checking, and possibly revising or revisiting that activity.
First, set your expectations for the core activity. Then collect information and compare your expectations to your information. If they don’t match, then take another look and either revise your expectations or your information. Maybe repeat the current core activity or return to a previous one. If they do match, then you’re in a good place, but you should occasionally check again for good measure, lest you fall into a confirmation bias trap.
In a real sense, critical reflection puts the science in data science. We can associate each core activity with example prompts for critical reflection.
Pose good questions. Is the question of interest? Valid and valuable? Consult with experts or the literature. If needed, revise the question.
Prepare data to address those questions. Are the data suited to the question? Examine data early and often to learn about structure, content, and suitability. If needed, refine the question or obtain more or different data.
Probe the data. Does the analytic model answer the question? Does it use data correctly and take a suitable form for explaining or predicting a phenomenon of interest? Challenge model assumptions and structure. If needed, revise model structure or inputs.
Place analytic results in context. Does the analysis provide a meaningful answer that holds up to scrutiny and contributes to domain knowledge? Assess the totality of analyses—effect sizes, accuracy or bias, variability or uncertainty—in consideration of varying assumptions about the data, the model, and the subject-matter context for the question. If needed, revise the analysis to provide a specific, meaningful answer or conduct diagnostics or sensitivity analyses to assess limitations and assumptions.
Present methods and results. Are the methods and results understood, complete, and meaningful to the audience? Assess content, style, and attitude and gauge audience feedback. If needed, revise the presentation to suit audience needs.
Preserve the entire life cycle. Are the data and analysis as open and transparent as possible? Assess whether data and analytic code can be made available with or without alteration or access controls, for example, through public repositories or under user agreements. If needed, work toward the least restrictive means for sharing.
These critical reflections show why the core activities need not occur in sequence. Indeed, some or all core activities could occur more than once for a given undertaking.
Core activity | Critical reflection | ||
---|---|---|---|
Set expectations | Collect information | Resolve mismatch | |
Pose: State a good question | Question is of interest, will advance public health | Consult experts, literature | Revise the question |
Prepare: Obtain and explore data | Data are appropriate for the question | Examine data early, often to learn about the data and learn from the data | Refine the question or obtain other data |
Probe: Build a formal model | Model answers question, to describe, explain, or predict | Challenge model assumptions and structure (e.g., sensitivity analysis) | Revise model structure or inputs |
Place in context: Interpret results and implications | Analysis provides specific, meaningful answers | Totality of analyses—effect sizes, accuracy, uncertainty | Revise analysis to provide a meaningful answer |
Present: Communicate methods, results, significance | Content, style, attitude meaningful to audience | Feedback from audience | Revise presentation |
Preserve and post: Make life cycle transparent, enduring | Data, code, and methods available | Assess sensitivities to release and possible restrictions | Make as open as possible; document restrictions |
Core activities and critical reflections are adapted, in part, from Peng and Matsui (2015).
3.1.4 Commitments: life cycle, centered on analysis, subject to norms
I have taken a broad view in surveying a variety of motivations and definitions for data science. In consideration of this breadth and variety, I contend that data science entails 3 main commitments:
We learn from data in the context of an overall life cycle: posing rich questions about the world, amenable to rich methods; guiding how we generate, transmit, obtain, and prepare data; probing data to answer questions about the world; placing answers from data in context, mindful of assumptions and alternatives; presenting data-driven answers to audiences clearly and correctly; and preserving those answers and ensure that the entire life cycle is transparent, accessible, traceable and, to the extent possible, reproducible.
Analysis centrally connects the life cycle of data. We pose questions, prepare data, place results in context, and present answers informed by the variety of available analytic approaches. If we have methods for analyzing images, then we can ask questions that only images can answer. To interpret and communicate a risk or a rate of change seen in a set of data, we infer meaning from the analytic method. As further discussed below, we have many analytic modes available to us beyond traditional statistical methods, such as causal inference and machine learning.
We judge data science approaches and claims by scientific norms. Since data science is about extracting knowledge through analyzing data, then it should be judged by the same criteria that apply to extracting knowledge from observations. This commitment is familiar within statistical practice; it needs to become familiar with other data-analytic modes, including machine learning.
These 3 commitments unpack what it means to learn from data, and they set some boundaries around that practice. They point to the practitioner’s responsibility for respecting context, respecting analytic intent, and respecting quality and rigor. They also point us in the direction of needed investments in resources and learning. These commitments are not, however, meant to imply that each person who practices data science individually carries out each activity. (See also section 5.3.) Let’s examine those boundaries and some disciplines that are related to, but distinct from, data science.
3.2 What data science is not
Data science overlaps other disciplines and scientific practices. Furthermore, as National Academies of Sciences, Engineering, and Medicine (2018) notes, the practice of data science necessarily crosses disciplines. How does data science relate to statistics and other modes of data analysis, to informatics, and to science in general? How should the practice of data science privilege science over technology and focus on meaning and rigor?
3.2.1 Data science is not statistics
Statistics is not the same as data science, though the field substantially overlaps with data science. Moreover, data science is not merely statistics dressed up with appealing marketing. I argue above that data science takes responsibility for the whole life cycle of data, connected centrally through analytic concerns. In this understanding, a statistician who limits their engagement solely to analysis and perhaps interpretation is not doing data science. A statistician who does analysis and engages the rest of the life cycle of data is doing data science—as is an epidemiologist, a sociologist, a microbiologist, or anyone else.
Donoho (2017) and Jones (2018) show that data science took shape as a discipline, in part, in reaction to the failure of academic statistics to focus sufficiently on pragmatic rather than theoretical concerns. This characterization cuts in 2 directions: While academic statistics might have shunned practical concerns, academic and applied statistics firmly root themselves in traditions of rigor and other scientific norms. On the other hand, machine learning and other analytic disciplines are not as firmly rooted. To be fair, academic machine learning—often located in computer science or information science departments—pays heed to rigorous mathematics, out-of-sample generalizability, and applied issues such as bias and fairness. But the traditions are not as deep, and the norms are not as strong.
Leo Breiman, the late UC Berkeley statistics faculty member and an early bridge-builder between statistical and machine-learning communities, said that he might advise a young person (in 2001), “Don’t go into statistics.” In the end, he would say, “Take statistics, but remember that the great adventure of statistics is in gathering and using data to solve interesting and important real world problems.” (Olshen 2001)
Andrew Gelman, Columbia faculty member and prolific blogger, wrote in 2013, “Statistics is the least important part of data science … Statistics is important—don’t get me wrong … But it’s not the most important part of data science, or even close.” (Gelman 2013)
3.2.2 Data science is not data analysis (not even machine learning)
The field of statistics connects disciplines and practices for constructing and probing models grounded in probability theory and inference. This characterization holds for frequentist, Bayesian, and other approaches, whether the probability model is highly specified (as with parametric models) or loosely specified (as with nonparametric models). Many other approaches to data analysis might have a probability component that is not of primary concern or might have no formal probability component at all.
Machine learning has been described as the answer to the question, “How can computers learn to solve problems without being explicitly programmed?” In practice, computers “learn” to solve problems by looking for mathematically representable patterns in data (such as clusters or topic models) or by constructing mathematical tools to guess an output, given a set of inputs, modeled on examples that associate known inputs with known outputs. In these senses, machine learning is data analysis. As a field and collection of methods, machine learning overlaps substantially with statistics, distinguished by its emphasis on finding patterns and making predictions rather than constructing models that directly represent data—even when a machine learning model is explicitly probability-based or a model’s performance is represented using concepts from probability.
There is no bright line between statistics and machine learning, and many methods inhere to both disciplines. For example, classical statistics has traditions of cluster analysis, dimensionality reduction, and regularization, and machine learning uses Bayes’s theorem and logistic regression for binary classification tasks. Although machine learning is often associated with complex models based on vast amounts of data, statistical models can be complex, with many model parameters, or they can be based on large amounts of data. Conversely, machine learning models can be simple or based on small data. Since machine learning models tend to emphasize predictive performance (outputs given inputs) rather than internal model structure, however, the largest data-analytic models in practice tend to use machine learning methods. Some deep learning models have billions, or possibly trillions, of model parameters. I discuss machine learning, along with artificial intelligence, at greater length below in section 8.
Other modes of data analysis beyond statistics include causal inference, geospatial methods, econometric methods, and compartmental and agent-based modeling. Pearl (2009) explicitly characterizes causal inference as extrastatistical; without additional strong assumptions, no probability model can inherently represent causality. Structural equation models and inverse probability weighting can help to disentangle a causal signal from random noise, subject to those extrastatistical assumptions. Geospatial and econometric methods also often use probability components, for example, to accommodate correlations in space or time, but they wed those components to other concepts. Compartmental and agent-based modeling might or might not use empirical observations, but when they do, any probability components for solving or simulating systems also extend beyond strictly probability-based models.
From the perspective of data science, the life cycle of data can center on any or all of these analytic disciplines, not just statistics. This perspective covers 2 of my 3 commitments of data science: the life cycle and the central concern of data analysis. The third commitment, to scientific norms and rigor, obtains when the practitioner acknowledges and respects the norms inherent to the various modes of analysis. For machine learning, for example, these norms include out-of-sample generalizability and model robustness and stability. Thus, not only is it wrong to characterize data science as an enhanced form of statistics, such a characterization risks failing to hold other analytic modes to similar expectations of rigor and norms.
Much more could be written about whether machine learning or other modes of analysis reveal “meaning” in data, as some data science definitions seek to do. For now, I note that all such modes, including statistics, can be subjected to various methods for interpretation in terms of model structure and the relationship of models to input data. Furthermore, such interpretations and accompanying explanations warrant careful critical evaluation in view of a broad swath of scientific norms, not least because an apparent interpretation or explanation can itself be an illusion regardless of the method of analysis.
3.2.3 Data science is not informatics
Public health informatics is the systematic application of information and computer science and technology to public health practice, research, and learning. Informatics applies technology to obtain, store, and use information. Per Savel and Foldy (2012), it concerns “the how and why of technology and systems versus the common what and where of information technology … the integration and proper application of technology and systems to get data rather than just the technology and systems” (emphasis added) … “frequently the application of standards and structure that help with meaning before data science gets to it.” My colleague Brian Lee has said (personal communication), “Informatics is all the work of understanding and making data available to determine meaning.” Thus, we can see that informatics and data science overlap, especially regarding data wrangling, movement, accessibility, and scale, but the fields take different orientations: Data science seeks meaning from data, empowered by informatics to work with data. The disciplines in each field are important, and one can practice the collection of disciplines across those fields. One ought not, however, conflate them nor to treat one field as a subset of the other.
3.2.4 Data science is not just good science
Much of scientific practice in public health uses, or purports to use, data that come from observations about individual health status and other aspects of the world. When public health science uses data, then it should conform to scientific norms and rigor, much as I have claimed data science should. Does it follow, then, that data science is just good science? The answer is no, both because data science inherits some commitments that do not apply broadly to the practice of science and because good science entails commitments that do not apply to data science.
By “good science” I mean, in brief, all those practices for building and organizing knowledge about the world through the methods and values of experiment and observation, neutrality, rigor, transparency, empiricism, reproducibility, minimizing subjective bias, and so on.
On the expectation that theoretical science need not use data at all, we can omit theoretical science and narrow our question to applied science. Even narrowed in this way, we can conclude that not all applied science uses data. While all applied science depends on observation and precedent, those contexts need not entail data in the sense of observations represented in a way that we can subject them to further analysis. For example, without implicating data, a scientist can classify an organism through observation or conduct a qualitative review of published literature to structure arguments and conclusions about the state of knowledge in a specified domain. Next, not all applied science that uses data, uses it well. While we could argue that applied science that uses data poorly is not “good science”, we should also acknowledge that applied science can consist of good and bad components in which the poor use of data does not undermine the entire project. Finally, and pivotally, not all applied science that uses data well also takes responsibility for the integrity of the life cycle of data and for connecting that data life cycle to how questions are posed, data obtained, analysis performed, results placed in context, methods and results presented, and the whole process preserved. Just as a statistician can conduct a rigorous data analysis without connecting that analysis to the life cycle of data, any other scientist can engage in portions of the scientific method, including portions of the life cycle of data, without having applied data science. Data science entails taking responsibility for the integrity of the life cycle of data—across the core activities of data science—in a way that does not apply to the full breadth of “good science”. Without question, doing data science well overlaps with doing good science, but it is important not to conflate them.
The life cycle of data is consistent with, but not synonymous with, the scientific method. This distinction between “good science” and data science matters because the distinction informs how we do data science, which in turn differs in emphasis and kind from how we do good science. In particular, many technical and nontechnical skills that support the practice of data science, especially the skills for locating data analysis as a central focus in the life cycle of data, do not generalize or scale to the wholesale conduct of good science.