7 I get to do data science

Who gets to do data science? I do!

7.1 How I think about data science

In January 2015, I started in CSELS as CDC’s first, and (for at least 7 years) only, Associate Director for Data Science (ADDS). NIH had an ADDS by that time, and other CDC centers have had informatics or statistics leads, and some now have data science leads. But the title ADDS remained unique within CDC. I stepped into the role about 15 years after becoming a CDC employee and about 30 years after I first started working with data, statistics, and computing. In the role of CSELS ADDS, I tried to make sense and nonsense of the term “data science”, thinking through what data science is and is not, why it matters for CDC, and most importantly how CDC can do public health better by doing data better. Early on, my favorite framing became not the definitional “What is data science?” but the cultural “Who gets to do data science?”

My personal values motivated me to pursue technical excellence and to offer those skills in public service. I entered public health because I wanted to use rigorous methods to help better the human condition. I am a methodologist, trained in math, statistics, some philosophy, and a smattering of other disciplines. I like to figure out how one can measure and count what is observed and quantify uncertainty about what remains beyond direct observation. Our culture perpetuates the notion that, with the inevitable march of dispassionate science, humanity will take the upper hand against what threatens or saddens us. How does a scientist resolve the apparent discordance between values and the cultural myth of dispassionate objectivity? We start by acknowledging the tools and power of scientific inquiry, and we respect their role in how we develop knowledge about the world. Values motivate and shape scientific endeavors, and passion itself can fuel scientific pursuits. None of us should shrink from or apologize for our commitment to the mission of public health. Our stories from data necessarily express perspectives and values; we have to commit to portraying and defending those worldviews.

My mentoring experience is the single greatest influence on how I think and talk about the values that can motivate and shape data science, as well as the skills that undergird data science practice. Since early 2000, I’ve had the pleasure of mentoring dozens of early-career scientists—post-doctoral fellows in the EIS program, Prevention Effectiveness, Public Health Informatics, and the Oak Ridge Institute for Science and Education (ORISE); undergraduate, master’s, and doctoral students; all budding scholars and professionals in public health, medicine, philosophy, physics, mathematics. Into each of these relationships, I have poured a bit of myself and my respect for managing data, for coaxing meaning from data, for delighting in discovery from data, and for sharing stories from data with colleagues. Every one of these mentoring relationships has changed me as a data scientist and reinforced my belief that learners believe.

I came to believe 2 things about the practice and profession of data science in my time at CDC: (1) Everyone who practices data science should have the intellectual support to do so rigorously, whether a statistician, epidemiologist, philosopher, or some other brand of scholar. Rigorous practice entails standing behind your methods and conclusions, which can be an intimidating duty when reaching beyond your expertise. (2) Everyone who commits to doing data science as a profession should accept the intellectual responsibility to contribute their expertise, both collaborating with and leading scientists from other disciplines. We have to commit ourselves to communicating clearly with those who share our specialty and with those who don’t but who respect the ways that our specialty bolsters and advances public health research and practice.

7.2 My personal history with data science

I’ve been practicing or professing data science one way or another for a long time, typically under the title statistician or mathematical statistician or methodologist. When I was 5 years old in first grade, I thought that I might want to be a mathematician (or an artist or a basketball player). In sixth grade, I got to play with an Apple II, with its BASIC programming and 5.25-inch floppy diskettes. As an undergraduate, I helped teach the obscure but powerful programming language APL (literally “A Programming Language”) to high school students. I earned a Bachelor of Arts degree in mathematics in 1991 and a Doctor of Philosophy degree in statistics in 1997.

In late 1997, about 2 months after filing my dissertation, I started with CDC as a contractor in the Division of Reproductive Health (DRH). I became a federal employee in early 2000, continuing to work in DRH, mostly on cohorts and clinical trials, until late 2004. Then I spent about 3 years overseeing CDC’s institutional review boards and thinking about the connection between how we justify research risk and how we learn from data.

I served from mid-2007 through early 2015 in the Division of Tuberculosis Elimination (DTBE), working largely on clinical trials and creative but rigorous ways to get better at finding TB to stop TB. While in DTBE, I became a self-appointed evangelist for the R statistical computing environment. In 2012, I articulated a vision for leadership in mathematical sciences, which included skills specifically in technical excellence and clear communication—the seeds of my belief in leadership in a progressive culture for data. In March 2014, I spoke on “A Scrappy Little Division That Cares a Lot About Data: A Vision for Data Sciences in DTBE”. That presentation included my first use and definition of the phrase “data science[s]”, with particular attention to “data science tasks: end to end”, later called the data life cycle.

In August 2015, shortly after I started as CSELS ADDS, I brainstormed dozens of potential topics for an internal CDC blog to explain and promote data science, called “expression(data, science)”. I wrote the blog title as if it were in a fictitious programming language, monospace font and all: "expression(data, science)". The blog never really happened. This essay revives many of the topics that I had brainstormed.

In November and December 2015, I presented "The Art of Data Science: The Intense Pragmatism of Data in the Service of Public Health" at the EIS fall course. I told the story of data science through real-life experiences of 9 EIS fellows whom I had mentored. Although none of those fellows had had a background specifically focused on data analysis, many of them achieved great things with data, and others made interesting mistakes worth learning from.

In May 2016, 14 months into my tenure as CSELS ADDS, I delivered a CSELS science seminar entitled “Data Science and Data Wisdom: Cultivating a Data Generation” (a pun on "generating data"; maybe I should let up on the puns), in which I laid out the cultural components to support the practice of data science. That presentation has evolved over the past few years to become “Who Gets to do Data Science? A Progressive Culture for Data in Public Health”, placing data science in the context of related but distinct disciplines and emphasizing who does data science more than what is data science.

In my latter days in CSELS, I turned my attention largely to machine learning (ML) and artificial intelligence (AI). We should regard ML as extending the set of data-analytic tools available to us, and we should use those tools where they help us learn things about the world—including potentially better ways to do public health surveillance and to adapt flexible and powerful ways to find disease and improve health. ML, and its scalable applications through AI, should not be mysterious or intimidating, and these tools should not be regarded as any more magical than familiar methods. Moreover, they should be subject to the same critical reflection and norms as other methods for scientific inquiry. Current agency discussion about the potential for ML and AI risks focusing too narrowly on technology and not enough on learning from data in ways that aim for scientific quality and rigor, as discussed at length in the sections 3.1 and 3.2 above: from posing questions of social value and scientific validity through ensuring that conclusions are traceable and defensible, that reasoning is coherent, and that the whole process is neutral, subject to minimal subjective bias, rigorous, transparent, reproducible, and so on. I expand further on ML and AI in section 8.

7.3 Why a progressive culture?

By the end of 2016, when I had been the CSELS ADDS for almost 2 years, CDC's Surveillance Strategy had successfully led to demonstrable improvements in technology for mortality records, case-based surveillance, syndromic surveillance, and laboratory-based surveillance. Early formative efforts for a fledgling Public Health Data Strategy in 2018 tapped dozens of midcareer and senior leaders to shape next-phase modernization. From those early days, I lodged 2 substantive concerns: (1) Regarding data, staff who work directly with public health data should join in co-leading the nascent data strategy because they know first-hand the challenges that they have in getting things done with data. (2) Regarding modernization, early-career staff should also join in co-leading the modernization effort, because they are more likely to have an essentially modern take on data and progress than mid- and late-career leaders alone. In August 2018, I nominated a “data science breakfast club”—an interdisciplinary collection of a dozen early-career data science practitioners—to discuss their experiences doing data science at CDC and how CDC could effectively develop a data science-savvy workforce. By early 2019, neither concern about data leadership and modernization leadership had gained appreciable traction, and the breakfast club never convened. I was told that the “movement" was open to early-career and data-involved staff, but it became clear that the movement would not engage them intentionally on their terms, for their co-leadership.

The strategy also struggled to describe why it was important to do data better. The emerging federal data strategy acknowledged the importance of data as an asset. And both the federal level and agency level connected that asset to informed decision-making and action. But neither the federal level nor the agency level explicitly articulated in what sense data were an asset and in what way data could inform decisions and action. I developed and shared a metaphor that we were conceiving of data as a treasure, and we were coming to acknowledge that we were largely hoarding that treasure, as if in a cave. Like the treasure in a cave, data are an asset because they have value, but we realize that value only when we use or spend rather than store the data. Data have value because we use data to learn things about the world and to do things with what we learn, including but not limited to making decisions and taking action. Data have value because we use them to build things, like artificial intelligence tools, that promise to help us interact with the world more efficiently and effectively.

I became disheartened by what I perceived as regressively narrow thinking and behavior. Nonetheless, I still believed that the intentions of the emerging strategy were largely sound. So I inverted my pessimism and asked, "What would it take for CDC to be progressive?" About 3 months later, I had drafted a manifesto for a progressive culture for data in public health, appended to this essay as an appendx (A).

In 2019, the Public Health Data Strategy merged with the concurrently emerging Information Technology Modernization Strategy to become the Public Health Modernization Initiative and eventually the Data Modernization Initiative (DMI). In August 2019, as the Surveillance Data Platform was wrapping up its work, a presentation on the merged modernization initiative enumerated 5 pillars for a modernization strategy. Despite the strategy’s stated intent to develop “world-class analytics”, nothing in the pillars addressed the role of analysis. When I pointed this out, I was told that it was implicit in all 5 pillars. A value that is not explicitly stated risks getting ignored. With DMI came initial political success in the form of $50 million in appropriation to seed the effort, fleeting moments before the Covid-19 pandemic pushed modernization efforts and funding into overdrive. DMI and concurrent investments have prepared the public health sector to advance more rapidly in response to the pandemic than ever expected. Nonetheless, DMI remained slow to engage data practitioners and early-career professionals as leaders in this data revolution. Thus, the manifesto still complements DMI as a vision for realizing the value of data as an asset in a culture centered on the roles of data practitioners and experts, learners and doers.

That manifesto is now an organizing principle for this essay. In the manifesto, I state, “a progressive culture remains rooted in history and continues to learn from old data in new ways.” This collection is itself rooted in a personal history.

6 Redux: Who, how, what, and why

8 Machine learning and artificial intelligence