5 Who
5.1 Who gets to do data science?
Everyone who wants to do good things with data should have the intellectual support to do so; in turn, they must proceed with rigor and stand behind their work. I first formulated this credo in 2016. I asked, “Who gets to do data science?” to express both empowerment (who has the ability) and privilege (who has the authority). I was proclaiming that one need not be an expert in statistics or computer science to perform well when working with data. Indeed, nonstatisticians can, and often do, perform great work with data.
I believe this credo because of my own experience mentoring, coaching, and advising learners through a variety of CDC or CDC-adjacent programs, including fellowship and student internship programs, with undergraduates, masters and doctoral students, and postgraduate learners. For the first few years in my CSELS ADDS role, I emphasized the learning-oriented, empowering component of data science: “everyone who wants to do good things with data”. In my mentoring experience, a learner’s specific analytic or technical background has been a poor predictor of how well they would do, especially in programs that don’t recruit specifically for previous analytic or technical education. For example, I have worked with several physicians who had no specific statistics background, who went on to execute superb analyses, some even winning awards. In each case, they proceeded with rigor and stood behind their work. My role was merely as mentor; they took up the challenges of doing data science. Conversely, some learners with apparent analytic background either shunned rigorous analysis or fumbled badly. CDC programs can select for prior technical or analytic experience, but I don’t believe that CDC programs need to do so.
Who gets to do data science? I can restate the question echoing my credo as follows: Who wants to do good things with data, proceed with rigor, and stand behind their work? I can restate the question again, echoing my working definition of data science: Who will line up tools to ask and answer good questions rigorously using data? I have the same answer for all 3 versions of the question. Self-learning problem-solvers get to do data science: people who connect a love of knowledge to self-learning and solving problems, people who ask thoughtful questions, pay close attention to details, honestly acknowledge what they don’t know, probe for deeper meaning, and persist in the face of obstacles. (See also Baehr (2013b).)
In 2016, I felt energized to tout such an empowering message focused on learning rather than specific disciplines. I slowly realized that this message was incomplete. While I situated self-learning problem-solvers in learning-oriented communities with mentors and advocates, I needed to say more about who learners, mentors, and advocates are, where specific technical and nontechnical skills fit in, and how that community operates beyond learning. So from mid-2016 through mid-2019, my primary formulation transmuted from Who gets to do data science to the more expansive and inclusive Who participates in a progressive culture for data. I began acknowledging that nonexperts who do good things with data often need guidance from experts to empower those achievements. Furthermore, nonexperts and experts alike need support and other resources from other members of the progressive culture for data.
I also believe that CDC has a substantial, untapped well of potential among existing staff for doing good, and better, things with data. In other words, CDC could achieve, or make great strides toward, a progressive culture for data with the right attention and direction regarding learning, doing, staffing, and leading. I have not seen CDC as a whole make those moves, though pockets here and there show promise.
Who gets to do data science? Echoing NIST, the Office of Personnel Management (Reinhold 2019) says, “Practitioners with sufficient knowledge in the areas of business needs, domain knowledge, analytical skills, and software and systems engineering to manage the end-to-end data processes in the data life cycle. Overlapping skills including data analysis, analytical applications, big data engineering, algorithms, domain expertise, statistics and machine learning. [They] use expertise in one or more of these domains to solve complex data problems.” This answer lacks poetry but specifies a few details in terms of knowledge, skills, and work activities. These details are useful for creating staffing strategies, occupational qualifications, and position descriptions.
Finally, who leads in a progressive culture for data? Everyone should get to.
5.2 Learning in a progressive culture for data
In 2018, I added the following to my growing collection of personal mottos: “Less training, more learning.” By this I meant that our culture should explicitly recognize and support personal initiative and self-direction as ways—in my view the most important ways—of gaining knowledge and experience for solving real problems, especially after entering the workforce. Community supports learning about data, and learning how to do data science, by centering on learners. Learning can follow formal curricula and be encouraged in structured programs, but substantial portions of learning occur in informal settings, as through reading, self-guided learning, and interaction with peers. In my experience, a culture that overemphasizes training risks undervaluing the full gamut of the ways that learners learn.
A focus on learning respects the agency and responsibility of the learner, who must take an active role not only in receiving instruction but in practicing and honing what is learned. A focus on learning further opens the way for various models and modes of learning, including self-teaching through reading, crafty on-line searching, independent tutorials, and experimentation. Self-guided and experiential learners need guidance from others, mentoring, and help identifying or filtering through material that can assist in learning. Learning concerns not only individual development but also better serving the shared mission, for example, by asking better questions, working better with data sources or structures, and communicating rigorously and clearly to a variety of audiences. Learning should not focus only on technical skills but also on the means to exercise data acumen, good sense, and judgment.
Beyond developing skills among current staff, we need an agency culture that provides intellectual support to everyone who wants to do good things with data, whether they already have the skills or not. For those who already have skills, it’s a matter of supporting good practice, supporting continuing development, and encouraging that they support others. For those who lack skills, it’s a matter of providing support that is more oriented to learning new skills or to adapting skills from other areas.
5.2.1 Relational learning
I start here with the relationship between learners—typically fellows, students, or early-career scientists—and mentors, since this relationship formed the initial impetus for my entire conception of a progressive culture for data.
Mentors create supportive conditions to guide other scientists to learn about data and to learn from data. A mentor is responsible for guiding technical skills to encourage a personalized direction for self-learning, sometimes as specific as skills for managing relational databases, creating graphics that smooth binary outcomes, modeling seasonality, or exploring categorical data using mosaic plots.
In my experience, experience has been more important for learning than specific subject-matter knowledge. I typically guide the learner through clear thinking and critical reflection more than through particular methods. Mentors model intellectual virtues. They show by example as they openly practice curiosity, courage, humility. Mentors create regular opportunities for the learner to practice intellectual virtues by stimulating curiosity, rewarding courage, and fostering humility. In some ways, these nontechnical skills are more fundamental to scientific practice than technical skills are: learning how to learn and, more than that, how to exercise judgment regarding level of effort, intensity of exploration, extent of experimentation, making interesting mistakes. Practical wisdom matures in part through guidance and in part through reflecting on one’s own lapses in intellectual virtue. A mentor can help the learner with both.
In the context of education, Baehr (2013b) explained that educating for intellectual character growth is personal, because it involves thinking of learners as persons whose basic beliefs, attitudes, and feelings about knowledge and learning also matter critically to the quality of their education; it is necessarily social or relational, because personal change and growth occur most readily in the context of trusting and caring relationships; and, it is reflective, because it involves reflecting on and discussing with learners the value of thinking and learning, regularly pausing to identify or reflect on the significance of what is being learned. The mentoring relationship responds to and nurtures a love of and interest in thinking, learning, answering good questions. “Intellectual virtues flow, not from a desire for praise or approval, but out of a genuine interest in thinking and learning.” (Baehr 2015)
As a mentor, I often explain my thought process to the learner, as messy and nonsensical as that thought process might be. Thinking out loud with a learner, usually when I have no idea whether I’m making sense, is real and honest. It is also an exercise in vulnerability. In turn, the mentor must earn the trust and respect of the learner, so that the learner is able to take risks to make interesting mistakes in their own thought processes.
CDC needs to find and support mentors in informal and formal ways. In the discussion below on staffing and capacity for data science, I propose some approaches through articulating competencies and accounting for performance.
In a robust culture supported by mentoring, the mentoring relationships go in several directions. I started this discussion focusing on the relationship of the mentor to the learner. In communities of peers, learners can mentor each other in the same personal, relational, and reflective ways as mentors do with learners. In addition, mentors could benefit from guidance and wisdom from experienced mentors, or what I call “meta-mentoring”. Finally, advocates influence, guide, and support the learning-oriented community.
Learners ask questions, solve problems, reflect critically on process, and improve their skills. They take responsibility for self-learning, seek mentorship, and practice technical skills and intellectual virtues. Learners take risks and make interesting mistakes, from which they learn how to exercise judgment. Learners use data responsibily to improve the world. Learners participate in community through fellowships and other learning programs, communities of practice, scientific workgroups, interest groups, and user groups. The Statistical and Machine Learning Community of Practice was a community for learning from data, an example grass-roots network of learners that brought together members of scientific workgroups, user groups, and other groups.
Advocates influence the practice and profession of data science, promote and reward commitments to data and to self-learning, remove needless barriers, support instructive failures and interesting mistakes, encourage those who practice data science, and uphold those who profess data science. Advocates include managers, decision-makers, associate directors, directors, and others.
5.2.2 Formal curricula and structured programs
A progressive culture for data can promote more systematic or standardized learning through formal curricula for technical skills and for nontechnical skills. Numerous vendors now offer courses on a wide variety of computational and data-analytic skills. CDC makes many of these available through OCIO, CDC University, programs like Advanced Molecular Detection, and sessions organized by CDC workgroups and user groups. In addition to the endless options for courses on technical skills, curricula are also available for nontechnical skills (e.g., Educating for Intellectual Virtues), though they are not as readily available.
CDC has a rich, long, and successful history of structured, experiential learning programs, including some for professionals outside of CDC. Until recently, these programs have not intentionally and explicitly addressed data science. The Epidemic Intelligence Service (EIS) program has addressed analytic skills in limited ways. Informatics and prevention effectiveness programs have specific, narrow technical areas of focus. Participants in these programs regularly benefit from communities of mentors and peers, but those have also not purposefully addressed data science.
Some recent developments demonstrate incremental shifts: The EIS program has piloted and continues to provide mentoring to EIS fellows for advanced analytic projects. The fellowship programs have also piloted efforts to create teams of fellows from different programs for intentionally interdisciplinary work. Recent modernization initiatives have sponsored a few fellows to focus on data science. Finally, the Data Science Upskilling (DSU) program launched in 2019 with about a dozen teams of incumbent federal staff and fellows, each focusing on a primary project, on-demand, online courses, and cross-team activities on 5 components of data science: statistics, machine learning, computing, visualization, and ethics.
DSU allows federal staff and other learners to set aside time to go deep on data and on methods that are new or unfamiliar to them, as well as time for trial and error. The program is predicated on explicit organizational support, including from supervisors, as well as structured leadership and access to experts, learning resources, technical tools, and fellow learners. Participants refine existing skills or learn new skills in analytic methods, software such as R, Python, and Power BI—importantly, not limited by their prior or primary occupational series or disciplines. They generally learn how to establish clearer boundaries on their motivating data science projects, develop workable (if preliminary) solutions, and establish a community of practice. Most of the methods and tools used in DSU are not new, though they might be unfamiliar or uncommon within CDC’s general culture. The program brings both an overarching purpose and specific value to learners and their programs by focusing on specific, mission-oriented problems. They thus establish, expand, and apply methods, tools, and technology available for CDC to use rigorously. Furthermore, they enrich their own and their teams’ ability to adapt to fast-changing contexts: newer questions, less familiar data sources, and less familiar methods and technology, all of which comes close to my vision for the motivation and disposition of a progressive culture for data.
5.3 Doing in a progressive culture for data
A progressive culture for data centers on learning in order to empower the practice and profession of data science. Community supports the practice and profession of data science by ensuring that everyone who wants to do good things with data has the resources to do so. On this account, I see 4 primary roles in that culture: learner-practitioners (which I also call “learners and doers”), expert practitioners, managers, and lay advocates. These roles can change over time and overlap with each other and with the roles that I articulated above as learning-oriented roles (mentor, learner, and advocate). The roles capture the essential distinctions for the primary practical needs of that culture for doing good things with data.
Learner-practitioners with basic or intermediate data skills come from any discipline, not just computer science or statistics, to do good things with data, mindful of the full life cycle of data.
Description |
seek to do good things with data come from any discipline, not just computer science or statistics |
|
Data-oriented skills | basic or intermediate | | literate in data fundamentals, such as the design of data | collection methods, data quality assurance, conventional | flat, tabular and multidimensional, relational data, and common analytic methods | | interpret, communicate, and memorialize learning from | data | | |
Goals and approach |
achieve, or work toward, data proficiency, building on fundamentals to work rigorously with more complex data or methods learn continuously and show how modern tools and methods solve modern problems mindful of the full life cycle of data |
Expert practitioners achieve data mastery and go deep on data science methods and provide the intellectual foundation for good practice.
Description |
provide the intellectual foundation for doing good things with data, aiming for scientific quality and analytic rigor master complex data structures or methods |
Data-oriented skills |
literate in advanced, contemporary methods for complex data structures or methods, such as high-volume or high-velocity data; analysis of patterns and predictions as well as inferences; visual and other methods interpret, communicate, and memorialize learning from data |
Goals and approach |
practice personal proficiency ensure that everyone who wants to do good things with data, can set norms for data-oriented practice and for learning from, about, and with data enable, guide, correct, and empower practitioners to proceed with rigor and stand behind their work mindful of the full life cycle of data |
Managers supervise learner-practitioners and experts to ensure that they have the resources and direction that they need to achieve good things with data, now and in the future.
Description | give learner-practitioners and experts resources and direction to do good things with data |
Data-oriented skills |
data fluency, acumen, or proficiency how to assess scientific quality and analytic rigor of data-oriented solutions how to allocate investments in data-oriented learning and technology how to allocate data-oriented assignments |
Goals and approach |
foster and reward curiosity, invest in learning (not just training), encourage creativity and interesting mistakes hold practitioners and experts to account for producing knowledge learned from, about, and with data advocate for the means to enable practitioners and experts to continue increasing their capability, efficiency, and effectiveness |
Lay advocates work in community with practitioners, experts, and managers as persons literate in the value of data to help learn things about the world.
Description | support doing good things with data, in community with practitioners, experts, and managers |
Data-oriented skills |
data fluency or acumen how to assess basic quality of data-oriented solutions how to allocate investments in data-oriented learning and technology |
Goals and approach | help ensure supportive resources to enable learning and achievement |
5.4 Staffing in a progressive culture for data
How does a progressive culture build and sustain the capacity to keep up with fast-moving methods, tools, and technology? How are people brought in, organized, and kept around?
Harvard Business Review headlined data scientist as “the sexiest job of the 21st Century” (Davenport and Patil 2012). Fast Company has called it one of the best 25 jobs in America (Dishman 2016).
5.4.1 Data science staff should be cultivated, hired, and outsourced
Amidst the mixture of excitement and marketing hype about data scientists, there’s a recurring question about whether data scientists are recruited and hired from the outside or cultivated from the inside.
Data scientists are hard to find and attract. … Data scientists are rare commodities. … What data scientists do—curate data, ask the right questions, build explanatory analytical models, implement the models into various applications—is simply not scaling at the pace of demand. (Millis 2015)
A prominent data scientist in Silicon Valley ... doesn’t hire on the basis of statistical or analytical capabilities. … [He] seeks both a skill set—a solid foundation in math, statistics, probability, and computer science—and certain habits of mind. He wants people with a feel for business issues and empathy for customers. Then, he says, he builds on all that with on-the-job training and an occasional course in a particular technology. (Davenport and Patil 2012)
I believe you indeed learn data science on the job. It is true that data scientists should know [some specific technical skills] … And self-learners can catch up quickly … But focusing only on people who call themselves data scientists is a mistake. (Van Cauwenberghe 2015)
In these 3 quotations, we sense that data scientists are hard to come by. Furthermore, since they need to keep up with fast-moving methods, tools, and technology, they need a firm foundation in technical and nontechnical skills as well as a disposition and self-sufficiency for continuous learning.
Federal workforce flexibilities afford a rich variety of staffing mechanisms and organizational options for achieving and sustaining an effective mix: career development among federal staff and other learners, recruiting new federal staff and learners, adding collaborators from academia and other partners, and acquiring data science services through contracts. This section provides a brief, opinionated summary narrowly focused on a few considerations. I organize the discussion around 6 broad, mutually exclusive segments:
Federal employees already on staff, including civil service and uniformed staff
To-be-recruited federal employees
Federal and nonfederal staff in learning programs, glossing over some nontrivial nuances distinguishing federal learners (e.g., some fellows hired under Title 42) from nonfederal learners (e.g., student interns)
Collaborators from academia or funded under a grant or cooperative agreement
Research and development contractors from federally funded research and development centers, university-affiliated research centers, and national laboratories
Commercial vendors
For ease of presentation in this section, I will sidestep some details that cannot be ignored in practice. For example, by learning programs, I mean staffing mechanisms such as fellowships, not coursework or programs like Data Science Upskilling. In addition, I include academic collaborators under the Intergovernmental Personnel Act or as Special Government Employees (SGEs) along with grantees, even though IPA funding is executed like a contract (acquisition) rather than a grant (assistance) and SGEs are technically civil federal employees.
Some of the material in this section corresponds to similar, more expansive discussions of the HHS Data Council’s Data-Oriented Workforce Subcommittee (Gehrke et al. 2021; Wagner 2022). The subcommittee’s reports present rich, thoughtful, comprehensive detail on staffing and organizing for data science in the federal workforce. While I provided some critical input to the subcommittee, the views that I present here are my own.
5.4.1.1 Federal employees
The federal government has been expanding options for classifying and developing federal employees to do data science. Historically, occupational series in science, technology, engineering, and mathematics (STEM) have represented narrow but workable disciplines, including engineering (0801), operations research (1515), mathematics (1520), statistics (1529 and 1530), computer science (1550), and to some extent information technology specialist (2210); some of these have been combined into interdisciplinary positions, such as health science and statistics (0601/1530). Other scientific or technical series in social and behavioral sciences (0101), microbiology (0400 group), and health sciences (0600 group) have been used for positions that focus on research or analysis. Around 2016, I wrote CDC’s first position description (in series 1530) that explicitly included machine learning.
In 2018, the Office of Personnel Management issued direct-hiring authority for STEM positions in economics, biology, engineering, physical sciences, and math fields. Then in 2019, OPM released guidance for adding parenthetical “(Data Scientist)” titling to several of these series (Reinhold 2019). Managers in the National Center for Injury Prevention and Control developed a set of standard “(Data Scientist)” position descriptions in several series and grades.
In 2019, CDC hosted a sequence of Future of Work (FoW) workshops to develop data science profiles. I appreciated the focused attention, but I perceived that the approach did not provide much latitude for existing federal staff who are experts in data science to influence the shape and direction of the effort. Data-oriented experts already in the workforce would have the direct experience to inform what is needed for doing good things with data—like existing supports and motivators (such as interesting problems and supervisory support) as well as persistent challenges (like barriers to nimbly using no-cost data science software). FoW’s contract support staff could say something about the ways that industry improves its use of data, but they lacked awareness from within CDC’s own culture of working with data. I also perceived that the approach risked conflating informatics with data science rather than clarifying the distinctions between them. On the benefit side, FoW fleshed out the concept of data fluency as a minimum competency for much of the federal workforce. In my schema above for doing in a progressive culture, managers and laypersons would best support the culture by achieving at least data fluency.
Finally, in late 2021, OPM issued the new data science occupational series 1560. The accompanying flysheet substantially emphasizes the defining importance of the life cycle of data, but it covers job activities that are diffuse or ill-defined enough that it will take special care to use the series effectively. I would have preferred improving the way that federal agencies use existing series, including flexibility with titling and combining series, but the development deserves to be taken seriously. Thus, CDC has worked to develop qualifications, competencies, position descriptions, and other resources for recruiting and hiring data scientists.
Based on my experience with learners, user groups, and other early-career professionals at CDC, I believe that unrecognized and untapped potential already exists among incumbent federal staff and that CDC has so far failed to see and characterize this potential. To realize this latent capacity, we need to shift our thinking from traditional assessments of existing skills and traditional emphasis on training, to assessments of aptitudes and habits of mind and a radically different take on on-the-job learning that rewards self-learning and nurturing networks with peers and mentors. At least as important, we should be finding out from employees and learners with these experiences or interests what they need and want in order to do good things with data, rather than a narrow top-down focus on what only managers perceive—especially managers unfamiliar with the motivations, commitments, and prospects of data science. It makes little sense to me to talk about recruitment and retention without examining what makes prospective or practicing data science practitioners want in order to join the workforce and stay in it.
Tapping this potential also calls for a culture shift among staff themselves who do or can do data science. While it can be important, for example, for a statistician to maintain the professional identity of their discipline, statisticians (and computer scientists and others) need to see themselves as part of, rather than separate from, intentional cross-disciplinary engagement.
Turning to hiring, CDC faces well known challenges competing with other sectors. Aware of limited flexibility to enhance incentives for prospective hires, what nonfinancial incentives can CDC offer? Foremost, CDC’s unique mission and public service already draws employees from many disciplines; that is, CDC appeals to many recruits’ personal values. Second, if CDC cultivates a truly progressive culture for data—one that rewards a drive to learn as well as a drive to contribute—then CDC becomes that much more attractive to exactly the kind of people who can sustain and enrich that progressive culture. But the culture must be genuine, or else its attractiveness will fade.
Stepping back from the fine details of series and grade, whether federal data science staff are cultivated from within or hired from outside, the most important operational considerations pertain to competencies and performance. CDC needs practitioners who are able to do data science, whether as part or as all of their duties. As a side benefit of CDC efforts to flesh out series 1560, human resources staff have worked to develop a richly varied set of competencies, work activities, and proficiencies. Those supportive resources can and should shape other series beyond the new 1560. A 1530 statistician could adopt the more expansive data-analytic competency or the enriched competency for machine learning and artificial intelligence. And those same human resource concepts can and should be adapted into performance elements and statements, so that everyone who does data science can be accountable and rewarded for doing so. In addition to competencies and performance elements that arise from series 1560, CDC should also develop competencies for skills associated with intellectual virtues. Skills and competencies oriented to learning, practical judgment, and mentoring could also help to differentiate proficiency and grade within series and could (and should) apply to other scientific series.
5.4.1.2 Fellows and other learners
CDC manages or partners on dozens of structured learning programs on dozens of topics, open to persons with a variety of educational backgrounds. Fellows stimulate, and demonstrate CDC’s commitment to, a vital culture of learning. CDC sometimes hires fellows as federal staff, often as an intentional career path. Although CDC fellows contribute to CDC’s product, their primary purpose is to learn, not to augment staff.
Some CDC fellows focus largely on doing good things with data. More fellows get to do good things with data, whether they focus on data or not.
I believe that CDC should commit to helping CDC programs develop data science capacity through a focus on fellows, with the follow-on intention that other members of a learner’s program unit can also develop their own data literacy or competency. Early drafts of the 2018 Public Health Data Strategy called for a ready response unit of data scientists who would work with CDC programs as needed. If such a unit were to be created, I recommended having it focus on working through fellows, such that the requesting CDC program would develop the capacity to address the data-oriented need rather than relying on outside staff to take care of it and move on. (As a side note, the Center for Forecasting and Outbreak Analytics largely goes the opposite direction from my recommendation, investing substantial data science resources in that center rather than distributing them among other CDC programs.)
All these fellows need support from peers and mentors. To that end, CDC should foster mentoring as a supported competency, with accountability and reward through performance appraisal and other incentives. CDC should not, however, overinstitutionalize mentoring, because the role itself needs latitude and flexibility for fostering both technical and nontechnical skills.
5.4.1.3 Nonfederal collaborators
In addition to federal employees and learners, nonfederal collaborators serve some of CDC’s data science needs, through joint research or other projects with academic or public health partners, through research and development organizations, and through commercial vendors. CDC often engages with academic and public health collaborators through grants and cooperative agreements or through a so-called mobility agreement under the Intergovernmental Personnel Act (which is administered more like a contract than a grant). Research and development organizations include federally funded research and development centers (FFRDCs, such as those operated by the MITRE Corporation or the RAND Corporation), university-affiliated research centers (UARCs, such as the Georgia Tech Research Institute and the Applied Physics Laboratory at Johns Hopkins University), and national laboratories (such as Oak Ridge National Laboratory and Sandia National Laboratories). Finally, commercial vendors include a vast collection of entities that bid to sell proprietary services to CDC under the Federal Acquisition Regulation.
These outside contributors can especially help by filling in gaps in CDC’s own capacity for data science activities varying in discipline, skill, or scale that CDC can’t address on its own. It’s important for CDC, through advocates and managers, to strive toward building capacity among CDC’s federal staff and to avoid assuming that only outside collaborators can do a particular thing (such as some forms of text or image analysis). As I’ve argued elsewhere in this essay, CDC’s federal staff and learners likely have substantial, unrecognized capacity for extending CDC’s data science capabilities into unrealized directions. It would be a mistake to outsource based on a faulty assumption.
Many needs do exceed CDC’s current capacity. When it is necessary to turn to nonfederal collaborators, it becomes especially important to have enough expertise among CDC’s federal staff (or at a bare minimum among trusted nonfederal partners) to ensure that contributions from nonfederal collaborators meet the intended need. How do we know if we’re getting something useful, or what we need, from these collaborators? I have observed more than one project in which a nonfederal collaborator—sometimes academic, sometimes commercial—supplied a deliverable that the home CDC program was unable to evaluate. In those instances, greater data science expertise within the CDC program, or through another service or community within CDC, could have helped to ensure that the proposed deliverables would be worth the investment and that the actual deliverables met the need.
5.4.2 Data science staff should be organized to do data science
5.4.2.1 Organizing data science capacity
As described in previous sections, a progressive culture for data needs data science learner-practitioners (from a variety of disciplines), expert practitioners (specifically data science disciplines), managers, and lay advocates. A discussion that focuses only on experts is incomplete and short-sighted. Not everyone needs to be a data scientist to be empowered to do good or to be held to a high standard. And not everyone needs to be held to a high standard.
Should analysts or data scientists be integrated with staff from other disciplines or set apart? This question and the reality cut both ways: statisticians and other data staff are often set apart, and they often prefer it that way. In a post-Covid workplace configuration, the organizational question comes down to 2 main characteristics that we can think of as within and between: Should data staff be placed in units that are homogenous or mixed with collaborating staff of other disciplines? How connected should data staff be across distinct units? I’ve seen some version of each configuration. The idea of grouping data scientists together seems like a wise way to manage limited resources, but in my experience, it fosters the notion that data scientists ought to be separate. During CDC’s Futures Initiative in the early aughts, there was talk of putting all statisticians together in one center. It would have both made it harder to work with analysts and constrained the professional development of statisticians. I think that the most effective all-around configuration is to mix data scientists with other professionals so that there are other data scientists nearby and all data science practitioners in a division, say, regularly interact with each other to work through problems together and to learn.
5.4.2.2 Assessing data science capacity
CDC’s ability as an agency to do data science depends on all the cultural components that I have listed above: intentional cultivation of learners as well as constructive support and direction for data science practitioners and experts that not only respects but also appeals to their know-how and their drive—both their technical skill and their nontechnical skills. An assessment of data science capacity needs to include, and go beyond, characterizing the aggregate set of those technical and nontechnical skills. It is important also to discern from people who do good things with data what they need and what they want in order both to continue and to improve. Let’s break those ideas down by focusing on people who do data science (practitioners and experts) and people who directly empower, enable, or support them (managers). In the federal system, we need to distinguish federal staff, (nonfederal) learners, and other nonfederal staff. Finally, we want to characterize individual data science competencies as well as unit-level competencies at increasing levels of aggregation, such as teams, branches (collections of teams), and so on.
For staff who do data science, we want to know their proficiency and aptitude with technical skills in data analysis and computation as they apply across the life cycle of data. In my experience, the most effective way to discern technical and nontechnical skills and aptitude is for experts to see the skills in action, either prospectively or retrospectively: How well can the practitioner frame a problem? Work out what kind of data address the problem? Arrange, explore, and analyze the data using suitable tools? Correctly describe and critique the analysis? Demonstrate critical reflection throughout? Keep the activity directed toward the ultimate goal and deal with obstacles by acting on traits such as curiosity, attentiveness, perseverance, open-mindedness, and creativity? In addition to this broad set of competencies, we also want to know about particular strengths, for example, with programming in Python or deep learning or time series, as well as areas that warrant new learning in order to address intended data science tasks. No one staff person needs to master all the technical skills, but they should have sufficient acumen to discern where their skills apply and where they do not.
For staff who supervise data science practitioners or lead projects that apply data science, we need to assess and edify their data fluency, sufficient to guide and empower practitioners and experts. Data fluency includes the ability to understand the components of the life cycle of data, how those components relate to each other, the skills that each core activity calls for, and the intellectual traits and practices that support critical reflection and adaptation throughout the life cycle. Managers could be, but do not need to be, data science practitioners or experts. Where a manager lacks expertise, they will need the humility and wisdom to turn to experts. Furthermore, managers should demonstrate the skills needed to foster both learning and mentoring.
Some staff enable data science but do not practice it or manage those who practice it, such as information technologists or cloud engineers. For these staff, we also need to assess and edify their data fluency and their understanding of the life cycle of data, centered on analysis.
Data science is interdisciplinary. To ensure domain knowledge in addition to computational and data-analytic skills, we need to account for the combined set of skills and knowledge as groups of staff roll up into teams and other aggregated units. And we need to consider additional nontechnical skills for collaboration and negotiation. When considering a collection of staff and their joint mission, what are the specific strengths, weaknesses, and gaps in their collective ability to prepare, conduct, and communicate analysis? Do they have special strength or notable weakness in areas that could affect their ability to meet their mission, such as detailed knowledge of longitudinal claims data or time series methods? Assessment of larger units could especially call for evaluative expertise from outside the unit, as practitioners and managers might not be able to identify their own gaps.
A capacity assessment extends beyond individual and collective skills and traits, however. What do incumbent staff think that they need in order to do data science well and to keep doing it better? What do incumbent staff want in order to do data science well and to keep doing it better? Taking staff interests seriously can nurture morale and foster staff retention, but it also recognizes that staff are often the experts on supporting and bolstering their own capacity. Just as modernization should pay due heed to early-career professionals (the epitome of modern), and world-class analytics should pay due heed to data science practitioners and experts (the epitome of data-savvy), an assessment of data science capacity should pay due heed to the staff who actually do things with data. And yet these staff are often overlooked when they should be intentionally and directly engaged. Enlightened organizations often conduct exit interviews with departing staff, in part as an after-action analysis of the counterfactual: Now that you’re leaving, under what conditions might you have stayed? In a progressive culture for data, practicing staff are continuously seen as partners, or even experts, in knowing how their unit can best function to keep up with fast-moving methods, tools, and technology. The culture, through supervisors and other governance, should continually engage with data-oriented practitioners proactively throughout their tenure, to empower them, facilitate their ongoing achievement, assure forward-looking resources, direct their efforts, and hold them to account.
5.4.2.3 Shaping and developing data science capacity
Assessment lets us know where we are and a little bit about how prepared we are to move in the directions that we want to go. But how do we shape and develop that capacity to do good things with data? This section outlines a way to structure the mission and focus of data science practice using 3 organizing rubrics predicated on concepts presented earlier in this essay. Those rubrics then translate into organizing principles, which lead to specific practices.
The 3 rubrics encompass (1) the core activities of the practice of data science, (2) the prepositional calculus of learning through data, and (3) a primary but fluid commitment to specific topics and services within the unit’s mission.
Rubric 1: Core activities of the practice of data science. Data science intentionally connects all core data science activities across the life cycle of data, as explained above, together with critical reflection at each core activity.
Rubric 2: Modes of learning through data. We use data to learn about the world in at least 3 ways:
Learn about data, to understand the kinds of questions they might be used to answer.
Learn from data, in support of answering questions put to the data.
Learn with data, by using data to develop, explore, or evaluate methods.
This prepositional calculus distinguishes assessing quality and utility from making inferences, which are in turn distinguished from a focus on methods themselves for learning about or from data.
Rubric 3: Topical goods and services. The third rubric distinguishes the goods or services delivered as a result of engaging with the life cycle of data. Under this rubric, technical assistance to collaborating partners is an essential service, as are developing methods for ensuring data validity, evaluating case definitions, and collaborative analysis of population health.
We translate these 3 rubrics into organizing principles by linking them to data science activities.
Rubric 1 → Principle 1. Link all data science activities to 1 or more of the core activities of the practice of data science, in the context and awareness of the other core activities. A team’s primary skills and products should be organized around performing these core activities, with explicit notice of the scientific or business question of interest, the source(s) and transformation of data, and so on.
Rubric 2 → Principle 2. Link all data science activities to 1 or more of the learning prepositions. Is the purpose of a given activity to understand the structure and attributes of some data source (learn about), to make claims about the world such as trends in asthma (learn from), or to get better at carrying out one or both of those purposes (learn with)?
Rubric 3 → Principle 3. Identify data science activities as subject-matter inquiry, service, or both. Establish the value and priority of each of these purposes.
Bringing it all together, data science capacity, and the skills to support and expand that capacity, should be linked directly to priority tasks and interests, which in turn are tied to core activities, “prepositional calculus”, and inquiry versus service.
Finally, put these principles in practice, as with the following examples:
Consider a team that focuses primarily on applying data science to the practice of syndromic surveillance. The team carries out tasks primarily related to data engineering (learning about data) and to supporting routine analysis of emergency department data (learning from data) for surveilling opioid overdose, hurricane-related morbidity, heat-related illness, Covid-19, and other conditions of public health interest. Although the team’s work covers all the core activities of data science, they focus primarily on the activities concerning obtaining, exploring, and analyzing data.
Example 1 (about). Among the list of proposed and actual activities directed to assessing or assuring data quality and utility, establish relative priorities and contingencies. These activities include at least the following:
Develop, automate, routinize, and integrate measures of the health of data feeds, which in turn include characteristics such as completeness, timeliness, conformance to standards, and fitness for purpose; products include regular reports, on-demand reports, and summary dashboards.
Assess value and limitations of data content, such as demographic data as received or as imputed.
Assess mechanics and quality of auxiliary data sources, including laboratory and vital records.
Assess mechanics and quality of using more than 1 data source for ecological analysis, such as merging at ZIP Code or county level, then aggregating post-fusion analytic results.
Example 2 (from). Among the list of proposed and actual activities directed to addressing specific, descriptive public health inquiry, establish relative priorities and contingencies. These activities include at least the following:
Measure coverage and representativeness with methods and results that pass peer review.
Characterize persons included in data sources, by demographic and (inferred) clinical factors.
Develop and evaluate methods for monitoring specific conditions, integrating external knowledge of the epidemiology of those conditions, to detect temporal anomalies in a way that balances the utility of automated signals with the effort to attend to those signals.
Example 3 (with).
Document methods for processing and learning about data sufficient to motivate independent re-implementation, in the interest of transparency, reproducibility, and intellectual credit.
Advance the ability to process and use data from multiple sources.
Advance the ability to develop data queries with a focus on conditions of interest, going beyond matching substrings by including machine-assisted record retrieval and semantic analysis.
Advance the ability to incorporate temporal and spatial information for detecting anomalies, focused on specific conditions and jurisdictions of interest.
This framework for shaping and developing data science capacity does not independently invoke learning, because learning is an essential defining characteristic of a progressive culture for data. Rather, this framework orients learning toward the rubrics, principles, and practices of doing data science, both to prepare to do data science and to do it. In a truly progressive culture, learning that is not specifically oriented to a product or service can still serve an essential good, because it prepares the mind to see possibility and, one hopes, to keep up with it. Louis Pasteur said, “In the field of observation, chance favors only the prepared mind.” (Pasteur and Vallery-Radot)
5.5 Leading in a progressive culture for data
Data science practitioners and experts—learners and doers—should lead CDC and ATSDR into the modern era through learning and advocacy. In a progressive culture for data, leadership is part of the practice of data science, and not separate from it. Leaders include practitioners, experts, managers, and laypersons, regardless of their career stage, job title or series, credential, or location in the hierarchy. Practitioners and experts must play a prominent, visible role in creating and leading a progressive culture for data in public health. As professionals who engage directly with data, they should ensure that the agency adopts, masters, and promotes an appropriately diverse set of tools and mindsets for using data to solve problems by showing how to learn from data and with data and by empowering others to do the same. Leadership continually shapes and sustains the culture of good data practice. As a community, they need to advocate to ensure that their interests and needs are folded into modernization initiatives as the agency becomes better tuned to meeting a modern mission. If learners and doers see leadership as separate from the practice of data science, then they risk leaving themselves out.
They should invest in and take pride in personal technical excellence in doing good things with data—to construct, analyze, and interpret models of public health or administrative outcomes. To lead, though, they need to go further than technical excellence.
They should emphasize learning from data—unlocking meaning through analysis. They need to be as practical and solutions-oriented as public health is. And they need to be rigorous, to ensure that all data-analytic practices hold up to scrutiny, even when there’s honest disagreement about methods or conclusions.
They should be principled pluralists on methodology. They see misapprehension about imputation methods (“making up data”), Bayesian methods (“too subjective”), and machine learning (“black box”, “data dredging”). But all these methods and more can help us learn from data, if those tools are used wisely and well. This is largely what Leo Breiman was saying in 2001 (Breiman 2001).
And they should promote and praise good data-analytic practice, regardless of job title, credentials, or occupational series. Everyone who wants to do good things with data should have the intellectual support to do so, as long as they proceed with rigor and stand behind their work. This is as true for sociologists and microbiologists as it is for epidemiologists and statisticians.
They should provide leadership on how to integrate data science into interdisciplinary efforts and put data science on equal footing with other specialties. They need to be able to serve as an integral part of a team with collaborators from other backgrounds or disciplines, to apply and translate rigorous data science concepts for the benefit of collaborating scientists, and to explore and respect the rigorous application of concepts from other domains as part of collaborative undertakings. They must learn and practice methods for interpreting complex concepts for nonspecialists, without unduly sacrificing rigor.
That said, experts in data science are the foundation for good practice by practitioners, helping them to use data-analytic tools wisely and well. In a progressive culture for data, leadership aims toward and flows from practical wisdom.
They should hold fast to solid norms in how they learn from data as a basis for high-consequence decisions. The Covid-19 pandemic has been a time of high pressure, fast movement, substantial uncertainty, intense collaboration, and rapid turnover. It can be tempting, under these circumstances, to cut corners on rigor—to try to get it done faster but to make concessions on quality. The opposite is needed: During times like this, and Ebola and other high-consequence events, integrity is as important as ever. Data science practitioners with varied expertise have shown that they can achieve both high speed and high quality.
They should lead from every level. Front-line analysts lead by showing how modern tools and methods help solve modern problems. Team leaders and branch chiefs lead by fostering and rewarding curiosity, investing in learning (and not just training), and encouraging the interesting mistakes that come with innovation. Division and center leaders and associate directors help ensure that our infrastructure—our people, processes, and technology—can support modern and future tools and methods. In all of this, those who specialize in methodology and analysis lead from wherever they are, so that everyone who wants to do good things with data, can.