2 Why
The purpose of data science is broadly to bring together, in a rigorous way, all that goes into doing good things with data—for learning from data and for building things with data to put those learnings to use, for example, in safeguarding public health. CDC consumes a lot of data to support its public health mission. Traditional sources include surveillance, vital records, surveys, program evaluation, studies of health services, and clinical trials. More recent sources include billing and claims data, electronic health records, social media, and sensor data. From small, structured data to high-volume, unstructured data, over time the scope and scale of those data expand and become more complex.
Why should we focus on data science? Because we need to keep up with rapidly changing methods, tools, and technology for extracting meaning from data.
Data science makes available tools for classic problems, such as working with data through the life cycle from problem formulation to collection to data management, through analysis, interpretation, and presentation. Contemporary issues in data science arise from movements to be open and to expand the scale of data and sophistication of analytic methods. Making data available for wide audiences, while ensuring adequate protections of individual privacy. Describing practices to make analyses reproducible, or at least traceable. Working with high-volume data, such as genomics, and high-velocity, real-time data, as found in syndromic surveillance and claims data feeds. And making sense out of unstructured text, images, and other nontraditional data types. Most contemporary problems differ from classic problems in scale rather than kind. For example, whether administrative data come from paper-based registers in resource-constrained settings or from massive stores of insurance claims data, they pose the same problems for inferring causes. Data science promotes principled use of the full breadth of methods, from the familiar to the unfamiliar, along with the norms to ensure that methods and results stand up to scrutiny. Data science crosses disciplines.
There’s another reason to focus on data science: Data science affords a measure of autonomy for practicing and honing data skills, because of ready access to open methods, tools, and technology. Some technical skills require special equipment (like growth media or microscopes in a microbiology laboratory) or access to humans (like clinical medicine or behavioral counseling). In contrast, to learn about data and from data, it is often enough to have data in hand, widely available software, and the persistence to jump into a problem and break it open. Software is now often freely available, with growing contributions by the very active R and Python user communities. So data science can be practiced with a great deal of self-determination. With that autonomy comes the latitude to own and direct one’s learning. Thus, the enterprising scientist can capitalize on that autonomy in order to keep up with fast-moving methods, tools, and technology, in part by continuing to learn how to learn from data. This autonomy presents a paradox that we will try to resolve later (in section 5.4.2): Data science is necessarily interdisciplinary, and not every practitioner needs to cross all the disciplines. So how can one be autonomous and team-oriented at the same time?
In summary, we focus on data science because we want to learn from data, learn about data, and learn with data.
Learning from data: Data have value because data help us learn things about the world. What we learn helps us to make informed choices about how we interact with the world, for example, through public health interventions.
Learning about data: Data come in many structures, sizes, shapes, and speeds, from small, flat data tables to massive, unstructured data streams. Data conform to a variety of standards, or no standards at all. The varied characteristics of data both enrich and constrain the ways that data reveal characteristics of the world.
Learning with data through its full life cycle: Analytic knowledge and skills allow us to pose rich questions about the world, amenable to rich methods; guide how we generate, transmit, obtain, and prepare data; probe data to answer questions about the world; place answers from data in context, mindful of assumptions and alternatives; present data-driven answers to audiences clearly and correctly; and preserve those answers and ensure that the entire life cycle is transparent, accessible, traceable and, to the extent possible, reproducible.
The field of data science addresses a wide variety of problems (what), and the practice of data science straddles autonomous and collaborative styles (how). Thus, we also focus on data science so that we can build and sustain a culture (who) for doing good things with data, for continuously learning things about the world, and for empowering choices informed by those learnings, and for being ever ready to learn from and act on data.