Chapter 1 Statistical data
1.1 Observational studies and experiments
Where do data come from? In statistics, it is often important to distinguish two main ways of collecting data: observational studies and experiments.
Experiments occur when the researcher can decide on certain variables (treatments) and measure the response to a given treatment. Experiments make it easier to identify cause-and-effect relationships.
Observational studies occur when the researcher has no control over the variables but simply records reality. In this situation, it is much harder (and without strong additional assumptions, impossible) to establish causal relationships. If certain variables and potential responses are linked, we speak of correlation, association, or statistical dependence.
1.2 Population and sample
A population refers to the entire set of units, objects, or phenomena under study. It may be finite (e.g., all students in a given school) or very large, even theoretically infinite (e.g., all dice rolls, all potential patients with a given disease).
A sample is a subset of the population that is actually studied and on the basis of which we draw conclusions about the population.
Example:
Population: all residents of Poland.
Sample: 1,000 people randomly selected to take part in a survey.
In practice, we usually have a sample, even though the population is what we are theoretically interested in. Statistical inference refers to methods that allow us to draw conclusions about a population based on a sample.
1.3 Sources of statistical data
There are many ways to collect data, for example:
Survey/questionnaire data – information provided directly by the respondent.
Researcher-collected data – field notes, recordings, lab tests, clinical trials.
Automatically collected data – sensor data (e.g., Internet of Things, satellites, smart bands), transactional data (sales records, banking logs, online activity).
Textual and multimedia data – documents, social media, images, video, audio.
It is also important to note that primary data may later be processed and reused for different purposes. From this perspective, we can distinguish:
Primary data – collected and used for a specific purpose.
Secondary data – previously collected by others, then reused and processed (e.g., official statistics, company reports, administrative registers).
Sometimes, tertiary data data, ie. compilations and syntheses (e.g., encyclopedias, meta-analyses) are mentioned as well.
Istnieje wiele sposobów zbierania danych. Można na przykład wyróżnić:
dane z ankiet/kwestionariuszy (informacje samodzielnie podawane przez respondenta),
dane zbierane przez badacza (notatki terenowe, nagrania, badania laboratoryjne, testy kliniczne).
Dane pozyskiwane automatycznie: dane pochodzące z czujników (np. internet rzeczy, satelity, smartbandy), dane transakcyjne (zapisy sprzedaży, logi bankowe, aktywność online).
Dane tekstowe i multimedialne (dokumenty, media społecznościowe, obrazy, wideo, audio).
Mówiąc o źródłach danych, warto wspomnieć również o tym, że dane pierwotne mogą być następnie przetwarzane i wykorzystywane w innym celu. Z tego punktu widzenia można wyróżnić:
dane pierwotne – zbierane i wykorzystywane do określonego celu,
dane wtórne – zebrane wcześniej przez kogoś innego, a następnie wykorzystanie i przetworzone (np. statystyki urzędowe, raporty firmowe, rejestry administracyjne).
Można wyodrębnić również dane trzeciego rzędu – opracowania i syntezy (np. encyklopedie, metaanalizy).
1.4 Quantitative and qualitative variables
Two fundamental types of statistical variables are:
quantitative (numeric, metric) variables and
qualitative (categorical) variables.
Quantitative variables take numerical values. Mathematically, they can be further divided into discrete variables (taking distinct values, e.g., number of children) and continuous variables (taking any value within an interval, e.g., call duration, proportion of people wearing glasses).1
Qualitative variables take non-numeric values. They remain qualitative even if recorded with digits (e.g., dog breed is a qualitative variable even if we assign numbers to breeds).
1.5 Measurement scales
1.5.1 Four main scales of measurement
A useful classification of variables is based on measurement scales:
Qualitative variables use the nominal i ordinal scale.
Quantitative variables use the interval and ratio.
Nominal variables are categorical variables whose values have no inherent order (e.g., blodd type, religion, dog breed).
Ordinal variables are categorical variables with a meaningful order (e.g., education level or Likert scale responses).
Interval variables are numeric variables where differences are meaningful, but ratios are not (e.g., Celsius temperature, calendar year of birth). It is often said that for these variables, the zero point is set arbitrarily — the arbitrariness of zero is indeed a useful way to identify variables on an interval scale.
Ratio variables are numeric variables where both differences and ratios are meaningful (e.g., number of cars owned, price, height).
1.5.2 Why measurement scales matter
Measurement scales determine which statistical tools and measures can be applied. For example:
Mean, standard deviation, skewness → only for quantitative variables.
Coefficient of variation → meaningful only for ratio variables.
Median and other quantiles → for quantitative and ordinal variables.
Mode → can be determined for all types of variables, even nominal.
Pearson correlation → for quantitative variables (with binary variables as a special case).
Spearman rank correlation → for ordinal and quantitative variables.
Histogram → for grouped quantitative variables (though once grouped, the data become ordinal).
Scatterplot → for pairs of quantitative variables.
Regression explanatory variables → quantitative and binary variables.
Ranks → for quantitative and ordinal variables.
1.5.3 Other scales
Some variables do not fit neatly into the four basic categories. Two common special cases are:
Binary (dichotomous) variables: two possible values, often coded 0/1 (e.g., Yes/No). These are technically qualitative but often treated like quantitative variables.
Cyclic scales: values repeat in cycles (e.g., months of the year, days of the week, hours of the day, compass directions, angles).
1.6 Other data classifications
By time dimension:
Cross-sectional data – collected at one point in time (snapshot).
Time series data – repeated measurements at regular time intervals.
Panel/longitudinal data – tracking the same units over time (a mix of cross-sectional and time series).
By structure/format:
Structured data – organized in tables, relational databases.
Unstructured data – texts, images, audio, video, raw logs.
Semi-structured data – partially organized (e.g., JSON, XML, web logs).
By other features:
By granularity – individual-level vs. aggregated data.
By spatial dimension – with or without a spatial dimension (e.g., GIS data).
By domain – e.g., medical, financial, or general-purpose data.
1.7 Numbers
1.7.1 Names: short and long scale
Be careful when translating large number names between languages (Polish, English, Ukrainian, etc.). Even Google Translate may make mistakes.
pol. miliard = eng. billion
pol. bilion = eng. trillion
pol. biliard = eng. quadrillion
pol. trylion = eng. quintillion
Number | Polish | English | Ukrainian |
---|---|---|---|
1 000 000 | milion | million | мільйон |
1 000 000 000 | miliard | billion | мільярд |
1 000 000 000 000 | bilion | trillion | трильйон |
1 000 000 000 000 000 | biliard | quadrillion | квадрильйон |
1 000 000 000 000 000 000 | trylion | quintillion | квінтильйон |
1.7.2 Decimal symbol and thousands separator
Note that Polish uses a comma as the decimal separator, while English uses a dot. Conversely, English uses a comma to separate thousands, while Polish typically uses a space (less often a dot).
pol. 1 000,23 = eng. 1,000.23
pol. 1.000.000 = eng. 1,000,000
1.7.3 Scientific/engineering notation
1.23e8 means \(1.23 \cdot 10^8 = 123{,}000{,}000\).
1.23e-6 means \(1.23 \cdot 10^{-6} = 0.00000123\).
Such notation (with “e” or “E”) often appears in R or Excel. However, it should not be used in articles or theses. In academic writing, if needed, use explicit powers of 10 notation.
1.7.4 Percentages
The percent sign (“%”) means “per hundred.” It is used (sometimes interchangeably with fractions) to express:
shares of a whole, frequencies,
comparisons, target completion levels,
relative growth, interest rates, returns, discounts, etc.,
probabilities (though fractions are often preferred here).
When working with percentages, it is crucial to clarify the base. For example: last year sales were $90M; the plan was $100M; actual sales were $96M. Did we achieve 96% of the plan (96/100) or 60% (6/10)?
It is also important to distinguish between percent change and percentage points. If interest rates increase from 6% to 9%, that is a 50% increase, but an increase of 3 percentage points.
Indices (e.g., real estate price index) are often presented as percentages but without the “%” sign.
1.8 Links
Exploration and visualization of categorical data – web app: https://istats.shinyapps.io/EDA_categorical/
Exploration and visualization of quantitative data – web app: https://istats.shinyapps.io/EDA_quantitative/
This division is not always clear-cut. For example, income is technically discrete (values differ by a cent or penny), but it is often modeled as continuous for convenience.↩︎