Chapter 4 Env Data Analysis 2017-01-17

4.1 About the Class

This class is about environmental data and what we want to do specifically about these kinds of data. We’ll consider, water, air, climate, and soils but we can also think about ecology and other similar domains. People are typically interested in the following types of problems:

  • Summaries: reducing data to measures or tendencies that capture essential features
  • Comparing two or more samples: if you’re an environmental regulartor, does some data that has been collected show that someone in compliance or with some given standard? What is the associated uncertainty?
  • Trends: has the water quality in a body of water gotten better or worse over time? What about the standard deviation or the 99th percentile of water quality?
  • Time series properties: what is the structure of time sequences that improves our ability to predict at the next structure?
  • Geostatistics: When you want to start mining, the cost of building a mine is several billions of mines. How can we interpret measurements with spatial clustering and correlation?
  • Extremes: using extreme value theory to understand the statistical properties of tsunamis, massive releases of pollution from a mine, floods, or droughts
  • Missing data: accounting for missing data, imputation, data that is not missing at random

We model the underlying process, not the data. But, we seek to understand the process by looking at the data.

4.2 Some Review Material

In this class we will go beyond proving statistical theorems to get practice applying useful statistical tools to answer interesting environmental questions.

4.2.1 What’s the Right Distribution?

  • Boxplots of streamflow and rainfall seasonality
  • Histograms: flows and transformations of flow.
  • Probability: frequentist definition: relative frequency of probability. Can be easily estimated from a histogram or directly from data.
  • Bias-variance tradeoff: for bin width more bins gives us better information but is noisier, or for line fitting.

How can we express frequencies? PDF vs CDF.

A random variable \(X\) is a variable whose outcomes are governed by the laws of chance.

4.2.2 Moments of Random Values

Population versus sample: - mean - exectation - variance (population sample is \(N-1\)!) - skewness

Other measures of central tendency - median as an alternative to mean - IQR as an alternative to \(\sigma\) or variance - However, if there are true large values in the sample that matter, we tend to ignore them which can also be non-ideal

4.2.3 Types of PDF

  • Uniform
  • Binomial
  • Poisson
  • Geometric

4.2.4 Fitting Parameters for Distributions

  • Moments
  • Maximum Likelihood

To cover: sufficient statistics and how these estimates are similar or different. Homework: go through the rest of the slides from class.