Chapter 6 Statistical Inference

Statistical inference is the process of using data analysis to deduce properties of an underlying probability distribution.[1] Inferential statistical analysis infers properties of a population, for example by testing hypotheses and deriving estimates. It is assumed that the observed data set is sampled from a larger population.

Inferential statistics can be contrasted with descriptive statistics. Descriptive statistics is solely concerned with properties of the observed data, and it does not rest on the assumption that the data come from a larger population.

Introduction

Statistical inference makes propositions about a population, using data drawn from the population with some form of sampling. Given a hypothesis about a population, for which we wish to draw inferences, statistical inference consists of (first) selecting a statistical model of the process that generates the data and (second) deducing propositions from the model.[citation needed]

Konishi & Kitagawa state, "The majority of the problems in statistical inference can be considered to be problems related to statistical modeling".[2] Relatedly, Sir David Cox has said, "How [the] translation from subject-matter problem to statistical model is done is often the most critical part of an analysis".[3]

The conclusion of a statistical inference is a statistical proposition.[citation needed] Some common forms of statistical proposition are the following:

Models and assumptions

Any statistical inference requires some assumptions. A statistical model is a set of assumptions concerning the generation of the observed data and similar data. Descriptions of statistical models usually emphasize the role of population quantities of interest, about which we wish to draw inference.[4] Descriptive statistics are typically used as a preliminary step before more formal inferences are drawn.[5]

A Primer

Inference is one of many possible goals in data analysis and so it’s worth discussing what exactly is the act of making inference. Recall previously we described one of the six types of questions you can ask in a data analysis is an inferential question. So what is inference?

In general, the goal of inference is to be able to make a statement about something that is not observed, and ideally to be able to characterize any uncertainty you have about that statement. Inference is difficult because of the difference between what you are able to observe and what you ultimately want to know.

Identify the population

The language of inference can change depending on the application, but most commonly, we refer to the things we cannot observe (but want to know about) as the population or as features of the population and the data that we observe as the sample. The goal is to use the sample to somehow make a statement about the population. In order to do this, we need to specify a few things.

Identifying the population is the most important task. If you cannot coherently identify or describe the population, then you cannot make an inference. Just stop. Once you’ve figured out what the population is and what feature of the population you want to make a statement about (e.g. the mean), then you can later translate that into a more specific statement using a formal statistical model (covered later in this book).

When inference is not needed

Before we delve into hypothesis testing, it’s good to remember that there are cases where you need not perform a rigorous statistical inference.

An important and time-saving skill is to ALWAYS do exploratory data analysis using dplyr and ggplot2 before thinking about running a hypothesis test.

Since there is no overlap at all, we can conclude that the air_time for San Francisco flights is statistically greater (at any level of significance) than the air_time for Boston flights.

This is a clear example of not needing to do anything more than some simple exploratory data analysis with descriptive statistics and data visualization to get an appropriate inferential conclusion. This is one reason why you should ALWAYS investigate the sample data first using dplyr and ggplot2 via exploratory data analysis.

As you get more and more practice with hypothesis testing, you’ll be better able to determine in many cases whether or not the results will be statistically significant. There are circumstances where it is difficult to tell, but you should always try to make a guess FIRST about significance after you have completed your data exploration and before you actually begin the inferential techniques.