Category: Statistics - Gary Larson

Examples of "correlation is not causation"

12/17/2016

Is this really a causal relationship? Here's an example of questionable statistical language in the news. Brazil breast-feeding study.

Another one (definitely no link to the actual study in here)
http://www.mirror.co.uk/news/uk-news/breastfed-children-earn-more-money-5358181

"Breastfeeding generally was found to increase adult intelligence, length of schooling and adult earnings."

The Folk Theorem, and others

8/17/2016

From Andrew Gelman, some "important methods and concepts related to statistics that are not as well known as they should be."

Limiting distribution vs. stationary distribution

3/21/2016

http://stats.stackexchange.com/questions/48262/what-is-the-difference-between-limiting-and-stationary-distributions

Great summary of Hamiltonian Monte Carlo

3/7/2016

I was looking into the use of Stan for Hamiltonian Monte Carlo. On page 23 of the Stan reference (stan-reference-2.9.0.pdf), I found this excellent and brief summary of HMC:

HMC accelerates both convergence to the stationary distribution and subsequent parameter exploration by using the gradient of the log probability function. The un- known quantity vector θ is interpreted as the position of a fictional particle. Each iteration generates a random momentum and simulates the path of the particle with potential energy determined [by] the (negative) log probability function. Hamilton’s decom- position shows that the gradient of this potential determines change in momentum and the momentum determines the change in position. These continuous changes over time are approximated using the leapfrog algorithm, which breaks the time into discrete steps which are easily simulated. A Metropolis reject step is then applied to correct for any simulation error and ensure detailed balance of the resulting Markov chain transitions (Metropolis et al., 1953; Hastings, 1970).

Immediately after that, the tuning parameters are discussed:

Basic Euclidean Hamiltonian Monte Carlo involves three “tuning” parameters to which its behavior is quite sensitive. Stan’s samplers allow these parameters to be set by hand or set automatically without user intervention.

The first tuning parameter is the step size, measured in temporal units (i.e., the discretization interval) of the Hamiltonian. Stan can be configured with a user- specified step size or it can estimate an optimal step size during warmup using dual averaging (Nesterov, 2009; Hoffman and Gelman, 2011, 2014). In either case, addi- tional randomization may be applied to draw the step size from an interval of possi- ble step sizes (Neal, 2011).

The second tuning parameter is the number of steps taken per iteration, the product of which with the temporal step size determines the total Hamiltonian simulation time. Stan can be set to use a specified number of steps, or it can automatically adapt the number of steps during sampling using the No-U-Turn (NUTS) sampler (Hoffman and Gelman, 2011, 2014).

The third tuning parameter is a mass matrix for the fictional particle. Stan can be configured to estimate a diagonal mass matrix or a full mass matrix during warmup; Stan will support user-specified mass matrices in the future. Estimating a diagonal mass matrix normalizes the scale of each element θk of the unknown variable sequence θ, whereas estimating a full mass matrix accounts for both scaling and rotation,2 but is more memory and computation intensive per leapfrog step due to the underlying matrix operations.

Excellent measure theory notes!

1/26/2016

http://www.math.uah.edu/stat/dist/Density.html (and related pages)

Good, simple, intuitive explanation of PCA

1/26/2016

https://georgemdallas.wordpress.com/2013/10/30/principal-component-analysis-4-dummies-eigenvectors-eigenvalues-and-dimension-reduction/

Data sets for logistic regression

9/1/2015

http://www.umass.edu/statdata/statdata/stat-logistic.html

Good, quick explanation of logistic regression and its cost function

5/2/2014

Part 1 and Part 2.

Also another view here.

Another PPT from Ingmar Schuster (Universitat Leipzig) appears to be very good. (attached and viewable below)

tmi05.2_logistic_regression.pdf
File Size:	947 kb
File Type:	pdf

Good, quick explanation of studentized residuals

4/22/2014

below is from Wikipedia, here

Regressions
In regression analysis, the distinction between errors and residuals is subtle and important, and leads to the concept of studentized residuals. Given an unobservable function that relates the independent variable to the dependent variable – say, a line – the deviations of the dependent variable observations from this function are the unobservable errors. If one runs a regression on some data, then the deviations of the dependent variable observations from the fitted function are the residuals.

However, a terminological difference arises in the expression mean squared error (MSE). The mean squared error of a regression is a number computed from the sum of squares of the computed residuals, and not of the unobservable errors. If that sum of squares is divided by n, the number of observations, the result is the mean of the squared residuals. Since this is a biased estimate of the variance of the unobserved errors, the bias is removed by multiplying the mean of the squared residuals by n / df where df is the number of degrees of freedom (n minus the number of parameters being estimated). This latter formula serves as an unbiased estimate of the variance of the unobserved errors, and is called the mean squared error.[1]

However, because of the behavior of the process of regression, the distributions of residuals at different data points (of the input variable) may vary even if the errors themselves are identically distributed. Concretely, in a linear regression where the errors are identically distributed, the variability of residuals of inputs in the middle of the domain will be higher than the variability of residuals at the ends of the domain: linear regressions fit endpoints better than the middle. This is also reflected in the influence functions of various data points on the regression coefficients: endpoints have more influence.

Thus to compare residuals at different inputs, one needs to adjust the residuals by the expected variability of residuals, which is called studentizing. This is particularly important in the case of detecting outliers: a large residual may be expected in the middle of the domain, but considered an outlier at the end of the domain.