[Online Review] Statistics

Overview of statistical inference. Topics include z-test, t-test, ANOVA & linear regression.


Unit 1 R

R vs. Python

R Python
Index start 1 0
Series of number inclusive, exclusive. range(0, 5) gives [0,1,2,3,4] both inclusive. 0:5 gives 0,1,2,3,4,5
negative index a[-1] all elements except for first one last element

Modes for Objects

mode() class()

same mode

  • vector: c(1,2,3,4) seq(1,4, by=1) length()
  • matrix: matrix(1:4, 2, 2) cbind(), rbind() diag(myVec) dim()
  • array: array(1:18, dim=c(2,3,3)) dim()

different mode

  • dataframe: data.frame(myVecA, myVecB, myVecC) colnames(myDf) str(myDf)
  • list: list(myObjA, myObjB, myObjC) myList$myObjA length()

Missing value :is.na()

Subset: a <- subset(acu_rct, is.na(pk1) | is.na(pk2) | is.na(pk5))

Read files

sep header stringAsFactor (TRUE=Factor / FALSE=Char) na.strings
read.table() ‘ ‘ FALSE Factor ‘NA’
read.csv() ‘,’ TRUE Factor ‘NA’
read.xlsx() - colNames=TRUE (no such para) Char ‘NA’
read_sas() - - - num: ‘.’ char: ‘ ‘

Histogram vs.Bar Plot

Histogram — for continuous data

  • bin width to be determined by ourself

Bar Plot — for discrete data

  • each bar is a category

Explore Distributions in R

Distribution

  • Normal distribution: dnorm(x, mean=0, sd=1)
  • Exponential distribution: dexp(x, rate=1)
  • Uniform distribution: dunif(x, min=0, max=1)
  • Binomial distribution: dbinom(x, size, prob,lower.tail=TRUE)
  • Poisson distribution: dpois(x, lambda)
  • T distribution: dt(t, df)

Prefix

  • Probability density function (PDF): dbinom(x, size, prob,lower.tail=TRUE) (给x值求对应的概率)
  • Cumulative distribution function (CDF, probabilities): pbinom(q, size, prob,lower.tail=TRUE) (给quantile求对应的累计概率)
    • lower.tail=TRUE: $P(X≤x)$
    • lower.tail=FALSE: $P(X>x)$
  • Percentiles from the distribution (quantiles): qbinom(p, size, prob,lower.tail=TRUE) (给累计概率p求对应的X)
    • lower.tail=TURE: return smallest $q$, so that $P(X≤q)≥p$
    • lower.tail=FALSE: return the biggest $q$, so that $P(X >q)≥p$
  • Random number generation from the distribution: rbinom(n, size, prob,lower.tail=TRUE) (产生随机数)

Unit 2 Statistical Inference (Z-test)

Interpretations

Hypothesis Testing

  • Goal: to determine whether a claim about a population parameter is supported by the sample data.
  • Conclusion: Reject $H_0$ / Fail to reject $H_0$

Interval Estimation

  • Goal: to determine a range of plausible values for a population parameter.

  • Conclusion: “We are 95% confident that the interval captures the true value of the population parameter.”

Concepts

Parameter: mean, std
Statistics: Z-score(normal dist), T-score(t-test), F-score(anova)

Observation: a value

Sample: a collection of observations form a sample

Hypothesis Testing
| Name | Notation | Example Value | Explaination |
| :—————– | :——: | :—————- | :———————————————————– |
| z-score | z | >+1.96 or < -1.96 | the number of $\sigma$ it falls above or below the mean $z = \frac{x - \mu}{\sigma}$ . for sample distribution: $Z = \frac{\bar{x} - \mu_0}{SE}$ |
| p-value | p | <0.05 | if the null hypothesis is true, the probability of observing a sample mean at least as large as $\bar{x}$ is p. (area, above the corresponding z) |
| significance level | $\alpha$ | 0.05 | We evaluate the hypotheses by comparing the p-value to the significance level. If the p-value is less than the significance level (p-value = 0.007 < 0.05 = α), we reject $H_0$ in favor of $H_A$. else we fail to reject $H_0$. *$\alpha = 0.05$ means we have a 5% chance of observing one sample means from the sampling distribution that is far enough away from the claimed value in the null hypothesis to lead to it being rejected |

$z — z_{thresh}$

$p — \alpha$

计算z 相当于是把$\bar{x}$从$N(\mu_0, SE)$标准化到了$N(0, 1)$的正态分布上, 标准化后就可以通过查表得到p-value了:

Interval Estimation

  • point estimate: $μ = \bar{x}$
  • Std: $SE = \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}} \simeq \frac{s}{\sqrt{n}}$
  • z* — correspongding to 95%
  • confidence intervals: $(\bar{x} - z^ SE, \bar{x} + z^SE)$
Name Notation Example Value Explaination
confidence level ($1-\alpha) ·100\%$ 95% We are 95% confident that the interval captures the true value of the population parameter
multiplier $C_\alpha$ or $z^*$ 2(1.96) the interval spreads out about $2*SE$ from the point estimate

Relationships between these statistics

Z-score Hypothesis testing Interval Estimation
$z^* =1.96$ confidence level 95%
$z = 0.43$ compare with $z_{thresh}$

Steps of Inference

testing -> estimation

先假设检验【定性】,再区间估计【定量】

Once a significant signal is detected with a hypothesis test, the next step is often to report a point and interval estimate to understand that signal.
For example, suppose the hypothesis test found evidence that θ is bigger than 10. What is your next question? Ok – how much bigger than 10? The interval estimate would allow you to address that question. As such, estimation and hypothesis testing are often used in tandem when performing inference; they are two parts of the whole.

Statistical vs. Scientific Significance

Increasing the sample size will result in smaller numerical differences but still meeting the statistical significance threshold when performing a hypothesis test.

Because of this, it is really important to always pair a hypothesis test with a confidence interval so the presence or lack of a signal in the test can be explained in the context of plausible values of the parameter. By examining the corresponding confidence interval, you will be able to determine if signal found in a hypothesis test is scientifically relevant.

statistical significance: 相对的显著性

scientific significance: 绝对的显著性

Unit 3 T-test, ANOVA

Determine which test to be used

Pool: z-test, t-test, two sample t-test, paired t-test, ANOVA, rmANOVA, Linear Regression

Predictor population parameter Level Dependency Test to be performed
None know $\mu$ 1 - z-test
None unknown, only have $s$ 1 - t-test
Discrete 2 independent Two sample t-test
Discrete 2 dependent Paired t-test
Discrete >=3 independent ANOVA
Discrete >=3 dependent rmANOVA
Continuous - independent Linear Regression
Continuous - dependent Multi-variable Linear Regression

z-test vs. t-test

$\sigma$: population std (x, i=0~N)

$s$: std of sample(x, i=0~n)

SE: standard Error, std of the sample distribution($\bar{x}$)

  • $SE = \frac{\sigma}{\sqrt{n}}$ — perform a Z-test
  • $SE = \frac{s}{\sqrt{n}}$ — perform a T-test

Assumptions

z-test

  • Independence of observations. <10% population
  • Large sample size. n ≥ 30 is a good rule of thumb. it doesn’t matter what distribution those observations come from.
  • The population distribution is not strongly skewed.

t-test

  • Independence of observations. collect a simple random sample from less than 10% of the population.

  • Approximately normal. Observations come from a nearly normal distribution.

Two sample t-test

  • each sample meets the conditions for using the t-test
  • the two samples are independent of each other

Paired t-test

  • each sample meets the conditions for using the t-test
  • the two samples are dependent of each other

ANOVA

Assumptions

  • Independence. The observations are independent within and across groups.
  • Approximately normal. The data within each group are nearly normal.
  • Constant variance. The variability across the groups is about equal. Usually, the largest sample standard deviation should not be more than twice as large as the smallest across the groups.

rmANOVA

  • The observations within each group are independent.
  • The same subjects are measured within each group. (across group dependent) This assumption can be checked by looking for any missing values of the response variable across the K groups.
  • The response variable must be normally distributed within each group. This assumption can be checked by creating a normal probability plot of the response variable for each group.
  • The variance of the paired differences in the response variable for each pair of the K groups must be the same. The standard deviation needs to be same for all K(K-1)/2 pairs of K groups. This assumption can be checked by computing the sample standard deviation of the paired differences for each pair of the groups. Make sure that the largest standard deviation is not more than twice the smallest.

Testing

T-test: use how different a sample mean is from a given value μ0, to decide if the population mean μ is really this value $μ_0$.

for the sample we got:

  • sample mean: $\bar{x}$
  • sample std: s

for the sample distribution:

  • center: $μ_0$
  • standard error: $SE = \frac{s} {\sqrt{n}}$
  • degrees of freedom: $df = n - 1$

The distribution of all samples we might get from the population. Our $\bar{x}$ is at some point of this distribution.

Paired T-test: use how different two dependent sample means are from each other, to decide if these two population means are equal. (to decide if $μ_{diff}$ is equal to 0)

for the sample we got:

  • sample mean: $\bar{x}_{diff}$
  • Sample std:$s_{diff} = \sqrt{\frac{\sum{x_i - \bar{x}_{diff}}}{n - 1}}$

for the sample distribution: r

  • center: 0
  • Standard Error:
  • degrees of freedom: $df = n_{diff} - 1$

Two sample T-test: how different two independent sample means are from each other, to decide if these two population means are equal.

  • sample mean: $\bar{x}_A, \bar{x}_B$
  • sample std: $s_A, s_B$

for the sample distribution:

  • center: 0
  • Standard Error:
  • degrees of freedom: $df = min(n_A-1, n_B-1)$ or $n_A + n_B -2$?

ANOVA: making inference about the population means from three or more independent groups and want to compare the population means across the groups.

$H_0: \mu_1 = \mu_2 = … = \mu_k$
$H_A:$ at least one $\mu_i$ is different

assume all samples have the same sample size n:

  • MSG:Between-Group Variability. the avg square deviation of each sample mean from the total mean(grand mean $\bar{x}_G$)

  • MSE: Within-Group Variability. the variability of each sample group

  • F score: $F = \frac{variability\, between\, means}{error} = \frac{Between-Group\,Variability}{Within-Group\, Variability} = \frac{MSG}{MSE} = \frac{SSG/df_G}{SSE/df_E} = \frac{n\sum_{i=1}^{k} {(\bar{x}i - \bar{x})^2/(k-1)}}{\sum{i=0}^{k} {\sum_{j=0}^{n} {(x_{ij} - \bar{x}_i)^2}}/(N-k)}$
  • degree of freedom
    • between group: $df_1 = df_G = k-1$
    • within group: $df_2 = df_E = N - k$

k: number of samples

n: sample size

N: total number of values from all samples. N = n*k

i: notation for each sample group

j: notation for each value in a sample group

rmANOVA

Changes on df

  • $df1 = df_G= k − 1$
  • $df2 = df_E = N − k - r - 1$

r: number of replicates in each group (that is, the unit on which the repeated measurement is made)

Distributions & Tables

Normal distribution & Z table

row, col: z-score
inside: shaded area (p-value for negartive z, 1 - p-values for positive z)

3

T distribution & T table

row: df=n-1, each df corresponds to a t-distribution

column: shaded area (equal to p-value)

Inside: t-score 4

5

F distribution & ANOVA table

Row: ${df}_G= k - 1$

Column: shaded area (p-value)

inside: f-score

6

After ANOVA, t-test will be used

After a significant ANOVA signal is detected, each possible pair of the K group means should be compared using a two sample t-test and/or confidence interval.

Under the rmANOVA design, the pair-wise group means should be compared using a paired t-test instead of a two sample t-test.

The Bonferroni correction should be used again to account for the multiple comparisons that will be made in the post-hoc analysis. When we run so many tests, the Type 1 Error rate increases.This issue is resolved by using a modified significance level:

$α^* = α/K $

$K = k(k−1)$

K: the number of comparisons being considered. If there are k groups, then $K = k(k−1)$ .

Implementations in R

One Sample T-Test

  • test for normality:
    par(mfrow==c(1,1))
    qqPlot(Mydata$X)

  • p-value(two sided): pt(-tscore, df) + (1 - pt(tscorem, df))

  • t-test

    1
    2
    3
    4
    t.test(Mydata$X,
    alterenative="two.sided",
    mu=0,
    conf.level = 0.95)
    • alternative: two.sided, greater, less
      In order to get a confidence interval using the t.test() function, the alternative argument must be set to “two.sided”. If the alternative argument is set to “greater” or “less”, the t.test() function will return a one-sided confidence bound (that is, either a lower or upper bound, respectively, on the population mean) instead of a confidence interval (that is, range of plausible values of the population mean).

Two Sample T-Test

1
2
3
4
t.test(Mydata1$X, MyData2$X,
alterenative="two.sided", # two-sided
mu=0,
conf.level = 0.95)

Paired T-Test

1
2
3
4
5
t.test(Mydata1$X, MyData2$X,
paired=TRUE # paired
alterenative="two.sided",
mu=0,
conf.level = 0.95)

ANOVA

  • assess the equal variance assumption: doBy::summaryBy()
    Here, the variable Y will be compared by the levels of variable Group from the MyData data set.
1
2
a_fit <- aov(Y ~ Group, data=MyData)
summary(a_fit)

Pair-wise confidence intervals can be constructed using the pairwiseCI() function in the pairwiseCI package.

1
2
3
no.tests <- choose(3, 2)
pairwiseCI(Y ~ Group, data=MyData,
conf.level = 1 - (0.05/no.tests))

or

1
2
pw_tests <- pairwiseTest(Y ~ Group, data=MyData)
summary(pw_tests, p.adjust.method='bonf')
  • Boxplot for each group

boxplot(Y ~ Group)

Repeated Measures ANOVA
Before performing the repeated measures ANOVA analysis, you must check if your subject and group variables are stored as factor variables. If not, use the factor() function to convert the variables to factors.

1
2
a_fit <- aov(Y ~ Group + Error(Subject), data=MyData)
summary(a_fit)

ANOVA table can be splited into two parts: error due variability within-subjects and the error that remains after taking out the within-subjects error.

Unit 4 Linear Regression

Concepts

Linear regression attempts to explain or predict how the mean value of the response variable changes with the value of a predictor variable.

Conditions

  • Linearity. The data should show a linear trend.
  • Nearly normal residuals. Generally the residuals must be nearly normal. When this condition is found to be unreasonable, it is usually because of outliers or concerns about influential points.
  • Constant variability. The variability of points around the least squares line remains roughly constant.
  • Independent observations. Be cautious about applying regression to time series data, which are sequential observations in time such as a stock price each day. Such data may have an underlying structure that should be considered in a model and analysis. There are also other instances where correlations within the data are important.

Variables

X: predictor, explanatory, independent variable

Y: outcome, response, dependent variable

$y$: observed value [data]

$\hat{y}$: expected value (based on the line of best fit) [fit]

e: residual, $e = y - \hat{y}$ [residual]

Data = Fit + Residual

Least Squares Line
choose the line that minimizes the sum of the squared residuals: $\sum{(y_i - \hat{y}_i})^2$

Coefficients

  • R: correlation coefficient.

Describes the strength of the linear relationship between two variables. We denote the correlation by R. usually we calculate R by software but here is the definition equation:

  • $R ^2$ : determination coefficient.

describes the amount of variation in the response that is explained by the least squares line.

eg. the R2 value was again found to be 0.2486 demonstrating that using a student’s family income to estimate their expected financial aid amount reduced the uncertainty in the estimate by explaining approximately 25% of the variability in the response. However, 75% of the variability in the response is left unexplained by the fitted regression model suggesting that other factors play a role in determining a student’s financial aid amount.

  • b1: slope coefficient

Point Estimate
We use $b_0, b_1$ to represent the point estimates of the parameters $\beta_0, \beta_1$

Find LSL by applying two properties of the least squares line:

  • The slope of the least squares line can be estimated by: $b_1 = R\frac{s_y} {s_x} $
  • $(\bar{x}, \bar{y})$ is on the least squares line: $y − \bar{y} = b_1 ( x − \bar{x} )$

Inference

we use t-test for the population slope $\beta_1$

$H_0$: $\beta_1 = 0$

$H_A$: $\beta_1\not= 0$

table for Linear Regression

row 1: $b_0$
row 2: $b_1$

  • df = N-2
  • SE, t, p are calculated using software

The p-values for testing whether or not the regression coefficients($b_0, b_1$) are different from 0

Equivalence between Linear Regression and T-tests / ANOVA

Input

  • linear regression: continuous
  • t-tests & ANOVA: discrete
  • LR with a single indicator variable for group (1 vs. 2) = a two sample t-test

  • MLR(multivariable linear regression) with K-1 indicator variables for group (1, 2, …, K-1 vs. K) = ANOVA for K groups
  • MLR with an indicator variable for which measurement (eg. pre vs. post) of the response variable is being considered and with n – 1 indicators for pair membership = a paired t-test.

Types of Outliers

(1) one outlier, though it only slightly influence the line.
(2) one outlier, though it is quite close to the least squares line, wasn’t very influential.
(3) one point far away from the cloud, and this outlier appears to pull the least squares line up on the right.
(4) a primary cloud and then a small secondary cloud of four outliers. The secondary cloud appears to be influencing the line somewhat strongly, making the least square line fit poorly almost everywhere.
(5) no obvious trend in the main cloud of points, the outlier on the right appears to largely control the slope of the least squares line.
(6) one outlier far from the cloud, but falls quite close to the least squares line and does not appear to be very influential.

There is some trend in the main clouds of (3) and (4). In these cases, the outliers influenced the slope of the least squares lines. In (5), data with no clear trend were assigned a line with a large trend simply due to one outlier (!).

Leverage: Points that fall horizontally away from the center of the cloud tend to pull harder on the line, so we call them points with high leverage.
If one of these high leverage points does appear to actually invoke its influence on the slope of the line – as in cases (3), (4), and (5) of Example 7.23 – then we call it an influential point

Implemetations in R

  • correlation: cor(x, y, na.rm=FALSE)
    na.rm is an optional argument that indicates whether or not missing values should be removed when computing the summary measure.
    • compute p-value give T-score: pt(t-stat, df)

Inference
fit <- lm(Y~X, data=MyData)
summary(fit)

  • It is not possible in the lm() function to specify the claimed value for the hypothesis test involving the regression coefficients or to specify a less than or greater than alternative hypothesis.
  • when fitting a regression line to a data set, only observations with non-missing values for both X and Y will be included in the analysis. T

Confidence Interval
confint(fit, level=0.95)

Inside the fit list
lm() object is a list that stores all of the output generated by the function call, and we can apply the names() function to the list object to determine what elements are stored in the list.

  • fit$coefficients will print a length 2 vector containing the estimated regression coefficients.
  • fit$fitted.values[1]
  • fit$residuals[1]
  • plot the regression line: abline(fit, col='blue', lwd = 3)
  • highlight a point: points(X_value, X_variable, col='red', pch = 19)

Check Conditions

  • Linearity: plot(fit$residuals~x),no pattern; or plot(fit$residuals ~ fit$fited_values), a horizontal line at zero

  • Normality: $y-\hat{y}$ (residual) ~ theoratical quantile of a normal distribution qqPlot(fit$residuals), should be a y=x line

    Q-Q plot: First, the set of intervals for the quantiles is chosen. A point (x, y) on the plot corresponds to one of the quantiles of the second distribution (y-coordinate) plotted against the same quantile of the first distribution (x-coordinate). Thus the line is a parametric curve with the parameter which is the number of the interval for the quantile.

  • Constant Variance: $|y-\hat{y}|$ ~ $\hat{y}$ plot(fit$fitted.values, abs(fit$residuals)), should be no pattern

  • Independence: $y-\hat{y}$ ~ c(1:n)plot(1:dim(Mydata)[1], fit$residuals), should be no pattern
    n is the sample size. When creating this plot, we have to assume that the observations are listed in the order they were collected in the data set, unless there is variable containing the time and date of when each observation was collected.

or simply: plot(fit): will yield 4 plots

  • Plot 1: linearity assumption. The plot is scatter plot of the residuals (y-axis) against the fitted values (x-axis). R also adds a LOESS curve to the plot. if the linearity assumption is met, the LOESS curve should be a horizontal line at zero.

  • Plot 2: normality assumption. If all of the points fall near the line of identity, the normality assumption is met for this data set.

  • Plot 3: constant variance assumption. The plot is scatter plot of the square root of the absolute value of the residuals (y-axis) against the fitted values (x-axis). R also adds a LOESS curve to the plot. If the constant variance assumption is met, the LOESS curve should be a flat line.

    放宽条件:Specifically, there appears to be less spread for smaller fitted values and more spread for larger fitted values. However, if we ignore the left most point in this plot, the spread seems approximately constant. Thus, the constant varianc e assumption may be reasonable for this data set, but it would be good to investigate the potential violation a bit more.

  • R will NOT create the independence diagnostic plot. Instead it creates a plot of the residuals against a measure of leverage. This plot can be used to determine if any of the data points are potential influential points because this plot includes contours lines for Cook’s distance. Be wary of data points with Cook’s distance values above 0.5 or 1. If observations in the data set have values near or above these limits, a contour line will appear in the diagnostic plot to alter you to their presence. If no contour lines appear in the plot, you may infer that there is no evidence of influential points in the data set.