Overview of statistical inference. Topics include ztest, ttest, ANOVA & linear regression.
Unit 1 R
R vs. Python
R  Python  

Index start  1  0 
Series of number  inclusive, exclusive. range(0, 5) gives [0,1,2,3,4] 
both inclusive. 0:5 gives 0,1,2,3,4,5 
negative index a[1]  all elements except for first one  last element 
Modes for Objects
mode()
class()
same mode
 vector:
c(1,2,3,4)
seq(1,4, by=1)
length()
 matrix:
matrix(1:4, 2, 2)
cbind()
,rbind()
diag(myVec)
dim()
 array:
array(1:18, dim=c(2,3,3))
dim()
different mode
 dataframe:
data.frame(myVecA, myVecB, myVecC)
colnames(myDf)
str(myDf)
 list:
list(myObjA, myObjB, myObjC)
myList$myObjA
length()
Missing value :is.na()
Subset: a < subset(acu_rct, is.na(pk1)  is.na(pk2)  is.na(pk5))
Read files
sep  header  stringAsFactor (TRUE=Factor / FALSE=Char)  na.strings  

read.table() 
‘ ‘  FALSE  Factor  ‘NA’ 
read.csv() 
‘,’  TRUE  Factor  ‘NA’ 
read.xlsx() 
  colNames=TRUE  (no such para) Char  ‘NA’ 
read_sas() 
      num: ‘.’ char: ‘ ‘ 
Histogram vs.Bar Plot
Histogram — for continuous data
 bin width to be determined by ourself
Bar Plot — for discrete data
 each bar is a category
Explore Distributions in R
Distribution
 Normal distribution:
dnorm(x, mean=0, sd=1)
 Exponential distribution:
dexp(x, rate=1)
 Uniform distribution:
dunif(x, min=0, max=1)
 Binomial distribution:
dbinom(x, size, prob，lower.tail=TRUE)
 Poisson distribution:
dpois(x, lambda)
 T distribution:
dt(t, df)
Prefix
 Probability density function (PDF):
dbinom(x, size, prob，lower.tail=TRUE)
（给x值求对应的概率）  Cumulative distribution function (CDF, probabilities):
pbinom(q, size, prob，lower.tail=TRUE)
（给quantile求对应的累计概率） lower.tail=TRUE: $P(X≤x)$
 lower.tail=FALSE: $P(X>x)$
 Percentiles from the distribution (quantiles):
qbinom(p, size, prob，lower.tail=TRUE)
（给累计概率p求对应的X） lower.tail=TURE: return smallest $q$, so that $P(X≤q)≥p$
 lower.tail=FALSE: return the biggest $q$, so that $P(X >q)≥p$
 Random number generation from the distribution:
rbinom(n, size, prob，lower.tail=TRUE)
(产生随机数)
Unit 2 Statistical Inference (Ztest)
Interpretations
Hypothesis Testing
 Goal: to determine whether a claim about a population parameter is supported by the sample data.
 Conclusion: Reject $H_0$ / Fail to reject $H_0$
Interval Estimation
Goal: to determine a range of plausible values for a population parameter.
Conclusion: “We are 95% confident that the interval captures the true value of the population parameter.”
Concepts
Parameter: mean, std
Statistics: Zscore(normal dist), Tscore(ttest), Fscore(anova)
Observation: a value
Sample: a collection of observations form a sample
Hypothesis Testing
 Name  Notation  Example Value  Explaination 
 :—————–  :——:  :—————  :———————————————————– 
 zscore  z  >+1.96 or < 1.96  the number of $\sigma$ it falls above or below the mean $z = \frac{x  \mu}{\sigma}$ . for sample distribution: $Z = \frac{\bar{x}  \mu_0}{SE}$ 
 pvalue  p  <0.05  if the null hypothesis is true, the probability of observing a sample mean at least as large as $\bar{x}$ is p. (area, above the corresponding z) 
 significance level  $\alpha$  0.05  We evaluate the hypotheses by comparing the pvalue to the significance level. If the pvalue is less than the significance level (pvalue = 0.007 < 0.05 = α), we reject $H_0$ in favor of $H_A$. else we fail to reject $H_0$. *$\alpha = 0.05$ means we have a 5% chance of observing one sample means from the sampling distribution that is far enough away from the claimed value in the null hypothesis to lead to it being rejected 
$z — z_{thresh}$
$p — \alpha$
计算z 相当于是把$\bar{x}$从$N(\mu_0, SE)$标准化到了$N(0, 1)$的正态分布上, 标准化后就可以通过查表得到pvalue了：
Interval Estimation
 point estimate: $μ = \bar{x}$
 Std: $SE = \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}} \simeq \frac{s}{\sqrt{n}}$
 z* — correspongding to 95%
 confidence intervals: $(\bar{x}  z^ SE, \bar{x} + z^SE)$
Name  Notation  Example Value  Explaination 

confidence level  ($1\alpha) ·100\%$  95%  We are 95% confident that the interval captures the true value of the population parameter 
multiplier  $C_\alpha$ or $z^*$  2(1.96)  the interval spreads out about $2*SE$ from the point estimate 
Relationships between these statistics
Zscore  Hypothesis testing  Interval Estimation  

$z^* =1.96$  confidence level 95%  
$z = 0.43$  compare with $z_{thresh}$ 
Steps of Inference
testing > estimation
先假设检验【定性】，再区间估计【定量】
Once a significant signal is detected with a hypothesis test, the next step is often to report a point and interval estimate to understand that signal.
For example, suppose the hypothesis test found evidence that θ is bigger than 10. What is your next question? Ok – how much bigger than 10? The interval estimate would allow you to address that question. As such, estimation and hypothesis testing are often used in tandem when performing inference; they are two parts of the whole.
Statistical vs. Scientific Significance
Increasing the sample size will result in smaller numerical differences but still meeting the statistical significance threshold when performing a hypothesis test.
Because of this, it is really important to always pair a hypothesis test with a confidence interval so the presence or lack of a signal in the test can be explained in the context of plausible values of the parameter. By examining the corresponding confidence interval, you will be able to determine if signal found in a hypothesis test is scientifically relevant.
statistical significance: 相对的显著性
scientific significance: 绝对的显著性
Unit 3 Ttest, ANOVA
Determine which test to be used
Pool: ztest, ttest, two sample ttest, paired ttest, ANOVA, rmANOVA, Linear Regression
Predictor  population parameter  Level  Dependency  Test to be performed 

None  know $\mu$  1    ztest 
None  unknown, only have $s$  1    ttest 
Discrete 

2  independent  Two sample ttest 
Discrete 

2  dependent  Paired ttest 
Discrete 

>=3  independent  ANOVA 
Discrete 

>=3  dependent  rmANOVA 
Continuous 

  independent  Linear Regression 
Continuous 

  dependent  Multivariable Linear Regression 
ztest vs. ttest
$\sigma$: population std (x, i=0~N)
$s$: std of sample(x, i=0~n)
SE: standard Error, std of the sample distribution($\bar{x}$)
 $SE = \frac{\sigma}{\sqrt{n}}$ — perform a Ztest
 $SE = \frac{s}{\sqrt{n}}$ — perform a Ttest
Assumptions
ztest
 Independence of observations. <10% population
 Large sample size. n ≥ 30 is a good rule of thumb. it doesn’t matter what distribution those observations come from.
 The population distribution is not strongly skewed.
ttest
Independence of observations. collect a simple random sample from less than 10% of the population.
Approximately normal. Observations come from a nearly normal distribution.
Two sample ttest
 each sample meets the conditions for using the ttest
 the two samples are independent of each other
Paired ttest
 each sample meets the conditions for using the ttest
 the two samples are dependent of each other
ANOVA
Assumptions
 Independence. The observations are independent within and across groups.
 Approximately normal. The data within each group are nearly normal.
 Constant variance. The variability across the groups is about equal. Usually, the largest sample standard deviation should not be more than twice as large as the smallest across the groups.
rmANOVA
 The observations within each group are independent.
 The same subjects are measured within each group. (across group dependent) This assumption can be checked by looking for any missing values of the response variable across the K groups.
 The response variable must be normally distributed within each group. This assumption can be checked by creating a normal probability plot of the response variable for each group.
 The variance of the paired differences in the response variable for each pair of the K groups must be the same. The standard deviation needs to be same for all K(K1)/2 pairs of K groups. This assumption can be checked by computing the sample standard deviation of the paired differences for each pair of the groups. Make sure that the largest standard deviation is not more than twice the smallest.
Testing
Ttest: use how different a sample mean is from a given value μ0, to decide if the population mean μ is really this value $μ_0$.
for the sample we got:
 sample mean: $\bar{x}$
 sample std: s
for the sample distribution:
 center: $μ_0$
 standard error: $SE = \frac{s} {\sqrt{n}}$
 degrees of freedom: $df = n  1$
The distribution of all samples we might get from the population. Our $\bar{x}$ is at some point of this distribution.
Paired Ttest: use how different two dependent sample means are from each other, to decide if these two population means are equal. (to decide if $μ_{diff}$ is equal to 0)
for the sample we got:
 sample mean: $\bar{x}_{diff}$
 Sample std:$s_{diff} = \sqrt{\frac{\sum{x_i  \bar{x}_{diff}}}{n  1}}$
for the sample distribution: r
 center: 0
 Standard Error:
 degrees of freedom: $df = n_{diff}  1$
Two sample Ttest: how different two independent sample means are from each other, to decide if these two population means are equal.
 sample mean: $\bar{x}_A, \bar{x}_B$
 sample std: $s_A, s_B$
for the sample distribution:
 center: 0
 Standard Error:
 degrees of freedom: $df = min(n_A1, n_B1)$ or $n_A + n_B 2$?
ANOVA: making inference about the population means from three or more independent groups and want to compare the population means across the groups.
$H_0: \mu_1 = \mu_2 = … = \mu_k$
$H_A:$ at least one $\mu_i$ is different
assume all samples have the same sample size n:
 MSG:BetweenGroup Variability. the avg square deviation of each sample mean from the total mean(grand mean $\bar{x}_G$)
 MSE: WithinGroup Variability. the variability of each sample group
 F score: $F = \frac{variability\, between\, means}{error} = \frac{BetweenGroup\,Variability}{WithinGroup\, Variability} = \frac{MSG}{MSE} = \frac{SSG/df_G}{SSE/df_E} = \frac{n\sum_{i=1}^{k} {(\bar{x}i  \bar{x})^2/(k1)}}{\sum{i=0}^{k} {\sum_{j=0}^{n} {(x_{ij}  \bar{x}_i)^2}}/(Nk)}$
 degree of freedom
 between group: $df_1 = df_G = k1$
 within group: $df_2 = df_E = N  k$
k: number of samples
n: sample size
N: total number of values from all samples. N = n*k
i: notation for each sample group
j: notation for each value in a sample group
rmANOVA
Changes on df
 $df1 = df_G= k − 1$
 $df2 = df_E = N − k  r  1$
r: number of replicates in each group (that is, the unit on which the repeated measurement is made)
Distributions & Tables
Normal distribution & Z table
row, col: zscore
inside: shaded area (pvalue for negartive z, 1  pvalues for positive z)
T distribution & T table
row: df=n1, each df corresponds to a tdistribution
column: shaded area (equal to pvalue)
Inside: tscore
F distribution & ANOVA table
Row: ${df}_G= k  1$
Column: shaded area (pvalue)
inside: fscore
After ANOVA, ttest will be used
After a significant ANOVA signal is detected, each possible pair of the K group means should be compared using a two sample ttest and/or confidence interval.
Under the rmANOVA design, the pairwise group means should be compared using a paired ttest instead of a two sample ttest.
The Bonferroni correction should be used again to account for the multiple comparisons that will be made in the posthoc analysis. When we run so many tests, the Type 1 Error rate increases.This issue is resolved by using a modified significance level:
$α^* = α/K $
$K = k(k−1)$
K: the number of comparisons being considered. If there are k groups, then $K = k(k−1)$ .
Implementations in R
One Sample TTest
test for normality:
par(mfrow==c(1,1))
qqPlot(Mydata$X)
pvalue(two sided):
pt(tscore, df) + (1  pt(tscorem, df))
ttest
1
2
3
4t.test(Mydata$X,
alterenative="two.sided",
mu=0,
conf.level = 0.95) alternative: two.sided, greater, less
In order to get a confidence interval using the t.test() function, the alternative argument must be set to “two.sided”. If the alternative argument is set to “greater” or “less”, the t.test() function will return a onesided confidence bound (that is, either a lower or upper bound, respectively, on the population mean) instead of a confidence interval (that is, range of plausible values of the population mean).
 alternative: two.sided, greater, less
Two Sample TTest
1  t.test(Mydata1$X, MyData2$X, 
Paired TTest
1  t.test(Mydata1$X, MyData2$X, 
ANOVA
 assess the equal variance assumption:
doBy::summaryBy()
Here, the variable Y will be compared by the levels of variable Group from the MyData data set.
1  a_fit < aov(Y ~ Group, data=MyData) 
Pairwise confidence intervals can be constructed using the pairwiseCI()
function in the pairwiseCI package.
1  no.tests < choose(3, 2) 
or
1  pw_tests < pairwiseTest(Y ~ Group, data=MyData) 
 Boxplot for each group
boxplot(Y ~ Group)
Repeated Measures ANOVA
Before performing the repeated measures ANOVA analysis, you must check if your subject and group variables are stored as factor variables. If not, use the factor()
function to convert the variables to factors.
1  a_fit < aov(Y ~ Group + Error(Subject), data=MyData) 
ANOVA table can be splited into two parts: error due variability withinsubjects and the error that remains after taking out the withinsubjects error.
Unit 4 Linear Regression
Concepts
Linear regression attempts to explain or predict how the mean value of the response variable changes with the value of a predictor variable.
Conditions
 Linearity. The data should show a linear trend.
 Nearly normal residuals. Generally the residuals must be nearly normal. When this condition is found to be unreasonable, it is usually because of outliers or concerns about influential points.
 Constant variability. The variability of points around the least squares line remains roughly constant.
 Independent observations. Be cautious about applying regression to time series data, which are sequential observations in time such as a stock price each day. Such data may have an underlying structure that should be considered in a model and analysis. There are also other instances where correlations within the data are important.
Variables
X: predictor, explanatory, independent variable
Y: outcome, response, dependent variable
$y$: observed value [data]
$\hat{y}$: expected value (based on the line of best fit) [fit]
e: residual, $e = y  \hat{y}$ [residual]
Data = Fit + Residual
Least Squares Line
choose the line that minimizes the sum of the squared residuals: $\sum{(y_i  \hat{y}_i})^2$
Coefficients
 R: correlation coefficient.
Describes the strength of the linear relationship between two variables. We denote the correlation by R. usually we calculate R by software but here is the definition equation:
 $R ^2$ : determination coefficient.
describes the amount of variation in the response that is explained by the least squares line.
eg. the R2 value was again found to be 0.2486 demonstrating that using a student’s family income to estimate their expected financial aid amount reduced the uncertainty in the estimate by explaining approximately 25% of the variability in the response. However, 75% of the variability in the response is left unexplained by the fitted regression model suggesting that other factors play a role in determining a student’s financial aid amount.
 b1: slope coefficient
Point Estimate
We use $b_0, b_1$ to represent the point estimates of the parameters $\beta_0, \beta_1$
Find LSL by applying two properties of the least squares line:
 The slope of the least squares line can be estimated by: $b_1 = R\frac{s_y} {s_x} $
 $(\bar{x}, \bar{y})$ is on the least squares line: $y − \bar{y} = b_1 ( x − \bar{x} )$
Inference
we use ttest for the population slope $\beta_1$
$H_0$: $\beta_1 = 0$
$H_A$: $\beta_1\not= 0$
table for Linear Regression
row 1: $b_0$
row 2: $b_1$
 df = N2
 SE, t, p are calculated using software
The pvalues for testing whether or not the regression coefficients($b_0, b_1$) are different from 0
Equivalence between Linear Regression and Ttests / ANOVA
Input
 linear regression: continuous
 ttests & ANOVA: discrete
 LR with a single indicator variable for group (1 vs. 2) = a two sample ttest
 MLR(multivariable linear regression) with K1 indicator variables for group (1, 2, …, K1 vs. K) = ANOVA for K groups
 MLR with an indicator variable for which measurement (eg. pre vs. post) of the response variable is being considered and with n – 1 indicators for pair membership = a paired ttest.
Types of Outliers
(1) one outlier, though it only slightly influence the line.
(2) one outlier, though it is quite close to the least squares line, wasn’t very influential.
(3) one point far away from the cloud, and this outlier appears to pull the least squares line up on the right.
(4) a primary cloud and then a small secondary cloud of four outliers. The secondary cloud appears to be influencing the line somewhat strongly, making the least square line fit poorly almost everywhere.
(5) no obvious trend in the main cloud of points, the outlier on the right appears to largely control the slope of the least squares line.
(6) one outlier far from the cloud, but falls quite close to the least squares line and does not appear to be very influential.
There is some trend in the main clouds of (3) and (4). In these cases, the outliers influenced the slope of the least squares lines. In (5), data with no clear trend were assigned a line with a large trend simply due to one outlier (!).
Leverage: Points that fall horizontally away from the center of the cloud tend to pull harder on the line, so we call them points with high leverage.
If one of these high leverage points does appear to actually invoke its influence on the slope of the line – as in cases (3), (4), and (5) of Example 7.23 – then we call it an influential point
Implemetations in R
 correlation:
cor(x, y, na.rm=FALSE)
na.rm is an optional argument that indicates whether or not missing values should be removed when computing the summary measure. compute pvalue give Tscore:
pt(tstat, df)
 compute pvalue give Tscore:
Inference
fit < lm(Y~X, data=MyData)
summary(fit)
 It is not possible in the lm() function to specify the claimed value for the hypothesis test involving the regression coefficients or to specify a less than or greater than alternative hypothesis.
 when fitting a regression line to a data set, only observations with nonmissing values for both X and Y will be included in the analysis. T
Confidence Intervalconfint(fit, level=0.95)
Inside the fit listlm()
object is a list that stores all of the output generated by the function call, and we can apply the names()
function to the list object to determine what elements are stored in the list.
fit$coefficients
will print a length 2 vector containing the estimated regression coefficients.fit$fitted.values[1]
fit$residuals[1]
 plot the regression line:
abline(fit, col='blue', lwd = 3)
 highlight a point:
points(X_value, X_variable, col='red', pch = 19)
Check Conditions
Linearity:
plot(fit$residuals~x)
,no pattern; orplot(fit$residuals ~ fit$fited_values)
, a horizontal line at zeroNormality: $y\hat{y}$ (residual) ~ theoratical quantile of a normal distribution
qqPlot(fit$residuals)
, should be a y=x lineQQ plot: First, the set of intervals for the quantiles is chosen. A point (x, y) on the plot corresponds to one of the quantiles of the second distribution (ycoordinate) plotted against the same quantile of the first distribution (xcoordinate). Thus the line is a parametric curve with the parameter which is the number of the interval for the quantile.
Constant Variance: $y\hat{y}$ ~ $\hat{y}$
plot(fit$fitted.values, abs(fit$residuals))
, should be no patternIndependence: $y\hat{y}$ ~ c(1:n)
plot(1:dim(Mydata)[1], fit$residuals)
, should be no pattern
n is the sample size. When creating this plot, we have to assume that the observations are listed in the order they were collected in the data set, unless there is variable containing the time and date of when each observation was collected.
or simply: plot(fit)
: will yield 4 plots
Plot 1: linearity assumption. The plot is scatter plot of the residuals (yaxis) against the fitted values (xaxis). R also adds a LOESS curve to the plot. if the linearity assumption is met, the LOESS curve should be a horizontal line at zero.
Plot 2: normality assumption. If all of the points fall near the line of identity, the normality assumption is met for this data set.
Plot 3: constant variance assumption. The plot is scatter plot of the square root of the absolute value of the residuals (yaxis) against the fitted values (xaxis). R also adds a LOESS curve to the plot. If the constant variance assumption is met, the LOESS curve should be a flat line.
放宽条件：Specifically, there appears to be less spread for smaller fitted values and more spread for larger fitted values. However, if we ignore the left most point in this plot, the spread seems approximately constant. Thus, the constant varianc e assumption may be reasonable for this data set, but it would be good to investigate the potential violation a bit more.
R will NOT create the independence diagnostic plot. Instead it creates a plot of the residuals against a measure of leverage. This plot can be used to determine if any of the data points are potential influential points because this plot includes contours lines for Cook’s distance. Be wary of data points with Cook’s distance values above 0.5 or 1. If observations in the data set have values near or above these limits, a contour line will appear in the diagnostic plot to alter you to their presence. If no contour lines appear in the plot, you may infer that there is no evidence of influential points in the data set.