Distributions
Statistical tests for multiple distributions, and tests for homoscedasticity.
Pingouin.anderson
— Functionanderson(x[, dist])
Anderson-Darling test of distribution.
Arguments
x::Array{<:Number}
: Array of sample data. May be different lengths. Also support multiple arrays (see examples),dist::Union{String, Distribution}
: Distribution ("norm", "expon", "logistic", "gumbel").
Returns
H::Bool
: True if data comes from this distribution,P::Float64
: The significance levels for the corresponding critical values in %.
(See :HypothesisTests.OneSampleADTest
for more details).
Examples
- Test that an array comes from a normal distribution
julia> x = [2.3, 5.1, 4.3, 2.6, 7.8, 9.2, 1.4]
julia> Pingouin.anderson(x, "norm")
(false, 8.571428568870942e-5)
- Test that an array comes from a custom distribution
julia> using Distributions
julia> x = [2.3, 5.1, 4.3, 2.6, 7.8, 9.2, 1.4]
julia> Pingouin.anderson(x, Normal(1,5))
(false, 0.04755873570126501)
- Test that 2 arrays come from a custom distribution
julia> using Distributions
julia> x = [[2.3, 5.1, 4.3, 2.6, 7.8], [1, 2.0, 3.5, 5.1]]
julia> Pingouin.anderson(x, Normal(1,5))
((true, true), (0.09387808627138938, 0.3413749361130328))
Pingouin.epsilon
— Methodepsilon(data[, dv, within, subject, correction])
Epsilon adjustement factor for repeated measures.
Arguments
data::Union{Array{<:Number}, DataFrame}
: DataFrame containing the repeated measurements. Only long-format dataframe are supported for this function.dv::Union{Symbol, String, Nothing}
: Name of column containing the dependent variable.within::Union{Array{String}, Array{Symbol}, Symbol, String, Nothing}
: Name of column containing the within factor (only required ifdata
is in long format). Ifwithin
is a list with two strings, this function computes the epsilon factor for the interaction between the two within-subject factor.subject::Union{Symbol, String, Nothing}
: Name of column containing the subject identifier (only required ifdata
is in long format).correction::String
: Specify the epsilon version:"gg"
: Greenhouse-Geisser,"hf"
: Huynh-Feldt,"lb"
: Lower bound.
Returns
eps::Float64: Epsilon adjustement factor.
See Also
sphericity
: Mauchly and JNS test for sphericity.homoscedasticity
: Test equality of variance.
Notes
The lower bound epsilon is:
\[lb = \frac{1}{\text{dof}}\]
,
where the degrees of freedom $\text{dof}$ is the number of groups $k$ minus 1 for one-way design and $(k_1 - 1)(k_2 - 1)$ for two-way design
The Greenhouse-Geisser epsilon is given by:
\[\epsilon_{GG} = \frac{k^2(\overline{\text{diag}(S)} - \overline{S})^2}{(k-1)(\sum_{i=1}^{k}\sum_{j=1}^{k}s_{ij}^2 - 2k\sum_{j=1}^{k}\overline{s_i}^2 + k^2\overline{S}^2)}\]
where $S$ is the covariance matrix, $\overline{S}$ the grandmean of S and $\overline{\text{diag}(S)}$ the mean of all the elements on the diagonal of S (i.e. mean of the variances).
The Huynh-Feldt epsilon is given by:
\[\epsilon_{HF} = \frac{n(k-1)\epsilon_{GG}-2}{(k-1) (n-1-(k-1)\epsilon_{GG})}\]
where $n$ is the number of observations.
Missing values are automatically removed from data (listwise deletion).
Examples
Using a wide-format dataframe
julia> data = DataFrame(A = [2.2, 3.1, 4.3, 4.1, 7.2],
B = [1.1, 2.5, 4.1, 5.2, 6.4],
C = [8.2, 4.5, 3.4, 6.2, 7.2])
julia> Pingouin.epsilon(data, correction="gg")
0.5587754577585018
julia> Pingouin.epsilon(data, correction="hf")
0.6223448311539789
julia> Pingouin.epsilon(data, correction="lb")
0.5
Now using a long-format dataframe
julia> data = Pingouin.read_dataset("rm_anova2")
julia> head(data)
6×4 DataFrame
│ Row │ Subject │ Time │ Metric │ Performance │
│ │ Int64 │ String │ String │ Int64 │
├─────┼─────────┼────────┼─────────┼─────────────┤
│ 1 │ 1 │ Pre │ Product │ 13 │
│ 2 │ 2 │ Pre │ Product │ 12 │
│ 3 │ 3 │ Pre │ Product │ 17 │
│ 4 │ 4 │ Pre │ Product │ 12 │
│ 5 │ 5 │ Pre │ Product │ 19 │
│ 6 │ 6 │ Pre │ Product │ 6 │
Let's first calculate the epsilon of the Time within-subject factor
julia> Pingouin.epsilon(data, dv="Performance", subject="Subject",
within="Time")
1.0
Since Time has only two levels (Pre and Post), the sphericity assumption is necessarily met, and therefore the epsilon adjustement factor is 1.
The Metric factor, however, has three levels:
julia> Pingouin.epsilon(data, dv=:Performance, subject=:Subject,
within=[:Metric])
0.9691029584899762
The epsilon value is very close to 1, meaning that there is no major violation of sphericity.
Now, let's calculate the epsilon for the interaction between the two repeated measures factor:
julia> Pingouin.epsilon(data, dv=:Performance, subject=:Subject,
within=[:Time, :Metric])
0.727166420214127
Pingouin.gzscore
— Methodgzscore(x)
Geometric standard (Z) score.
Arguments
x::Array{<:Number}
: Array of raw values
Returns
gzscore::Array{<:Number}
: Array of geometric z-scores (same shape as x)
Notes
Geometric Z-scores are better measures of dispersion than arithmetic z-scores when the sample data come from a log-normally distributed population [1].
Given the raw scores $x$, the geometric mean $\mu_g$ and the geometric standard deviation $\sigma_g$, the standard score is given by the formula:
\[z = \frac{log(x) - log(\mu_g)}{log(\sigma_g)}\]
References
[1] https://en.wikipedia.org/wiki/Geometricstandarddeviation
Examples
Standardize a lognormal-distributed vector:
julia> raw = [1,4,5,4,1,2,5,8,6,6,9,8,3]
julia> z = Pingouin.gzscore(raw)
13-element Array{Float64,1}:
-1.8599725059104346
0.03137685347921089
0.3358161014965816
0.03137685347921089
-1.8599725059104346
⋮
0.5845610789821727
0.5845610789821727
1.1377453044851344
0.9770515331740336
-0.3611136007126501
Pingouin.homoscedasticity
— Methodhomoscedasticity(data[, dv, group, method, α])
Test equality of variance.
Arguments
data
: Iterable. Can be either an Array iterables or a wide- or long-format dataframe.dv::Union{Symbol, String, Nothing}
: Dependent variable (only when $data$ is a long-format dataframe).group::Union{Symbol, String, Nothing}
: Grouping variable (only when $data$ is a long-format dataframe).method::String
: Statistical test.'levene'
(default) performs the Levene test and'bartlett'
performs the Bartlett test. The former is more robust to departure from normality.α::Float64
: Significance level.
Returns
stats::DataFrame
:W/T
: Test statistic ('W' for Levene, 'T' for Bartlett),pval
: p-value,equal_var
: true ifdata
has equal variance.
See Also
normality
: Univariate normality test.sphericity
: Mauchly's test for sphericity.
Notes
The Bartlett $T$ statistic [1] is defined as:
\[T = \frac{(N-k) \ln{s^{2}_{p}} - \sum_{i=1}^{k}(N_{i} - 1) \ln{s^{2}_{i}}}{1 + (1/(3(k-1)))((\sum_{i=1}^{k}{1/(N_{i} - 1))} - 1/(N-k))}\]
where $s_i^2$ is the variance of the $i^{th}$ group, $N$ is the total sample size, $N_i$ is the sample size of the $i^{th}$ group, $k$ is the number of groups, and $s_p^2$ is the pooled variance.
The pooled variance is a weighted average of the group variances and is defined as:
\[s^{2}_{p} = \sum_{i=1}^{k}(N_{i} - 1)s^{2}_{i}/(N-k)\]
The p-value is then computed using a chi-square distribution:
\[T \sim \chi^2(k-1)\]
The Levene $W$ statistic [2] is defined as:
\[W = \frac{(N-k)} {(k-1)} \frac{\sum_{i=1}^{k}N_{i}(\overline{Z}_{i.}-\overline{Z})^{2} } {\sum_{i=1}^{k}\sum_{j=1}^{N_i}(Z_{ij}-\overline{Z}_{i.})^{2} }\]
where $Z_{ij} = |Y_{ij} - \text{median}({Y}_{i.})|$, $\overline{Z}_{i.}$ are the group means of $Z_{ij}$ and $\overline{Z}$ is the grand mean of $Z_{ij}$.
The p-value is then computed using a F-distribution:
\[W \sim F(k-1, N-k)\]
WARNING: Missing values are not supported for this function. Make sure to remove them before.
References
[1] Bartlett, M. S. (1937). Properties of sufficiency and statistical tests. Proc. R. Soc. Lond. A, 160(901), 268-282.
[2] Brown, M. B., & Forsythe, A. B. (1974). Robust tests for the equality of variances. Journal of the American Statistical Association, 69(346), 364-367.
Examples
- Levene test on a wide-format dataframe
julia> data = Pingouin.read_dataset("mediation");
julia> Pingouin.homoscedasticity(data[["X", "Y", "M"]])
1×3 DataFrame
│ Row │ W │ pval │ equal_var │
│ │ Float64 │ Float64 │ Bool │
├─────┼─────────┼──────────┼───────────┤
│ 1 │ 1.17352 │ 0.310707 │ 1 │
- Bartlett test using an array of arrays
julia> data = [[4, 8, 9, 20, 14], [5, 8, 15, 45, 12]];
julia> Pingouin.homoscedasticity(data, method="bartlett", α=.05)
1×3 DataFrame
│ Row │ T │ pval │ equal_var │
│ │ Float64 │ Float64 │ Bool │
├─────┼─────────┼──────────┼───────────┤
│ 1 │ 2.87357 │ 0.090045 │ 1 │
- Long-format dataframe
julia> data = Pingouin.read_dataset("rm_anova2");
julia> Pingouin.homoscedasticity(data, "Performance", "Time")
1×3 DataFrame
│ Row │ W │ pval │ equal_var │
│ │ Float64 │ Float64 │ Bool │
├─────┼─────────┼───────────┼───────────┤
│ 1 │ 3.1922 │ 0.0792169 │ 1 │
Pingouin.normality
— Methodnormality(data[, dv, group, method, α])
Univariate normality test.
Arguments
data
: Iterable. Can be either a single list, 1D array, or a wide- or long-format dataframe.dv::Union{Symbol, String, Nothing}
: Dependent variable (only when $data$ is a long-format dataframe).group::Union{Symbol, String, Nothing}
: Grouping variable (only when $data$ is a long-format dataframe).method::String
: Normality test.'shapiro'
(default) performs the Shapiro-Wilk test using the AS R94 algorithm. If the kurtosis is higher than 3, it performs a Shapiro-Francia test for leptokurtic distributions. Supported values: ["shapiro", "jarque_bera"].α::Float64
: Significance level.
Returns
stats::DataFrame
:W
: Test statistic,pval
: p-value,normal
: true ifdata
is normally distributed.
See Also
homoscedasticity
: Test equality of variance.sphericity
: Mauchly's test for sphericity.
Notes
The Shapiro-Wilk test calculates a :math:W
statistic that tests whether a random sample $x_1, x_2, ..., x_n$ comes from a normal distribution.
The $W$ is normalized ($W = (W - μ) / σ$)
The null-hypothesis of this test is that the population is normally distributed. Thus, if the p-value is less than the chosen alpha level (typically set at 0.05), then the null hypothesis is rejected and there is evidence that the data tested are not normally distributed.
The result of the Shapiro-Wilk test should be interpreted with caution in the case of large sample sizes (>5000). Indeed, quoting from Wikipedia:
"Like most statistical significance tests, if the sample size is sufficiently large this test may detect even trivial departures from the null hypothesis (i.e., although there may be some statistically significant effect, it may be too small to be of any practical significance); thus, additional investigation of the effect size is typically advisable, e.g., a Q–Q plot in this case."
The Jarque-Bera statistic is to test the null hypothesis that a real-valued vector $y$ is normally distributed. Note that the approximation by the Chi-squared distribution does not work well and the speed of convergence is slow. In small samples, the test tends to be over-sized for nominal levels up to about 3% and under-sized for larger nominal levels (Mantalos, 2010).
Note that missing values are automatically removed (casewise deletion).
References
- Shapiro, S. S., & Wilk, M. B. (1965). An analysis of variance test
for normality (complete samples). Biometrika, 52(3/4), 591-611.
- Panagiotis Mantalos, 2011, "The three different measures of the sample skewness and
kurtosis and the effects to the Jarque-Bera test for normality", International Journal of Computational Economics and Econometrics, Vol. 2, No. 1, link.
https://www.itl.nist.gov/div898/handbook/prc/section2/prc213.htm
Examples
- Shapiro-Wilk test on a 1D array
julia> dataset = Pingouin.read_dataset("anova")
julia> Pingouin.normality(dataset["Pain threshold"])
1×3 DataFrame
│ Row │ W │ pval │ normal │
│ │ Float64 │ Float64 │ Bool │
├─────┼──────────┼──────────┼────────┤
│ 1 │ 0.971204 │ 0.800257 │ 1 │
- Wide-format dataframe using Jarque-Bera test
julia> dataset = Pingouin.read_dataset("mediation")
julia> Pingouin.normality(dataset, method="jarque_bera")
7×4 DataFrame
│ Row │ dv │ W │ pval │ normal │
│ │ Symbol │ Float64 │ Float64 │ Bool │
├─────┼────────┼──────────┼─────────────┼────────┤
│ 1 │ X │ 1.42418 │ 0.490618 │ 1 │
│ 2 │ M │ 0.645823 │ 0.724038 │ 1 │
│ 3 │ Y │ 0.261805 │ 0.877303 │ 1 │
│ 4 │ Mbin │ 16.6735 │ 0.000239553 │ 0 │
│ 5 │ Ybin │ 16.6675 │ 0.000240265 │ 0 │
│ 6 │ W1 │ 5.40923 │ 0.0668961 │ 1 │
│ 7 │ W2 │ 80.6857 │ 3.01529e-18 │ 0 │
- Long-format dataframe
julia> dataset = Pingouin.read_dataset("rm_anova2")
julia> Pingouin.normality(dataset, :Performance, :Time)
2×4 DataFrame
│ Row │ Time │ W │ pval │ normal │
│ │ String │ Float64 │ Float64 │ Bool │
├─────┼────────┼──────────┼───────────┼────────┤
│ 1 │ Pre │ 0.967718 │ 0.478771 │ 1 │
│ 2 │ Post │ 0.940728 │ 0.0951576 │ 1 │
Pingouin.sphericity
— Methodsphericity(data[, dv, within, subject, method, α])
Mauchly and JNS test for sphericity.
Arguments
data::DataFrame
: DataFrame containing the repeated measurements. Only long-format dataframe are supported for this function.dv::Union{Nothing, String, Symbol}
: Name of column containing the dependent variable.within::Union{Array{String}, Array{Symbol}, Nothing, String, Symbol}
: Name of column containing the within factor. Ifwithin
is a list with two strings, this function computes the epsilon factor for the interaction between the two within-subject factor.subject::Union{Nothing, String, Symbol}
: Name of column containing the subject identifier (only required ifdata
is in long format).method::String
: Method to compute sphericity:"jns"
: John, Nagao and Sugiura test."mauchly"
: Mauchly test (default).
α::Float64
: Significance level
Returns
spher::Bool
: True if data have the sphericity property.W::Float64:
Test statistic.chi2::Float64
: Chi-square statistic.dof::Int64
: Degrees of freedom.pval::Flot64
: P-value.
Throwing
DomainError
When testing for an interaction, if both within-subject factors have more than 2 levels (not yet supported in Pingouin).
See Also
epsilon
: Epsilon adjustement factor for repeated measures.homoscedasticity
: Test equality of variance.normality
: Univariate normality test.
Notes
The Mauchly $W$ statistic [1] is defined by:
\[W = \frac{\prod \lambda_j}{(\frac{1}{k-1} \sum \lambda_j)^{k-1}}\]
where $\lambda_j$ are the eigenvalues of the population covariance matrix (= double-centered sample covariance matrix) and $k$ is the number of conditions.
From then, the $W$ statistic is transformed into a chi-square score using the number of observations per condition $n$
\[f = \frac{2(k-1)^2+k+1}{6(k-1)(n-1)}\]
\[\chi_w^2 = (f-1)(n-1) \text{log}(W)\]
The p-value is then approximated using a chi-square distribution:
\[\chi_w^2 \sim \chi^2(\frac{k(k-1)}{2}-1)\]
The JNS $V$ statistic ([2], [3], [4]) is defined by:
\[V = \frac{(\sum_j^{k-1} \lambda_j)^2}{\sum_j^{k-1} \lambda_j^2}\]
\[\chi_v^2 = \frac{n}{2} (k-1)^2 (V - \frac{1}{k-1})\]
and the p-value approximated using a chi-square distribution
\[\chi_v^2 \sim \chi^2(\frac{k(k-1)}{2}-1)\]
Missing values are automatically removed from data
(listwise deletion).
References
[1] Mauchly, J. W. (1940). Significance test for sphericity of a normal n-variate distribution. The Annals of Mathematical Statistics, 11(2), 204-209.
[2] Nagao, H. (1973). On some test criteria for covariance matrix. The Annals of Statistics, 700-709.
[3] Sugiura, N. (1972). Locally best invariant test for sphericity and the limiting distributions. The Annals of Mathematical Statistics, 1312-1316.
[4] John, S. (1972). The distribution of a statistic used for testing sphericity of normal distributions. Biometrika, 59(1), 169-173.
See also http://www.real-statistics.com/anova-repeated-measures/sphericity/
Examples
Mauchly test for sphericity using a wide-format dataframe
julia> data = DataFrame(A = [2.2, 3.1, 4.3, 4.1, 7.2],
B = [1.1, 2.5, 4.1, 5.2, 6.4],
C = [8.2, 4.5, 3.4, 6.2, 7.2])
julia> Pingouin.sphericity(data)
│ Row │ spher │ W │ chi2 │ dof │ pval │
│ │ Bool │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼───────┼──────────┼─────────┼─────────┼───────────┤
│ 1 │ 1 │ 0.210372 │ 4.67663 │ 2.0 │ 0.0964902 │
John, Nagao and Sugiura (JNS) test
julia> Pingouin.sphericity(data, method="jns")[:pval] # P-value only
0.045604240307520305
Now using a long-format dataframe
julia> data = Pingouin.read_dataset("rm_anova2")
julia> head(data)
6x4 DataFrame
│ Row │ Subject │ Time │ Metric │ Performance │
│ │ Int64 │ String │ String │ Int64 │
├─────┼─────────┼────────┼─────────┼─────────────┤
│ 1 │ 1 │ Pre │ Product │ 13 │
│ 2 │ 2 │ Pre │ Product │ 12 │
│ 3 │ 3 │ Pre │ Product │ 17 │
│ 4 │ 4 │ Pre │ Product │ 12 │
│ 5 │ 5 │ Pre │ Product │ 19 │
│ 6 │ 6 │ Pre │ Product │ 6 │
Let's first test sphericity for the Time within-subject factor
julia> Pingouin.sphericity(data, dv=:Performance, subject=:Subject,
within=:Time)
│ Row │ spher │ W │ chi2 │ dof │ pval │
│ │ Bool │ Float64 │ Float64 │ Int64 │ Float64 │
├─────┼───────┼─────────┼─────────┼───────┼─────────┤
│ 1 │ 1 │ NaN │ NaN │ 1 │ 1.0 │
Since Time has only two levels (Pre and Post), the sphericity assumption is necessarily met.
The Metric factor, however, has three levels:
julia> Pingouin.sphericity(data, dv="Performance", subject="Subject",
within=["Metric"])[:pval]
0.8784417991645139
The p-value value is very large, and the test therefore indicates that there is no violation of sphericity.
Now, let's calculate the epsilon for the interaction between the two repeated measures factor. The current implementation in Pingouin only works if at least one of the two within-subject factors has no more than two levels.
julia> Pingouin.sphericity(data, dv="Performance",
subject="Subject",
within=["Time", "Metric"])
│ Row │ spher │ W │ chi2 │ dof │ pval │
│ │ Bool │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼───────┼──────────┼─────────┼─────────┼──────────┤
│ 1 │ 1 │ 0.624799 │ 3.7626 │ 2.0 │ 0.152392 │
Here again, there is no violation of sphericity acccording to Mauchly's test.