Distributions

Statistical tests for multiple distributions, and tests for homoscedasticity.

Pingouin.anderson — Function

anderson(x[, dist])

Anderson-Darling test of distribution.

Arguments

x::Array{<:Number}: Array of sample data. May be different lengths. Also support multiple arrays (see examples),
dist::Union{String, Distribution}: Distribution ("norm", "expon", "logistic", "gumbel").

Returns

H::Bool: True if data comes from this distribution,
P::Float64: The significance levels for the corresponding critical values in %.

(See :HypothesisTests.OneSampleADTest for more details).

Examples

Test that an array comes from a normal distribution

julia> x = [2.3, 5.1, 4.3, 2.6, 7.8, 9.2, 1.4]
julia> Pingouin.anderson(x, "norm")
(false, 8.571428568870942e-5)

Test that an array comes from a custom distribution

julia> using Distributions
julia> x = [2.3, 5.1, 4.3, 2.6, 7.8, 9.2, 1.4]
julia> Pingouin.anderson(x, Normal(1,5))
(false, 0.04755873570126501)

Test that 2 arrays come from a custom distribution

julia> using Distributions
julia> x = [[2.3, 5.1, 4.3, 2.6, 7.8], [1, 2.0, 3.5, 5.1]]
julia> Pingouin.anderson(x, Normal(1,5))
((true, true), (0.09387808627138938, 0.3413749361130328))

source

Pingouin.epsilon — Method

epsilon(data[, dv, within, subject, correction])

Epsilon adjustement factor for repeated measures.

Arguments

data::Union{Array{<:Number}, DataFrame}: DataFrame containing the repeated measurements. Only long-format dataframe are supported for this function.
dv::Union{Symbol, String, Nothing}: Name of column containing the dependent variable.
within::Union{Array{String}, Array{Symbol}, Symbol, String, Nothing}: Name of column containing the within factor (only required if data is in long format). If within is a list with two strings, this function computes the epsilon factor for the interaction between the two within-subject factor.
subject::Union{Symbol, String, Nothing}: Name of column containing the subject identifier (only required if data is in long format).
correction::String: Specify the epsilon version:
- "gg": Greenhouse-Geisser,
- "hf": Huynh-Feldt,
- "lb": Lower bound.

Returns

eps::Float64: Epsilon adjustement factor.

See Also

sphericity: Mauchly and JNS test for sphericity.
homoscedasticity: Test equality of variance.

Notes

The lower bound epsilon is:

\[lb = \frac{1}{\text{dof}}\]

where the degrees of freedom $\text{dof}$ is the number of groups $k$ minus 1 for one-way design and $(k_1 - 1)(k_2 - 1)$ for two-way design

The Greenhouse-Geisser epsilon is given by:

\[\epsilon_{GG} = \frac{k^2(\overline{\text{diag}(S)} - \overline{S})^2}{(k-1)(\sum_{i=1}^{k}\sum_{j=1}^{k}s_{ij}^2 - 2k\sum_{j=1}^{k}\overline{s_i}^2 + k^2\overline{S}^2)}\]

where $S$ is the covariance matrix, $\overline{S}$ the grandmean of S and $\overline{\text{diag}(S)}$ the mean of all the elements on the diagonal of S (i.e. mean of the variances).

The Huynh-Feldt epsilon is given by:

\[\epsilon_{HF} = \frac{n(k-1)\epsilon_{GG}-2}{(k-1) (n-1-(k-1)\epsilon_{GG})}\]

where $n$ is the number of observations.

Missing values are automatically removed from data (listwise deletion).

Examples

Using a wide-format dataframe

julia> data = DataFrame(A = [2.2, 3.1, 4.3, 4.1, 7.2],
                     B = [1.1, 2.5, 4.1, 5.2, 6.4],
                     C = [8.2, 4.5, 3.4, 6.2, 7.2])
julia> Pingouin.epsilon(data, correction="gg")
0.5587754577585018
julia> Pingouin.epsilon(data, correction="hf")
0.6223448311539789
julia> Pingouin.epsilon(data, correction="lb")
0.5

Now using a long-format dataframe

julia> data = Pingouin.read_dataset("rm_anova2")
julia> head(data)
6×4 DataFrame
│ Row │ Subject │ Time   │ Metric  │ Performance │
│     │ Int64   │ String │ String  │ Int64       │
├─────┼─────────┼────────┼─────────┼─────────────┤
│ 1   │ 1       │ Pre    │ Product │ 13          │
│ 2   │ 2       │ Pre    │ Product │ 12          │
│ 3   │ 3       │ Pre    │ Product │ 17          │
│ 4   │ 4       │ Pre    │ Product │ 12          │
│ 5   │ 5       │ Pre    │ Product │ 19          │
│ 6   │ 6       │ Pre    │ Product │ 6           │

Let's first calculate the epsilon of the Time within-subject factor

julia> Pingouin.epsilon(data, dv="Performance", subject="Subject",
                     within="Time")
1.0

Since Time has only two levels (Pre and Post), the sphericity assumption is necessarily met, and therefore the epsilon adjustement factor is 1.

The Metric factor, however, has three levels:

julia> Pingouin.epsilon(data, dv=:Performance, subject=:Subject,
                     within=[:Metric])
0.9691029584899762

The epsilon value is very close to 1, meaning that there is no major violation of sphericity.

Now, let's calculate the epsilon for the interaction between the two repeated measures factor:

julia> Pingouin.epsilon(data, dv=:Performance, subject=:Subject,
                     within=[:Time, :Metric])
0.727166420214127

source

Pingouin.gzscore — Method

gzscore(x)

Geometric standard (Z) score.

Arguments

x::Array{<:Number}: Array of raw values

Returns

gzscore::Array{<:Number}: Array of geometric z-scores (same shape as x)

Notes

Geometric Z-scores are better measures of dispersion than arithmetic z-scores when the sample data come from a log-normally distributed population [1].

Given the raw scores $x$, the geometric mean $\mu_g$ and the geometric standard deviation $\sigma_g$, the standard score is given by the formula:

\[z = \frac{log(x) - log(\mu_g)}{log(\sigma_g)}\]

References

[1] https://en.wikipedia.org/wiki/Geometricstandarddeviation

Examples

Standardize a lognormal-distributed vector:

julia> raw = [1,4,5,4,1,2,5,8,6,6,9,8,3]
julia> z = Pingouin.gzscore(raw)
13-element Array{Float64,1}:
 -1.8599725059104346
  0.03137685347921089
  0.3358161014965816
  0.03137685347921089
 -1.8599725059104346
  ⋮
  0.5845610789821727
  0.5845610789821727
  1.1377453044851344
  0.9770515331740336
 -0.3611136007126501

source

Pingouin.homoscedasticity — Method

homoscedasticity(data[, dv, group, method, α])

Test equality of variance.

Arguments

data: Iterable. Can be either an Array iterables or a wide- or long-format dataframe.
dv::Union{Symbol, String, Nothing}: Dependent variable (only when $data$ is a long-format dataframe).
group::Union{Symbol, String, Nothing}: Grouping variable (only when $data$ is a long-format dataframe).
method::String: Statistical test. 'levene' (default) performs the Levene test and 'bartlett' performs the Bartlett test. The former is more robust to departure from normality.
α::Float64: Significance level.

Returns

stats::DataFrame:
- W/T: Test statistic ('W' for Levene, 'T' for Bartlett),
- pval: p-value,
- equal_var: true if data has equal variance.

See Also

normality : Univariate normality test.
sphericity : Mauchly's test for sphericity.

Notes

The Bartlett $T$ statistic [1] is defined as:

\[T = \frac{(N-k) \ln{s^{2}_{p}} - \sum_{i=1}^{k}(N_{i} - 1) \ln{s^{2}_{i}}}{1 + (1/(3(k-1)))((\sum_{i=1}^{k}{1/(N_{i} - 1))} - 1/(N-k))}\]

where $s_i^2$ is the variance of the $i^{th}$ group, $N$ is the total sample size, $N_i$ is the sample size of the $i^{th}$ group, $k$ is the number of groups, and $s_p^2$ is the pooled variance.

The pooled variance is a weighted average of the group variances and is defined as:

\[s^{2}_{p} = \sum_{i=1}^{k}(N_{i} - 1)s^{2}_{i}/(N-k)\]

The p-value is then computed using a chi-square distribution:

\[T \sim \chi^2(k-1)\]

The Levene $W$ statistic [2] is defined as:

\[W = \frac{(N-k)} {(k-1)} \frac{\sum_{i=1}^{k}N_{i}(\overline{Z}_{i.}-\overline{Z})^{2} } {\sum_{i=1}^{k}\sum_{j=1}^{N_i}(Z_{ij}-\overline{Z}_{i.})^{2} }\]

where $Z_{ij} = |Y_{ij} - \text{median}({Y}_{i.})|$, $\overline{Z}_{i.}$ are the group means of $Z_{ij}$ and $\overline{Z}$ is the grand mean of $Z_{ij}$.

The p-value is then computed using a F-distribution:

\[W \sim F(k-1, N-k)\]

WARNING: Missing values are not supported for this function. Make sure to remove them before.

References

[1] Bartlett, M. S. (1937). Properties of sufficiency and statistical tests. Proc. R. Soc. Lond. A, 160(901), 268-282.

[2] Brown, M. B., & Forsythe, A. B. (1974). Robust tests for the equality of variances. Journal of the American Statistical Association, 69(346), 364-367.

Examples

Levene test on a wide-format dataframe

julia> data = Pingouin.read_dataset("mediation");
julia> Pingouin.homoscedasticity(data[["X", "Y", "M"]])
1×3 DataFrame
│ Row │ W       │ pval     │ equal_var │
│     │ Float64 │ Float64  │ Bool      │
├─────┼─────────┼──────────┼───────────┤
│ 1   │ 1.17352 │ 0.310707 │ 1         │

Bartlett test using an array of arrays

julia> data = [[4, 8, 9, 20, 14], [5, 8, 15, 45, 12]];
julia> Pingouin.homoscedasticity(data, method="bartlett", α=.05)
1×3 DataFrame
│ Row │ T       │ pval     │ equal_var │
│     │ Float64 │ Float64  │ Bool      │
├─────┼─────────┼──────────┼───────────┤
│ 1   │ 2.87357 │ 0.090045 │ 1         │

Long-format dataframe

julia> data = Pingouin.read_dataset("rm_anova2");
julia> Pingouin.homoscedasticity(data, "Performance", "Time")
1×3 DataFrame
│ Row │ W       │ pval      │ equal_var │
│     │ Float64 │ Float64   │ Bool      │
├─────┼─────────┼───────────┼───────────┤
│ 1   │ 3.1922  │ 0.0792169 │ 1         │

source

Pingouin.normality — Method

normality(data[, dv, group, method, α])

Univariate normality test.

Arguments

data: Iterable. Can be either a single list, 1D array, or a wide- or long-format dataframe.
dv::Union{Symbol, String, Nothing}: Dependent variable (only when $data$ is a long-format dataframe).
group::Union{Symbol, String, Nothing}: Grouping variable (only when $data$ is a long-format dataframe).
method::String: Normality test. 'shapiro' (default) performs the Shapiro-Wilk test using the AS R94 algorithm. If the kurtosis is higher than 3, it performs a Shapiro-Francia test for leptokurtic distributions. Supported values: ["shapiro", "jarque_bera"].
α::Float64: Significance level.

Returns

stats::DataFrame:
- W: Test statistic,
- pval: p-value,
- normal: true if data is normally distributed.

See Also

homoscedasticity: Test equality of variance.
sphericity: Mauchly's test for sphericity.

Notes

The Shapiro-Wilk test calculates a :math:W statistic that tests whether a random sample $x_1, x_2, ..., x_n$ comes from a normal distribution.

The $W$ is normalized ($W = (W - μ) / σ$)

The null-hypothesis of this test is that the population is normally distributed. Thus, if the p-value is less than the chosen alpha level (typically set at 0.05), then the null hypothesis is rejected and there is evidence that the data tested are not normally distributed.

The result of the Shapiro-Wilk test should be interpreted with caution in the case of large sample sizes (>5000). Indeed, quoting from Wikipedia:

"Like most statistical significance tests, if the sample size is sufficiently large this test may detect even trivial departures from the null hypothesis (i.e., although there may be some statistically significant effect, it may be too small to be of any practical significance); thus, additional investigation of the effect size is typically advisable, e.g., a Q–Q plot in this case."

The Jarque-Bera statistic is to test the null hypothesis that a real-valued vector $y$ is normally distributed. Note that the approximation by the Chi-squared distribution does not work well and the speed of convergence is slow. In small samples, the test tends to be over-sized for nominal levels up to about 3% and under-sized for larger nominal levels (Mantalos, 2010).

Note that missing values are automatically removed (casewise deletion).

References

Shapiro, S. S., & Wilk, M. B. (1965). An analysis of variance test

for normality (complete samples). Biometrika, 52(3/4), 591-611.

Panagiotis Mantalos, 2011, "The three different measures of the sample skewness and

kurtosis and the effects to the Jarque-Bera test for normality", International Journal of Computational Economics and Econometrics, Vol. 2, No. 1, link.

https://www.itl.nist.gov/div898/handbook/prc/section2/prc213.htm
Jarque-Bera test on Wikipedia

Examples

Shapiro-Wilk test on a 1D array

julia> dataset = Pingouin.read_dataset("anova")
julia> Pingouin.normality(dataset["Pain threshold"])
1×3 DataFrame
│ Row │ W        │ pval     │ normal │
│     │ Float64  │ Float64  │ Bool   │
├─────┼──────────┼──────────┼────────┤
│ 1   │ 0.971204 │ 0.800257 │ 1      │

Wide-format dataframe using Jarque-Bera test

julia> dataset = Pingouin.read_dataset("mediation")
julia> Pingouin.normality(dataset, method="jarque_bera")
7×4 DataFrame
│ Row │ dv     │ W        │ pval        │ normal │
│     │ Symbol │ Float64  │ Float64     │ Bool   │
├─────┼────────┼──────────┼─────────────┼────────┤
│ 1   │ X      │ 1.42418  │ 0.490618    │ 1      │
│ 2   │ M      │ 0.645823 │ 0.724038    │ 1      │
│ 3   │ Y      │ 0.261805 │ 0.877303    │ 1      │
│ 4   │ Mbin   │ 16.6735  │ 0.000239553 │ 0      │
│ 5   │ Ybin   │ 16.6675  │ 0.000240265 │ 0      │
│ 6   │ W1     │ 5.40923  │ 0.0668961   │ 1      │
│ 7   │ W2     │ 80.6857  │ 3.01529e-18 │ 0      │

Long-format dataframe

julia> dataset = Pingouin.read_dataset("rm_anova2")
julia> Pingouin.normality(dataset, :Performance, :Time)
2×4 DataFrame
│ Row │ Time   │ W        │ pval      │ normal │
│     │ String │ Float64  │ Float64   │ Bool   │
├─────┼────────┼──────────┼───────────┼────────┤
│ 1   │ Pre    │ 0.967718 │ 0.478771  │ 1      │
│ 2   │ Post   │ 0.940728 │ 0.0951576 │ 1      │

source

Pingouin.sphericity — Method

sphericity(data[, dv, within, subject, method, α])

Mauchly and JNS test for sphericity.

Arguments

data::DataFrame: DataFrame containing the repeated measurements. Only long-format dataframe are supported for this function.
dv::Union{Nothing, String, Symbol}: Name of column containing the dependent variable.
within::Union{Array{String}, Array{Symbol}, Nothing, String, Symbol}: Name of column containing the within factor. If within is a list with two strings, this function computes the epsilon factor for the interaction between the two within-subject factor.
subject::Union{Nothing, String, Symbol}: Name of column containing the subject identifier (only required if data is in long format).
method::String: Method to compute sphericity:
- "jns": John, Nagao and Sugiura test.
- "mauchly": Mauchly test (default).
α::Float64: Significance level

Returns

spher::Bool: True if data have the sphericity property.
W::Float64: Test statistic.
chi2::Float64: Chi-square statistic.
dof::Int64: Degrees of freedom.
pval::Flot64: P-value.

Throwing

DomainError When testing for an interaction, if both within-subject factors have more than 2 levels (not yet supported in Pingouin).

See Also

epsilon: Epsilon adjustement factor for repeated measures.
homoscedasticity: Test equality of variance.
normality: Univariate normality test.

Notes

The Mauchly $W$ statistic [1] is defined by:

\[W = \frac{\prod \lambda_j}{(\frac{1}{k-1} \sum \lambda_j)^{k-1}}\]

where $\lambda_j$ are the eigenvalues of the population covariance matrix (= double-centered sample covariance matrix) and $k$ is the number of conditions.

From then, the $W$ statistic is transformed into a chi-square score using the number of observations per condition $n$

\[f = \frac{2(k-1)^2+k+1}{6(k-1)(n-1)}\]

\[\chi_w^2 = (f-1)(n-1) \text{log}(W)\]

The p-value is then approximated using a chi-square distribution:

\[\chi_w^2 \sim \chi^2(\frac{k(k-1)}{2}-1)\]

The JNS $V$ statistic ([2], [3], [4]) is defined by:

\[V = \frac{(\sum_j^{k-1} \lambda_j)^2}{\sum_j^{k-1} \lambda_j^2}\]

\[\chi_v^2 = \frac{n}{2} (k-1)^2 (V - \frac{1}{k-1})\]

and the p-value approximated using a chi-square distribution

\[\chi_v^2 \sim \chi^2(\frac{k(k-1)}{2}-1)\]

Missing values are automatically removed from data (listwise deletion).

References

[1] Mauchly, J. W. (1940). Significance test for sphericity of a normal n-variate distribution. The Annals of Mathematical Statistics, 11(2), 204-209.

[2] Nagao, H. (1973). On some test criteria for covariance matrix. The Annals of Statistics, 700-709.

[3] Sugiura, N. (1972). Locally best invariant test for sphericity and the limiting distributions. The Annals of Mathematical Statistics, 1312-1316.

[4] John, S. (1972). The distribution of a statistic used for testing sphericity of normal distributions. Biometrika, 59(1), 169-173.

See also http://www.real-statistics.com/anova-repeated-measures/sphericity/

Examples

Mauchly test for sphericity using a wide-format dataframe

julia> data = DataFrame(A = [2.2, 3.1, 4.3, 4.1, 7.2],
                     B = [1.1, 2.5, 4.1, 5.2, 6.4],
                     C = [8.2, 4.5, 3.4, 6.2, 7.2])
julia> Pingouin.sphericity(data)
│ Row │ spher │ W        │ chi2    │ dof     │ pval      │
│     │ Bool  │ Float64  │ Float64 │ Float64 │ Float64   │
├─────┼───────┼──────────┼─────────┼─────────┼───────────┤
│ 1   │ 1     │ 0.210372 │ 4.67663 │ 2.0     │ 0.0964902 │

John, Nagao and Sugiura (JNS) test

julia> Pingouin.sphericity(data, method="jns")[:pval]  # P-value only
0.045604240307520305

Now using a long-format dataframe

julia> data = Pingouin.read_dataset("rm_anova2")
julia> head(data)
6x4 DataFrame
│ Row │ Subject │ Time   │ Metric  │ Performance │
│     │ Int64   │ String │ String  │ Int64       │
├─────┼─────────┼────────┼─────────┼─────────────┤
│ 1   │ 1       │ Pre    │ Product │ 13          │
│ 2   │ 2       │ Pre    │ Product │ 12          │
│ 3   │ 3       │ Pre    │ Product │ 17          │
│ 4   │ 4       │ Pre    │ Product │ 12          │
│ 5   │ 5       │ Pre    │ Product │ 19          │
│ 6   │ 6       │ Pre    │ Product │ 6           │

Let's first test sphericity for the Time within-subject factor

julia> Pingouin.sphericity(data, dv=:Performance, subject=:Subject,
                        within=:Time)
│ Row │ spher │ W       │ chi2    │ dof   │ pval    │
│     │ Bool  │ Float64 │ Float64 │ Int64 │ Float64 │
├─────┼───────┼─────────┼─────────┼───────┼─────────┤
│ 1   │ 1     │ NaN     │ NaN     │ 1     │ 1.0     │

Since Time has only two levels (Pre and Post), the sphericity assumption is necessarily met.

The Metric factor, however, has three levels:

julia> Pingouin.sphericity(data, dv="Performance", subject="Subject",
                        within=["Metric"])[:pval]
0.8784417991645139

The p-value value is very large, and the test therefore indicates that there is no violation of sphericity.

Now, let's calculate the epsilon for the interaction between the two repeated measures factor. The current implementation in Pingouin only works if at least one of the two within-subject factors has no more than two levels.

julia> Pingouin.sphericity(data, dv="Performance",
                        subject="Subject",
                        within=["Time", "Metric"])
│ Row │ spher │ W        │ chi2    │ dof     │ pval     │
│     │ Bool  │ Float64  │ Float64 │ Float64 │ Float64  │
├─────┼───────┼──────────┼─────────┼─────────┼──────────┤
│ 1   │ 1     │ 0.624799 │ 3.7626  │ 2.0     │ 0.152392 │

Here again, there is no violation of sphericity acccording to Mauchly's test.

source