Distributions

Statistical tests for multiple distributions, and tests for homoscedasticity.

Pingouin.andersonFunction
anderson(x[, dist])

Anderson-Darling test of distribution.

Arguments

  • x::Array{<:Number}: Array of sample data. May be different lengths. Also support multiple arrays (see examples),
  • dist::Union{String, Distribution}: Distribution ("norm", "expon", "logistic", "gumbel").

Returns

  • H::Bool: True if data comes from this distribution,
  • P::Float64: The significance levels for the corresponding critical values in %.

(See :HypothesisTests.OneSampleADTest for more details).

Examples

  1. Test that an array comes from a normal distribution
julia> x = [2.3, 5.1, 4.3, 2.6, 7.8, 9.2, 1.4]
julia> Pingouin.anderson(x, "norm")
(false, 8.571428568870942e-5)
  1. Test that an array comes from a custom distribution
julia> using Distributions
julia> x = [2.3, 5.1, 4.3, 2.6, 7.8, 9.2, 1.4]
julia> Pingouin.anderson(x, Normal(1,5))
(false, 0.04755873570126501)
  1. Test that 2 arrays come from a custom distribution
julia> using Distributions
julia> x = [[2.3, 5.1, 4.3, 2.6, 7.8], [1, 2.0, 3.5, 5.1]]
julia> Pingouin.anderson(x, Normal(1,5))
((true, true), (0.09387808627138938, 0.3413749361130328))
source
Pingouin.epsilonMethod
epsilon(data[, dv, within, subject, correction])

Epsilon adjustement factor for repeated measures.

Arguments

  • data::Union{Array{<:Number}, DataFrame}: DataFrame containing the repeated measurements. Only long-format dataframe are supported for this function.
  • dv::Union{Symbol, String, Nothing}: Name of column containing the dependent variable.
  • within::Union{Array{String}, Array{Symbol}, Symbol, String, Nothing}: Name of column containing the within factor (only required if data is in long format). If within is a list with two strings, this function computes the epsilon factor for the interaction between the two within-subject factor.
  • subject::Union{Symbol, String, Nothing}: Name of column containing the subject identifier (only required if data is in long format).
  • correction::String: Specify the epsilon version:
    • "gg": Greenhouse-Geisser,
    • "hf": Huynh-Feldt,
    • "lb": Lower bound.

Returns

eps::Float64: Epsilon adjustement factor.

See Also

Notes

The lower bound epsilon is:

\[lb = \frac{1}{\text{dof}}\]

,

where the degrees of freedom $\text{dof}$ is the number of groups $k$ minus 1 for one-way design and $(k_1 - 1)(k_2 - 1)$ for two-way design

The Greenhouse-Geisser epsilon is given by:

\[\epsilon_{GG} = \frac{k^2(\overline{\text{diag}(S)} - \overline{S})^2}{(k-1)(\sum_{i=1}^{k}\sum_{j=1}^{k}s_{ij}^2 - 2k\sum_{j=1}^{k}\overline{s_i}^2 + k^2\overline{S}^2)}\]

where $S$ is the covariance matrix, $\overline{S}$ the grandmean of S and $\overline{\text{diag}(S)}$ the mean of all the elements on the diagonal of S (i.e. mean of the variances).

The Huynh-Feldt epsilon is given by:

\[\epsilon_{HF} = \frac{n(k-1)\epsilon_{GG}-2}{(k-1) (n-1-(k-1)\epsilon_{GG})}\]

where $n$ is the number of observations.

Missing values are automatically removed from data (listwise deletion).

Examples

Using a wide-format dataframe

julia> data = DataFrame(A = [2.2, 3.1, 4.3, 4.1, 7.2],
                     B = [1.1, 2.5, 4.1, 5.2, 6.4],
                     C = [8.2, 4.5, 3.4, 6.2, 7.2])
julia> Pingouin.epsilon(data, correction="gg")
0.5587754577585018
julia> Pingouin.epsilon(data, correction="hf")
0.6223448311539789
julia> Pingouin.epsilon(data, correction="lb")
0.5

Now using a long-format dataframe

julia> data = Pingouin.read_dataset("rm_anova2")
julia> head(data)
6×4 DataFrame
│ Row │ Subject │ Time   │ Metric  │ Performance │
│     │ Int64   │ String │ String  │ Int64       │
├─────┼─────────┼────────┼─────────┼─────────────┤
│ 1   │ 1       │ Pre    │ Product │ 13          │
│ 2   │ 2       │ Pre    │ Product │ 12          │
│ 3   │ 3       │ Pre    │ Product │ 17          │
│ 4   │ 4       │ Pre    │ Product │ 12          │
│ 5   │ 5       │ Pre    │ Product │ 19          │
│ 6   │ 6       │ Pre    │ Product │ 6           │

Let's first calculate the epsilon of the Time within-subject factor

julia> Pingouin.epsilon(data, dv="Performance", subject="Subject",
                     within="Time")
1.0

Since Time has only two levels (Pre and Post), the sphericity assumption is necessarily met, and therefore the epsilon adjustement factor is 1.

The Metric factor, however, has three levels:

julia> Pingouin.epsilon(data, dv=:Performance, subject=:Subject,
                     within=[:Metric])
0.9691029584899762

The epsilon value is very close to 1, meaning that there is no major violation of sphericity.

Now, let's calculate the epsilon for the interaction between the two repeated measures factor:

julia> Pingouin.epsilon(data, dv=:Performance, subject=:Subject,
                     within=[:Time, :Metric])
0.727166420214127
source
Pingouin.gzscoreMethod
gzscore(x)

Geometric standard (Z) score.

Arguments

  • x::Array{<:Number}: Array of raw values

Returns

  • gzscore::Array{<:Number}: Array of geometric z-scores (same shape as x)

Notes

Geometric Z-scores are better measures of dispersion than arithmetic z-scores when the sample data come from a log-normally distributed population [1].

Given the raw scores $x$, the geometric mean $\mu_g$ and the geometric standard deviation $\sigma_g$, the standard score is given by the formula:

\[z = \frac{log(x) - log(\mu_g)}{log(\sigma_g)}\]

References

[1] https://en.wikipedia.org/wiki/Geometricstandarddeviation

Examples

Standardize a lognormal-distributed vector:

julia> raw = [1,4,5,4,1,2,5,8,6,6,9,8,3]
julia> z = Pingouin.gzscore(raw)
13-element Array{Float64,1}:
 -1.8599725059104346
  0.03137685347921089
  0.3358161014965816
  0.03137685347921089
 -1.8599725059104346
  ⋮
  0.5845610789821727
  0.5845610789821727
  1.1377453044851344
  0.9770515331740336
 -0.3611136007126501
source
Pingouin.homoscedasticityMethod
homoscedasticity(data[, dv, group, method, α])

Test equality of variance.

Arguments

  • data: Iterable. Can be either an Array iterables or a wide- or long-format dataframe.
  • dv::Union{Symbol, String, Nothing}: Dependent variable (only when $data$ is a long-format dataframe).
  • group::Union{Symbol, String, Nothing}: Grouping variable (only when $data$ is a long-format dataframe).
  • method::String: Statistical test. 'levene' (default) performs the Levene test and 'bartlett' performs the Bartlett test. The former is more robust to departure from normality.
  • α::Float64: Significance level.

Returns

  • stats::DataFrame:
    • W/T: Test statistic ('W' for Levene, 'T' for Bartlett),
    • pval: p-value,
    • equal_var: true if data has equal variance.

See Also

Notes

The Bartlett $T$ statistic [1] is defined as:

\[T = \frac{(N-k) \ln{s^{2}_{p}} - \sum_{i=1}^{k}(N_{i} - 1) \ln{s^{2}_{i}}}{1 + (1/(3(k-1)))((\sum_{i=1}^{k}{1/(N_{i} - 1))} - 1/(N-k))}\]

where $s_i^2$ is the variance of the $i^{th}$ group, $N$ is the total sample size, $N_i$ is the sample size of the $i^{th}$ group, $k$ is the number of groups, and $s_p^2$ is the pooled variance.

The pooled variance is a weighted average of the group variances and is defined as:

\[s^{2}_{p} = \sum_{i=1}^{k}(N_{i} - 1)s^{2}_{i}/(N-k)\]

The p-value is then computed using a chi-square distribution:

\[T \sim \chi^2(k-1)\]

The Levene $W$ statistic [2] is defined as:

\[W = \frac{(N-k)} {(k-1)} \frac{\sum_{i=1}^{k}N_{i}(\overline{Z}_{i.}-\overline{Z})^{2} } {\sum_{i=1}^{k}\sum_{j=1}^{N_i}(Z_{ij}-\overline{Z}_{i.})^{2} }\]

where $Z_{ij} = |Y_{ij} - \text{median}({Y}_{i.})|$, $\overline{Z}_{i.}$ are the group means of $Z_{ij}$ and $\overline{Z}$ is the grand mean of $Z_{ij}$.

The p-value is then computed using a F-distribution:

\[W \sim F(k-1, N-k)\]

WARNING: Missing values are not supported for this function. Make sure to remove them before.

References

[1] Bartlett, M. S. (1937). Properties of sufficiency and statistical tests. Proc. R. Soc. Lond. A, 160(901), 268-282.

[2] Brown, M. B., & Forsythe, A. B. (1974). Robust tests for the equality of variances. Journal of the American Statistical Association, 69(346), 364-367.

Examples

  1. Levene test on a wide-format dataframe
julia> data = Pingouin.read_dataset("mediation");
julia> Pingouin.homoscedasticity(data[["X", "Y", "M"]])
1×3 DataFrame
│ Row │ W       │ pval     │ equal_var │
│     │ Float64 │ Float64  │ Bool      │
├─────┼─────────┼──────────┼───────────┤
│ 1   │ 1.17352 │ 0.310707 │ 1         │
  1. Bartlett test using an array of arrays
julia> data = [[4, 8, 9, 20, 14], [5, 8, 15, 45, 12]];
julia> Pingouin.homoscedasticity(data, method="bartlett", α=.05)
1×3 DataFrame
│ Row │ T       │ pval     │ equal_var │
│     │ Float64 │ Float64  │ Bool      │
├─────┼─────────┼──────────┼───────────┤
│ 1   │ 2.87357 │ 0.090045 │ 1         │
  1. Long-format dataframe
julia> data = Pingouin.read_dataset("rm_anova2");
julia> Pingouin.homoscedasticity(data, "Performance", "Time")
1×3 DataFrame
│ Row │ W       │ pval      │ equal_var │
│     │ Float64 │ Float64   │ Bool      │
├─────┼─────────┼───────────┼───────────┤
│ 1   │ 3.1922  │ 0.0792169 │ 1         │
source
Pingouin.normalityMethod
normality(data[, dv, group, method, α])

Univariate normality test.

Arguments

  • data: Iterable. Can be either a single list, 1D array, or a wide- or long-format dataframe.
  • dv::Union{Symbol, String, Nothing}: Dependent variable (only when $data$ is a long-format dataframe).
  • group::Union{Symbol, String, Nothing}: Grouping variable (only when $data$ is a long-format dataframe).
  • method::String: Normality test. 'shapiro' (default) performs the Shapiro-Wilk test using the AS R94 algorithm. If the kurtosis is higher than 3, it performs a Shapiro-Francia test for leptokurtic distributions. Supported values: ["shapiro", "jarque_bera"].
  • α::Float64: Significance level.

Returns

  • stats::DataFrame:
    • W: Test statistic,
    • pval: p-value,
    • normal: true if data is normally distributed.

See Also

Notes

The Shapiro-Wilk test calculates a :math:W statistic that tests whether a random sample $x_1, x_2, ..., x_n$ comes from a normal distribution.

The $W$ is normalized ($W = (W - μ) / σ$)

The null-hypothesis of this test is that the population is normally distributed. Thus, if the p-value is less than the chosen alpha level (typically set at 0.05), then the null hypothesis is rejected and there is evidence that the data tested are not normally distributed.

The result of the Shapiro-Wilk test should be interpreted with caution in the case of large sample sizes (>5000). Indeed, quoting from Wikipedia:

"Like most statistical significance tests, if the sample size is sufficiently large this test may detect even trivial departures from the null hypothesis (i.e., although there may be some statistically significant effect, it may be too small to be of any practical significance); thus, additional investigation of the effect size is typically advisable, e.g., a Q–Q plot in this case."

The Jarque-Bera statistic is to test the null hypothesis that a real-valued vector $y$ is normally distributed. Note that the approximation by the Chi-squared distribution does not work well and the speed of convergence is slow. In small samples, the test tends to be over-sized for nominal levels up to about 3% and under-sized for larger nominal levels (Mantalos, 2010).

Note that missing values are automatically removed (casewise deletion).

References

  • Shapiro, S. S., & Wilk, M. B. (1965). An analysis of variance test

for normality (complete samples). Biometrika, 52(3/4), 591-611.

  • Panagiotis Mantalos, 2011, "The three different measures of the sample skewness and

kurtosis and the effects to the Jarque-Bera test for normality", International Journal of Computational Economics and Econometrics, Vol. 2, No. 1, link.

Examples

  1. Shapiro-Wilk test on a 1D array
julia> dataset = Pingouin.read_dataset("anova")
julia> Pingouin.normality(dataset["Pain threshold"])
1×3 DataFrame
│ Row │ W        │ pval     │ normal │
│     │ Float64  │ Float64  │ Bool   │
├─────┼──────────┼──────────┼────────┤
│ 1   │ 0.971204 │ 0.800257 │ 1      │
  1. Wide-format dataframe using Jarque-Bera test
julia> dataset = Pingouin.read_dataset("mediation")
julia> Pingouin.normality(dataset, method="jarque_bera")
7×4 DataFrame
│ Row │ dv     │ W        │ pval        │ normal │
│     │ Symbol │ Float64  │ Float64     │ Bool   │
├─────┼────────┼──────────┼─────────────┼────────┤
│ 1   │ X      │ 1.42418  │ 0.490618    │ 1      │
│ 2   │ M      │ 0.645823 │ 0.724038    │ 1      │
│ 3   │ Y      │ 0.261805 │ 0.877303    │ 1      │
│ 4   │ Mbin   │ 16.6735  │ 0.000239553 │ 0      │
│ 5   │ Ybin   │ 16.6675  │ 0.000240265 │ 0      │
│ 6   │ W1     │ 5.40923  │ 0.0668961   │ 1      │
│ 7   │ W2     │ 80.6857  │ 3.01529e-18 │ 0      │
  1. Long-format dataframe
julia> dataset = Pingouin.read_dataset("rm_anova2")
julia> Pingouin.normality(dataset, :Performance, :Time)
2×4 DataFrame
│ Row │ Time   │ W        │ pval      │ normal │
│     │ String │ Float64  │ Float64   │ Bool   │
├─────┼────────┼──────────┼───────────┼────────┤
│ 1   │ Pre    │ 0.967718 │ 0.478771  │ 1      │
│ 2   │ Post   │ 0.940728 │ 0.0951576 │ 1      │
source
Pingouin.sphericityMethod
sphericity(data[, dv, within, subject, method, α])

Mauchly and JNS test for sphericity.

Arguments

  • data::DataFrame: DataFrame containing the repeated measurements. Only long-format dataframe are supported for this function.
  • dv::Union{Nothing, String, Symbol}: Name of column containing the dependent variable.
  • within::Union{Array{String}, Array{Symbol}, Nothing, String, Symbol}: Name of column containing the within factor. If within is a list with two strings, this function computes the epsilon factor for the interaction between the two within-subject factor.
  • subject::Union{Nothing, String, Symbol}: Name of column containing the subject identifier (only required if data is in long format).
  • method::String: Method to compute sphericity:
    • "jns": John, Nagao and Sugiura test.
    • "mauchly": Mauchly test (default).
  • α::Float64: Significance level

Returns

  • spher::Bool: True if data have the sphericity property.
  • W::Float64: Test statistic.
  • chi2::Float64: Chi-square statistic.
  • dof::Int64: Degrees of freedom.
  • pval::Flot64: P-value.

Throwing

DomainError When testing for an interaction, if both within-subject factors have more than 2 levels (not yet supported in Pingouin).

See Also

Notes

The Mauchly $W$ statistic [1] is defined by:

\[W = \frac{\prod \lambda_j}{(\frac{1}{k-1} \sum \lambda_j)^{k-1}}\]

where $\lambda_j$ are the eigenvalues of the population covariance matrix (= double-centered sample covariance matrix) and $k$ is the number of conditions.

From then, the $W$ statistic is transformed into a chi-square score using the number of observations per condition $n$

\[f = \frac{2(k-1)^2+k+1}{6(k-1)(n-1)}\]

\[\chi_w^2 = (f-1)(n-1) \text{log}(W)\]

The p-value is then approximated using a chi-square distribution:

\[\chi_w^2 \sim \chi^2(\frac{k(k-1)}{2}-1)\]

The JNS $V$ statistic ([2], [3], [4]) is defined by:

\[V = \frac{(\sum_j^{k-1} \lambda_j)^2}{\sum_j^{k-1} \lambda_j^2}\]

\[\chi_v^2 = \frac{n}{2} (k-1)^2 (V - \frac{1}{k-1})\]

and the p-value approximated using a chi-square distribution

\[\chi_v^2 \sim \chi^2(\frac{k(k-1)}{2}-1)\]

Missing values are automatically removed from data (listwise deletion).

References

[1] Mauchly, J. W. (1940). Significance test for sphericity of a normal n-variate distribution. The Annals of Mathematical Statistics, 11(2), 204-209.

[2] Nagao, H. (1973). On some test criteria for covariance matrix. The Annals of Statistics, 700-709.

[3] Sugiura, N. (1972). Locally best invariant test for sphericity and the limiting distributions. The Annals of Mathematical Statistics, 1312-1316.

[4] John, S. (1972). The distribution of a statistic used for testing sphericity of normal distributions. Biometrika, 59(1), 169-173.

See also http://www.real-statistics.com/anova-repeated-measures/sphericity/

Examples

Mauchly test for sphericity using a wide-format dataframe

julia> data = DataFrame(A = [2.2, 3.1, 4.3, 4.1, 7.2],
                     B = [1.1, 2.5, 4.1, 5.2, 6.4],
                     C = [8.2, 4.5, 3.4, 6.2, 7.2])
julia> Pingouin.sphericity(data)
│ Row │ spher │ W        │ chi2    │ dof     │ pval      │
│     │ Bool  │ Float64  │ Float64 │ Float64 │ Float64   │
├─────┼───────┼──────────┼─────────┼─────────┼───────────┤
│ 1   │ 1     │ 0.210372 │ 4.67663 │ 2.0     │ 0.0964902 │

John, Nagao and Sugiura (JNS) test

julia> Pingouin.sphericity(data, method="jns")[:pval]  # P-value only
0.045604240307520305

Now using a long-format dataframe

julia> data = Pingouin.read_dataset("rm_anova2")
julia> head(data)
6x4 DataFrame
│ Row │ Subject │ Time   │ Metric  │ Performance │
│     │ Int64   │ String │ String  │ Int64       │
├─────┼─────────┼────────┼─────────┼─────────────┤
│ 1   │ 1       │ Pre    │ Product │ 13          │
│ 2   │ 2       │ Pre    │ Product │ 12          │
│ 3   │ 3       │ Pre    │ Product │ 17          │
│ 4   │ 4       │ Pre    │ Product │ 12          │
│ 5   │ 5       │ Pre    │ Product │ 19          │
│ 6   │ 6       │ Pre    │ Product │ 6           │

Let's first test sphericity for the Time within-subject factor

julia> Pingouin.sphericity(data, dv=:Performance, subject=:Subject,
                        within=:Time)
│ Row │ spher │ W       │ chi2    │ dof   │ pval    │
│     │ Bool  │ Float64 │ Float64 │ Int64 │ Float64 │
├─────┼───────┼─────────┼─────────┼───────┼─────────┤
│ 1   │ 1     │ NaN     │ NaN     │ 1     │ 1.0     │

Since Time has only two levels (Pre and Post), the sphericity assumption is necessarily met.

The Metric factor, however, has three levels:

julia> Pingouin.sphericity(data, dv="Performance", subject="Subject",
                        within=["Metric"])[:pval]
0.8784417991645139

The p-value value is very large, and the test therefore indicates that there is no violation of sphericity.

Now, let's calculate the epsilon for the interaction between the two repeated measures factor. The current implementation in Pingouin only works if at least one of the two within-subject factors has no more than two levels.

julia> Pingouin.sphericity(data, dv="Performance",
                        subject="Subject",
                        within=["Time", "Metric"])
│ Row │ spher │ W        │ chi2    │ dof     │ pval     │
│     │ Bool  │ Float64  │ Float64 │ Float64 │ Float64  │
├─────┼───────┼──────────┼─────────┼─────────┼──────────┤
│ 1   │ 1     │ 0.624799 │ 3.7626  │ 2.0     │ 0.152392 │

Here again, there is no violation of sphericity acccording to Mauchly's test.

source