Parametric and nonparametric statistics

Published:

Parametric statistics is a branch of statistics which assumes that sample data comes from a population that follows a probability distribution based on a fixed set of parameters. Most well-known elementary statistical methods are parametric. Conversely a non-parametric model differs precisely in that the parameter set (or feature set in machine learning) is not fixed and can increase, or even decrease if new relevant information is collected.

A parametric model as it relies on a fixed parameter set assumes more about a given population than non-parametric methods. When the assumptions are correct, parametric methods will produce more accurate and precise estimates than non-parametric methods, i.e. have more statistical power. As more is assumed when the assumptions are not correct they have a greater chance of failing, and for this reason are not a robust statistical method. On the other hand, parametric formulae are often simpler to write down and faster to compute. For this reason their simplicity can make up for their lack of robustness, especially if care is taken to examine diagnostic statistics.

Nonparametric statistics are statistics not based on parameterized families of probability distributions. They include both descriptive and inferential statistics. The typical parameters are the mean, variance, etc. Unlike parametric statistics, nonparametric statistics make no assumptions about the probability distributions of the variables being assessed. The difference between parametric models and non-parametric models is that the former has a fixed number of parameters, while the latter grows the number of parameters with the amount of training data. Note that the non-parametric model does not have any parameters: parameters are determined by the training data, not the model.

Properties of non-parametric testing in comparison with parametric testing.

  • Wider range of application.
  • More robust.
  • Simplicity (model structure is not specified a priori but is instead determined from data).
  • Larger sample size can be required to draw conclusions with the same degree of confidence.
  • Less powerful than the appliable parametric test (if it exists)

Applications

Applications of non-parametric methods:

  • Studying populations that take on a ranked order (such as movie reviews receiving one to four stars)
  • The use of non-parametric methods may be necessary when data have a ranking but no clear numerical interpretation, such as when assessing preferences. In terms of levels of measurement, non-parametric methods result in “ordinal” data.
  • Situations where less is known about the application in question

Models

Non-parametric models:

  • A histogram is a simple nonparametric estimate of a probability distribution.
  • Kernel density estimation provides better estimates of the density than histograms.
  • Nonparametric regression and semiparametric regression methods have been developed based on kernels, splines, and wavelets.
  • Data envelopment analysis provides efficiency coefficients similar to those obtained by multivariate analysis without any distributional assumption.
  • KNNs classify the unseen instance based on the K points in the training set which are nearest to it.
  • A support vector machine (with a Gaussian kernel) is a nonparametric large-margin classifier.

Methods

The most knwon non-parametric methods are:

  • Analysis of similarities
  • Anderson-Darling test: tests whether a sample is drawn from a given distribution
  • Statistical bootstrap methods: estimates the accuracy/sampling distribution of a statistic
  • Cochran’s Q: tests whether k treatments in randomized block designs with 0/1 outcomes have identical effects
  • Cohen’s kappa: measures inter-rater agreement for categorical items
  • Friedman two-way analysis of variance by ranks: tests whether k treatments in randomized block designs have identical effects
  • Kaplan-Meier: estimates the survival function from lifetime data, modeling censoring
  • Kendall’s tau: measures statistical dependence between two variables
  • Kendall’s W: a measure between 0 and 1 of inter-rater agreement
  • Kolmogorov-Smirnov test: tests whether a sample is drawn from a given distribution, or whether two samples are drawn from the same distribution
  • Kruskal-Wallis one-way analysis of variance by ranks: tests whether > 2 independent samples are drawn from the same distribution
  • Kuiper’s test: tests whether a sample is drawn from a given distribution, sensitive to cyclic variations such as day of the week
  • Logrank test: compares survival distributions of two right-skewed, censored samples
  • Mann-Whitney U or Wilcoxon rank sum test: tests whether two samples are drawn from the same distribution, as compared to a given alternative hypothesis.
  • McNemar’s test: tests whether, in 2 × 2 contingency tables with a dichotomous trait and matched pairs of subjects, row and column marginal frequencies are equal
  • Median test: tests whether two samples are drawn from distributions with equal medians
  • Pitman’s permutation test: a statistical significance test that yields exact p values by examining all possible rearrangements of labels
  • Rank products: detects differentially expressed genes in replicated microarray experiments
  • Siegel-Tukey test: tests for differences in scale between two groups
  • Sign test: tests whether matched pair samples are drawn from distributions with equal medians
  • Spearman’s rank correlation coefficient: measures statistical dependence between two variables using a monotonic function
  • Squared ranks test: tests equality of variances in two or more samples
  • Tukey-Duckworth test: tests equality of two distributions by using ranks
  • Wald-Wolfowitz runs test: tests whether the elements of a sequence are mutually independent/random
  • Wilcoxon signed-rank test: tests whether matched pair samples are drawn from populations with different mean ranks

See also

Statistical testing