Glossary of Terms (Part 2)
- negative
binomial distribution
- A discrete probability distribution useful for
characterizing the time between Bernoulli trials.
For example, suppose machine parts are characterized as
defective or non-defective, and let the probability of a
defective part equal p. If you begin testing a sample of
parts to find a defective, then the number of parts which
must be tested before you find k defective parts follows
a negative binomial distribution. The geometric
distribution is a special case of the negative
binomial distribution where k=1. The negative binomial is
sometimes called the Pascal distribution.
Parameters: event probability 0<=p<=1, number of
successes k (positive integer)
Domain: X=k,k+1,k+2,...
Mean: k/p
Variance: (1-p)/p2
-
- nested model
- A model in which the various factors are contained
within one another in a specific hierarchical order. A
typical example is a study in which 15 batches are
selected, 5 samples are taken from each batch, and 2
measurements are made on each sample. In that case,
samples are said to be nested within batches and
measurements are nested within samples.
-
- Newman-Keuls
procedure
- A multiple comparisons procedure which allows you to
compare all pairs of means while controlling the overall alpha risk at a
specified level. It is a multiple-stage procedure based
on the Studentized range distribution. While it does not
provide interval estimates of the difference between each
pair of means, it does indicate which means are
significantly different from which others.
-
- nominal type
variable
- A type of variable for which there is no natural ordering
to the values which it can take. This is in contrast to
an ordinal variable.
-
- nonlinear models
- Models in which the terms do not enter in a purely
additive fashion. The Simple Regression Statlet allows
you to estimate various nonlinear models. Each of these
models can be transformed to a linear model by
transforming Y, X, or both.
-
- nonparametric
methods
- Methods which test hypotheses using data samples without
making any rigid assumptions about the type of
distribution from which the data come. Most of the classical methods
such as confidence intervals for the mean, t tests, and
least squares regression, assume that the data come from
a normal distribution. STATLETS contains various
nonparametric methods which can be employed to verify the
correctness of the classical methods, or as alternatives
to them. Many nonparametric tests are based on the ranks of the data rather than their
actual values. These tests are contained in various
Statlets throughout the package.
-
- nonstationarity
- A characteristic of a time series
for which the distribution changes over time.
-
- normal distribution
- A continuous probability distribution which is used to
characterize a wide variety of types of data. It is a symmetric distribution,
shaped like a bell, and is completely determined by its
mean and standard deviation. The normal distribution is
particularly important in statistics because of the
tendency for sample means to follow the normal
distribution (this is a result of the Central Limit
Theorem).
Most classical statistics procedures such as confidence
intervals rely on results from the normal
distribution. The normal is also known as the Gaussian
distribution after its originator, Frederich Gauss.
Parameters: mean mu, standard deviation sigma>0
Domain: all real X
Mean: mu
Variance: sigma2
- observation
-
- Repeated values of a data variable. The rows of a column
represent the observations.
-
- order statistics
- For a sample of data, the data values arranged from
smallest to largest. For example, the "first order
statistic" is equal to the minimum. For a sample of
size n, the "n-th order statistic" is equal to
the maximum.
Order statistics are used to measure the center of a data
set through the median,
data spread through the interquartile
range, and to judge whether data comes from a normal
distribution using a normal probability plot.
Certain nonparametric
tests are also based on the sample order statistics.
-
- ordinal
- A type of variable for which there is a natural ordering
to the values which it can take. It does not necessarily
have to be numeric. For example, the response to a
question on a survey is ordinal if it can take the values
"disagree strongly", "disagree",
"agree", and "agree strongly".
-
- outlier
- A data value which is unusual with respect to the group
of data in which it is found. It may be a single isolated
value far away from all the others, or a value which does
not follow the general pattern of the rest. Most classical
statistical techniques tend to be quite sensitive to
outliers, so that it is important to be on the alert for
them. Graphical techniques, particularly residual plots, are very
helpful in detecting the presence of outliers.
Some of the newer Exploratory
Data Analysis techniques and nonparametric procedures
are much less sensitive to outliers (such procedures are
said to be robust).
-
- outside point
- A point lying unusually far from the center of a sample
of data values. A point is called "outside" if
it lies more than 1.5 times the interquartile
range below the lower quartile
or above the upper quartile. It is called "far
outside" if it lies more than 3 times the
interquartile range away from the nearer quartile.
-
- Box-and-whisker
plots and stem-and-leaf
displays routinely list outside points.
- P value
- A P-value indicates how unusual a computed test statistic
is compared with what would be expected under the null
hypothesis. A small value indicates that the null
hypothesis should be rejected at any significance level above
the calculated value. For example, if the P value equals
.0246, we would reject the null hypothesis at the 5%
significance level, but would not reject it at the 1%
significance level. P values are printed in procedures
such as Multiple Regression to determine whether the
estimated coefficients are significantly different than
zero.
- parameter
- A numeric value which characterizes a probability distribution.
The mean and variance are typical examples. Statistics are used to estimate
parameters.
-
- Pareto distribution
- A distribution used for random variables which are
constrained to be greater or equal to 1.
Parameters: shape a>0
Domain: X>=1
Mean: a/(a-1) for a>1
Variance: [a/(a-2)]-[a/(a-1)] 2 for a>2
-
- partial
autocorrelation
- An estimate of the correlation between the data value at
time t and the data value at time (t-k), having accounted
for the correlation between the value at time t and the
values at lags less than k. Partial autocorrelations are
used when constructing ARIMA models for time series data.
For each partial autocorrelation, a corresponding standard error is calculated.
If the time series is random, all of the partial
autocorrelations should be within approximately +/- 2
standard errors. When constructing an autoregressive
model, a partial autocorrelation estimate extending
beyond this distance indicates the need for a coefficient
at the indicated time lag.
-
- partial correlation
- A measure of the strength of the relationship between two
or more numeric variables having accounted for their
joint relationship with one or more additional variables.
On a scale of -1 to +1, it measures the extent of the
unique correlation
between the two variables which is not shared with the
other variables.
- Pascal distribution
- Another name for the negative binomial
distribution.
-
- Pearson
correlation coefficient
- A statistic computed to calculate the correlation between
two numeric variables. It ranges from -1 for perfect
negative correlation to +1 for perfect positive
correlation. It is calculated by dividing the covariance of the
variables by the square root of the product of their variances.
-
- Pearson's R
- A statistic calculated in the Crosstabulation and
Contingency Table Statlets. It measures the degree of
association between the row and column variables using
the common correlation
coefficient. It ranges between -1 and +1 and is
relevant only if both variables are of interval type.
A P value is computed to test the
null hypothesis that the correlation equals 0.
-
- percentiles
- The p-th percentile of a set of data is that value which
is greater or equal to at least p% of the data and which
is less than at most (100-p)% of the data. The lower quartile, median, and upper quartile
are the same as the 25th, 50th, and 75th percentiles.
-
- Poisson
distribution
- A distribution often used to express probabilities
concerning the number of events per unit. For example,
the number of computer malfunctions per year, or the
number of bubbles per square yard in a sheet of glass,
might follow a Poisson distribution. The distribution is
fully characterized by its mean, usually expressed in
terms of a rate.
Parameters: mean B>0
Domain: X=0,1,2,...
Mean: B
Variance: B
-
- pooled variance
- A single estimate of the common variance obtained by
combining the estimated variances from each of several
samples.When comparing several samples from different
populations, it is common to assume that the variances of each of the populations
is the same. Normally, this estimate is obtained by a
weighted average based upon the degrees of freedom
in each of the samples.
Pooling of variance is justified only if one is confident
that the groups have equal variances. Tests for equality
of variance should be performed first to establish that
the variances are equal. If the variances are unequal,
separate variance estimates should be used for the
groups.
-
- population
- The set of all the observations which we wish to draw
conclusions about. A random
sample of observations is normally assumed to have
been drawn from a much larger population.
-
- power
- The probability of rejecting a null hypothesis when it is
false. Good tests have high power. For more details, see hypothesis tests.
-
- predicted values
- The values predicted by a statistical
model which was fit to the data. For example, when
fitting a line to values for X and Y, the predicted
values refer to the location of the line at a given value
of X.
-
- prediction limits
- Limits displayed in the Regression Statlets to show the
precision by which the fitted model can predict new
observations. If the limits are calculated at a
confidence level of 95%, then we would expect 95% of all
new observations to fall within these limits.
-
- probability
- A number between 0 and 1 which represents how likely an
event is to occur. Events with probability equal to 0
never occur. Events with probability equal to 1 always
occur.
In data analysis, probability is normally defined in
terms of the relative frequency of occurrence of an event
which can be repeated many times. For example, if you
repeatedly sample temperatures from a process and get
values below 150 degrees half the time, then the
"probability" of getting a reading below 150
degrees is equal to 0.5 or 50%.
In daily life, we sometimes use probability in a
different sense, i.e., to express our degree of belief
about the likelihood of an event which can not be
repeated indefinitely under identical conditions. For
example, you might say that the chance of getting a raise
this year is "one in a million". Such
"subjective" probabilities are used in
statistical decision theory.
-
- probability
distribution
- A mathematical model describing the probability of
observing various values of a random
variable.
- probability
mass function
- For a discrete
distribution, the probability of obtaining each possible
value of the random variable.
-
- pure error
- Variability between observations made at the same values
of the independent
variable or variables.
- quantile plot
- A plot of the percentiles for
a sample of data.
-
- The quantile plot shows the quantities (i-0.5)/(n+0.25)
versus the data values in sorted order, for i ranging
between 1 and n. It shows the cumulative proportion of
the data less than or equal to X.
-
- quantiles
- Those data values below which lie specified proportions
of the sample. For a sample of size n, the data value
which has rank i corresponds to the quantile at
(i-0.5)/(n+0.25).
- quartiles
- Statistics which divide the observations in a numeric
sample into 4 intervals, each containing 25% of the data.
The lower, middle, and upper quartiles are computed by
ordering the data from smallest to largest and then
finding the values below which fall 25%, 50%, and 75% of
the data.
The middle quartile is usually called the median.
Quartiles are special cases of percentiles.
The lower quartile, median, and upper quartile are the
same as the 25th, 50th, and 75th percentiles.
Box and
whisker plots provide a graphical summary of a data
sample in terms of its quartiles. Upper and lower
quartiles are calculated in the One Variable Analysis
Statlet.
- R-squared
- A statistic employed in regression
analysis that measures how much variance has been
explained by the regression model. Specifically, it is
the proportion of the total variability (variance) in the
dependent
variable that can be explained by the independent
variables. R-squared is also employed as a measure of
goodness of fit of the model. R-squared ranges from 0% to
100%. If all the observations fall on the regression
line, R-squared is equal to 100%.
The variability in the dependent variable is partitioned
into two component sums of squares: variability explained
by the regression model and unexplained variation. To
calculate R-squared, you divide the regression sums of
squares by the total sums of squares. In a simple
regression, R-squared can also be obtained by squaring
the correlation
coefficient.
-
- random sampling
- A sampling method in which all elements in the population
have an equal chance of being selected, and in which the
value of one observation does not affect the outcome of
other observations.
Random samples have important properties which are
necessary in many statistical tests.
- random variable
- A function which assigns a numerical value to all
possible outcomes of an experiment. The values of random
variables differ from one observation to the next in a
manner described by their probability
distribution.
-
- range
- The range of a sample is defined as the
(maximum value) - (minimum value)
It is sometimes used in place of the standard deviation to
measure the spread of a data sample. It is calculated in
the One Variable Analysis Statlet.
-
- rank correlation
- A number between -1 and +1 which measures the strength of
the relationship between two variables. Unlike the
ordinary correlation
coefficient which is based on the values of the data
values, this correlation is based on the ranks of the
data within each variable. It is thus less sensitive to outliers.
-
- ranks
- Values found by first ordering the observations
in a sample from smallest to largest. The smallest value
is assigned a rank of 1, the second smallest gets a rank
of 2, and so on. If several values are equal in
magnitude, their ranks are pooled and each is assigned
the average rank.
Ranks are used in many nonparametric
tests where it is not reasonable to assume that the
sample comes from a normal
distribution.
-
- ratio type variable
- A type of variable for which the distance between data
values is meaningful and for which there is a natural
origin. For example, the heights of human beings are
measured on a ratio scale. For such a variable, it is
meaningful to say that one value is twice as large as
another.
-
- reciprocal-X model
- In the Simple Regression Statlet, a model which takes the
form:
Y = a + b/X
-
- reciprocal-Y model
- In the Simple Regression Statlet, a model which takes the
form:
Y = 1 / (a + b*X)
-
- rectangular
distribution
- Another name for the uniform
distribution.
-
- regression
coefficients
- Calculated in regression analyses, estimates of unknown parameters in a regression model.
Interpretation of the coefficients as indicators of the
relative importance of the variables is not appropriate
unless they are standardized, since the actual magnitude
will depend upon the units in which the variables are
measured.
-
- regression model
- A statistical model
relating a dependent
variable to one or more independent
variables.
-
- relative frequency
- A count of the number of occurrences of a data value in a
sample, or the number of values falling within a fixed
range, expressed as a proportion of the total number of
observations. Data can be summarized in this manner by
tabulating all the data values into distinct categories
and then counting the number of times each category
appears in the frequency distribution. This tabular
summary is called a frequency table. A graphical
representation would be a barchart or histogram. Relative
frequencies are calculated in the Tabulation and
Crosstabulation Statlets.
-
- relative
standard deviation
- Another name for the coefficient
of variation.
-
- reliable
- Statistical reliability refers to measurements that can
be repeated with time. For example, a variable or
statistic is said to be reliable if its value can be
measured in the same way in repeated experiments.
-
- residual
- The "error" left over after a statistical model is fit to
a sample of data. In a regression
analysis, the residuals are the differences between
the observed values and the values that are predicted by
the regression model. Specifically, they are the
deviations from the regression line. In a oneway analysis
of variance, the residuals are the observed values minus
the mean of the samples from which the observations come.
Analysis of residuals is an important step in any
analysis since the residuals are assumed to be normally
and independently distributed with constant variance.
Plots of residuals are often the best way to see if the
data violates these assumptions. When interpreting
residuals, it is often desirable to refer to Studentized residuals
because the magnitudes are easier to judge.
-
- residual plots
- Residual plots are provided as options in many Statlets.
They are used to display the deviations of the data
values from the fitted statistical
models. Residuals should be random, with zero mean
and constant variance.
-
- resistant
- A characteristic of a statistical procedure which is not
greatly influenced by the presence of outliers.
Procedures involving medians
and quartiles are in general
more resistant than those involving means and standard
deviations.
-
- resolution
- The number of dots or "pixels" (picture
elements) on a screen or printer. The quality of graphs
will depend on the resolution of an output device.
-
- robust methods
- Statistical procedures which are not greatly affected
when the underlying assumptions are not exactly met. The
use of confidence
intervals for the mean is an example of a procedure
which is robust against the assumption that the data
follows a normal distribution. In general, nonparametric and EDA procedures tend to be
more robust than classical
methods.
- S-curve model
- In the Simple Regression Statlet, a model which takes the
form:
Y = exp(a + b/X)
-
- sample
- A set of observations, usually considered to have been
taken from a much larger population.
Statistics are numerical or graphical quantities
calculated from a sample. Since the data in one sample
will vary from that of another, so will the statistics
calculated from those samples.
-
- Scheffe intervals
- Intervals which bound the estimation error in each mean
using a method suggested by Scheffe. This method is
appropriate whether or not the group sizes are all the
same. Using the F-distribution,
it generates intervals which allow you to make all
possible linear comparisons among the sample means while
controlling the experimentwide error rate at a specified
level.
- seasonal adjustment
- Adjustment of time series data
to remove cyclical effects caused by periodic seasonal
factors. Data such as unemployment rates, which tend to
be higher at certain times of the year, are often
adjusted to allow better comparison across months.
In a multiplicative adjustment, each data value is
divided by the estimated seasonal index for the
corresponding month. Seasonal indices are scaled so that
an average month has an index of 100. An index of 110
would indicate a season 10% above average, while an index
of 90 would indicate a season 10% below average. In an
additive adjustment, an average month has an index of 0.
In the latter method, the index is subtracted from the
data value to form the adjusted data.
-
- serial correlation
- A situation in which the values in a sample are correlated based upon
their order. Most statistical procedures, except for
those in the time series
analysis section, assume that the data are not serially
correlated. Serial correlation can invalidate the P values calculated by common
statistical procedures.
-
- Shapiro-Wilks test
- A test to determine whether or not a sample comes from a normal distribution,
conducted by regressing the quantiles
of the observed data against that of the best-fitting
normal distribution.
-
- significance level
- The significance level of a test is the smallest alpha level at which
the null hypothesis would be rejected. Usually, if the
significance level is less than a number such as .05
(5%), the null hypothesis would be rejected in favor of
the alternative. In many cases, the significance level
can be thought of intuitively as the chance of getting a
sample like the one being analyzed if the null hypothesis
were true. A small significance level would imply that
getting such a sample was highly unlikely, suggesting
that the null hypothesis is probably not true. The
significance level is also called the P
value of the test.
-
- significant
- A term used in statistics which has a very precise
definition. It means that the observed event would occur
by chance under hypothesized conditions less than a
specified proportion of the time. For example, we may be
interested in knowing whether the means of two samples
are the same or different. If the sample means are
different enough, we might conclude on the basis of a
statistical test that the population means are
"significantly different at the 5% level". This
states that, if the means are actually the same, a
difference as large as we observed would occur less than
5% of the time.
-
- simple regression
- A statistical model
relating a single dependent variable
to a single independent
variable.
- skewed
- A characteristic applicable to probability distributions
or samples which refers to a lack of symmetry.
-
- skewness
- A statistic which measures the lack of symmetry in a
distribution. A plot of a skewed distribution would show
a long tail to either the left or the right.
Distributions with a longer upper tail are said to be
positively (right) skewed, while those with a longer
lower tail are negatively (left) skewed.
The skewness of data is usually measured through a
coefficient of skewness which is zero for symmetric
distributions such as the normal
or uniform distribution,
is greater than zero for positively skewed data, and is
less than zero for negatively skewed distributions. To
judge whether data departs significantly from a normal
distribution, a standardized
skewness statistic can also be computed. Skewness is
calculated in the One Variable Analysis Statlet.
-
- skychart
- A graphical summary of the joint frequency distribution
of two variables. It is created by plotting bars whose
heights are proportional to the frequencies in the cells
of a Crosstabulation.
-
- slope
- The term in the equation of a regression
line which multiplies the independent
variable X. If the equation of a line is given as Y =
A+B*X, then the slope is B. The slope's numerical value
indicates the steepness of a line while the algebraic
sign describes the relationship between X and Y. A
positive slope means that as X increases, Y increases.
The converse is true for a negative slope: as X
increases, Y decreases.
In a regression analysis, the slope tells you how much
increase there is in the predicted Y variable for every
unit change in the X variable.
-
- Somer's D
- A statistic calculated in the Crosstabulation and
Contingency Table Statlets. It ranges from -1 to +1 and
is based on the number of concordant and discordant pairs
of observations. A concordant pair is one is which the
two variables (row and column) have the same relative
ranking (greater than or less than). A discordant pair is
one in which the two variables have the opposite ranking.
One variable (row or column) is considered to be the independent
variable, while the other is considered to be the dependent variable.
Both variables must be ordinal. A
correction is made for ties on the independent variable.
-
- square root-X model
- In the Simple Regression Statlet, a model which takes the
form:
Y = a + b*X0.5
-
- square root-Y model
- In the Simple Regression Statlet, a model which takes the
form:
Y = (a + b*X)2
-
- standard deviation
- A statistic which measures the variability or dispersion
of a set of data. It is calculated from the deviations
(distances) between each data value and the sample mean,
and is often represented by the letter "s". The
more disperse the data is, the larger the standard
deviation.
The standard deviation squared is called the variance. For data which follows a normal distribution,
approximately 68% of all data will fall within one
standard deviation of the sample mean, 95% of all values
will fall within two standard deviations, and 99.7% of
all data will fall within three standard deviations. For
data from any distribution, AT LEAST 75% of all values
fall within plus and minus two standard deviations, while
AT LEAST 89% fall within three standard deviations.
-
- standard error
- The standard deviation of the sampling distribution of a
statistic. As such, it measures the precision of the
statistic as an estimate of a population or model
parameter.
A commonly computed standard error is the standard error
of the mean, defined as
(population standard deviation) / n0.5
- standard error
of means
- In regression analysis,
the standard deviation of the predicted mean for many new
observations.
-
- standard
error of predictions
- In regression analysis,
the standard deviation of the predicted value for a
single new observation.
-
- standard
error of the estimate
- In regression analysis,
the best estimate of the standard deviation of the residuals about the regression line.
It can be calculated by taking the square root of the
residual mean squared error. Typically, the smaller the
standard error, the better the fit of the regression
line.
- standard
normal distribution
- A normal distribution
with a mean equal to 0 and a standard deviation equal to
1. Also called the unit normal distribution.
-
- standardized
skewness
- A standardized form of the skewness
statistic which renders the statistic free of scale.
Standardized skewness can be used to test whether your
data comes from a normal
distribution. If it does, the statistic will fall
between -2 and +2 about 95% of the time.
-
- standardized
kurtosis
- A standardized form of the kurtosis statistic which
renders the statistic free of scale.
Standardized kurtosis can be used to test whether your
data comes from a normal
distribution. If it does, the statistic will fall
between -2 and +2 about 95% of the time.
-
- stationarity
- A characteristic of a time series
for which the distribution does not change over time.
-
- statistic
- Anything that can be calculated from a sample of data.
The most common use of the word is for summary measures
such as the sample mean and sample standard deviation,
although graphical displays such as histograms are also
statistics. When referring to the population mean or
standard deviation, the term parameter
is used.
Another use of the term statistics is to describe two
broad categories of data analysis: Descriptive and
Inferential Statistics. Descriptive statistics involves
numerical or graphical summaries of data. Inferential
statistics allows one to use sample statistics to make
statements about population parameters.
-
- statistical
inference
- The extension of sample results to
a larger population.
Descriptive statistics (such as the mean or a histogram)
provide concise methods for summarizing a lot of
information. However, it is inferential statistics that
allows one to make statements about the population from a
sample.
For example, it is often virtually impossible to measure
an entire population, but by statistical inference one
can use the measured sample statistics to make statements
about the unmeasured population (see estimation). However,
in order to use the power of statistical inference,
certain assumptions about the statistic must first be
met. For example, making correct inferences about a
population from a sample can often require that random sampling be employed.
-
- statistical model
- A model used to describe the relationship between a dependent variable
Y and one or more independent
variables. It takes the general form:
Y = f(X1,X2,...) + error
where f(X1,X2,...) is a mathematical function, and error
represents the random deviations from the model.
-
- stem-leaf display
- A cross between a table and a graph that shows individual
data values stacked up like a histogram. As a result,
the data is displayed in less space than a frequency
table and unlike a table, affords visual examination of
the frequency distribution for skewness
and outliers.
The display consists of a "stem" which is a
vertical list of all the leading digits in the data to
the left of a central vertical line. On the right side of
the central vertical line are the "leaves",
which are the digits which immediately follow the stems.
A cumulative count for each row, called the
"depth", is shown along the far left of the
display. Outside points are
listed on separate low and high stems.
A stem-leaf display is created in the One Variable
Analysis Statlet.
-
- Student's t
distribution
- A probability distribution which is very similar in shape
to the standard
normal distribution. The mean of the t distribution
is always equal to 0, while the standard deviation is
usually slightly greater than 1. Only one parameter,
called the degrees
of freedom, is necessary to completely specify the
distribution.
Values of Student's t are frequently used in forming confidence
intervals for the mean when the variance is unknown,
testing if two sample means are significantly different,
or when testing the significance of coefficients in a regression model.
Parameters: degrees of freedom n>0
Domain: all real X
Mean: 0
Variance: n/(n+2) for n>2
-
- Studentized
residual
- A residual which has been divided
by its estimated standard error, where that standard
error is based upon fitting a statistical model using all
points except the point whose residual is to be computed.
Studentized residuals should behave like a sample from a normal distribution with
a mean of 0 and a standard deviation of 1. Studentized
residuals are a type of "standardized"
residual, which allows for easier interpretation of
residuals. Since each residual is expressed in terms of
the number of standard deviations away from 0,
Studentized residuals greater than 3 or less than -3
should occur very infrequently.
-
- sum of squares
- A mathematical quantity used in various statistical
procedures to measure variability. For example, the
sample variance is computed by summing the squared
deviations of each observation from the sample mean and
dividing by n-1.
-
- symmetric
- A characteristic of a sample or population which looks
the same to the right of the peak as it looks to the left
of the peak.
-
- symmetric
distribution
- A symmetric distribution which looks identical to the
left and right of its mean. For such a distribution, the
mean and median are the same. Common examples of
symmetric distributions are the normal and the uniform
distributions. A distribution which is not symmetric is
said to be skewed.
- t test
- A hypothesis test
based on Student's t
distribution. Commonly used to test hypotheses about
one or more population means or coefficients in a linear
regression model.
-
- time series
- A sample of data values collected at equally spaced
points in time. Possible autocorrelation
between adjacent values makes it necessary to use special
statistical methods to analyze this type of data.
-
- triangular
distribution
- A distribution used for random variables which are
constrained to lie between two fixed limits. However,
unlike the uniform
distribution in which all values between the limits
are equally likely, the triangular distribution peaks at
a central value. It is characterized by three parameters:
lower limit, central value, and upper limit.
Parameters: lower limit a, central value c>a, upper
limit b>c
Domain: a<=X<=b
Mean: (a+b+c)/3
Variance: [a2 + b2 + c2
- ab - ac - bc]/18
-
- Type I
error
- Incorrectly rejecting a true null hypothesis. The
probability of such an error is the alpha risk. It is
common to set the probability of making a Type I error at
a number such as 5%. For more details, see hypothesis tests.
-
- Type II error
- Not rejecting a false null hypothesis. The probability of
such an error is called the beta risk. For more
details, see hypothesis
tests.
- uncertainty
coefficient
- A statistic calculated in the Crosstabulation and
Contingency Table Statlets. On a scale of 0 to 1, it
measures the proportional reduction in the uncertainty
about the value of the row or column variable given
knowledge of the other.
- uniform
distribution
- A continuous probability distribution which is useful for
characterizing data which ranges over an interval of
values, each of which is equally likely. It is sometimes
called the rectangular distribution because of its shape
when plotted. The distribution is completely determined
by the smallest possible value "a" and the
largest possible value "b". For discrete data,
there is a related discrete
uniform distribution.
Parameters: lower limit a, upper limit b>a
Domain: a<=X<=b
Mean: (a+b)/a
Variance: (b-a) 2/12
- unit normal
distribution
- Another name for the standard normal
distribution.
-
- upper quartile
- The 75th percentile,
calculated by ordering the data from smallest to largest
and finding the value which lies 75% of the way up
through the data.
Box and
whisker plots are a graphical display of a sample in
terms of its quartiles.
- variable
- A term used in statistics is used to describe the factors
that are to be studied. Data variables are described as
either:
1) Qualitative (Categorical) or
- 2) Quantitative (Numerical)
Qualitative variables are named categories of data. For
example, gender or machine number. The categories may be
coded numerically (for e.g., female=1, male=2), but the
actual numbers have no true numerical meaning. An average
value for the variable gender, for instance, would not
make any sense.
On the other hand, quantitative variables are truly
numeric in nature. For example, taking the average of a
sample of item weights would have true meaning.
Quantitative variables can further be described as discrete or continuous.
In statistical analysis, it is important to know the
nature of your variables (categorical or numeric) and how
the variable is measured. The scale of measurement used
for your variable dictates what statistical procedures
can legitimately be applied to your data.
- variable data
Data measured on a continuous scale, such as a person's
age in years.
-
- variance
- A statistic which measures how spread out or dispersed a
set of data is. The value calculated will always be
greater than or equal to zero, with larger values
corresponding to data which is more spread out. If all
data values are identical, the variance is equal to zero.
The square root of the variance is called the standard deviation. Since
the standard deviation is measured in the same units as
the data, it is more frequently used than the variance.
The variance and standard deviation are calculated in the
One Variable Analysis Statlet.
-
- variance component
- The variance due to a particular factor in a model which
decomposes the overall process variance into two or more
pieces.
-
- viewpoint
- The location of the user's eye when viewing 3D graphs. It
can be changed by pressing the arrows on the graphics
panel.
- warning limits
-
- Limits placed on a control
chart for variables or attributes at 1 and 2
"sigma" to help determine how far points lie
from the centerline.
-
- Weibull
distribution
- A distribution used for random variables which are
constrained to be greater or equal to 0. It is
characterized by two parameters: shape and scale. The
Weibull distribution is one of the few distributions
which can be used to model data which is negatively skewed.
Parameters: shape a>0, scale B>0
Domain: X>=0
Mean: (B/a)/G(1/a)
Variance: B2/a [2G(2/a)-(1/a){G(1/a)} ]2
-
- where G(y) is the gamma function evaluated at y.
Title Page
Revised: July 30, 1997.
Copyright © 1997 by NWP Associates, Inc.
All trademarks or product names mentioned herein are the property
of their respective owners.