Select the first letter of the word from the list above to
jump to appropriate section of the glossary. If the term you are
looking for starts with a digit or symbol, choose the '#' link.
- adjusted R-squared
- The regression R-squared statistic "corrected"
for the number of independent variables in a multiple
regression analysis. It is often used to compare models
involving different numbers of coefficients. The adjusted
R-squared statistic is interpreted as:
(1) a measure of the goodness of fit of the least squares
regression line.
(2) the proportion of variance in the dependent variable
accounted for by the independent variables.
-
- alpha risk
- The probability of rejecting a null hypothesis when it is
true. It is the probability of making a Type I error. For
more details, see hypothesis
tests.
-
- ARIMA models
- ARIMA (AutoRegressive Integrated Moving Average) models
are often used to forecast time series data.
They provide a flexible class of models which can
represent the dynamic behavior found in many sets of
data. To select an ARIMA model, you must specify several
parameters:
p = order of the autoregressive
part of the model
d = order of differencing (if any)
q = order of the moving
average part of the model.
Separate values are entered for the nonseasonal part of
the model and for the seasonal part. The model may or may
not include a constant term.
-
- aspect ratio
- For a graphics device, the ratio of the screen
dimensions, normally defined as
vertical screen dimension
---------------------------
horizontal screen dimension
Unless the surface on which your graph is displayed is
perfectly square, with an aspect ratio equal to 1,
objects such as circles will be distorted when displayed
unless the aspect ratio is adjusted for.
-
- autocorrelation
- At lag k, the correlation between the data value at time
t and the data value at time (t-k). Autocorrelations are
often calculated for time
series data to determine how the correlation between
data values varies with the distance or time
"lag" between them.
For each autocorrelation, a corresponding standard error
is calculated. If the time series is random, all of the
autocorrelations should be within approximately +/- 2
standard errors. Estimates extending beyond this distance
indicate significant correlation between data values
separated by the indicated time lag.
-
- autoregressive
models
- A statistical model for time series which
relates the observed data value at time t to the values
in previous time periods. Common models are:
Nonseasonal order 1: Yt = a + b*Yt-1
+ At
Nonseasonal order 2: Yt = a + b*Yt-1
+ c*Yt-2 + At
Seasonal order 1: Yt = a + b*Yt-s +
At
Seasonal order 2: Yt = a + b*Yt-s +
c*Yt-2s + At
where At is a random error term at time t.
These models can be fit using the Forecasting Models
Statlet in the time series section.
-
- average
A statistic calculated by summing a set of data values
and dividing by the number of values. Used to measure the
center of a sample of variable data.
- Bernoulli trial
- An experiment with only two possible outcomes, such as
success or failure, heads or tails, or good or bad. The
probability of a selected outcome (such as a success) is
normally referred to as p. Related probability
distributions include the binomial distribution,
the geometric, and
the negative
binomial.
Parameter: event probability 0<=p<=1
Domain: X=0,1
Mean: p
Variance: p(1-p)
-
- beta distribution
- A distribution used for continuous random variables which
are constrained to lie between 0 and 1. It is
characterized by two parameters: shape and scale.
Parameters: shape a>0
scale B>0
Domain: 0<=X<=1
Mean: a/(a+B)
- Variance: aB/{[(a+B+1)(a+b) ]}2
-
- beta risk
- The probability of not rejecting a null hypothesis when
it is false. It is the probability of making a Type II error, or
(1.0-power). For more details, see hypothesis tests.
-
- binomial
distribution
- A distribution which gives the probability of observing X
successes in a fixed number (n) of independent Bernoulli trials. p
represents the probability of a success on a single
trial.
Parameters: event probability 0<=p<=1
number of trials: n>0
Domain: X=0,1,...,n
Mean: np
Variance: np(1-p)
-
- Bonferroni
intervals
- Used when comparing several means, these intervals bound
the estimation error in each mean using a method
suggested by Bonferroni. This method is appropriate
whether or not the group sizes are all the same. Using
the F-distribution, it
generates intervals which allow you to make a specified
number of linear comparisons among the sample means while
controlling the experiment-wide error rate at a specified
level. In the STATLETS ANOVA procedures, the number of
comparisons is set to allow you to compare the
differences between all pairs of means.
-
- box-and-whisker
plot
- A graphical display designed to provide a quick summary
of a sample of numeric data. It consists of a box
extending from the lower quartile to the upper
quartile, with a central line at the median.
Whiskers extend to the smallest and largest points which
are not classified as outside points.
- capability indices
- Indices computed in the Process Capability procedure to
measure how well a sample of data conforms to process
specifications. The indices include:
(1) CP = (USL - LSL) / (6*sigma)
(2) CPK = smaller of CPK(upper) and CPK(lower)
(3) CPK(upper) = (USL - sample mean) / (3*sigma)
(4) CPK(lower) = (sample mean - LSL) / (3*sigma)
where USL and LSL represent the upper and lower
specification limits, respectively. Normally, all of
these indices should be greater or equal to 1.33 for the
process to be deemed "capable" of meeting the
specifications.
-
- case
- A case consists of measurements on variables for an
individual subject or experimental unit. For example, you
might take many measurements on a single person:
variables such as height, weight, gender, etc. The
individual person's measurements on all the variables
represents a case. Measurement on a single variable is
called an observation.
Sometimes the term case and observation are used
interchangeably. But, strictly speaking, a case
represents one computer record, or one row of data in a
spreadsheet, for an individual experimental unit. Each
individual case (or person in this example) will have a
row of data contained in the data file. The data file
will have n rows of data, where n is the number of cases.
-
- central limit
theorem
- The Central Limit Theorem is an important mathematical
result which states that for a random sample of
observations from any distribution with a finite mean and
a finite variance, the average will tend to follow a normal
distribution for large samples. This theorem is the
main justification for the widespread use of confidence intervals
based on the normal distribution and for t tests when
estimating the mean and comparing two means.
-
- chi-square
distribution
- A distribution used for random variables which are
constrained to be greater or equal to 0. It is
characterized by one parameter: degrees of freedom. The
chi-square distribution is used most often as the
sampling distribution for various statistical tests.
Parameter: degrees of freedom v (positive integer)
Domain: X>=0
Mean : v
Variance : 2v
-
- chi-square test
- A test performed on a two-way frequency table to test
whether two variables can be considered statistically
independent. It is calculated in the Crosstabulation and
Contingency Table Statlets.
In calculating the chi-square test, the observed
frequency in each cell is compared to the frequency which
would be expected if the row and column classifications
were independent. If the calculated statistic is large
(i.e., if its P value
is less than a predetermined significance level such as
.05), then the null hypothesis of independence must be
rejected.
The chi-square test is valid only if the expected
frequency in each cell is relatively large. If any
frequency is less than 5, a warning is displayed.
If the two-way table contains exactly 2 rows and 2
columns and the total count in the table does not exceed
100, Fisher's exact test
is also performed.
-
- classical methods
- The estimation of confidence
intervals and use of hypothesis
tests have been widely applied for over 50 years.
They are sometimes referred to as classical procedures,
in contrast to newer nonparametric
and EDA techniques which tend to be
much less sensitive (more robust) to the
assumption that data follows a normal
distribution.
-
- coefficient of
variation (CV)
- The ratio of the standard deviation divided by the mean,
multiplied by 100, so that it is expressed as a percent.
Sometimes called the relative standard deviation.
This summary statistic is often employed in the natural
sciences, where the standard deviation of measurement
error is often proportional to the magnitude of the
values being measured. Since the coefficient of variation
provides a measure of relative variation and is
scale-free, it is particularly useful in making
comparisons between different samples.
The CV is calculated in the One Variable Analysis and
Multiple Samples Statlets.
-
- concordant
- A pair of cases for two ordered data variables in which
values for the first case are either both higher or both
lower than the values of the variables for the second
case. For example, the following pair is concordant:
X1 X2
10 100
20 150
-
- See also discordant.
-
- conditional gamma
- A statistic calculated in the Crosstabulation and
Contingency Table Statlets. It ranges from -1 to +1 and
is based on the number of concordant
and discordant pairs of
observations. A concordant pair is one is which the two
variables (row and column) have the same relative ranking
(greater than or less than). A discordant pair is one in
which the two variables have the opposite ranking. Both
variables must be ordinal. No
correction is made for ties.
-
- conditional
sum of squares
- In a multiple
regression analysis, the sums of squares attributable
to each individual independent
variable when entered into the model in the order
specified. They are used to determine the contribution
each independent variable makes to the regression model
as it is added. If the variables are highly correlated with each other,
however, this contribution could vary greatly if the
variables were entered in a different order.
-
- confidence interval
- A statistic
constructed from a set of data to provide an interval
estimate for a parameter.
For example, when estimating the mean of a normal
distribution, the sample average provides a point
estimate or best guess about the value of the mean.
However, this estimate is almost surely not exactly
correct. A confidence interval provides a range of values
around that estimate to show how precise the estimate is.
The confidence level associated with the interval,
usually 90%, 95%, or 99%, is the percentage of times in
repeated sampling that the intervals will contain the
true value of the unknown parameter.
-
- confidence limits
- Limits displayed in the Simple Regression
Statlet to show the precision by which the fitted model
has been estimated. If the limits are drawn at a
confidence level of 95%, then we would expect the mean of
many new observations taken at a given value of X to fall
within these limits.
-
- contingency
coefficient
- A statistic calculated in the Crosstabulation and
Contingency Table Statlets. It measures the degree of
association between the values of the row and column
variables on a scale of 0 to 1, based on the usual chi-square statistic. It
cannot in general attain the value 1 for all tables.
-
- continuous
- A type of random variable which may take any value over
an interval. This is contrasted with a discrete random variable, which may
take only a limited set of values.
-
- control chart
- A chart used to determine whether the distribution of
data values generated by a process is stable over time. A
control chart plots a statistic versus time.
-
- correlation
- A measure of association between two variables. It
measures how strongly the variables are related, or
change, with each other. If two variables tend to move up
or down together, they are said to be positively
correlated. If they tend to move in opposite directions,
they are said to be negatively correlated. Correlations
are computed in the Multiple Regression Statlet.
The most common statistic for measuring association is
the Pearson
correlation coefficient.
-
- correlation
coefficient
- Denoted as "r", a measure of the linear
relationship between two variables. The absolute value of
"r" provides an indication of the strength of
the relationship.
The value of "r" varies between positive 1 and
negative 1, with -1 or 1 indicating a perfect linear
relationship, and r = 0 indicating no relationship. The
sign of the correlation coefficient indicates whether the
slope of the line is positive or negative when the two
variables are plotted in a scatterplot.
-
- covariance
- A measure of the joint variability of a pair of numeric
variables. It is based upon the sum of crossproducts of
the values of the variables.
-
- covariate
- A quantitative factor which varies together with the dependent variable. In an
analysis of covariance, the relationship between the
dependent variable and the covariate is first adjusted
for before the effects of the other factors are examined.
-
- Cramer's V
- A statistic calculated in the Crosstabulation and
Contingency Table Statlets. It measures the degree of
association between the values of the row and column
variables on a scale of 0 to 1, based on the usual chi-square statistic. Unlike
the contingency
coefficient, it can attain the value 1 for all
tables.
-
- cumulative
frequency
- The number of observations falling in a given class in a
frequency table, plus all observations falling in earlier
classes. Cumulative frequencies are calculated in the
Tabulation and Crosstabulation Statlets.
-
- cumulative
probability
- The probability that a random variable will be less than
or equal to a specified value. For example, the
cumulative probability equals 0.5 for the median.
-
- cumulative
relative frequency
- The number of observations falling in a given class in a
frequency table, plus all observations falling in earlier
classes, divided by the total number of observations.
- data file
- A collection of observations on several characteristics,
arranged in the form of a spreadsheet. Each row of the
file represents a single case or
observation. Each column represents a single
characteristic or variable.
-
- degrees of freedom
- A term used in statistics to characterize the number of
independent pieces of information contained in a
statistic. For example, if we begin with a random sample
of n observations and estimate the mean by the sample
average, we are left with only (n-1) independent
measurements from which to estimate the variance or
deviations around the mean. In a simple regression, where
we estimate both an intercept and a slope, only (n-2)
degrees of freedom remain to measure variability around
the fitted line.
-
- density function
- A mathematical function used to determine probabilities
for a continuous random
variable. The bell-shaped curve corresponding to a normal
distribution is one example. To determine the
probability of finding a value between two limits, the
area under the density function between those limits is
computed.
-
- dependent variable
- The variable whose behavior is to be measured as a result
of an experiment. For example, in a consumer research
study on the effects of types of packaging on purchase
behavior (where the size, shape and colors of the box are
the independent variables),
the dependent variable would be the quantity of product
purchased.
Ideally, the dependent variable should be reliable,
sensitive, easy to measure, and distributed in a way that
conforms to the assumptions of a statistical model.
By convention, the dependent variable is plotted along
the vertical Y-axis with the independent variable on the
horizontal X-axis.
-
- DFITS
- A statistic computed when fitting a multiple regression
model to measure the change in each predicted value
which would occur if a single data value was deleted.
Large values correspond to points which have a big
influence on the fitted model.
-
- differencing
- An operation often used to make a nonstationary time series
approximately stationary.
A nonstationary time series is one which does not have a
fixed mean. The types of differencing normally performed
are:
Nonseasonal of order 1: Yt - Yt-1
Nonseasonal of order 2: (Yt-Yt-1) -
(Yt-1-Yt-2)
Seasonal of order 1: Yt - Yt-s
Seasonal of order 2: (Yt-Yt-s) - (Yt-s-Yt-2s)
Mixed seasonal and nonseasonal: (Yt-Yt-1)
- (Yt-s-Yt-s-1)
-
- where s is the length of seasonality (such as 12 for
monthly data).
-
- discordant
- A pair of cases for two ordered data variables in which
the value of one variable for the first case is higher
(or lower) than its value in the second case, and the
relative relationship is switched for the second
variable. For example, the following pair is discordant:
X1 X2
10 100
20 50
-
- See also concordant.
-
- discrete
- A type of random variable which may take on only a
limited set of values, such as 1,2,3,...,10. The list may
be finite, or there may be an infinite number of values.
A discrete random variable is to be contrasted with a continuous random variable.
-
- discrete
uniform distribution
- A discrete distribution which allocates equal
probabilities to all integer values between a lower and
upper limit.
Parameters: lower limit a
upper limit b>a
Domain: X=a,a+1,a+2,...,b
Mean: (a+b)/2
Variance: [(b-a+1) 2-1]/12
-
- distribution
- A probability function which describes the relative
frequency of occurrence of data values when sampled from
a population. Distributions are either continuous, typically used for
variables which can be measured, or discrete,
typically used for data that are the result of counts.
-
- double
exponential distribution
- Another name for the Laplace
distribution.
-
- double
reciprocal model
- A model fit in the Simple Regression Statlet which takes
the form:
Y = 1 / (a + b/X)
-
- Duncan's procedure
- A multiple comparisons procedure which allows you to
compare all pairs of means while controlling the overall alpha risk at a specified level.
It is a multiple-stage procedure based on the Studentized
range distribution. While it does not provide interval
estimates of the difference between each pair of means,
it does indicate which means are significantly different
from which others.
-
- Durbin-Watson
statistic
- A statistic employed in multiple regression
analysis to determine if sequential (adjacent) residuals are
correlated. One of the assumptions of regression analysis
is that the residuals (errors) are independent of each
other. Sometimes, however, the data set may unknowingly
contain an "order effect," meaning that a
previous measurement could influence the outcome of the
successive observations.
If the residuals are not correlated, the Durbin-Watson
statistic should be close to 2. Small values (less than
about 1.4) indicate positive correlation between
successive residuals, while large values indicate a
negative correlation. A graphical means to examine an
"order effect" is to plot the residuals against
row order.
- EDA
- A class of statistical techniques which are designed to
let the analyst display data and extract information from
it in a manner which is not overly sensitive to any
underlying assumptions about how that data is
distributed. These Exploratory Data Analysis (EDA)
techniques include box
and whisker plots, stem-and-leaf
displays, and the widespread use of medians
in place of means.
Many popular EDA techniques were developed by Prof. John
Tukey and his colleagues.
-
- Erlang distribution
- A distribution used for continuous random variables which
are constrained to be greater or equal to 0. It is
characterized by two parameters: shape and scale. It is a
special case of the gamma
distribution in which the shape parameter is required
to be an integer.
Parameters: shape a (integer >=1), scale B>0
Domain: X>=0
Mean: aB
Variance: aB2
-
- estimation
- The process of using a sample to estimate features of a
population. Using a sample statistic, such as the
sample mean, to obtain a best estimate of a population parameter (the
population mean) is called point estimation. This is
distinguished from interval estimation, in which an
interval or range of values is provided for the measure
of interest.
-
- eta
- A statistic calculated in the Crosstabulation and
Contingency Table Statlets. It ranges between 0 and 1.
When squared, eta represents the proportion of variation
in the dependent variable
which can be explained by knowledge of the independent variable. It
is only appropriate when the dependent variable is of interval type and the
independent variable is nominal or ordinal.
-
- exponential
distribution
- A continuous probability distribution useful for
characterizing random variables which may only take
positive values. It is often used to characterize the
time between events such as arrivals of customers at a
store. The distribution is completely determined by its
mean. For the exponential distribution, the mean and
standard deviation are equal. It is highly skewed to the right,
peaking at zero and decaying in a smooth (exponential)
fashion.
Parameter: mean B>0
Domain: X>=0
Mean: B
Variance: B2
-
- exponential model
- A model fit in the Simple Regression Statlet which takes
the form:
Y = exp(a + b*X)
-
- exponential
smoothing
- A statistical technique commonly used to forecast time series data or
to smooth the values on a control
chart. A forecast function is estimated from previous
data using a weighted least squares technique. The degree
to which data in the far past is weighted relative to the
near past is governed by the value of one or more
smoothing constants, which must be between 0 and 1. In
general, the smaller the smoothing constant, the more
weight is given to the far past.
-
- extreme
value distribution
- A distribution used for random variables which are
constrained to be greater or equal to 0. It is
characterized by two parameters: mode and scale.
Parameters: mode a>0, scale b>0
Domain: all real X
Mean: a-0.57721b
Variance: (3.14159265b)2 /6
- F distribution
- Also called the variance ratio distribution, it is used
for random variables which are constrained to be greater
or equal to 0. It is characterized by two parameters:
numerator degrees of
freedom and denominator degrees of freedom. It is
used most often as the sampling distribution for test
statistics which are created as the ratio of two variance
estimates.
Parameters: numerator degrees of freedom v (positive
integer),denominator degrees of freedom w (positive
integer)
Domain: X>0
Mean: w/(w-2) for w>2
Variance: [2ww(v+w-2)]/[v(w-2)(w-2) (w-4)] for w>4
-
- factor
- A variable in a statistical
model whose effect on the dependent variable or
variables is to be studied.
-
- Fisher's exact test
- This test is calculated by the Crosstabulation and
Contingency Table procedures whenever the sum of the
counts in the table does not exceed 100. It is a test of
the hypothesis that the row and column classifications
are independent. However, unlike the chi-square test, it can be
used with small cell counts, since it computes the exact
probability of obtaining a table similar to (or more
unusual than) that observed.
If the calculated P value
is small, i.e., less than a predetermined significance
level such as .05, then the hypothesis of independence
must be rejected.
-
- frequency
- The number of occurrences of a data value in your sample,
or the number of values falling within a fixed range.
Data can be summarized in this manner by tabulating all
the data values into distinct categories and then
counting the number of times each category appears in the
frequency distribution. This tabular summary is called a
frequency table. A graphical representation would be a
barchart or histogram.
Frequencies are calculated in the Tabulation,
Crosstabulation, and One Variable Analysis Statlets.
- gamma distribution
- A distribution used for continuous random variables which
are constrained to be greater or equal to 0. It is
characterized by two parameters: shape and scale. The
gamma distribution is often used to model data which is
positively skewed.
Parameters: shape a>0, scale B>0
Domain: X>=0
Mean: aB
Variance: aB2
- Gaussian
distribution
- Another name for the normal
distribution.
- geometric
distribution
- A discrete probability distribution useful for
characterizing the time between Bernoulli trials. For
example, suppose machine parts are characterized as
defective or non-defective, and let the probability of a
defective part equal p. If you begin testing a sample of
parts to find a defective, then the number of parts which
must be tested before the first defective is found
follows a geometric distribution.
Parameters: event probability 0<=p<=1
Domain: X=0,1,2,...
Mean: (1-p)/p
Variance: (1-p)/p2
-
- geometric mean
- A statistic calculated by multiplying n data values
together and taking the n-th root of the result. It is
often used as a measure of central tendency for
positively skewed
distributions. The geometric mean may also be calculated
by computing the arithmetic mean of
the logarithms of the data values and taking the inverse
logarithm of the result.
- heteroscedasticity
- A term which refers to situations in which the
variability of the residuals
is not constant. Most statistical procedures such as
regression and analysis of variance assume that the
variability of the residuals is constant everywhere. If
heteroscedasticity is observed, it may often be removed
by transforming the dependent
variable using a square root or a logarithm.
- histogram
- A graphical display showing the distribution of data
values in a sample by dividing the range of the data into
non-overlapping intervals and counting the number of
values which fall into each interval. These counts are
called frequencies. Bars are
plotted with height proportional to the frequencies.
Histograms are calculated in several of the Analyze
Statlets.
-
- HSD intervals
- The "honestly significant difference" method of
comparing several pairs of means based upon the work of
John Tukey. The HSD technique allows you to compare all
pairs of means and be assured that, if there are no
differences, you will not detect a difference anywhere
more than a stated percentage of the time (the alpha risk).
The HSD method allows you to specify a "family
confidence level" for the comparisons which you are
performing, as opposed to the LSD
method which controls the error rate for a single
comparison. The HSD intervals will be wider than the LSD
intervals, making it harder to declare a pair of means to
be significantly different. The HSD test is therefore
more conservative.
-
- hypothesis tests
- Tests based on a sample of data to determine which of two
different states of nature is true. The two states of
nature are commonly called the null hypothesis, which
gets the benefit of the doubt, and the alternative
hypothesis. Important concepts in hypothesis testing
include the alpha risk, the beta risk, Type I errors, Type II errors,
and power.
- independent
- A property which results when the outcome of one trial
does not depend in any way on what happens in other
trials. For example, the tossing of dice are said to be
independent events if the first toss of the die in no way
influences the outcome of the second toss.
Two observations are said to be statistically independent
when the value of one observation does not influence, or
change, the value of another. Most statistical procedures
assume that the available data represents a random sample of
independent observations.
- independent
variable
- Independent variables are the factors whose effects are
to be studied and manipulated in an experiment. They are
called independent because the experimenter is free to
choose their levels. Examples are aging time, dollars
spent on advertising, assembly line number, or type of
catalyst.
There are two types of independent variables which are
often treated differently in statistical analyses:
(1) quantitative variables which differ in amounts that
can be ordered (e.g. weight of zinc in a battery,
temperature of a process, age of subject).
(2) qualitative variables which differ in
"types" that can not be ordered (e.g. method of
training, brand of car, gender of subject).
By convention when graphing data, the independent
variable is plotted along the X-axis with the dependent variable on the
Y-axis.
-
- interaction
- A situation in which the effect of one factor
depends upon the level of another factor. Interactions
are included in statistical
models whenever the factors do not act in a purely
additive manner.
-
- intercept
-
- The constant term in the equation of a regression line.
If the equation of the line is given as Y = A+B*X, then
the intercept is A. The intercept is the point on the
regression line where the X variable, also known as the
independent variable, equals 0.
-
- interquartile range
- The distance between the upper and lower quartiles. As a
measure of variability, it is less sensitive than the standard
deviation or range
to the possible presence of outliers.
The interquartile range is calculated in the One Variable
Analysis Statlet. It is also used to define the box in a box-and-whisker plot.
-
- interval type
variable
- A type of variable in which the distance between values
of that variable is meaningful. For example, temperature
is measured on an interval scale. However, there may be
no natural origin, as when temperature is measured in
degrees Celsius or Fahrenheit. A variable with a natural
origin is said to be measured on a ratio scale.
- Kendall's tau b
- A statistic calculated in the Crosstabulation and
Contingency Table Statlets. It ranges from -1 to +1 and
is based on the number of concordant
and discordant pairs of
observations. A concordant pair is one is which the two
variables (row and column) have the same relative ranking
(greater than or less than). A discordant pair is one in
which the two variables have the opposite ranking. Both
variables must be ordinal.
A correction is made for tied pairs.
-
- Kendall's tau c
- A statistic calculated in the Crosstabulation and
Contingency Table Statlets.
-
- Kruskal-Wallis test
- A test which compares the medians
of multiple samples using a nonparametric
technique. It first combines all of the data and
sorts them from smallest to largest, giving a rank of 1
to the smallest value and a rank of n to the largest
value. It then calculates the average rank of the values
within each group and computes a statistic to determine
whether there are significant differences between those
average ranks. The P
value on the table is of most interest. If it falls
below 0.05, we may conclude with 95% confidence that
there are significant differences between the medians of
the various groups.
-
- kurtosis
- A measure of the peakedness of a distribution calculated
in several Statlets. This statistic is useful in
determining how far your data departs from a normal
distribution.
For the normal distribution, the theoretical kurtosis
value equals 0 and the distribution is described as
mesokurtic. (Note: some authors define kurtosis such that
a normal distribution has a value = 3. In STATLETS, the 3
has been subtracted away.) If the distribution has long
tails (i.e., an excess of data values near the mean and
far from it) like the t-distribution, the statistic will
be greater than 0. Such distributions are called
"leptokurtic". Values of kurtosis less than 0
result from curves that have a flatter top than the
normal distribution. They are called
"platykurtic". To judge whether data departs
significantly from a normal distribution, a standardized
kurtosis statistic can also be computed.
- lambda
-
- A statistic calculated in the Crosstabulation and
Contingency Table Statlets. On a scale of 0 to 1, it
measures the relative improvement in predicting either
rows or columns given knowledge of the other.
-
- Laplace
distribution
- A distribution used for continuous random variables which
is more peaked than the normal. It
is characterized by two parameters: mean and scale. The
Laplace distribution is sometimes called the "double
exponential" distribution.
Parameter: mean a, scale B>0
Domain: all real X
Mean: a
Variance: 2/B2
-
- least squares means
- The predicted values for the dependent variable at each
level of a selected factor, averaged over all the levels
of the other factors. In an unbalanced design, these
adjusted means will not equal the simple level means. The
least squares means try to place each level of the factor
on an equal footing by predicting the response at the
same levels of the other factors.
-
- leverage
- A measure of how much influence a single observation has
on a fitted regression
model. Leverage is important since isolated points
far from all the others may have a major impact on the
fitted model. The regression Statlets list points whose
leverage is very large, so that you may assess whether
those points are improperly distorting the estimated
model.
-
- linear model
- In the Simple Regression Statlet, a model which takes the
form:
Y = a + b*X
-
- linear
statistical model
- A model in which the coefficients enter in an additive
manner. Most models estimated by STATLETS are linear
models.
-
- log probit model
- In the Simple Regression Statlet, a model which takes the
form:
Y = normal(a + b*ln(X))
It can only be fit if all values of Y lie between 0 and
1.
-
- logarithmic-X model
- In the Simple Regression Statlet, a model which takes the
form:
Y = a + b*ln(X)
-
- logistic
distribution
- A distribution used for random variables which are
constrained to be greater or equal to 0.
Parameters: mean a, standard deviation B>0
Domain: all real X
Mean: a
Variance: B2
-
- logistic model
- In the Simple Regression Statlet, a model which takes the
form:
Y = exp(a + b*X)/(1 + exp(a + b*X))
It can only be fit if all values of Y lie between 0 and
1.
-
- lognormal
distribution
- A distribution which is used for random variables which
are constrained to be greater or equal to 0. It is
characterized by two parameters: mean and standard
deviation. The lognormal distribution is often used to
model data which is positively skewed. If a sample of
data comes from a lognormal distribution, then the log of
the data can be considered to have come from a normal
distribution.
Parameters: mean mu, standard deviation sigma>0
Domain: X>0
Mean: mu
Variance: sigma2
-
- lower quartile
- The 25th percentile,
calculated by ordering the data from smallest to largest
and finding the value which lies 25% of the way up
through the data.
Box and whisker plots
are a graphical display of a sample in terms of its
quartiles.
- LSD intervals
- The Least Significant Difference method attributed to
Fisher as a method of controlling Type I errors when
comparing several pairs of means. Here alpha is defined
as the number of Type I errors with respect to tests on
differences between means divided by the number of
comparisons. In this method of comparisons, the level of
significance is applied to each pairwise comparison as
contrasted with the HSD
"experimentwide" error rate.
- Mahalanobis
distance
- A statistic which measures the distance of a single data
point from the sample mean or centroid in the space of
the independent variables used to fit a multiple regression
model. It provides a way of finding points which are
far from all of the others in a multidimensional space.
-
- maximum
- In a sample of data, the largest observation.
- mean
- A statistic which measures the center of a sample of data
by adding up the observations and dividing by the number
of data points. It may be thought of as the center of
mass or balancing point for the data, i.e., that point at
which a ruler would balance if all the data values were
placed along it at their appropriate numerical values.
Regardless of the distribution from which the data comes,
the Central Limit
Theorem shows that as the sample size increases,
sample means will tend to follow a normal
distribution. Unlike the sample median,
outliers can have a
large impact on the calculated sample mean.
-
- mean absolute error
- The mean of the absolute value of the residuals from a fitted
statistical model.
-
- measurement
- In statistical analysis, the "level of
measurement" determines how a variable should be
analyzed. Variables can be classified into different
groups by how they are measured as follows:
(1) Nominal
variables are named categories, for example
"gender of worker." Numerical values can be
assigned to these nominal variables for analysis, but the
numbers have no true numerical meaning - they are just
"codes" for data analysis purposes (e.g.,
female=1, male=2).
(2) Ordinal variables
consist of categories which can be arranged in order. For
example, "job satisfaction" might be measured
on a scale of 1 to 10. The numbers assigned to this
variable allow you to put the responses in order (from
low to high), but the actual distances between the
numeric codes have no true numeric meaning.
(3) Interval variables
provide measurement where the distance between values is
meaningful but where there is no true zero point. An
example is the model year of a car.
(4) Ratio
variables are measured on a scale where the
proportions, or ratios, between items are meaningful. In
other words, there is a true zero point on the scale. An
example is the weight of an automobile in pounds.
Nominal and ordinal variables are both said to be
"categorical" or "qualitative" data,
and interval and ratio can both be described as
"numerical" or "quantitative" data.
Numerical data can further be broken down into discrete or continuous
variables.
-
- median
- A statistic which measures the center of a set of data by
finding that value which divides the data in half. A
technical definition is that the median is the value
which is greater than or equal to half of the values in
the data set and less than or equal to half the values.
To compute the median yourself, you would sort the data
from smallest to largest and:
(1) for an odd number of observations, pick the center
value.
(2) for an even number of observations, pick the value
halfway between the center two.
For a symmetric
distribution such as the normal distribution, the
median is the same as the mean. For a
distribution which is skewed
to the right (left), the median is typically smaller
(larger) than the mean.
-
- minimum
- In a sample of data, the smallest observation.
-
- missing values
- In a sample of data, desired observations which could not
be obtained.
When entering data into STATLETS, missing values are
normally entered as blank cells. This is true if entering
the data directly into the STATLET spreadsheet, or if
importing the data from another software package.
Most procedures handle missing values in one of the
following manners:
(1) by excluding all rows which have missing values in
any of the variables being analyzed. This is called the
"listwise" method, and it is the most common.
It is used in Statlets such as Multiple Regression.
(2) by excluding all observations which are missing for
each variable or pair of variables separately. This is
frequently called the "pairwise" method. It is
an option in certain Statlets.
(3) by estimating an appropriate replacement value, which
is done in the time series analysis Statlets which must
preserve the sequential nature of a set of data.
Internally, missing values are stored as blanks for
character variables and as the number -32768 for numeric
variables. Normally, users need not be concerned with
this internal representation.
-
- mode
- A statistic defined as the most frequently occurring data
value. It is sometimes used as an alternative to the mean or median as
a measure of central tendency. It is calculated in the
Analyze Statlets. If more than one value occurs equally
often, no value is printed.
The mode is a particularly useful summary statistic when
the data is measured on a nominal
scale.
For grouped data, the mode is usually defined as the
midpoint of the interval containing the highest frequency
count. In some distributions, there may be more than one
mode: two high points (bimodal) or many high points
(multimodal distributions).
-
- moving average
- The average of the most recent K data values, where K is
the "order" or "span" of the moving
average. This is one of the methods often used for
forecasting time
series data. Large values for K give good results for
very stable series. Smaller values are needed for series
which tend to change level frequently.
-
- moving average
model
- A type of ARIMA model which
relates the observed data value at time t to random
errors in the current and previous time periods. Common
models are:
Nonseasonal order 1: Yt = a + At -
b*At-1
Nonseasonal order 2: Yt = a + At -
b*At-1 - c*At-2
Seasonal order 1: Yt = a + At - b*At-s
Seasonal order 2: Yt = a + At - b*At-s
- c*At-2s
where At is a random error term at time t.
These models can be fit using the Forecasting Statlet in
the time series section.
-
- moving range chart
- A control chart which plots
the moving range of groups of 2 observations. It is used
when plotting individuals data where true subgroups are
not available. The moving range of 2 is equivalent to
computing the absolute value of differences between
consecutive points.
-
- multicollinearity
- A condition in which the predictor variables in a regression model are themselves
highly correlated. Multicollinearity often leads to
models in which the coefficients are poorly estimated
and, while a fitted model may be good for predictive
purposes, it can be difficult to interpret the relative
effects of the various predictor variables. One of the
important properties of a designed experiment is that it
avoids such a condition.
-
- multiple
regression model
- A statistical
model relating a single dependent
variable to two or more independent variables.
-
- multiplicative
model
- In the Simple Regression Statlet, a model which takes the
form:
Y = a*Xb
The model is fit in the metric
ln(Y) = ln(a) + b*ln(X)
Title Page
Revised: July 30, 1997.
Copyright © 1997 by NWP Associates, Inc.
All trademarks or product names mentioned herein are the property
of their respective owners.