# Glossary of Terms

### ABCDEFGHIJKLMNOPQRSTUVWXYZ#

Select the first letter of the word from the list above to jump to appropriate section of the glossary. If the term you are looking for starts with a digit or symbol, choose the '#' link.

# - A -

The regression R-squared statistic "corrected" for the number of independent variables in a multiple regression analysis. It is often used to compare models involving different numbers of coefficients. The adjusted R-squared statistic is interpreted as:

(1) a measure of the goodness of fit of the least squares regression line.

(2) the proportion of variance in the dependent variable accounted for by the independent variables.

alpha risk
The probability of rejecting a null hypothesis when it is true. It is the probability of making a Type I error. For more details, see hypothesis tests.

ARIMA models
ARIMA (AutoRegressive Integrated Moving Average) models are often used to forecast time series data. They provide a flexible class of models which can represent the dynamic behavior found in many sets of data. To select an ARIMA model, you must specify several parameters:

p = order of the autoregressive part of the model
d = order of differencing (if any)
q = order of the moving average part of the model.

Separate values are entered for the nonseasonal part of the model and for the seasonal part. The model may or may not include a constant term.

aspect ratio
For a graphics device, the ratio of the screen dimensions, normally defined as

vertical screen dimension
---------------------------
horizontal screen dimension

Unless the surface on which your graph is displayed is perfectly square, with an aspect ratio equal to 1, objects such as circles will be distorted when displayed unless the aspect ratio is adjusted for.

autocorrelation
At lag k, the correlation between the data value at time t and the data value at time (t-k). Autocorrelations are often calculated for time series data to determine how the correlation between data values varies with the distance or time "lag" between them.

For each autocorrelation, a corresponding standard error is calculated. If the time series is random, all of the autocorrelations should be within approximately +/- 2 standard errors. Estimates extending beyond this distance indicate significant correlation between data values separated by the indicated time lag.

autoregressive models
A statistical model for time series which relates the observed data value at time t to the values in previous time periods. Common models are:

Nonseasonal order 1: Yt = a + b*Yt-1 + At
Nonseasonal order 2: Yt = a + b*Yt-1 + c*Yt-2 + At
Seasonal order 1: Yt = a + b*Yt-s + At
Seasonal order 2: Yt = a + b*Yt-s + c*Yt-2s + At

where At is a random error term at time t. These models can be fit using the Forecasting Models Statlet in the time series section.

average

A statistic calculated by summing a set of data values and dividing by the number of values. Used to measure the center of a sample of variable data.

# - B -

Bernoulli trial
An experiment with only two possible outcomes, such as success or failure, heads or tails, or good or bad. The probability of a selected outcome (such as a success) is normally referred to as p. Related probability distributions include the binomial distribution, the geometric, and the negative binomial.

Parameter: event probability 0<=p<=1

Domain: X=0,1

Mean: p

Variance: p(1-p)

beta distribution
A distribution used for continuous random variables which are constrained to lie between 0 and 1. It is characterized by two parameters: shape and scale.

Parameters: shape a>0
scale B>0

Domain: 0<=X<=1

Mean: a/(a+B)
Variance: aB/{[(a+B+1)(a+b) ]}2

beta risk
The probability of not rejecting a null hypothesis when it is false. It is the probability of making a Type II error, or (1.0-power). For more details, see hypothesis tests.

binomial distribution
A distribution which gives the probability of observing X successes in a fixed number (n) of independent Bernoulli trials. p represents the probability of a success on a single trial.

Parameters: event probability 0<=p<=1
number of trials: n>0

Domain: X=0,1,...,n

Mean: np

Variance: np(1-p)

Bonferroni intervals
Used when comparing several means, these intervals bound the estimation error in each mean using a method suggested by Bonferroni. This method is appropriate whether or not the group sizes are all the same. Using the F-distribution, it generates intervals which allow you to make a specified number of linear comparisons among the sample means while controlling the experiment-wide error rate at a specified level. In the STATLETS ANOVA procedures, the number of comparisons is set to allow you to compare the differences between all pairs of means.

box-and-whisker plot
A graphical display designed to provide a quick summary of a sample of numeric data. It consists of a box extending from the lower quartile to the upper quartile, with a central line at the median. Whiskers extend to the smallest and largest points which are not classified as outside points.

# - C -

capability indices
Indices computed in the Process Capability procedure to measure how well a sample of data conforms to process specifications. The indices include:

(1) CP = (USL - LSL) / (6*sigma)

(2) CPK = smaller of CPK(upper) and CPK(lower)

(3) CPK(upper) = (USL - sample mean) / (3*sigma)

(4) CPK(lower) = (sample mean - LSL) / (3*sigma)

where USL and LSL represent the upper and lower specification limits, respectively. Normally, all of these indices should be greater or equal to 1.33 for the process to be deemed "capable" of meeting the specifications.

case
A case consists of measurements on variables for an individual subject or experimental unit. For example, you might take many measurements on a single person: variables such as height, weight, gender, etc. The individual person's measurements on all the variables represents a case. Measurement on a single variable is called an observation.

Sometimes the term case and observation are used interchangeably. But, strictly speaking, a case represents one computer record, or one row of data in a spreadsheet, for an individual experimental unit. Each individual case (or person in this example) will have a row of data contained in the data file. The data file will have n rows of data, where n is the number of cases.

central limit theorem
The Central Limit Theorem is an important mathematical result which states that for a random sample of observations from any distribution with a finite mean and a finite variance, the average will tend to follow a normal distribution for large samples. This theorem is the main justification for the widespread use of confidence intervals based on the normal distribution and for t tests when estimating the mean and comparing two means.

chi-square distribution
A distribution used for random variables which are constrained to be greater or equal to 0. It is characterized by one parameter: degrees of freedom. The chi-square distribution is used most often as the sampling distribution for various statistical tests.

Parameter: degrees of freedom v (positive integer)

Domain: X>=0

Mean : v

Variance : 2v

chi-square test
A test performed on a two-way frequency table to test whether two variables can be considered statistically independent. It is calculated in the Crosstabulation and Contingency Table Statlets.

In calculating the chi-square test, the observed frequency in each cell is compared to the frequency which would be expected if the row and column classifications were independent. If the calculated statistic is large (i.e., if its P value is less than a predetermined significance level such as .05), then the null hypothesis of independence must be rejected.

The chi-square test is valid only if the expected frequency in each cell is relatively large. If any frequency is less than 5, a warning is displayed.

If the two-way table contains exactly 2 rows and 2 columns and the total count in the table does not exceed 100, Fisher's exact test is also performed.

classical methods
The estimation of confidence intervals and use of hypothesis tests have been widely applied for over 50 years. They are sometimes referred to as classical procedures, in contrast to newer nonparametric and EDA techniques which tend to be much less sensitive (more robust) to the assumption that data follows a normal distribution.

coefficient of variation (CV)
The ratio of the standard deviation divided by the mean, multiplied by 100, so that it is expressed as a percent. Sometimes called the relative standard deviation.

This summary statistic is often employed in the natural sciences, where the standard deviation of measurement error is often proportional to the magnitude of the values being measured. Since the coefficient of variation provides a measure of relative variation and is scale-free, it is particularly useful in making comparisons between different samples.

The CV is calculated in the One Variable Analysis and Multiple Samples Statlets.

concordant
A pair of cases for two ordered data variables in which values for the first case are either both higher or both lower than the values of the variables for the second case. For example, the following pair is concordant:

X1 X2
10 100
20 150

conditional gamma
A statistic calculated in the Crosstabulation and Contingency Table Statlets. It ranges from -1 to +1 and is based on the number of concordant and discordant pairs of observations. A concordant pair is one is which the two variables (row and column) have the same relative ranking (greater than or less than). A discordant pair is one in which the two variables have the opposite ranking. Both variables must be ordinal. No correction is made for ties.

conditional sum of squares
In a multiple regression analysis, the sums of squares attributable to each individual independent variable when entered into the model in the order specified. They are used to determine the contribution each independent variable makes to the regression model as it is added. If the variables are highly correlated with each other, however, this contribution could vary greatly if the variables were entered in a different order.

confidence interval
A statistic constructed from a set of data to provide an interval estimate for a parameter. For example, when estimating the mean of a normal distribution, the sample average provides a point estimate or best guess about the value of the mean. However, this estimate is almost surely not exactly correct. A confidence interval provides a range of values around that estimate to show how precise the estimate is. The confidence level associated with the interval, usually 90%, 95%, or 99%, is the percentage of times in repeated sampling that the intervals will contain the true value of the unknown parameter.

confidence limits
Limits displayed in the Simple Regression Statlet to show the precision by which the fitted model has been estimated. If the limits are drawn at a confidence level of 95%, then we would expect the mean of many new observations taken at a given value of X to fall within these limits.

contingency coefficient
A statistic calculated in the Crosstabulation and Contingency Table Statlets. It measures the degree of association between the values of the row and column variables on a scale of 0 to 1, based on the usual chi-square statistic. It cannot in general attain the value 1 for all tables.

continuous
A type of random variable which may take any value over an interval. This is contrasted with a discrete random variable, which may take only a limited set of values.

control chart
A chart used to determine whether the distribution of data values generated by a process is stable over time. A control chart plots a statistic versus time.

correlation
A measure of association between two variables. It measures how strongly the variables are related, or change, with each other. If two variables tend to move up or down together, they are said to be positively correlated. If they tend to move in opposite directions, they are said to be negatively correlated. Correlations are computed in the Multiple Regression Statlet.

The most common statistic for measuring association is the Pearson correlation coefficient.

correlation coefficient
Denoted as "r", a measure of the linear relationship between two variables. The absolute value of "r" provides an indication of the strength of the relationship.

The value of "r" varies between positive 1 and negative 1, with -1 or 1 indicating a perfect linear relationship, and r = 0 indicating no relationship. The sign of the correlation coefficient indicates whether the slope of the line is positive or negative when the two variables are plotted in a scatterplot.

covariance
A measure of the joint variability of a pair of numeric variables. It is based upon the sum of crossproducts of the values of the variables.

covariate
A quantitative factor which varies together with the dependent variable. In an analysis of covariance, the relationship between the dependent variable and the covariate is first adjusted for before the effects of the other factors are examined.

Cramer's V
A statistic calculated in the Crosstabulation and Contingency Table Statlets. It measures the degree of association between the values of the row and column variables on a scale of 0 to 1, based on the usual chi-square statistic. Unlike the contingency coefficient, it can attain the value 1 for all tables.

cumulative frequency
The number of observations falling in a given class in a frequency table, plus all observations falling in earlier classes. Cumulative frequencies are calculated in the Tabulation and Crosstabulation Statlets.

cumulative probability
The probability that a random variable will be less than or equal to a specified value. For example, the cumulative probability equals 0.5 for the median.

cumulative relative frequency
The number of observations falling in a given class in a frequency table, plus all observations falling in earlier classes, divided by the total number of observations.

# - D -

data file
A collection of observations on several characteristics, arranged in the form of a spreadsheet. Each row of the file represents a single case or observation. Each column represents a single characteristic or variable.

degrees of freedom
A term used in statistics to characterize the number of independent pieces of information contained in a statistic. For example, if we begin with a random sample of n observations and estimate the mean by the sample average, we are left with only (n-1) independent measurements from which to estimate the variance or deviations around the mean. In a simple regression, where we estimate both an intercept and a slope, only (n-2) degrees of freedom remain to measure variability around the fitted line.

density function
A mathematical function used to determine probabilities for a continuous random variable. The bell-shaped curve corresponding to a normal distribution is one example. To determine the probability of finding a value between two limits, the area under the density function between those limits is computed.

dependent variable
The variable whose behavior is to be measured as a result of an experiment. For example, in a consumer research study on the effects of types of packaging on purchase behavior (where the size, shape and colors of the box are the independent variables), the dependent variable would be the quantity of product purchased.

Ideally, the dependent variable should be reliable, sensitive, easy to measure, and distributed in a way that conforms to the assumptions of a statistical model.

By convention, the dependent variable is plotted along the vertical Y-axis with the independent variable on the horizontal X-axis.

DFITS
A statistic computed when fitting a multiple regression model to measure the change in each predicted value which would occur if a single data value was deleted. Large values correspond to points which have a big influence on the fitted model.

differencing
An operation often used to make a nonstationary time series approximately stationary. A nonstationary time series is one which does not have a fixed mean. The types of differencing normally performed are:

Nonseasonal of order 1: Yt - Yt-1
Nonseasonal of order 2: (Yt-Yt-1) - (Yt-1-Yt-2)
Seasonal of order 1: Yt - Yt-s
Seasonal of order 2: (Yt-Yt-s) - (Yt-s-Yt-2s)
Mixed seasonal and nonseasonal: (Yt-Yt-1) - (Yt-s-Yt-s-1)

where s is the length of seasonality (such as 12 for monthly data).

discordant
A pair of cases for two ordered data variables in which the value of one variable for the first case is higher (or lower) than its value in the second case, and the relative relationship is switched for the second variable. For example, the following pair is discordant:

X1 X2
10 100
20 50

discrete
A type of random variable which may take on only a limited set of values, such as 1,2,3,...,10. The list may be finite, or there may be an infinite number of values. A discrete random variable is to be contrasted with a continuous random variable.

discrete uniform distribution
A discrete distribution which allocates equal probabilities to all integer values between a lower and upper limit.

Parameters: lower limit a
upper limit b>a

Domain: X=a,a+1,a+2,...,b

Mean: (a+b)/2

Variance: [(b-a+1) 2-1]/12

distribution
A probability function which describes the relative frequency of occurrence of data values when sampled from a population. Distributions are either continuous, typically used for variables which can be measured, or discrete, typically used for data that are the result of counts.

double exponential distribution
Another name for the Laplace distribution.

double reciprocal model
A model fit in the Simple Regression Statlet which takes the form:

Y = 1 / (a + b/X)

Duncan's procedure
A multiple comparisons procedure which allows you to compare all pairs of means while controlling the overall alpha risk at a specified level. It is a multiple-stage procedure based on the Studentized range distribution. While it does not provide interval estimates of the difference between each pair of means, it does indicate which means are significantly different from which others.

Durbin-Watson statistic
A statistic employed in multiple regression analysis to determine if sequential (adjacent) residuals are correlated. One of the assumptions of regression analysis is that the residuals (errors) are independent of each other. Sometimes, however, the data set may unknowingly contain an "order effect," meaning that a previous measurement could influence the outcome of the successive observations.

If the residuals are not correlated, the Durbin-Watson statistic should be close to 2. Small values (less than about 1.4) indicate positive correlation between successive residuals, while large values indicate a negative correlation. A graphical means to examine an "order effect" is to plot the residuals against row order.

# - E -

EDA
A class of statistical techniques which are designed to let the analyst display data and extract information from it in a manner which is not overly sensitive to any underlying assumptions about how that data is distributed. These Exploratory Data Analysis (EDA) techniques include box and whisker plots, stem-and-leaf displays, and the widespread use of medians in place of means.

Many popular EDA techniques were developed by Prof. John Tukey and his colleagues.

Erlang distribution
A distribution used for continuous random variables which are constrained to be greater or equal to 0. It is characterized by two parameters: shape and scale. It is a special case of the gamma distribution in which the shape parameter is required to be an integer.

Parameters: shape a (integer >=1), scale B>0

Domain: X>=0

Mean: aB

Variance: aB2

estimation
The process of using a sample to estimate features of a population. Using a sample statistic, such as the sample mean, to obtain a best estimate of a population parameter (the population mean) is called point estimation. This is distinguished from interval estimation, in which an interval or range of values is provided for the measure of interest.

eta
A statistic calculated in the Crosstabulation and Contingency Table Statlets. It ranges between 0 and 1. When squared, eta represents the proportion of variation in the dependent variable which can be explained by knowledge of the independent variable. It is only appropriate when the dependent variable is of interval type and the independent variable is nominal or ordinal.

exponential distribution
A continuous probability distribution useful for characterizing random variables which may only take positive values. It is often used to characterize the time between events such as arrivals of customers at a store. The distribution is completely determined by its mean. For the exponential distribution, the mean and standard deviation are equal. It is highly skewed to the right, peaking at zero and decaying in a smooth (exponential) fashion.

Parameter: mean B>0

Domain: X>=0

Mean: B

Variance: B2

exponential model
A model fit in the Simple Regression Statlet which takes the form:

Y = exp(a + b*X)

exponential smoothing
A statistical technique commonly used to forecast time series data or to smooth the values on a control chart. A forecast function is estimated from previous data using a weighted least squares technique. The degree to which data in the far past is weighted relative to the near past is governed by the value of one or more smoothing constants, which must be between 0 and 1. In general, the smaller the smoothing constant, the more weight is given to the far past.

extreme value distribution
A distribution used for random variables which are constrained to be greater or equal to 0. It is characterized by two parameters: mode and scale.

Parameters: mode a>0, scale b>0

Domain: all real X

Mean: a-0.57721b

Variance: (3.14159265b)2 /6

# - F -

F distribution
Also called the variance ratio distribution, it is used for random variables which are constrained to be greater or equal to 0. It is characterized by two parameters: numerator degrees of freedom and denominator degrees of freedom. It is used most often as the sampling distribution for test statistics which are created as the ratio of two variance estimates.

Parameters: numerator degrees of freedom v (positive integer),denominator degrees of freedom w (positive integer)

Domain: X>0

Mean: w/(w-2) for w>2

Variance: [2ww(v+w-2)]/[v(w-2)(w-2) (w-4)] for w>4

factor
A variable in a statistical model whose effect on the dependent variable or variables is to be studied.

Fisher's exact test
This test is calculated by the Crosstabulation and Contingency Table procedures whenever the sum of the counts in the table does not exceed 100. It is a test of the hypothesis that the row and column classifications are independent. However, unlike the chi-square test, it can be used with small cell counts, since it computes the exact probability of obtaining a table similar to (or more unusual than) that observed.

If the calculated P value is small, i.e., less than a predetermined significance level such as .05, then the hypothesis of independence must be rejected.

frequency
The number of occurrences of a data value in your sample, or the number of values falling within a fixed range. Data can be summarized in this manner by tabulating all the data values into distinct categories and then counting the number of times each category appears in the frequency distribution. This tabular summary is called a frequency table. A graphical representation would be a barchart or histogram. Frequencies are calculated in the Tabulation, Crosstabulation, and One Variable Analysis Statlets.

# - G -

gamma distribution
A distribution used for continuous random variables which are constrained to be greater or equal to 0. It is characterized by two parameters: shape and scale. The gamma distribution is often used to model data which is positively skewed.

Parameters: shape a>0, scale B>0

Domain: X>=0

Mean: aB

Variance: aB2
Gaussian distribution
Another name for the normal distribution.
geometric distribution
A discrete probability distribution useful for characterizing the time between Bernoulli trials. For example, suppose machine parts are characterized as defective or non-defective, and let the probability of a defective part equal p. If you begin testing a sample of parts to find a defective, then the number of parts which must be tested before the first defective is found follows a geometric distribution.

Parameters: event probability 0<=p<=1

Domain: X=0,1,2,...

Mean: (1-p)/p

Variance: (1-p)/p2

geometric mean
A statistic calculated by multiplying n data values together and taking the n-th root of the result. It is often used as a measure of central tendency for positively skewed distributions. The geometric mean may also be calculated by computing the arithmetic mean of the logarithms of the data values and taking the inverse logarithm of the result.

# - H -

heteroscedasticity
A term which refers to situations in which the variability of the residuals is not constant. Most statistical procedures such as regression and analysis of variance assume that the variability of the residuals is constant everywhere. If heteroscedasticity is observed, it may often be removed by transforming the dependent variable using a square root or a logarithm.
histogram
A graphical display showing the distribution of data values in a sample by dividing the range of the data into non-overlapping intervals and counting the number of values which fall into each interval. These counts are called frequencies. Bars are plotted with height proportional to the frequencies. Histograms are calculated in several of the Analyze Statlets.

HSD intervals
The "honestly significant difference" method of comparing several pairs of means based upon the work of John Tukey. The HSD technique allows you to compare all pairs of means and be assured that, if there are no differences, you will not detect a difference anywhere more than a stated percentage of the time (the alpha risk).

The HSD method allows you to specify a "family confidence level" for the comparisons which you are performing, as opposed to the LSD method which controls the error rate for a single comparison. The HSD intervals will be wider than the LSD intervals, making it harder to declare a pair of means to be significantly different. The HSD test is therefore more conservative.

hypothesis tests
Tests based on a sample of data to determine which of two different states of nature is true. The two states of nature are commonly called the null hypothesis, which gets the benefit of the doubt, and the alternative hypothesis. Important concepts in hypothesis testing include the alpha risk, the beta risk, Type I errors, Type II errors, and power.

# - I -

independent
A property which results when the outcome of one trial does not depend in any way on what happens in other trials. For example, the tossing of dice are said to be independent events if the first toss of the die in no way influences the outcome of the second toss.

Two observations are said to be statistically independent when the value of one observation does not influence, or change, the value of another. Most statistical procedures assume that the available data represents a random sample of independent observations.
independent variable
Independent variables are the factors whose effects are to be studied and manipulated in an experiment. They are called independent because the experimenter is free to choose their levels. Examples are aging time, dollars spent on advertising, assembly line number, or type of catalyst.

There are two types of independent variables which are often treated differently in statistical analyses:

(1) quantitative variables which differ in amounts that can be ordered (e.g. weight of zinc in a battery, temperature of a process, age of subject).

(2) qualitative variables which differ in "types" that can not be ordered (e.g. method of training, brand of car, gender of subject).

By convention when graphing data, the independent variable is plotted along the X-axis with the dependent variable on the Y-axis.

interaction
A situation in which the effect of one factor depends upon the level of another factor. Interactions are included in statistical models whenever the factors do not act in a purely additive manner.

intercept

The constant term in the equation of a regression line. If the equation of the line is given as Y = A+B*X, then the intercept is A. The intercept is the point on the regression line where the X variable, also known as the independent variable, equals 0.

interquartile range
The distance between the upper and lower quartiles. As a measure of variability, it is less sensitive than the standard deviation or range to the possible presence of outliers.

The interquartile range is calculated in the One Variable Analysis Statlet. It is also used to define the box in a box-and-whisker plot.

interval type variable
A type of variable in which the distance between values of that variable is meaningful. For example, temperature is measured on an interval scale. However, there may be no natural origin, as when temperature is measured in degrees Celsius or Fahrenheit. A variable with a natural origin is said to be measured on a ratio scale.

# - K -

Kendall's tau b
A statistic calculated in the Crosstabulation and Contingency Table Statlets. It ranges from -1 to +1 and is based on the number of concordant and discordant pairs of observations. A concordant pair is one is which the two variables (row and column) have the same relative ranking (greater than or less than). A discordant pair is one in which the two variables have the opposite ranking. Both variables must be ordinal. A correction is made for tied pairs.

Kendall's tau c
A statistic calculated in the Crosstabulation and Contingency Table Statlets.

Kruskal-Wallis test
A test which compares the medians of multiple samples using a nonparametric technique. It first combines all of the data and sorts them from smallest to largest, giving a rank of 1 to the smallest value and a rank of n to the largest value. It then calculates the average rank of the values within each group and computes a statistic to determine whether there are significant differences between those average ranks. The P value on the table is of most interest. If it falls below 0.05, we may conclude with 95% confidence that there are significant differences between the medians of the various groups.

kurtosis
A measure of the peakedness of a distribution calculated in several Statlets. This statistic is useful in determining how far your data departs from a normal distribution.

For the normal distribution, the theoretical kurtosis value equals 0 and the distribution is described as mesokurtic. (Note: some authors define kurtosis such that a normal distribution has a value = 3. In STATLETS, the 3 has been subtracted away.) If the distribution has long tails (i.e., an excess of data values near the mean and far from it) like the t-distribution, the statistic will be greater than 0. Such distributions are called "leptokurtic". Values of kurtosis less than 0 result from curves that have a flatter top than the normal distribution. They are called "platykurtic". To judge whether data departs significantly from a normal distribution, a standardized kurtosis statistic can also be computed.

# - L -

lambda

A statistic calculated in the Crosstabulation and Contingency Table Statlets. On a scale of 0 to 1, it measures the relative improvement in predicting either rows or columns given knowledge of the other.

Laplace distribution
A distribution used for continuous random variables which is more peaked than the normal. It is characterized by two parameters: mean and scale. The Laplace distribution is sometimes called the "double exponential" distribution.

Parameter: mean a, scale B>0

Domain: all real X

Mean: a

Variance: 2/B2

least squares means
The predicted values for the dependent variable at each level of a selected factor, averaged over all the levels of the other factors. In an unbalanced design, these adjusted means will not equal the simple level means. The least squares means try to place each level of the factor on an equal footing by predicting the response at the same levels of the other factors.

leverage
A measure of how much influence a single observation has on a fitted regression model. Leverage is important since isolated points far from all the others may have a major impact on the fitted model. The regression Statlets list points whose leverage is very large, so that you may assess whether those points are improperly distorting the estimated model.

linear model
In the Simple Regression Statlet, a model which takes the form:

Y = a + b*X

linear statistical model
A model in which the coefficients enter in an additive manner. Most models estimated by STATLETS are linear models.

log probit model
In the Simple Regression Statlet, a model which takes the form:

Y = normal(a + b*ln(X))

It can only be fit if all values of Y lie between 0 and 1.

logarithmic-X model
In the Simple Regression Statlet, a model which takes the form:

Y = a + b*ln(X)

logistic distribution
A distribution used for random variables which are constrained to be greater or equal to 0.

Parameters: mean a, standard deviation B>0

Domain: all real X

Mean: a

Variance: B2

logistic model
In the Simple Regression Statlet, a model which takes the form:

Y = exp(a + b*X)/(1 + exp(a + b*X))

It can only be fit if all values of Y lie between 0 and 1.

lognormal distribution
A distribution which is used for random variables which are constrained to be greater or equal to 0. It is characterized by two parameters: mean and standard deviation. The lognormal distribution is often used to model data which is positively skewed. If a sample of data comes from a lognormal distribution, then the log of the data can be considered to have come from a normal distribution.

Parameters: mean mu, standard deviation sigma>0

Domain: X>0

Mean: mu

Variance: sigma2

lower quartile
The 25th percentile, calculated by ordering the data from smallest to largest and finding the value which lies 25% of the way up through the data.

Box and whisker plots are a graphical display of a sample in terms of its quartiles.
LSD intervals
The Least Significant Difference method attributed to Fisher as a method of controlling Type I errors when comparing several pairs of means. Here alpha is defined as the number of Type I errors with respect to tests on differences between means divided by the number of comparisons. In this method of comparisons, the level of significance is applied to each pairwise comparison as contrasted with the HSD "experimentwide" error rate.

# - M -

Mahalanobis distance
A statistic which measures the distance of a single data point from the sample mean or centroid in the space of the independent variables used to fit a multiple regression model. It provides a way of finding points which are far from all of the others in a multidimensional space.

maximum
In a sample of data, the largest observation.
mean
A statistic which measures the center of a sample of data by adding up the observations and dividing by the number of data points. It may be thought of as the center of mass or balancing point for the data, i.e., that point at which a ruler would balance if all the data values were placed along it at their appropriate numerical values. Regardless of the distribution from which the data comes, the Central Limit Theorem shows that as the sample size increases, sample means will tend to follow a normal distribution. Unlike the sample median, outliers can have a large impact on the calculated sample mean.

mean absolute error
The mean of the absolute value of the residuals from a fitted statistical model.

measurement
In statistical analysis, the "level of measurement" determines how a variable should be analyzed. Variables can be classified into different groups by how they are measured as follows:

(1) Nominal variables are named categories, for example "gender of worker." Numerical values can be assigned to these nominal variables for analysis, but the numbers have no true numerical meaning - they are just "codes" for data analysis purposes (e.g., female=1, male=2).

(2) Ordinal variables consist of categories which can be arranged in order. For example, "job satisfaction" might be measured on a scale of 1 to 10. The numbers assigned to this variable allow you to put the responses in order (from low to high), but the actual distances between the numeric codes have no true numeric meaning.

(3) Interval variables provide measurement where the distance between values is meaningful but where there is no true zero point. An example is the model year of a car.

(4) Ratio variables are measured on a scale where the proportions, or ratios, between items are meaningful. In other words, there is a true zero point on the scale. An example is the weight of an automobile in pounds.

Nominal and ordinal variables are both said to be "categorical" or "qualitative" data, and interval and ratio can both be described as "numerical" or "quantitative" data. Numerical data can further be broken down into discrete or continuous variables.

median
A statistic which measures the center of a set of data by finding that value which divides the data in half. A technical definition is that the median is the value which is greater than or equal to half of the values in the data set and less than or equal to half the values. To compute the median yourself, you would sort the data from smallest to largest and:

(1) for an odd number of observations, pick the center value.
(2) for an even number of observations, pick the value halfway between the center two.

For a symmetric distribution such as the normal distribution, the median is the same as the mean. For a distribution which is skewed to the right (left), the median is typically smaller (larger) than the mean.

minimum
In a sample of data, the smallest observation.

missing values
In a sample of data, desired observations which could not be obtained.

When entering data into STATLETS, missing values are normally entered as blank cells. This is true if entering the data directly into the STATLET spreadsheet, or if importing the data from another software package.

Most procedures handle missing values in one of the following manners:

(1) by excluding all rows which have missing values in any of the variables being analyzed. This is called the "listwise" method, and it is the most common. It is used in Statlets such as Multiple Regression.

(2) by excluding all observations which are missing for each variable or pair of variables separately. This is frequently called the "pairwise" method. It is an option in certain Statlets.

(3) by estimating an appropriate replacement value, which is done in the time series analysis Statlets which must preserve the sequential nature of a set of data.

Internally, missing values are stored as blanks for character variables and as the number -32768 for numeric variables. Normally, users need not be concerned with this internal representation.

mode
A statistic defined as the most frequently occurring data value. It is sometimes used as an alternative to the mean or median as a measure of central tendency. It is calculated in the Analyze Statlets. If more than one value occurs equally often, no value is printed.

The mode is a particularly useful summary statistic when the data is measured on a nominal scale.

For grouped data, the mode is usually defined as the midpoint of the interval containing the highest frequency count. In some distributions, there may be more than one mode: two high points (bimodal) or many high points (multimodal distributions).

moving average
The average of the most recent K data values, where K is the "order" or "span" of the moving average. This is one of the methods often used for forecasting time series data. Large values for K give good results for very stable series. Smaller values are needed for series which tend to change level frequently.

moving average model
A type of ARIMA model which relates the observed data value at time t to random errors in the current and previous time periods. Common models are:

Nonseasonal order 1: Yt = a + At - b*At-1
Nonseasonal order 2: Yt = a + At - b*At-1 - c*At-2
Seasonal order 1: Yt = a + At - b*At-s
Seasonal order 2: Yt = a + At - b*At-s - c*At-2s

where At is a random error term at time t. These models can be fit using the Forecasting Statlet in the time series section.

moving range chart
A control chart which plots the moving range of groups of 2 observations. It is used when plotting individuals data where true subgroups are not available. The moving range of 2 is equivalent to computing the absolute value of differences between consecutive points.

multicollinearity
A condition in which the predictor variables in a regression model are themselves highly correlated. Multicollinearity often leads to models in which the coefficients are poorly estimated and, while a fitted model may be good for predictive purposes, it can be difficult to interpret the relative effects of the various predictor variables. One of the important properties of a designed experiment is that it avoids such a condition.

multiple regression model
A statistical model relating a single dependent variable to two or more independent variables.

multiplicative model
In the Simple Regression Statlet, a model which takes the form:

Y = a*Xb

The model is fit in the metric

ln(Y) = ln(a) + b*ln(X)