Glossary of Terms (Part 2)

**negative binomial distribution**

- A discrete probability distribution useful for
characterizing the time between Bernoulli trials.
For example, suppose machine parts are characterized as
defective or non-defective, and let the probability of a
defective part equal p. If you begin testing a sample of
parts to find a defective, then the number of parts which
must be tested before you find k defective parts follows
a negative binomial distribution. The geometric
distribution is a special case of the negative
binomial distribution where k=1. The negative binomial is
sometimes called the Pascal distribution.

Parameters: event probability 0<=p<=1, number of successes k (positive integer)

Domain: X=k,k+1,k+2,...

Mean: k/p

Variance: (1-p)/p^{2} **nested model**

- A model in which the various factors are contained within one another in a specific hierarchical order. A typical example is a study in which 15 batches are selected, 5 samples are taken from each batch, and 2 measurements are made on each sample. In that case, samples are said to be nested within batches and measurements are nested within samples.
**Newman-Keuls procedure**

- A multiple comparisons procedure which allows you to compare all pairs of means while controlling the overall alpha risk at a specified level. It is a multiple-stage procedure based on the Studentized range distribution. While it does not provide interval estimates of the difference between each pair of means, it does indicate which means are significantly different from which others.
**nominal type variable**

- A type of variable for which there is no natural ordering to the values which it can take. This is in contrast to an ordinal variable.
**nonlinear models**

- Models in which the terms do not enter in a purely additive fashion. The Simple Regression Statlet allows you to estimate various nonlinear models. Each of these models can be transformed to a linear model by transforming Y, X, or both.
**nonparametric methods**

- Methods which test hypotheses using data samples without making any rigid assumptions about the type of distribution from which the data come. Most of the classical methods such as confidence intervals for the mean, t tests, and least squares regression, assume that the data come from a normal distribution. STATLETS contains various nonparametric methods which can be employed to verify the correctness of the classical methods, or as alternatives to them. Many nonparametric tests are based on the ranks of the data rather than their actual values. These tests are contained in various Statlets throughout the package.
**nonstationarity**

- A characteristic of a time series for which the distribution changes over time.
**normal distribution**

- A continuous probability distribution which is used to
characterize a wide variety of types of data. It is a symmetric distribution,
shaped like a bell, and is completely determined by its
mean and standard deviation. The normal distribution is
particularly important in statistics because of the
tendency for sample means to follow the normal
distribution (this is a result of the Central Limit
Theorem).

Most classical statistics procedures such as confidence intervals rely on results from the normal distribution. The normal is also known as the Gaussian distribution after its originator, Frederich Gauss.

Parameters: mean mu, standard deviation sigma>0

Domain: all real X

Mean: mu

Variance: sigma^{2}

**observation**

- Repeated values of a data variable. The rows of a column represent the observations.
**order statistics**

- For a sample of data, the data values arranged from
smallest to largest. For example, the "first order
statistic" is equal to the minimum. For a sample of
size n, the "n-th order statistic" is equal to
the maximum.

Order statistics are used to measure the center of a data set through the median, data spread through the interquartile range, and to judge whether data comes from a normal distribution using a normal probability plot.

Certain nonparametric tests are also based on the sample order statistics. **ordinal**

- A type of variable for which there is a natural ordering to the values which it can take. It does not necessarily have to be numeric. For example, the response to a question on a survey is ordinal if it can take the values "disagree strongly", "disagree", "agree", and "agree strongly".
**outlier**

- A data value which is unusual with respect to the group
of data in which it is found. It may be a single isolated
value far away from all the others, or a value which does
not follow the general pattern of the rest. Most classical
statistical techniques tend to be quite sensitive to
outliers, so that it is important to be on the alert for
them. Graphical techniques, particularly residual plots, are very
helpful in detecting the presence of outliers.

Some of the newer Exploratory Data Analysis techniques and nonparametric procedures are much less sensitive to outliers (such procedures are said to be robust). **outside point**

- A point lying unusually far from the center of a sample of data values. A point is called "outside" if it lies more than 1.5 times the interquartile range below the lower quartile or above the upper quartile. It is called "far outside" if it lies more than 3 times the interquartile range away from the nearer quartile.
- Box-and-whisker plots and stem-and-leaf displays routinely list outside points.

**P value**

- A P-value indicates how unusual a computed test statistic is compared with what would be expected under the null hypothesis. A small value indicates that the null hypothesis should be rejected at any significance level above the calculated value. For example, if the P value equals .0246, we would reject the null hypothesis at the 5% significance level, but would not reject it at the 1% significance level. P values are printed in procedures such as Multiple Regression to determine whether the estimated coefficients are significantly different than zero.

**parameter**

- A numeric value which characterizes a probability distribution. The mean and variance are typical examples. Statistics are used to estimate parameters.
**Pareto distribution**

- A distribution used for random variables which are
constrained to be greater or equal to 1.

Parameters: shape a>0

Domain: X>=1

Mean: a/(a-1) for a>1

Variance: [a/(a-2)]-[a/(a-1)]^{2}for a>2 **partial autocorrelation**

- An estimate of the correlation between the data value at
time t and the data value at time (t-k), having accounted
for the correlation between the value at time t and the
values at lags less than k. Partial autocorrelations are
used when constructing ARIMA models for time series data.

For each partial autocorrelation, a corresponding standard error is calculated. If the time series is random, all of the partial autocorrelations should be within approximately +/- 2 standard errors. When constructing an autoregressive model, a partial autocorrelation estimate extending beyond this distance indicates the need for a coefficient at the indicated time lag. **partial correlation**

- A measure of the strength of the relationship between two or more numeric variables having accounted for their joint relationship with one or more additional variables. On a scale of -1 to +1, it measures the extent of the unique correlation between the two variables which is not shared with the other variables.

**Pascal distribution**

- Another name for the negative binomial distribution.
**Pearson correlation coefficient**

- A statistic computed to calculate the correlation between two numeric variables. It ranges from -1 for perfect negative correlation to +1 for perfect positive correlation. It is calculated by dividing the covariance of the variables by the square root of the product of their variances.
**Pearson's R**

- A statistic calculated in the Crosstabulation and Contingency Table Statlets. It measures the degree of association between the row and column variables using the common correlation coefficient. It ranges between -1 and +1 and is relevant only if both variables are of interval type. A P value is computed to test the null hypothesis that the correlation equals 0.
**percentiles**

- The p-th percentile of a set of data is that value which is greater or equal to at least p% of the data and which is less than at most (100-p)% of the data. The lower quartile, median, and upper quartile are the same as the 25th, 50th, and 75th percentiles.
**Poisson distribution**

- A distribution often used to express probabilities
concerning the number of events per unit. For example,
the number of computer malfunctions per year, or the
number of bubbles per square yard in a sheet of glass,
might follow a Poisson distribution. The distribution is
fully characterized by its mean, usually expressed in
terms of a rate.

Parameters: mean B>0

Domain: X=0,1,2,...

Mean: B

Variance: B **pooled variance**

- A single estimate of the common variance obtained by
combining the estimated variances from each of several
samples.When comparing several samples from different
populations, it is common to assume that the variances of each of the populations
is the same. Normally, this estimate is obtained by a
weighted average based upon the degrees of freedom
in each of the samples.

Pooling of variance is justified only if one is confident that the groups have equal variances. Tests for equality of variance should be performed first to establish that the variances are equal. If the variances are unequal, separate variance estimates should be used for the groups. **population**

- The set of all the observations which we wish to draw conclusions about. A random sample of observations is normally assumed to have been drawn from a much larger population.
**power**

- The probability of rejecting a null hypothesis when it is false. Good tests have high power. For more details, see hypothesis tests.
**predicted values**

- The values predicted by a statistical model which was fit to the data. For example, when fitting a line to values for X and Y, the predicted values refer to the location of the line at a given value of X.
**prediction limits**

- Limits displayed in the Regression Statlets to show the precision by which the fitted model can predict new observations. If the limits are calculated at a confidence level of 95%, then we would expect 95% of all new observations to fall within these limits.
**probability**

- A number between 0 and 1 which represents how likely an
event is to occur. Events with probability equal to 0
never occur. Events with probability equal to 1 always
occur.

In data analysis, probability is normally defined in terms of the relative frequency of occurrence of an event which can be repeated many times. For example, if you repeatedly sample temperatures from a process and get values below 150 degrees half the time, then the "probability" of getting a reading below 150 degrees is equal to 0.5 or 50%.

In daily life, we sometimes use probability in a different sense, i.e., to express our degree of belief about the likelihood of an event which can not be repeated indefinitely under identical conditions. For example, you might say that the chance of getting a raise this year is "one in a million". Such "subjective" probabilities are used in statistical decision theory. **probability distribution**

- A mathematical model describing the probability of observing various values of a random variable.

**probability mass function**

- For a discrete distribution, the probability of obtaining each possible value of the random variable.
**pure error**

- Variability between observations made at the same values of the independent variable or variables.

**quantile plot**

- A plot of the percentiles for a sample of data.
- The quantile plot shows the quantities (i-0.5)/(n+0.25) versus the data values in sorted order, for i ranging between 1 and n. It shows the cumulative proportion of the data less than or equal to X.
**quantiles**

- Those data values below which lie specified proportions of the sample. For a sample of size n, the data value which has rank i corresponds to the quantile at (i-0.5)/(n+0.25).

**quartiles**

- Statistics which divide the observations in a numeric
sample into 4 intervals, each containing 25% of the data.
The lower, middle, and upper quartiles are computed by
ordering the data from smallest to largest and then
finding the values below which fall 25%, 50%, and 75% of
the data.

The middle quartile is usually called the median.

Quartiles are special cases of percentiles. The lower quartile, median, and upper quartile are the same as the 25th, 50th, and 75th percentiles.

Box and whisker plots provide a graphical summary of a data sample in terms of its quartiles. Upper and lower quartiles are calculated in the One Variable Analysis Statlet.

**R-squared**

- A statistic employed in regression
analysis that measures how much variance has been
explained by the regression model. Specifically, it is
the proportion of the total variability (variance) in the
dependent
variable that can be explained by the independent
variables. R-squared is also employed as a measure of
goodness of fit of the model. R-squared ranges from 0% to
100%. If all the observations fall on the regression
line, R-squared is equal to 100%.

The variability in the dependent variable is partitioned into two component sums of squares: variability explained by the regression model and unexplained variation. To calculate R-squared, you divide the regression sums of squares by the total sums of squares. In a simple regression, R-squared can also be obtained by squaring the correlation coefficient. **random sampling**

- A sampling method in which all elements in the population
have an equal chance of being selected, and in which the
value of one observation does not affect the outcome of
other observations.

Random samples have important properties which are necessary in many statistical tests.

**random variable**

- A function which assigns a numerical value to all possible outcomes of an experiment. The values of random variables differ from one observation to the next in a manner described by their probability distribution.
**range**

- The range of a sample is defined as the

(maximum value) - (minimum value)

It is sometimes used in place of the standard deviation to measure the spread of a data sample. It is calculated in the One Variable Analysis Statlet. **rank correlation**

- A number between -1 and +1 which measures the strength of the relationship between two variables. Unlike the ordinary correlation coefficient which is based on the values of the data values, this correlation is based on the ranks of the data within each variable. It is thus less sensitive to outliers.
**ranks**

- Values found by first ordering the observations
in a sample from smallest to largest. The smallest value
is assigned a rank of 1, the second smallest gets a rank
of 2, and so on. If several values are equal in
magnitude, their ranks are pooled and each is assigned
the average rank.

Ranks are used in many nonparametric tests where it is not reasonable to assume that the sample comes from a normal distribution. **ratio type variable**

- A type of variable for which the distance between data values is meaningful and for which there is a natural origin. For example, the heights of human beings are measured on a ratio scale. For such a variable, it is meaningful to say that one value is twice as large as another.
**reciprocal-X model**

- In the Simple Regression Statlet, a model which takes the
form:

Y = a + b/X **reciprocal-Y model**

- In the Simple Regression Statlet, a model which takes the
form:

Y = 1 / (a + b*X) **rectangular distribution**

- Another name for the uniform distribution.
**regression coefficients**

- Calculated in regression analyses, estimates of unknown parameters in a regression model. Interpretation of the coefficients as indicators of the relative importance of the variables is not appropriate unless they are standardized, since the actual magnitude will depend upon the units in which the variables are measured.
**regression model**

- A statistical model relating a dependent variable to one or more independent variables.
**relative frequency**

- A count of the number of occurrences of a data value in a sample, or the number of values falling within a fixed range, expressed as a proportion of the total number of observations. Data can be summarized in this manner by tabulating all the data values into distinct categories and then counting the number of times each category appears in the frequency distribution. This tabular summary is called a frequency table. A graphical representation would be a barchart or histogram. Relative frequencies are calculated in the Tabulation and Crosstabulation Statlets.
**relative standard deviation**

- Another name for the coefficient of variation.
**reliable**

- Statistical reliability refers to measurements that can be repeated with time. For example, a variable or statistic is said to be reliable if its value can be measured in the same way in repeated experiments.
**residual**

- The "error" left over after a statistical model is fit to
a sample of data. In a regression
analysis, the residuals are the differences between
the observed values and the values that are predicted by
the regression model. Specifically, they are the
deviations from the regression line. In a oneway analysis
of variance, the residuals are the observed values minus
the mean of the samples from which the observations come.

Analysis of residuals is an important step in any analysis since the residuals are assumed to be normally and independently distributed with constant variance. Plots of residuals are often the best way to see if the data violates these assumptions. When interpreting residuals, it is often desirable to refer to Studentized residuals because the magnitudes are easier to judge. **residual plots**

- Residual plots are provided as options in many Statlets. They are used to display the deviations of the data values from the fitted statistical models. Residuals should be random, with zero mean and constant variance.
**resistant**

- A characteristic of a statistical procedure which is not greatly influenced by the presence of outliers. Procedures involving medians and quartiles are in general more resistant than those involving means and standard deviations.
**resolution**

- The number of dots or "pixels" (picture elements) on a screen or printer. The quality of graphs will depend on the resolution of an output device.
**robust methods**

- Statistical procedures which are not greatly affected when the underlying assumptions are not exactly met. The use of confidence intervals for the mean is an example of a procedure which is robust against the assumption that the data follows a normal distribution. In general, nonparametric and EDA procedures tend to be more robust than classical methods.

**S-curve model**

- In the Simple Regression Statlet, a model which takes the
form:

Y = exp(a + b/X) **sample**

- A set of observations, usually considered to have been taken from a much larger population. Statistics are numerical or graphical quantities calculated from a sample. Since the data in one sample will vary from that of another, so will the statistics calculated from those samples.
**Scheffe intervals**

- Intervals which bound the estimation error in each mean using a method suggested by Scheffe. This method is appropriate whether or not the group sizes are all the same. Using the F-distribution, it generates intervals which allow you to make all possible linear comparisons among the sample means while controlling the experimentwide error rate at a specified level.

**seasonal adjustment**

- Adjustment of time series data
to remove cyclical effects caused by periodic seasonal
factors. Data such as unemployment rates, which tend to
be higher at certain times of the year, are often
adjusted to allow better comparison across months.

In a multiplicative adjustment, each data value is divided by the estimated seasonal index for the corresponding month. Seasonal indices are scaled so that an average month has an index of 100. An index of 110 would indicate a season 10% above average, while an index of 90 would indicate a season 10% below average. In an additive adjustment, an average month has an index of 0. In the latter method, the index is subtracted from the data value to form the adjusted data. **serial correlation**

- A situation in which the values in a sample are correlated based upon their order. Most statistical procedures, except for those in the time series analysis section, assume that the data are not serially correlated. Serial correlation can invalidate the P values calculated by common statistical procedures.
**Shapiro-Wilks test**

- A test to determine whether or not a sample comes from a normal distribution, conducted by regressing the quantiles of the observed data against that of the best-fitting normal distribution.
**significance level**

- The significance level of a test is the smallest alpha level at which the null hypothesis would be rejected. Usually, if the significance level is less than a number such as .05 (5%), the null hypothesis would be rejected in favor of the alternative. In many cases, the significance level can be thought of intuitively as the chance of getting a sample like the one being analyzed if the null hypothesis were true. A small significance level would imply that getting such a sample was highly unlikely, suggesting that the null hypothesis is probably not true. The significance level is also called the P value of the test.
**significant**

- A term used in statistics which has a very precise definition. It means that the observed event would occur by chance under hypothesized conditions less than a specified proportion of the time. For example, we may be interested in knowing whether the means of two samples are the same or different. If the sample means are different enough, we might conclude on the basis of a statistical test that the population means are "significantly different at the 5% level". This states that, if the means are actually the same, a difference as large as we observed would occur less than 5% of the time.
**simple regression**

- A statistical model relating a single dependent variable to a single independent variable.

**skewed**

- A characteristic applicable to probability distributions or samples which refers to a lack of symmetry.
**skewness**

- A statistic which measures the lack of symmetry in a
distribution. A plot of a skewed distribution would show
a long tail to either the left or the right.
Distributions with a longer upper tail are said to be
positively (right) skewed, while those with a longer
lower tail are negatively (left) skewed.

The skewness of data is usually measured through a coefficient of skewness which is zero for symmetric distributions such as the normal or uniform distribution, is greater than zero for positively skewed data, and is less than zero for negatively skewed distributions. To judge whether data departs significantly from a normal distribution, a standardized skewness statistic can also be computed. Skewness is calculated in the One Variable Analysis Statlet. **skychart**

- A graphical summary of the joint frequency distribution of two variables. It is created by plotting bars whose heights are proportional to the frequencies in the cells of a Crosstabulation.
**slope**

- The term in the equation of a regression
line which multiplies the independent
variable X. If the equation of a line is given as Y =
A+B*X, then the slope is B. The slope's numerical value
indicates the steepness of a line while the algebraic
sign describes the relationship between X and Y. A
positive slope means that as X increases, Y increases.
The converse is true for a negative slope: as X
increases, Y decreases.

In a regression analysis, the slope tells you how much increase there is in the predicted Y variable for every unit change in the X variable. **Somer's D**

- A statistic calculated in the Crosstabulation and Contingency Table Statlets. It ranges from -1 to +1 and is based on the number of concordant and discordant pairs of observations. A concordant pair is one is which the two variables (row and column) have the same relative ranking (greater than or less than). A discordant pair is one in which the two variables have the opposite ranking. One variable (row or column) is considered to be the independent variable, while the other is considered to be the dependent variable. Both variables must be ordinal. A correction is made for ties on the independent variable.
**square root-X model**

- In the Simple Regression Statlet, a model which takes the
form:

Y = a + b*X^{0.5} **square root-Y model**

- In the Simple Regression Statlet, a model which takes the
form:

Y = (a + b*X)^{2} **standard deviation**

- A statistic which measures the variability or dispersion
of a set of data. It is calculated from the deviations
(distances) between each data value and the sample mean,
and is often represented by the letter "s". The
more disperse the data is, the larger the standard
deviation.

The standard deviation squared is called the variance. For data which follows a normal distribution, approximately 68% of all data will fall within one standard deviation of the sample mean, 95% of all values will fall within two standard deviations, and 99.7% of all data will fall within three standard deviations. For data from any distribution, AT LEAST 75% of all values fall within plus and minus two standard deviations, while AT LEAST 89% fall within three standard deviations. **standard error**

- The standard deviation of the sampling distribution of a
statistic. As such, it measures the precision of the
statistic as an estimate of a population or model
parameter.

A commonly computed standard error is the standard error of the mean, defined as

(population standard deviation) / n^{0.5}

**standard error of means**

- In regression analysis, the standard deviation of the predicted mean for many new observations.
**standard error of predictions**

- In regression analysis, the standard deviation of the predicted value for a single new observation.
**standard error of the estimate**

- In regression analysis, the best estimate of the standard deviation of the residuals about the regression line. It can be calculated by taking the square root of the residual mean squared error. Typically, the smaller the standard error, the better the fit of the regression line.

**s****tandard normal distribution**

- A normal distribution with a mean equal to 0 and a standard deviation equal to 1. Also called the unit normal distribution.
**standardized skewness**

- A standardized form of the skewness
statistic which renders the statistic free of scale.

Standardized skewness can be used to test whether your data comes from a normal distribution. If it does, the statistic will fall between -2 and +2 about 95% of the time. **standardized kurtosis**

- A standardized form of the kurtosis statistic which renders the statistic free of scale.

Standardized kurtosis can be used to test whether your data comes from a normal distribution. If it does, the statistic will fall between -2 and +2 about 95% of the time.**stationarity**

- A characteristic of a time series for which the distribution does not change over time.
**statistic**

- Anything that can be calculated from a sample of data.
The most common use of the word is for summary measures
such as the sample mean and sample standard deviation,
although graphical displays such as histograms are also
statistics. When referring to the population mean or
standard deviation, the term parameter
is used.

Another use of the term statistics is to describe two broad categories of data analysis: Descriptive and Inferential Statistics. Descriptive statistics involves numerical or graphical summaries of data. Inferential statistics allows one to use sample statistics to make statements about population parameters. **statistical inference**

- The extension of sample results to
a larger population.

Descriptive statistics (such as the mean or a histogram) provide concise methods for summarizing a lot of information. However, it is inferential statistics that allows one to make statements about the population from a sample.

For example, it is often virtually impossible to measure an entire population, but by statistical inference one can use the measured sample statistics to make statements about the unmeasured population (see estimation). However, in order to use the power of statistical inference, certain assumptions about the statistic must first be met. For example, making correct inferences about a population from a sample can often require that random sampling be employed. **statistical model**

- A model used to describe the relationship between a dependent variable
Y and one or more independent
variables. It takes the general form:

Y = f(X1,X2,...) + error

where f(X1,X2,...) is a mathematical function, and error represents the random deviations from the model. **stem-leaf display**

- A cross between a table and a graph that shows individual
data values stacked up like a histogram. As a result,
the data is displayed in less space than a frequency
table and unlike a table, affords visual examination of
the frequency distribution for skewness
and outliers.

The display consists of a "stem" which is a vertical list of all the leading digits in the data to the left of a central vertical line. On the right side of the central vertical line are the "leaves", which are the digits which immediately follow the stems. A cumulative count for each row, called the "depth", is shown along the far left of the display. Outside points are listed on separate low and high stems.

A stem-leaf display is created in the One Variable Analysis Statlet. **Student's t distribution**

- A probability distribution which is very similar in shape
to the standard
normal distribution. The mean of the t distribution
is always equal to 0, while the standard deviation is
usually slightly greater than 1. Only one parameter,
called the degrees
of freedom, is necessary to completely specify the
distribution.

Values of Student's t are frequently used in forming confidence intervals for the mean when the variance is unknown, testing if two sample means are significantly different, or when testing the significance of coefficients in a regression model.

Parameters: degrees of freedom n>0

Domain: all real X

Mean: 0

Variance: n/(n+2) for n>2 **Studentized residual**

- A residual which has been divided by its estimated standard error, where that standard error is based upon fitting a statistical model using all points except the point whose residual is to be computed. Studentized residuals should behave like a sample from a normal distribution with a mean of 0 and a standard deviation of 1. Studentized residuals are a type of "standardized" residual, which allows for easier interpretation of residuals. Since each residual is expressed in terms of the number of standard deviations away from 0, Studentized residuals greater than 3 or less than -3 should occur very infrequently.
**sum of squares**

- A mathematical quantity used in various statistical procedures to measure variability. For example, the sample variance is computed by summing the squared deviations of each observation from the sample mean and dividing by n-1.
**symmetric**

- A characteristic of a sample or population which looks the same to the right of the peak as it looks to the left of the peak.
**symmetric distribution**

- A symmetric distribution which looks identical to the left and right of its mean. For such a distribution, the mean and median are the same. Common examples of symmetric distributions are the normal and the uniform distributions. A distribution which is not symmetric is said to be skewed.

**t test**

- A hypothesis test based on Student's t distribution. Commonly used to test hypotheses about one or more population means or coefficients in a linear regression model.
**time series**

- A sample of data values collected at equally spaced points in time. Possible autocorrelation between adjacent values makes it necessary to use special statistical methods to analyze this type of data.
**triangular distribution**

- A distribution used for random variables which are
constrained to lie between two fixed limits. However,
unlike the uniform
distribution in which all values between the limits
are equally likely, the triangular distribution peaks at
a central value. It is characterized by three parameters:
lower limit, central value, and upper limit.

Parameters: lower limit a, central value c>a, upper limit b>c

Domain: a<=X<=b

Mean: (a+b+c)/3

Variance: [a^{2}+ b^{2}+ c^{2}- ab - ac - bc]/18

**T****ype I error**

- Incorrectly rejecting a true null hypothesis. The probability of such an error is the alpha risk. It is common to set the probability of making a Type I error at a number such as 5%. For more details, see hypothesis tests.
**Type II error**

- Not rejecting a false null hypothesis. The probability of such an error is called the beta risk. For more details, see hypothesis tests.

**uncertainty coefficient**

- A statistic calculated in the Crosstabulation and Contingency Table Statlets. On a scale of 0 to 1, it measures the proportional reduction in the uncertainty about the value of the row or column variable given knowledge of the other.

**uniform distribution**

- A continuous probability distribution which is useful for
characterizing data which ranges over an interval of
values, each of which is equally likely. It is sometimes
called the rectangular distribution because of its shape
when plotted. The distribution is completely determined
by the smallest possible value "a" and the
largest possible value "b". For discrete data,
there is a related discrete
uniform distribution.

Parameters: lower limit a, upper limit b>a

Domain: a<=X<=b

Mean: (a+b)/a

Variance: (b-a)^{2}/12

**unit normal distribution**

- Another name for the standard normal distribution.
**upper quartile**

- The 75th percentile,
calculated by ordering the data from smallest to largest
and finding the value which lies 75% of the way up
through the data.

Box and whisker plots are a graphical display of a sample in terms of its quartiles.

**variable**

- A term used in statistics is used to describe the factors
that are to be studied. Data variables are described as
either:

1) Qualitative (Categorical) or - 2) Quantitative (Numerical)

Qualitative variables are named categories of data. For example, gender or machine number. The categories may be coded numerically (for e.g., female=1, male=2), but the actual numbers have no true numerical meaning. An average value for the variable gender, for instance, would not make any sense.

On the other hand, quantitative variables are truly numeric in nature. For example, taking the average of a sample of item weights would have true meaning. Quantitative variables can further be described as discrete or continuous.

In statistical analysis, it is important to know the nature of your variables (categorical or numeric) and how the variable is measured. The scale of measurement used for your variable dictates what statistical procedures can legitimately be applied to your data.

**variable data**

Data measured on a continuous scale, such as a person's age in years.**variance**

- A statistic which measures how spread out or dispersed a set of data is. The value calculated will always be greater than or equal to zero, with larger values corresponding to data which is more spread out. If all data values are identical, the variance is equal to zero. The square root of the variance is called the standard deviation. Since the standard deviation is measured in the same units as the data, it is more frequently used than the variance. The variance and standard deviation are calculated in the One Variable Analysis Statlet.
**variance component**

- The variance due to a particular factor in a model which decomposes the overall process variance into two or more pieces.
**viewpoint**

- The location of the user's eye when viewing 3D graphs. It can be changed by pressing the arrows on the graphics panel.

**warning limits**

- Limits placed on a control chart for variables or attributes at 1 and 2 "sigma" to help determine how far points lie from the centerline.
**Weibull distribution**

- A distribution used for random variables which are
constrained to be greater or equal to 0. It is
characterized by two parameters: shape and scale. The
Weibull distribution is one of the few distributions
which can be used to model data which is negatively skewed.

Parameters: shape a>0, scale B>0

Domain: X>=0

Mean: (B/a)/G(1/a)

Variance: B^{2}/a [2G(2/a)-(1/a){G(1/a)} ]^{2} - where G(y) is the gamma function evaluated at y.

Copyright © 1997 by NWP Associates, Inc.

All trademarks or product names mentioned herein are the property of their respective owners.