\documentclass[11pt]{article} \usepackage{amsmath} \usepackage[utf8]{inputenc} \usepackage{indentfirst} \usepackage{natbib} \usepackage[colorlinks=true,allcolors=blue]{hyperref} \usepackage{url} \usepackage{doi} \newcommand{\fatdot}{\,\cdot\,} \newcommand{\abs}[1]{\lvert #1 \rvert} \let\code=\texttt \DeclareMathOperator{\pr}{pr} %\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{Fuzzy Rank Tests and Confidence Intervals} \begin{document} \title{Fuzzy Rank Tests and Confidence Intervals} \author{Charles J. Geyer} \maketitle \begin{abstract} How to do exact-exact (rather than only conservative-exact) sign, signrank, and ranksum hypothesis tests, whether or not there are tied ranks. Also how to do the corresponding confidence intervals. Exact-exact procedures must be either randomized or fuzzy. This package provides the latter. \end{abstract} \section{License} This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License \url{http://creativecommons.org/licenses/by-sa/4.0/}. \section{R} \begin{itemize} \item The version of R used to make this document is \Sexpr{getRversion()}. \item The version of the \texttt{knitr} package used to make this document is \Sexpr{packageVersion("knitr")}. \item The version of the \texttt{fuzzyRankTests} package used to make this document is \Sexpr{packageVersion("fuzzyRankTests")}. \end{itemize} <>= options(keep.source = TRUE, width = 80) @ <>= library(fuzzyRankTests) @ \section{Introduction} \subsection{What This is About} We deal with three tests of statistical hypotheses: \begin{itemize} \item the sign test, \item Wilcoxon's signed rank test, and \item Wilcoxon's rank sum test (also called Mann-Whitney). \end{itemize} And we deal with two issues with these. \begin{itemize} \item Like all tests with discrete test statistics, exact tests are impossible unless the test is randomized. \item Tied data and tied ranks complicate the situation. \end{itemize} Assumptions: \begin{itemize} \item One Sample or Paired Comparison \begin{itemize} \item Sign test: no assumptions. \item Signed rank test: symmetric population distribution. \item $t$ test: normal population distribution. \end{itemize} \item Two Independent Samples \begin{itemize} \item Rank sum test: one population distribution is the other shifted. \item $t$ test: both population distributions normal with same variance. \end{itemize} \end{itemize} This package does not do $t$ tests, see R function \code{t.test} in core R for that. We only include them to show that the assumptions get more restrictive as one goes down the list. For non-fuzzy tests the assumptions above need an additional assumption that the population distribution is continuous so there are no tied data or tied ranks. As will be seen, fuzzy tests and confidence intervals do not need this assumption. \subsection{Fuzzy Tests and Confidence Intervals} Despite being the official theory of testing statistical hypotheses since it was invented by Neyman and Pearson in the 1930's \citep[Chapters~3 and~4]{tsh-4th-ed} and despite being taught to all PhD statistics students, the theory of randomized hypothesis tests gets little application (I have never seen it used) because of the arbitrariness of the artificial randomization. Two statisticians can analyze exactly the same data using exactly the same hypothesis test and come to opposite decisions due to the artificial randomization. \citet{geyer-meeden} proposed a simple fix for this issue: ``unrandomize'' randomized tests in the sense that one reports not a decision or a $P$-value or a confidence interval that purports to be a realization of some random process (the artificial randomness in the hypothesis test) but rather report (a description of) the probability distribution of that random quantity. That is we report \emph{abstract} randomness rather than \emph{realized} randomness. In more detail, a randomized hypothesis test rejects the null hypothesis with probability $\phi(X)$ when test statistic $X$ is observed. This function $\phi$ is called the \emph{critical function} of the test. \citet{geyer-meeden} point out that the critical function also depends on the significance level $\alpha$ and the value of the parameter hypothesized under the null hypothesis (for one-tailed tests, the boundary point of the composite null hypothesis). So they write the critical function $\phi(x, \alpha, \theta)$. And they say the result of the test is to report this critical function, not some realization of some random variable related to it. \citet{geyer-meeden} go on to point out three different interpretations of the critical function. \begin{itemize} \item The function $\phi(\fatdot, \alpha, \theta)$ is the critical function of the randomized test, as considered classically. \item The function $\phi(x, \fatdot, \theta)$ is the (distribution function of) the abstract randomized (also called \emph{fuzzy}) $P$-value of the randomized test. \item The function $1 - \phi(x, \alpha, \theta)$ is the (membership function of) the \emph{fuzzy confidence interval}) that is dual to the randomized test. \end{itemize} There is no difference between $\phi(x)$ used classically and $\phi(x, \alpha, \theta)$ used by \citet{geyer-meeden} when considered as a function of $x$ for fixed $\alpha$ and $\theta$. It is the same function of $x$ either way. \citet{geyer-meeden} say what one should report is the number $\phi(x, \alpha, \theta)$ rather than a decision (accept or reject the null hypothesis that purportedly has this number as its probability of rejection). In order for the function $\phi(x, \fatdot, \theta)$ to be a distribution function, the hypothesis test need only have nested critical regions \citep[equation~(1.4) and the surrounding discussion]{geyer-meeden} and be continuous (which property our applications have). If we were to generate a random variable $P$ having this distribution function, then rejecting the null hypothesis when $P < \alpha$ is the classical randomized test. Hence this is the $P$-value of that test. \citet{geyer-meeden} are only saying that rather than simulating such a $P$ and reporting that number, one should report its distribution as described by the distribution function $\phi(x, \fatdot, \theta)$ or perhaps by the probability density function of that distribution function. The function $1 - \phi(x, \alpha, \fatdot)$ takes value between zero and one, including (if the test is actually randomized) values strictly between zero and one. \citet{geyer-meeden} suggest we interpret this as the membership function of a fuzzy set, as in fuzzy set theory \citep*{fuzzy-book}. One interprets the membership function as saying to what degree the point is in the fuzzy set. \citet{geyer-meeden} say one should interpret it like partial credit on a test question. After all, that is what probability does. The coverage probability of the interval is $$ E_\theta\{1 - \phi(X, \alpha, \theta)\} = 1 - \alpha $$ and this means point $x$ is being given ``partial credit'' $1 - \phi(x, \alpha, \theta)$ when $\theta$ is the true unknown parameter value. \subsection{Tied Data or Tied Ranks} Tied data (data points tied with the hypothesized value under the null hypothesis) or tied ranks (for the signed rank test or for the rank sum test) bring more issues. We deal with these using the methods of \citet{thompson-geyer}. Now our model has data in two parts: the observable part $x$ and the unobservable part $y$ (also called missing data, latent variables, random effects, or hidden layer). So we write the critical function of our randomized test $\psi(x, y, \alpha, \theta)$. Then \begin{equation} \label{eq:average-critical-function} \phi(x, \alpha, \theta) = E_\theta\{ \psi(x, Y, \alpha, \theta) \} \end{equation} is the critical function for the test based on the observed data $x$. \subsection{What this Package Does} \label{sec:what-we-do} For all three hypothesis tests this package does, the null distribution of the test statistic is discrete and symmetric. Let $T$ be the test statistic for an upper tailed test and $\tau$ be the center of symmetry of its null distribution. Then $- T$ is the test statistic for the lower tailed test, and $\abs{T - \tau}$ is the test statistic for the two-tailed test. In all three cases, the fuzzy $P$-value is uniformly distributed on the interval with endpoints $\pr_\theta(W > w)$ and $\pr_\theta(W \ge w)$, where $W$ is the test statistic considered as a random variable and $w$ is its observed value. Hence the critical function of the test is $$ \phi(w, \alpha, \theta) = \begin{cases} 0, & \alpha \le \pr_\theta(W > w) \\ \frac{\alpha - \pr_\theta(W > w)}{\pr_\theta(W = w)}, & \pr_\theta(W > w) < \alpha < \pr_\theta(W \ge w) \\ 1, & \pr_\theta(W \ge w) \le \alpha \end{cases} $$ when there are no ties in the data or the ranks. When there are ties in the data or the ranks, we assume the data have been measured with inadequate precision. If more precise measurement had been used there would be no ties in the data or the ranks. We assume that all orderings of the hypothetical precise data consistent with the observed (imprecise) data are equiprobable (since there is no data favoring any such ordering). Thus the critical function when there are ties is just the average of the critical functions \eqref{eq:average-critical-function} for the precise data (with no ties) consistent with the observed imprecise data. \subsection{Ordered Categorical Data} We do not recommend the procedures in this package as competitors for procedures for ordered categorical data \citep[Sections~8.2 and~8.3]{agresti}. If one has ordered categorical response data, then one should probably use statistical models and procedures designed specifically for that. But if the ordered categories have arisen from imprecise measurement, then one could also justify using the fuzzy procedures this package provides for such data. \subsection{Other Procedures for Tied Data or Tied Ranks} We take \citet*{hollander-et-al} to be authoritative about existing practice. \subsubsection{Sign Test} For the sign test, their recommended procedure is to report the usual $P$-value for a discrete test: $\pr_\theta(W \ge w)$ when there are no ties (data values equal to the value hypothesized by the null hypothesis). When there are ties, \citet[Subsection Ties of Section~3.4]{hollander-et-al} say one should eliminate the ties from the data and then proceed as above. We say this is unacceptable. It is cherry-picking data that favor the alternative hypothesis (suppressing data that favor the null hypothesis). This correction for ties, although widely used, can never be justified. To be fair to \citet{hollander-et-al} they say (Comment~34 of Section~3.4) that one should not do their recommended procedure when the number of ties ``represent a sizable percentage of the total.'' So they already recognize the wrongness. They also give two other procedures. \begin{itemize} \item A randomized procedure that is what we ``unrandomize'' turning it into a fuzzy $P$-value. They do not like randomized procedures and hence do not recommend them. But we do not either. Hence the unrandomization, which escapes their criticism. \item A conservative procedure that counts all ties in favor of the null hypothesis. Our procedure also calculates this: its $P$-value is the upper endpoint of the support of the distribution of our fuzzy $P$-value. So we take that into account (including exactly how conservative it is). \end{itemize} \subsubsection{Signed Rank Test} This section is much like the preceding one \emph{mutatis mutandis}. The issues surrounding exactness and ties are much the same. Ranks bring in a few technical details, which we do not need to emphasize because the computer does all the work dealing with them. For the signed rank test, the recommended procedure of \citet[Section~3.1]{hollander-et-al} is to report the usual $P$-value for a discrete test: $\pr_\theta(W \ge w)$ when there are no ties (either data values equal to the value hypothesized by the null hypothesis or tied ranks). When there are ties, \citet[Subsection Ties of Section~3.1]{hollander-et-al} say one should (i) eliminate data values equal to the value hypothesized by the null hypothesis and (ii) use average ranks when there are tied ranks. Using average ranks changes the null distribution of the test statistic to something not easily understood, so one uses the asymptotic normal distribution of the test statistic under the null hypothesis, which has its asymptotic variance corrected for ties. We say (i) is unacceptable. It is cherry-picking data that favor the alternative hypothesis (suppressing data that favor the null hypothesis). Although widely used, it can never be justified. We also do not need (ii) because we use unrandomized randomized tests (Section~\ref{sec:what-we-do} above) instead. To be fair to \citet{hollander-et-al} they say (Comments~9 and~10 of Section~3.1) that one should not do their recommended procedure unless the ``zero values are a very small percentage'' of the total. So they already recognize the wrongness. They also give two other procedures. \begin{itemize} \item A randomized procedure that is what we ``unrandomize'' turning it into a fuzzy $P$-value. They do not like randomized procedures and hence do not recommend them. But we do not either. Hence the unrandomization, which escapes their criticism. \item A conservative procedure that counts all ties in favor of the null hypothesis. Our procedure also calculates this: its $P$-value is the upper endpoint of the support of the distribution of our fuzzy $P$-value. So we take that into account (including exactly how conservative it is). \end{itemize} They also discuss (Comment~11 of Section~3.1) another procedure that keeps the tied ranks but uses intensive computation to calculate the exact permutation distribution conditioning on the pattern of ties. Since we have an alternative, we are not interested in this either. \subsubsection{Rank Sum Test} For some reason, the discussion in \citet{hollander-et-al} of this test is not parallel to the other two. They do not discuss randomized versions of this test, although they obviously exist and work just as well as for the other two. Hence this package does the fuzzy hypothesis tests and confidence intervals that are justified in the same way as for the other two procedures. \section{Examples} \subsection{Sign Test} \subsubsection{No Zero Values} For an example with no zero values, we do Example~{3.5} in \citet{hollander-et-al} <>= z <- c(-0.8, 7.5, 46.9, 17.6, -4.6, 54.0, 48.3, 3.9, 16.7, 19.7, -8.5, 7.1, 40.7, 23.8, 14.8, 20.6, 25.0, 24.7, -1.8, 21.9, 4.7, 24.7, 52.8, 8.5, 1.9) fuzzy.sign.test(z, alternative = "greater") @ Since (the support of the distribution of) the fuzzy $P$-value is far below common criteria of statistical significance, this is strong evidence against the null hypothesis. Note that the upper endpoint of the support of (the distribution of) the fuzzy $P$-value is the conventional $P$-value given by \citet{hollander-et-al}. A 95\% fuzzy confidence interval for the median difference is given by <>= fuzzy.sign.ci(z) |> plot() @ Figure~\ref{fig:beak-ci} shows (the membership function of) this fuzzy confidence interval. Although we say this example has no ties, that means it has no ties at the hypothesized value under the null hypothesis, which in this case is zero. It does have ties at the upper endpoint of the support of the fuzzy confidence interval, which affects the value at that point. \subsubsection{With Zero Values} For an example with zero values, we make up some data. <>= z <- c(-1.3, -0.4, 0.0, 0.0, 0.3, 0.5, 0.9, 1.1, 1.1, 1.1, 2.3, 2.5, 3.1, 4.5, 5.5) fuzzy.sign.test(z) @ This might be called borderline statistically significant. It is equivocal. We can plot the probability density function (Figure~\ref{fig:sign.test.with.zeroes.plot.pdf}). <>= fuzzy.sign.test(z) |> plot() @ Or we can plot the cumulative distribution function (Figure~\ref{fig:sign.test.with.zeroes.plot.cdf}). <>= fuzzy.sign.test(z) |> plot(type = "cdf") @ It is left as an exercise for the reader, if he or she is interested, to remove the zeroes from the data and redo, and then try to defend those results. (We do not think any defense can be valid.) The interpretation of the PDF (Figure~\ref{fig:sign.test.with.zeroes.plot.pdf}) is that the area under the curve to the left of $\alpha$ is the probability the null hypothesis is rejected at level $\alpha$. The interpretation of the CDF (Figure~\ref{fig:sign.test.with.zeroes.plot.cdf}) is that the height of the curve at $\alpha$ is the probability the null hypothesis is rejected at level $\alpha$. The 95\% fuzzy confidence interval is Figure~\ref{fig:sign.ci.with.zeroes}. <>= fuzzy.sign.ci(z) |> plot() @ \subsection{Signed Rank Test} Again, to illustrate the issues with ties, we just make up some data. Figure~\ref{fig:signed.rank.pdf} is the PDF of the fuzzy $P$-value. <>= z <- c(-2.2, -1.3, -0.3, 0.0, 0.0, 0.3, 0.5, 0.9, 1.1, 1.3, 1.3, 2.3, 2.5, 3.1, 4.5, 5.5) fuzzy.signrank.test(z) |> plot() @ And Figure~\ref{fig:signed.rank.cdf} is the CDF of the fuzzy $P$-value. <>= fuzzy.signrank.test(z) |> plot(type = "cdf") @ And Figure~\ref{fig:signed.rank.ci} is (the membership function of) the 95\% fuzzy confidence interval. <>= fuzzy.signrank.ci(z) |> plot() @ \subsection{Rank Sum Test} Again, to illustrate the issues with ties, we just make up some data. Figure~\ref{fig:rank.sum.pdf} is the PDF of the fuzzy $P$-value. <>= x <- c(1, 2, 3, 4, 4, 4, 5, 6, 7) y <- c(4, 5, 7, 7, 8, 9, 10, 11) fuzzy.ranksum.test(x, y) |> plot() @ And Figure~\ref{fig:rank.sum.ci} is (the membership function of) the 95\% fuzzy confidence interval. <>= fuzzy.ranksum.ci(x, y) |> plot() @ \begin{thebibliography}{} \bibitem[Agresti(2013)]{agresti} Agresti, A. (2013). \newblock \emph{Categorical Data Analysis}, third edition. \newblock John Wiley \& Sons, Hoboken, NJ. \bibitem[Geyer and Meeden(2005)]{geyer-meeden} Geyer, C.~J. and Meeden, G.~D. (2005). \newblock Fuzzy and randomized confidence intervals and $P$-values (with discussion). \newblock \emph{Statistical Science}, \textbf{20}, 358--387. \newblock \doi{10.1214/088342305000000340}. \bibitem[Hollander, et al.(2014)Hollander, Wolfe, and Chicken]{hollander-et-al} Hollander, M., Wolfe, D.~A., and Chicken, E. (2014). \newblock \emph{Nonparametric Statistical Methods}, third edition. \newblock John Wiley \& Sons, Hoboken, NJ. \bibitem[Klir, et al.(1997)Klir, St.\@ Clair, and Yuan]{fuzzy-book} Klir, G.~J., St.\@ Clair, U.~H., and Yuan, B. (1997). \newblock \emph{Fuzzy Set Theory: Foundations and Applications}. \newblock Prentice Hall, Upper Saddle River, NJ. \bibitem[Lehmann and Romano(2022)]{tsh-4th-ed} Lehmann, E.~L., and Romano, J.~P. (2022). \newblock \emph{Testing Statistical Hypotheses}, fourth edition. \newblock Springer, Cham. \bibitem[Thompson and Geyer(2007)]{thompson-geyer} Thompson, E.~A. and Geyer, C.~J. (2007). \newblock Fuzzy $P$-values in latent variable problems. \newblock \emph{Biometrika}, \textbf{94}, 49--60. \newblock \doi{10.1093/biomet/asm001}. \end{thebibliography} \end{document}