3 Log-Rank Test

The Kaplan-Meier estimator is used to estimate a survival curve for one group. In many applications, however, the main scientific question is comparative: do two groups have the same survival experience, or is one group systematically surviving longer than the other? The log-rank test is the standard nonparametric tool for answering this question.

3.1 Fisher’s Exact Test for Equal Binomial Parameters

Before studying the log-rank test, it is helpful to recall a simpler testing problem. Suppose two independent groups have binomial outcomes with success probabilities \(p_1\) and \(p_2\). We want to test \[ H_0: p_1 = p_2 \qquad \text{versus} \qquad H_1: p_1 \ne p_2. \]

The data are often summarized in a \(2\times 2\) table:

	Success	Failure	Total
Group 1	\(X_1\)	\(n_1-X_1\)	\(n_1\)
Group 2	\(X_2\)	\(n_2-X_2\)	\(n_2\)
Total	\(D\)	\(n-D\)	\(n\)

where \(D=X_1+X_2\) and \(n=n_1+n_2\).

Under the null hypothesis, if we condition on the row totals and the total number of successes \(D\), then the number of successes in Group 1 has a hypergeometric distribution: \[ X_1 \mid D \sim \text{Hypergeometric}(n, n_1, D). \] Therefore \[ E(X_1 \mid D)=\frac{n_1D}{n}, \] and \[ \mathrm{Var}(X_1 \mid D) = \frac{n_1n_2D(n-D)}{n^2(n-1)}. \]

Fisher’s exact test is built from this conditional distribution. The key idea is that, under the null, the observed number of successes in Group 1 should be close to its expected number. This same observed-minus-expected logic reappears in the log-rank test.

3.1.1 The Chi-Square Test for a \(2\times 2\) Table

When the sample size is not too small, we often replace Fisher’s exact test by a large-sample chi-square test. Let \(a\) denote the Group 1 success count in the \(2\times 2\) table. Under the null hypothesis, \[ E(a)=\frac{n_1D}{n}, \] and \[ \mathrm{Var}(a)=\frac{n_1n_2D(n-D)}{n^2(n-1)}. \] This leads to the test statistic \[ T = \frac{[a-E(a)]^2}{\mathrm{Var}(a)} = \frac{\left(a-\frac{n_1D}{n}\right)^2} {\frac{n_1n_2D(n-D)}{n^2(n-1)}}, \] which is approximately chi-square with 1 degree of freedom under the null hypothesis.

This form is worth remembering because it has exactly the same structure as the log-rank test:

take an observed count,
subtract its expected value under the null,
square the result,
divide by its variance.

3.1.2 Stratified \(2\times 2\) Tables and the Mantel-Haenszel Idea

Suppose now that instead of one \(2\times 2\) table, we observe a sequence of such tables, indexed by \(j=1,\dots,k\). For example, the tables may come from different centers in a multicenter trial or from different time points. Let \(a_j\) denote the Group 1 success count in table \(j\), with expectation \(E(a_j)\) and variance \(\mathrm{Var}(a_j)\) under the null.

The Mantel-Haenszel test combines these tables through \[ T_{MH} = \frac{\left[\sum_{j=1}^k (a_j-E(a_j))\right]^2} {\sum_{j=1}^k \mathrm{Var}(a_j)}, \] which is also approximately chi-square with 1 degree of freedom.

This is the immediate bridge to the log-rank test. In the survival setting, each event time creates a local \(2\times 2\) table, and the log-rank statistic is obtained by combining these local observed-minus-expected contributions exactly in the Mantel-Haenszel spirit.

3.2 Testing Equality of Two Survival Curves

Now suppose we have two groups of right-censored survival data. Let \(S_1(t)\) and \(S_2(t)\) denote their survival functions. The basic hypothesis is \[ H_0: S_1(t)=S_2(t)\ \text{for all } t \] versus the alternative that the two survival curves differ.

Let the distinct observed event times in the pooled sample be \[ t_{(1)} < t_{(2)} < \cdots < t_{(k)}. \] At each event time \(t_{(j)}\), define

\(n_{1j}\): number at risk in Group 1 just before time \(t_{(j)}\),
\(n_{2j}\): number at risk in Group 2 just before time \(t_{(j)}\),
\(n_j=n_{1j}+n_{2j}\): total number at risk,
\(d_{1j}\): number of events in Group 1 at time \(t_{(j)}\),
\(d_{2j}\): number of events in Group 2 at time \(t_{(j)}\),
\(d_j=d_{1j}+d_{2j}\): total number of events.

At each event time, we can think of the data as forming a local \(2\times 2\) table:

	Event at \(t_{(j)}\)	No event at \(t_{(j)}\)	Total at risk
Group 1	\(d_{1j}\)	\(n_{1j}-d_{1j}\)	\(n_{1j}\)
Group 2	\(d_{2j}\)	\(n_{2j}-d_{2j}\)	\(n_{2j}\)
Total	\(d_j\)	\(n_j-d_j\)	\(n_j\)

Under the null hypothesis that the two groups have the same survival experience, and conditional on the risk set sizes and the total number of events \(d_j\), the number of Group 1 events behaves like the Group 1 success count in Fisher’s exact test. Therefore \[ E(d_{1j}\mid n_{1j},n_{2j},d_j)=\frac{n_{1j}d_j}{n_j}, \] and \[ \mathrm{Var}(d_{1j}\mid n_{1j},n_{2j},d_j) = \frac{n_{1j}n_{2j}d_j(n_j-d_j)}{n_j^2(n_j-1)}. \]

This leads to the log-rank statistic \[ U = \sum_{j=1}^k (d_{1j}-e_{1j}), \qquad e_{1j}=\frac{n_{1j}d_j}{n_j}, \] with variance \[ V = \sum_{j=1}^k v_j, \qquad v_j= \frac{n_{1j}n_{2j}d_j(n_j-d_j)}{n_j^2(n_j-1)}. \]

The standardized test statistic is \[ Z = \frac{U}{\sqrt{V}}, \] and for two groups it is common to report \[ \chi^2 = \frac{U^2}{V}, \] which is approximately chi-square with 1 degree of freedom under the null hypothesis.

3.2.1 Relationship with Fisher’s Exact Test

The connection between the two procedures is conceptually very important:

Fisher’s exact test compares observed and expected successes in one \(2\times 2\) table.
The log-rank test compares observed and expected events in a sequence of \(2\times 2\) tables, one for each event time.
The log-rank test then adds those observed-minus-expected contributions over time.

So the log-rank test may be viewed as a survival-data analogue of repeatedly applying the Fisher-table idea across all failure times.

3.3 A Hand Calculation of the Log-Rank Test

Consider the following small study comparing two treatment groups.

Table 3.1: A small two-group dataset for illustrating the log-rank test.

Subject	Group	Time	Status
A1	A	2	Event
A2	A	3	Event
A3	A	4	Event
A4	A	6	Censored
A5	A	7	Event
B1	B	5	Censored
B2	B	8	Event
B3	B	9	Event
B4	B	10	Censored
B5	B	11	Event

The event times are \(2\), \(3\), \(4\), \(7\), \(8\), \(9\), and \(11\). However, after time 7 there are no Group A subjects left at risk, so the remaining event times contribute 0 to the Group A observed-minus-expected calculation. Therefore the relevant calculations are:

Table 3.2: Hand calculation of the log-rank statistic for Group A.

Event time	\(n_{1j}\)	\(n_{2j}\)	\(d_{1j}\)	\(d_j\)	\(e_{1j}=\frac{n_{1j}d_j}{n_j}\)	\(v_j\)
2	5	5	1	1	\(5/10 = 0.500\)	\(1/4 = 0.250\)
3	4	5	1	1	\(4/9 = 0.444\)	\(20/81 = 0.247\)
4	3	5	1	1	\(3/8 = 0.375\)	\(15/64 = 0.234\)
7	1	4	1	1	\(1/5 = 0.200\)	\(4/25 = 0.160\)

Now add the observed counts and the expected counts: \[ O_1 = \sum d_{1j} = 4, \] \[ E_1 = \sum e_{1j} = \frac{1}{2}+\frac{4}{9}+\frac{3}{8}+\frac{1}{5} \approx 1.519, \] and \[ V = \sum v_j = \frac{1}{4}+\frac{20}{81}+\frac{15}{64}+\frac{4}{25} \approx 0.891. \]

Therefore \[ U = O_1 - E_1 \approx 4 - 1.519 = 2.481, \] and \[ \chi^2 = \frac{U^2}{V} \approx \frac{(2.481)^2}{0.891} \approx 6.904. \]

Comparing this to a chi-square distribution with 1 degree of freedom gives a \(p\)-value of about \(0.009\). Thus we would reject the null hypothesis of equal survival curves and conclude that the two groups have different survival experiences. Because Group A has more observed events than expected under the null, Group A appears to have worse survival in this example.

3.4 Log-Rank Test in `R`

The survival package implements the log-rank test through the function survdiff(). We again use the lung dataset.

library(survival)

lung2 <- na.omit(survival::lung[, c("time", "status", "sex")])
lung2$status <- lung2$status == 2
lung2$sex <- factor(lung2$sex, levels = c(1, 2), labels = c("Male", "Female"))

head(lung2)

  time status  sex
1  306   TRUE Male
2  455   TRUE Male
3 1010  FALSE Male
4  210   TRUE Male
5  883   TRUE Male
6 1022  FALSE Male

We can now run the log-rank test:

lr <- survdiff(Surv(time, status) ~ sex, data = lung2)
lr

Call:
survdiff(formula = Surv(time, status) ~ sex, data = lung2)

             N Observed Expected (O-E)^2/E (O-E)^2/V
sex=Male   138      112     91.6      4.55      10.3
sex=Female  90       53     73.4      5.68      10.3

 Chisq= 10.3  on 1 degrees of freedom, p= 0.001

To extract the \(p\)-value:

p_value <- 1 - pchisq(lr$chisq, df = length(lr$n) - 1)
p_value

[1] 0.001311165

In this dataset the test statistic is about \(10.33\) with 1 degree of freedom, giving a \(p\)-value of about \(0.0013\). This is strong evidence against the null hypothesis of equal survival curves for males and females in the lung data. This conclusion is consistent with the Kaplan-Meier plot from the previous chapter, where the female survival curve lies above the male curve for most of the follow-up period.

It is also useful to look at the observed and expected numbers of events reported by survdiff(). If one group has more observed events than expected under the null, then that group tends to have poorer survival.

3.5 Weighted Log-Rank Tests

The ordinary log-rank test gives equal weight to each event time through \[ U = \sum_{j=1}^k (d_{1j}-e_{1j}). \] More generally, a weighted log-rank test uses \[ U_w = \sum_{j=1}^k w_j(d_{1j}-e_{1j}), \] with variance \[ V_w = \sum_{j=1}^k w_j^2 v_j. \]

The choice of weights determines which parts of the survival curve receive more emphasis:

Log-rank: \(w_j=1\), giving equal weight across event times.
Breslow-Gehan: larger risk sets receive more weight, so early differences are emphasized.
Tarone-Ware: uses intermediate weighting, often \(w_j=\sqrt{n_j}\).
Peto-Peto: uses weights based on the estimated survival function.
Fleming-Harrington: uses a flexible family of weights that can emphasize early or late differences.

Table 3.3: Common choices of weights in weighted log-rank tests. Here \(\widetilde{S}\) and \(\widehat{S}\) denote pooled estimates of the common survival function under \(H_0\).

Weight function	Test name	Emphasis	Sensitive to censoring pattern?
\(w(t)=1\)	Mantel log-rank test	Later times / overall differences	No
\(w(t)=n(t)\)	Wilcoxon-Gehan-Breslow test	Earlier times	Yes
\(w(t)=\sqrt{n(t)}\)	Tarone-Ware test	Earlier and later times	Yes
\(w(t)=\widetilde{S}(t)\)	Peto-Peto-Prentice test	Earlier times	No
\(w(t)=\widehat{S}(t-)^p\{1-\widehat{S}(t-)\}^q\)	Fleming-Harrington test	Depends on \((p,q)\)	No

Table Table 3.3 summarizes a useful practical point: different choices of weights target different parts of the follow-up period. The ordinary log-rank test gives the same formal weight to all event times, but because there are usually fewer subjects at risk late in follow-up, it is often most powerful for proportional-hazards alternatives and tends to reflect broad, overall differences across the whole study period.

Weighted versions are useful when the survival curves are expected to differ mainly early in follow-up or mainly late in follow-up. For example, if treatment effects appear only after a delay, a test that gives more weight to later event times may be more sensitive than the ordinary log-rank test.

3.5.1 What Does “Sensitive to Censoring Pattern” Mean?

Some weighted tests use functions such as \(n_j\) or \(\sqrt{n_j}\) in their weights. These quantities depend directly on how many subjects remain under observation at each time. Because censoring changes the number at risk, it also changes the weights. This is what is meant by a test being sensitive to the censoring pattern.

More concretely:

If one study has heavy early censoring, then the later risk sets become much smaller.
A test using weights based on \(n_j\) may then put much less emphasis on later event times than it would in a study with lighter censoring.
Thus, two studies with similar event-time behavior but different censoring behavior can lead to different effective weight patterns.

This does not mean the test is invalid. Rather, it means its practical emphasis can be influenced not only by the event process but also by how censoring is distributed over time. That is one reason why the ordinary log-rank test and the Peto-Peto or Fleming-Harrington families are often preferred when one wants weights with a cleaner interpretation.

3.5.2 More on the Fleming-Harrington Family

The Fleming-Harrington class is especially important because it provides a flexible two-parameter family of weights: \[ w_j = \widehat{S}(t_{(j)}-)^p \{1-\widehat{S}(t_{(j)}-)\}^q, \] where \(\widehat{S}\) is a pooled estimate of the common survival function under the null hypothesis, and \(p,q \ge 0\) are chosen in advance.

The parameters \(p\) and \(q\) control which part of the follow-up period receives more emphasis:

If \(p=0\) and \(q=0\), then \(w_j=1\) and we recover the ordinary log-rank test.
If \(p>0\) and \(q=0\), then larger weights are placed earlier in follow-up, because \(\widehat{S}(t)\) is largest near the beginning.
If \(p=0\) and \(q>0\), then larger weights are placed later in follow-up, because \(1-\widehat{S}(t)\) grows over time.
If both \(p>0\) and \(q>0\), then the test can emphasize middle portions of the follow-up period.

Common examples are:

FH(1,0): emphasizes early differences.
FH(0,1): emphasizes late differences.
FH(1,1): tends to emphasize middle-to-late differences.

This flexibility makes the Fleming-Harrington family useful when scientific knowledge suggests how the curves might differ. For instance:

if a treatment is expected to have an immediate effect, an early-weighted choice may be sensible;
if the treatment effect is delayed, a late-weighted choice such as FH(0,1) may be more appropriate.

However, the choice of \((p,q)\) should be made a priori based on scientific considerations, not after inspecting the data. Otherwise the testing procedure can become a fishing expedition and the nominal significance level is no longer trustworthy.

In R, the standard log-rank test corresponds to rho = 0 in survdiff(). The choice rho = 1 gives a common weighted version:

survdiff(Surv(time, status) ~ sex, data = lung2, rho = 1)

Call:
survdiff(formula = Surv(time, status) ~ sex, data = lung2, rho = 1)

             N Observed Expected (O-E)^2/E (O-E)^2/V
sex=Male   138     70.4     55.6      3.95      12.7
sex=Female  90     28.7     43.5      5.04      12.7

 Chisq= 12.7  on 1 degrees of freedom, p= 4e-04

The survdiff() function covers the ordinary log-rank test and the Harrington-Fleming \(G^\rho\) family through the argument rho. For more general Fleming-Harrington \((p,q)\) choices, analysts often use additional survival-analysis packages that allow both parameters to vary.

In practice, the ordinary log-rank test remains the default choice, but weighted log-rank tests are an important extension when the scientific question suggests that some time regions should matter more than others.

3.6 Summary

The log-rank test is the standard nonparametric method for comparing two survival curves in the presence of censoring. At each event time it compares the observed number of events in one group with the expected number under the null hypothesis of equal survival, and then sums these differences over time. This makes its connection with Fisher’s exact test very natural: both methods are built from observed-versus-expected counts in \(2\times 2\) tables. The ordinary log-rank test is easily implemented in R using survdiff(), and weighted log-rank tests extend the same idea when early and late differences should receive different emphasis.