Modifying a Built-in Stata Command
I recently came across an issue in which modifying a shipped-with-Stata command was the easier way to address. I’m writing this to document the approach I took in case it comes up again and proves useful to remember the steps I took.
(Note: If you see ... in Stata code snippets below, it
indicates portions of code I excluded for brevity.)
The Problem
A client was working with survival data and using the sts
test command to show that the number of observed and expected
events was similar. The sts test command produces a
p-value, as shown below, but the journal the client was submitting to
rightly prefers confidence intervals over p-values. So the client was
hoping to bootstrap sts test to generate confidence
intervals for the expected number of events. The problem was
that sts test returns very little information and thus
bootstrap wouldn’t work as normal (as it requires return’d
or ereturn’d values to operate).
. webuse stan3
(Heart transplant data)
. sts test posttran
Failure _d: died
Analysis time _t: t1
ID variable: id
Equality of survivor functions
Log-rank test
| Observed Expected
posttran | events events
---------+-------------------------
0 | 30 31.20
1 | 45 43.80
---------+-------------------------
Total | 75 75.00
chi2(1) = 0.13
Pr>chi2 = 0.7225
. return list
scalars:
r(chi2) = .1261354821436887
r(df) = 1
While I could probably have figured out the math that Stata uses to
calculate those expected events, it seemed probably easier and
definitely more fun to hack the sts test command.
Finding Where to Modify
The which command will tell you whether a command exists in
an ado file, or is “built-in”, which cannot be modified as easily. Since
the “test” of sts test is a subcommand, we can find
where sts is defined.
. which sts /Applications/Stata/ado/base/s/sts.ado *! version 8.8.0 23apr2022
Opening that file, I quickly found the test subcommand:
program define Test, rclass ...
Searching the file for “Expected” or “events” or any other text in
the sts test output came up blank, so the actual
calcuations must take place elsewhere.
The sts test command runs a variety of different tests, and
I noticed that a good chunk of the code was dedicated to determining
which test the user request, and setting the cmd macro.
...
else if "`tware'"~="" {
local cmd "tware_st"
}
...
else if "`peto'"~="" {
local cmd "peto_st"
}
...
else local cmd "logrank"
...
Finally, near the bottom of the command, the cmd macro is
used:
...
`vv' `cmd' _t _d `w' if `touse', strata(`strata') /*
*/ t0(_t0) `id' `by' `options' `detail' `trend' `p' `q'
...
Ignoring the vv macro (which handles version
if applicable), it’s clear that the cmd macro must be an
actual command. And the particular test the client was working with was
the logrank test, therefore I was looking for the logrank
command.
. which logrank /Applications/Stata/ado/base/l/logrank.ado *! version 7.1.17 10nov2021
Inside I found the following code:
...
di in smcl in gr _n _col(`len') `" {c |} Observed Expected"'
local pad = `len' - `len1'
if `"`strata'"'==`""' { local dup `" events"' }
else local dup `" events*"'
di in smcl in gr `"`ttl'"' _skip(`pad') `"{c |} events `dup'"'
di in smcl in gr "{hline `len'}{c +}{hline 25}"
local sum 0
local i 1
local gstr = (bsubstr("`:type `grp''", 1, 3)=="str")
while `i' <= _N {
if (`gstr') {
local x : di udsubstr(`grp'[`i'], 1, 255)
}
else {
local x = `grp'[`i']
}
local pad = `len' - udstrlen(`"`x'"')-1
di in smcl in gr _skip(`pad') `"`x' "' "{c |}" in ye /*
*/ %10.0g `wo'[1,`i'] `" "' %10.2f `w'[1,`i']
local sum = `sum' + `wo'[1,`i']
local i = `i' + 1
}
di in smcl in gr "{hline `len'}{c +}{hline 25}"
local pad = `len' - 6
di in smcl in gr _skip(`pad') `"Total "' `"{c |}"' in ye /*
*/ %10.0g `sum' `" "' %10.2f `sum'
...
This is very confusing code to look at at first (for some reason all
shipped-with-Stata code uses the shortest possible versions of the
command names, making things even more obtuse) but we can tell that this
is producing the table output we’re looking for. di
is display, so each di line is printing
something out. The first couple are printing “Observed”, “Expected”, and
“events”, so that’s the head of the table, and the last di
is printing “Total” which is the end of the table. The sts
test (and logrank) command takes in a categorical
variable and prints a row per level, so the while loop must
be going through each level of the variable. Inside the loop, there is
only a single di statement:
...
di in smcl in gr _skip(`pad') `"`x' "' "{c |}" in ye /*
*/ %10.0g `wo'[1,`i'] `" "' %10.2f `w'[1,`i']
...
Note the reference to two matrix extractions: wo[1, i]
and w[1, i]. These insert the observed (wo)
and expected (w) number of events!
Making the Modification
So ultimately, we just need to return the w matrix. Finding
the other returns,
...
ret scalar df = colsof(`w') - 1
ret scalar chi2 = `V'[1,1]
...
we can add our own return that stores each expected value into a scalar:
matrix events=`w'
local i 1
while `i' <= _N {
return scalar e`i' = events[1,`i']
local i = `i' + 1
}
Now we can save this, re-open Stata, and it works!
. webuse stan3
(Heart transplant data)
. bootstrap e1=r(e1) e2=r(e2), reps(10): sts test posttran
(running sts on estimation sample)
warning: sts does not set e(sample), so no observations will be excluded from
the resampling because of missing values or other reasons. To
exclude observations, press Break, save the data, drop any
observations that are to be excluded, and rerun bootstrap.
Bootstrap replications (10): .........10 done
Bootstrap results Number of obs = 172
Replications = 10
Command: sts test posttran
e1: r(e1)
e2: r(e2)
------------------------------------------------------------------------------
| Observed Bootstrap Normal-based
| coefficient std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
e1 | 31.19955 7.380891 4.23 0.000 16.73327 45.66583
e2 | 43.80045 5.779255 7.58 0.000 32.47332 55.12758
------------------------------------------------------------------------------
An Added Complication
The client was working on a virtual Windows machine which did not have
permission to overwrite the logrank.ado file. It is easy enough to make
a copy of logrank.ado, rename logrank
to logrank2 and treat it as a user-written ado file, but
we’d have to also create a custom version of sts test which
might be slightly messy due to it being a subcommand.
Instead, we can use the sts.ado to figure out to get the comparable
logrank command for a given sts test command.
...
`vv' `cmd' _t _d `w' if `touse', strata(`strata') /*
*/ t0(_t0) `id' `by' `options' `detail' `trend' `p' `q'
...
As mentioned above, vv is just version control, so here we have
some variables passed to logrank (_t,
_d, and whatever is inside w), and a bunch of
options. We can modify our version of logrank to print out
all these, to determine what is actually being passed.
program define logrank /* timevar [deadvar] [, by(group) t0(t0) id(tvid)] */, rclass
version 6.0, missing
syntax varlist(min=1 max=2) [if] [in] [fw iw] [, /*
*/ BY(varlist) CHECK Detail ID(varname) LOGRANK /*
*/ MAT(string) T0(varname) noTItle /*
*/ STrata(varlist) TVid(varname) trend DINOTE]
display "varlist: `varlist'"
display "t0: `t0'"
display "id: `id'"
etc...
Ultimately it ended up that the command was:
. logrank _t _d, by(posttran) id(id) t0(_t0)
Equality of survivor functions
Log-rank test
| Observed Expected
posttran | events events
---------+-------------------------
0 | 30 31.20
1 | 45 43.80
---------+-------------------------
Total | 75 75.00
chi2(1) = 0.13
Pr>chi2 = 0.7225