Program similar to rangestat that can handle categorical and multiple criteria

Hong-Yi Tu Ye

Join Date: Mar 2018

Posts: 13
#1

Program similar to rangestat that can handle categorical and multiple criteria

08 Mar 2018, 00:47

Hi,
Are there any programs similar to ssc-rangestat that can handle categorical variables (maybe with an inlist() instead of inrange() option) and multiple criteria?
Thanks!
Tags: range, rangestat, tsegen
Nick Cox

Join Date: Mar 2014

Posts: 35711
#2

08 Mar 2018, 01:22

Give us specific problems that you want to solve. It's even possible that rangestat or rangerun (also SSC) could solve them!
Comment
Hong-Yi Tu Ye

Join Date: Mar 2018

Posts: 13
#3

08 Mar 2018, 14:11

Let's say we are interested in a binary variable Y and I basically want to calculate the min of that Y across observations that fall within a geographic region and time interval. The geographic region is by FIPS state code and let's assume that I have a function that tells me which states to include in the scope (e.g. if observation_i is CA then I'm interested in NV, AZ, OR). Our goal is to generate the statistic of interest (i.e. the min of Y over states and years) given different time and geographical scopes. Am I right in observing that rangestat only works for numeric conditions? If that is the case I was thinking of handling the state conditions manually while incorporating the year conditions in an inner loop with rangestat.
P.S. Finding programs that can deal with multiple conditions is secondary since I can always create two rangestat loops.
Thanks!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35711
#4

08 Mar 2018, 14:28

On an international forum it's best not to assume that everyone else is in the same country. As it happens I am a geographer and think I can decode some of this, including CA, NV, AZ and OR.

If regions are disjoint then this sounds reasonable. If you're talking different neighbours for different states then I can't see how this fits within the scope of rangestat but I think we need a toy example at least to think more.

Last edited by Nick Cox; 08 Mar 2018, 14:42.
Comment
Hong-Yi Tu Ye

Join Date: Mar 2018

Posts: 13
#5

08 Mar 2018, 17:44

Thanks Nick! Apologies for the abbreviations, I will keep that in mind next time. I'll try to come up with some pseudocode to motivate an example but I think the problem is outside the scope of rangestat because:
1. It is dealing with different neighbouring states as the state of interest changes.
2. You can't really subset a dataset based on a range of states, only select from a list.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

08 Mar 2018, 19:22

One question is, how big is your dataset - how many observations per state per year (if year is your time interval)?

I'm thinking of applying expand to your data so each observation from, for example, CA, is expanded into 3 observations, identified by the new variable "neighborof" with values NV, AZ, and OR:. And so for every state. Then the analysis for CA consists of those for which neighborof=="CA".

Perhaps there are better approaches than this, but it is a way of making use of the powerful rangestat and rangerun commands.
1 like
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

08 Mar 2018, 20:22

A lot depends on the details of what you are trying to do. I'll guess that what you call geographical scopes are neighboring states. I'll also assume that you want to calculate a statistic over a range of years relative to the current observation.

In the example below, I create two small datasets that define the geographic scope to use for CA and NV and panel data for a few states over 10 years.

Since all observations from the same state have the same geographic scope, I use a loop over each fips code in the panel dataset and use merge to define the scope for the current state. I then set Y2use to the value of Y if the state is part of the scope for the current state. rangestat is then used to calculate the statistics over a -2/+2 year range.

As usual, you can always check results for any given observation using simple Stata commands so I spot check an observation for CA and one for NV.

Code:

clear
input str2 fips
"AZ"
"NV"
"OR"
end
save "scope_CA.dta", replace

clear
input str2 fips
"AZ"
"CA"
"ID"
"OR"
"UT"
end
save "scope_NV.dta", replace

clear
input str2 fips
"AZ"
"CA"
"ID"
"NV"
"OR"
"UT"
end
expand 10
bysort fips: gen year = 2000 + _n
set seed 31231
gen Y = runiform()

* loop over states
gen meanY = .
gen nobs = .
levelsof fips, local(states) clean
// override to limit example to two states
local states CA NV
foreach state of local states {
    dis "doing `state'"
    merge m:1 fips using "scope_`state'.dta", assert(master match)
    gen Y2use = Y if _merge == 3
    tab fips if !mi(Y2use)
    rangestat (mean) Y2use (count) Y2use, interval(year -2 2)
    replace meanY = Y2use_mean if fips == "`state'"
    replace nobs = Y2use_count if fips == "`state'"
    drop Y2use* _merge
}

* spot check a CA observation
list in 17
sum Y if inlist(fips,"AZ","NV","OR") & inrange(year, year[17]-2, year[17]+2)

* spot check an NV observation
list in 38
sum Y if inlist(fips,"AZ","CA","ID","OR","UT") & inrange(year, year[38]-2, year[38]+2)

and the spot check results

Code:

. * spot check a CA observation
. list in 17

     +------------------------------------------+
     | fips   year          Y      meanY   nobs |
     |------------------------------------------|
 17. |   CA   2007   .8792565   .4366143     15 |
     +------------------------------------------+

. sum Y if inlist(fips,"AZ","NV","OR") & inrange(year, year[17]-2, year[17]+2)

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
           Y |         15    .4366143    .2609226   .0175193   .8861946

. 
. * spot check an NV observation
. list in 38

     +------------------------------------------+
     | fips   year          Y      meanY   nobs |
     |------------------------------------------|
 38. |   NV   2008   .1444371   .4818619     25 |
     +------------------------------------------+

. sum Y if inlist(fips,"AZ","CA","ID","OR","UT") & inrange(year, year[38]-2, year[38]+2)

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
           Y |         25    .4818619    .3035426   .0398817   .9074047

.

Comment

Hong-Yi Tu Ye

Join Date: Mar 2018

Posts: 13
#8

08 Mar 2018, 22:33

@ Robert: Thanks for the toy example, your understanding of what I meant by geographical scope and time is spot on.
@ Robert: and @William: Unfortunately, we are dealing with at least one million entries (observations at a HCPCS-ICD9-State-Year level) and I am interested in how the statistic of interest (in my case the min of the subset, in Robert's case the mean of the subset) evolves as I change the time and geographical conditions. Right now I am thinking of having an outer loop handling the geographical subset manually (i.e. selecting more and more neighbouring states as the scope increases from closest to farthest states) and an inner loop with rangestat for handling the year based subset. I might try William's method with --expand-- but I would need to repeat it for 50 different distance settings as I expand the scope from 0 neighbouring states to 49, so I'm not sure if there are any efficiency gains at that point. Thoughts?

Last edited by Hong-Yi Tu Ye; 08 Mar 2018, 22:38.
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#9

09 Mar 2018, 07:50

Again, the devil is in the details. It kind of looks like you want to perform a grid search over distance and time and you need to calculate all possibilities. I would not try to expand the data as that only makes rangestat work harder. The best approach is probably to handle each iteration separately using the smallest possible dataset. For example, in the outer loop handling each state, reduce to observations within the geographic scope, loop over a range of years and save results for the current state separately.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#10

09 Mar 2018, 15:35

I had some alone time on a train this afternoon so I thought I could use the opportunity to try out a strategy of looping over states and running rangestat only on observations within the geographical scope of the current state. Also, since the desired measure is only needed for observations within the current state, I use an invalid interval to skip calculating results for out-of-state observations.

Since this strategy requires a fair amount of data management gymnastics, I use runby (from SSC) to handle the looping over states. With runby, you write a little Stata program that will run on each by-group subset. The mechanics are quite flexible and what's left in memory when the program terminates is considered results for the by-group and accumulates. When all the by-groups have been processed, the accumulated results replace the data in memory.

Without a "function" to generate the geographical scope, I improvised and assigned states randomly (15% of other states are considered within the desired scope). Note that the current state is paired with itself; this is needed so that observations from the current state will be part of the sample used with rangestat. If you do not want the values for current state to be taken into consideration when calculating the desired statistic, they should simply be replaced with missing values within the subsample. For those who want to play along, here's the code I used to create a list of 50 states (+ DC); a pairing, for each of these states, with random states in long form; a sample dataset with about 1 million obs, and a 10% sample of the full dataset.

Code:

clear all

* dataset of 50 states + DC -- 2 letter abbrev
gen st = ""
local i 0
foreach s in AK AL AR AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY LA MA ///
        MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA RI SC SD TN ///
        TX UT VA VT WA WI WV WY {
    set obs `++i'
    replace st = "`s'" in `i'
}
save "st.dta", replace

* create geographic scope variables, -runby- is from SSC
set seed 12312
program random_match
    local current = st
    use "st.dta", clear
    rename st st_2use
    gen st = "`current'"
    keep if runiform() < .15 | st_2use == "`current'"
end
runby random_match, by(st)
isid st st_2use, sort
save "scope_long.dta", replace

* create dataset with about 1M obs
use "st.dta", clear
expand 20000
bysort st: gen Y = runiform()
gen year = runiformint(2000, 2017)
gen long id = _n
save "Y_data.dta", replace

* create a 10% sample
keep if runiform() < .1
save "Y_data_10%.dta", replace

With these datasets saved in the current directory, I use runby to loop over the dataset of states (one observation per state). Each state forms a by-group with a single observation. When the scope_it program is called, it first loads the relevant state pairings stored in "scope_long.dta". These are then merged with the 10% sample dataset and reduced to matching observations. I then implement a loop to call rangestat using a time window that goes back in time. Valid lower and upper bounds for the interval are only defined for the current state, all other observations will not get results (rangestat does not calculate results if the interval is invalid). Note that I use (mean) instead of (min) because it reduces the likelihood that a spot check will fail if the current observation has the lowest value for Y. Each call to rangestat with the loop generates separate variables for the statistic and the observation count. The last command reduces the dataset to observations from the current state. These are considered results and accumulate.

Code:

clear all

* use -runby- (from SSC) to loop over states
use "st.dta", clear
program scope_it
    local current = st
    use "scope_long.dta" if st == "`current'", clear
    keep st_2use
    rename st_2use st
    merge 1:m st using "Y_data_10%.dta", assert(match using) keep(match) nogen
    sort year
    dis _n "sample for states within the geographical scope of `current'"
    tab year st
    forvalues i = 0/5 {
        gen low  = cond(st == "`current'", year-`i', 1)
        gen high = cond(st == "`current'", year, 0)
        rangestat (mean) minY_`i'=Y (count) n_`i'=Y, interval(year low high)
        drop low high
    }
    keep if st == "`current'"
end
runby scope_it, by(st) verbose

isid id, sort
save "results_10%.dta", replace

It's a bit tricky to spot check results because you have to be able to recreate the correct scope used to calculate results for the target observation. Here's code that does just that for two observations in AK and two in CA.

Code:

* show that the results contain the original variables and observations
use "results_10%.dta", clear
cf st Y year id using "Y_data_10%.dta", all

* merge geographical scope for AK with results
use "scope_long.dta" if st == "AK", clear
keep st_2use
list
rename st_2use st
merge 1:m st using "results_10%.dta"
sort id
cf st Y year id using "Y_data_10%.dta", all // show that nothing changed

* spot check for some observations in AK
list in 56
forvalues j=0/5 {
    sum Y if _merge == 3 & inrange(year, year[56]-`j', year[56]), meanonly
    dis as txt "minY_`j' = " as res r(mean) as txt " n = " as res r(N)
}
list in 1933
forvalues j=0/5 {
    sum Y if _merge == 3 & inrange(year, year[1933]-`j', year[1933]), meanonly
    dis as txt "minY_`j' = " as res r(mean) as txt " n = " as res r(N)
}

* merge geographical scope for CA with results
use "scope_long.dta" if st == "CA", clear
keep st_2use
list
rename st_2use st
merge 1:m st using "results_10%.dta"
sort id
cf st Y year id using "Y_data_10%.dta", all // show that nothing changed

* spot check for some observations in CA
list in 8031
forvalues j=0/5 {
    sum Y if _merge == 3 & inrange(year, year[8031]-`j', year[8031]), meanonly
    dis as txt "minY_`j' = " as res r(mean) as txt " n = " as res r(N)
}
list in 9983
forvalues j=0/5 {
    sum Y if _merge == 3 & inrange(year, year[9983]-`j', year[9983]), meanonly
    dis as txt "minY_`j' = " as res r(mean) as txt " n = " as res r(N)
}
-----------

Here are the results for the AK spot checks:

Code:

. * spot check for some observations in AK
. list in 56

     +------------------------------------------------------------------------------------+
 56. | st |        Y | year |  id |    minY_0 | n_0 |    minY_1 |  n_1 |    minY_2 |  n_2 |
     | AK | .4255773 | 2014 | 582 | .52047304 | 912 | .51272567 | 1834 | .50705929 | 2725 |
     |----------------------+-------------------------------------------------------------|
     |     minY_3  |   n_3  |     minY_4  |   n_4  |     minY_5  |   n_5  |       _merge  |
     |  .50640955  |  3657  |  .50784093  |  4492  |  .50467394  |  5379  |  matched (3)  |
     +------------------------------------------------------------------------------------+

. forvalues j=0/5 {
  2.         sum Y if _merge == 3 & inrange(year, year[56]-`j', year[56]), meanonly
  3.         dis as txt "minY_`j' = " as res r(mean) as txt " n = " as res r(N)
  4. }
minY_0 = .52047304 n = 912
minY_1 = .51272567 n = 1834
minY_2 = .50705929 n = 2725
minY_3 = .50640955 n = 3657
minY_4 = .50784093 n = 4492
minY_5 = .50467394 n = 5379

. list in 1933

      +-------------------------------------------------------------------------------------+
1933. | st |        Y | year |    id |    minY_0 | n_0 |    minY_1 |  n_1 |   minY_2 |  n_2 |
      | AK | .9820464 | 2007 | 19343 | .51469251 | 907 | .50133866 | 1801 | .4989838 | 2696 |
      |----------------------+--------------------------------------------------------------|
      |     minY_3  |   n_3  |     minY_4  |   n_4  |     minY_5  |   n_5  |       _merge   |
      |  .50115699  |  3551  |  .49818084  |  4459  |  .49963947  |  5352  |  matched (3)   |
      +-------------------------------------------------------------------------------------+

. forvalues j=0/5 {
  2.         sum Y if _merge == 3 & inrange(year, year[1933]-`j', year[1933]), meanonly
  3.         dis as txt "minY_`j' = " as res r(mean) as txt " n = " as res r(N)
  4. }
minY_0 = .51469251 n = 907
minY_1 = .50133866 n = 1801
minY_2 = .4989838 n = 2696
minY_3 = .50115699 n = 3551
minY_4 = .49818084 n = 4459
minY_5 = .49963947 n = 5352

.

and the results for the CA spot checks

Code:

. * spot check for some observations in CA
. list in 8031

      +------------------------------------------------------------------------------------+
8031. | st |        Y | year |    id |   minY_0 | n_0 |    minY_1 | n_1 |    minY_2 |  n_2 |
      | CA | .8724612 | 2017 | 80312 | .4918918 | 441 | .49346747 | 898 | .50508394 | 1317 |
      |----------------------+------------------------------------+------------------------|
      |     minY_3  |   n_3  |     minY_4  |   n_4  |     minY_5  |   n_5  |       _merge  |
      |  .49826881  |  1763  |  .49352776  |  2191  |  .49645966  |  2635  |  matched (3)  |
      +------------------------------------------------------------------------------------+

. forvalues j=0/5 {
  2.         sum Y if _merge == 3 & inrange(year, year[8031]-`j', year[8031]), meanonly
  3.         dis as txt "minY_`j' = " as res r(mean) as txt " n = " as res r(N)
  4. }
minY_0 = .4918918 n = 441
minY_1 = .49346747 n = 898
minY_2 = .50508394 n = 1317
minY_3 = .49826881 n = 1763
minY_4 = .49352776 n = 2191
minY_5 = .49645966 n = 2635

. list in 9983

      +-------------------------------------------------------------------------------------+
9983. | st |        Y | year |    id |    minY_0 | n_0 |    minY_1 | n_1 |    minY_2 |  n_2 |
      | CA | .1749907 | 2010 | 99984 | .49497293 | 457 | .50147114 | 894 | .49737837 | 1392 |
      |-------------------------------------------------------------------------------------|
      |    minY_3  |   n_3  |     minY_4  |   n_4  |     minY_5  |   n_5   |       _merge   |
      |  .5018406  |  1835  |  .50560071  |  2241  |  .50362051  |  2677   |  matched (3)   |
      +-------------------------------------------------------------------------------------+

. forvalues j=0/5 {
  2.         sum Y if _merge == 3 & inrange(year, year[9983]-`j', year[9983]), meanonly
  3.         dis as txt "minY_`j' = " as res r(mean) as txt " n = " as res r(N)
  4. }
minY_0 = .49497293 n = 457
minY_1 = .50147114 n = 894
minY_2 = .49737837 n = 1392
minY_3 = .5018406 n = 1835
minY_4 = .50560071 n = 2241
minY_5 = .50362051 n = 2677

.

The code can easily be extended to the full 1M sample. On my laptop, the full sample run takes just under 5 minutes to run.

Comment

Hong-Yi Tu Ye

Join Date: Mar 2018

Posts: 13
#11

17 Mar 2018, 12:59

Hi,
Robert Picard
I have been experimenting with runby and I think I am close to getting what I want but I have one quick follow up: are you allowed to pass arguments into a program as part of a runby commnand?
Something like:

Code:

runby calcMean arg1("input"), by(state)

I was reading through the runby doc and my impression is that it is not possible to pass through further arguments to the runby function but perhaps I'm missing something.
Thanks!

Last edited by Hong-Yi Tu Ye; 17 Mar 2018, 13:32.
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#12

18 Mar 2018, 09:12

With runby, your user-written program will be called once for each by-group in the data. I see no point in having runby pass the same argument to the same program as it processes by-groups. In that case, why not bake the argument into your program? If that's not something you can/want to do, you can also store the argument in a scalar before calling runby. Since the scope of a scalar is global, you can access it within your program. If the argument(s) are more complicated, you can save them in a dataset/file and read them within the program.
1 like
Comment
Hong-Yi Tu Ye

Join Date: Mar 2018

Posts: 13
#13

18 Mar 2018, 18:37

Thanks, your global macro suggestion worked!
Comment

Announcement