Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Program similar to rangestat that can handle categorical and multiple criteria

    Hi,
    Are there any programs similar to ssc-rangestat that can handle categorical variables (maybe with an inlist() instead of inrange() option) and multiple criteria?
    Thanks!

  • #2
    Give us specific problems that you want to solve. It's even possible that rangestat or rangerun (also SSC) could solve them!

    Comment


    • #3
      Let's say we are interested in a binary variable Y and I basically want to calculate the min of that Y across observations that fall within a geographic region and time interval. The geographic region is by FIPS state code and let's assume that I have a function that tells me which states to include in the scope (e.g. if observation_i is CA then I'm interested in NV, AZ, OR). Our goal is to generate the statistic of interest (i.e. the min of Y over states and years) given different time and geographical scopes. Am I right in observing that rangestat only works for numeric conditions? If that is the case I was thinking of handling the state conditions manually while incorporating the year conditions in an inner loop with rangestat.
      P.S. Finding programs that can deal with multiple conditions is secondary since I can always create two rangestat loops.
      Thanks!

      Comment


      • #4
        On an international forum it's best not to assume that everyone else is in the same country. As it happens I am a geographer and think I can decode some of this, including CA, NV, AZ and OR.

        If regions are disjoint then this sounds reasonable. If you're talking different neighbours for different states then I can't see how this fits within the scope of rangestat but I think we need a toy example at least to think more.
        Last edited by Nick Cox; 08 Mar 2018, 14:42.

        Comment


        • #5
          Thanks Nick! Apologies for the abbreviations, I will keep that in mind next time. I'll try to come up with some pseudocode to motivate an example but I think the problem is outside the scope of rangestat because:
          1. It is dealing with different neighbouring states as the state of interest changes.
          2. You can't really subset a dataset based on a range of states, only select from a list.

          Comment


          • #6
            One question is, how big is your dataset - how many observations per state per year (if year is your time interval)?

            I'm thinking of applying expand to your data so each observation from, for example, CA, is expanded into 3 observations, identified by the new variable "neighborof" with values NV, AZ, and OR:. And so for every state. Then the analysis for CA consists of those for which neighborof=="CA".

            Perhaps there are better approaches than this, but it is a way of making use of the powerful rangestat and rangerun commands.

            Comment


            • #7
              A lot depends on the details of what you are trying to do. I'll guess that what you call geographical scopes are neighboring states. I'll also assume that you want to calculate a statistic over a range of years relative to the current observation.

              In the example below, I create two small datasets that define the geographic scope to use for CA and NV and panel data for a few states over 10 years.

              Since all observations from the same state have the same geographic scope, I use a loop over each fips code in the panel dataset and use merge to define the scope for the current state. I then set Y2use to the value of Y if the state is part of the scope for the current state. rangestat is then used to calculate the statistics over a -2/+2 year range.

              As usual, you can always check results for any given observation using simple Stata commands so I spot check an observation for CA and one for NV.

              Code:
              clear
              input str2 fips
              "AZ"
              "NV"
              "OR"
              end
              save "scope_CA.dta", replace
              
              clear
              input str2 fips
              "AZ"
              "CA"
              "ID"
              "OR"
              "UT"
              end
              save "scope_NV.dta", replace
              
              clear
              input str2 fips
              "AZ"
              "CA"
              "ID"
              "NV"
              "OR"
              "UT"
              end
              expand 10
              bysort fips: gen year = 2000 + _n
              set seed 31231
              gen Y = runiform()
              
              * loop over states
              gen meanY = .
              gen nobs = .
              levelsof fips, local(states) clean
              // override to limit example to two states
              local states CA NV
              foreach state of local states {
                  dis "doing `state'"
                  merge m:1 fips using "scope_`state'.dta", assert(master match)
                  gen Y2use = Y if _merge == 3
                  tab fips if !mi(Y2use)
                  rangestat (mean) Y2use (count) Y2use, interval(year -2 2)
                  replace meanY = Y2use_mean if fips == "`state'"
                  replace nobs = Y2use_count if fips == "`state'"
                  drop Y2use* _merge
              }
              
              * spot check a CA observation
              list in 17
              sum Y if inlist(fips,"AZ","NV","OR") & inrange(year, year[17]-2, year[17]+2)
              
              * spot check an NV observation
              list in 38
              sum Y if inlist(fips,"AZ","CA","ID","OR","UT") & inrange(year, year[38]-2, year[38]+2)
              and the spot check results
              Code:
              . * spot check a CA observation
              . list in 17
              
                   +------------------------------------------+
                   | fips   year          Y      meanY   nobs |
                   |------------------------------------------|
               17. |   CA   2007   .8792565   .4366143     15 |
                   +------------------------------------------+
              
              . sum Y if inlist(fips,"AZ","NV","OR") & inrange(year, year[17]-2, year[17]+2)
              
                  Variable |        Obs        Mean    Std. Dev.       Min        Max
              -------------+---------------------------------------------------------
                         Y |         15    .4366143    .2609226   .0175193   .8861946
              
              . 
              . * spot check an NV observation
              . list in 38
              
                   +------------------------------------------+
                   | fips   year          Y      meanY   nobs |
                   |------------------------------------------|
               38. |   NV   2008   .1444371   .4818619     25 |
                   +------------------------------------------+
              
              . sum Y if inlist(fips,"AZ","CA","ID","OR","UT") & inrange(year, year[38]-2, year[38]+2)
              
                  Variable |        Obs        Mean    Std. Dev.       Min        Max
              -------------+---------------------------------------------------------
                         Y |         25    .4818619    .3035426   .0398817   .9074047
              
              .

              Comment


              • #8
                @ Robert: Thanks for the toy example, your understanding of what I meant by geographical scope and time is spot on.
                @ Robert: and @William: Unfortunately, we are dealing with at least one million entries (observations at a HCPCS-ICD9-State-Year level) and I am interested in how the statistic of interest (in my case the min of the subset, in Robert's case the mean of the subset) evolves as I change the time and geographical conditions. Right now I am thinking of having an outer loop handling the geographical subset manually (i.e. selecting more and more neighbouring states as the scope increases from closest to farthest states) and an inner loop with rangestat for handling the year based subset. I might try William's method with --expand-- but I would need to repeat it for 50 different distance settings as I expand the scope from 0 neighbouring states to 49, so I'm not sure if there are any efficiency gains at that point. Thoughts?
                Last edited by Hong-Yi Tu Ye; 08 Mar 2018, 22:38.

                Comment


                • #9
                  Again, the devil is in the details. It kind of looks like you want to perform a grid search over distance and time and you need to calculate all possibilities. I would not try to expand the data as that only makes rangestat work harder. The best approach is probably to handle each iteration separately using the smallest possible dataset. For example, in the outer loop handling each state, reduce to observations within the geographic scope, loop over a range of years and save results for the current state separately.

                  Comment


                  • #10
                    I had some alone time on a train this afternoon so I thought I could use the opportunity to try out a strategy of looping over states and running rangestat only on observations within the geographical scope of the current state. Also, since the desired measure is only needed for observations within the current state, I use an invalid interval to skip calculating results for out-of-state observations.

                    Since this strategy requires a fair amount of data management gymnastics, I use runby (from SSC) to handle the looping over states. With runby, you write a little Stata program that will run on each by-group subset. The mechanics are quite flexible and what's left in memory when the program terminates is considered results for the by-group and accumulates. When all the by-groups have been processed, the accumulated results replace the data in memory.

                    Without a "function" to generate the geographical scope, I improvised and assigned states randomly (15% of other states are considered within the desired scope). Note that the current state is paired with itself; this is needed so that observations from the current state will be part of the sample used with rangestat. If you do not want the values for current state to be taken into consideration when calculating the desired statistic, they should simply be replaced with missing values within the subsample. For those who want to play along, here's the code I used to create a list of 50 states (+ DC); a pairing, for each of these states, with random states in long form; a sample dataset with about 1 million obs, and a 10% sample of the full dataset.

                    Code:
                    clear all
                    
                    * dataset of 50 states + DC -- 2 letter abbrev
                    gen st = ""
                    local i 0
                    foreach s in AK AL AR AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY LA MA ///
                            MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA RI SC SD TN ///
                            TX UT VA VT WA WI WV WY {
                        set obs `++i'
                        replace st = "`s'" in `i'
                    }
                    save "st.dta", replace
                    
                    * create geographic scope variables, -runby- is from SSC
                    set seed 12312
                    program random_match
                        local current = st
                        use "st.dta", clear
                        rename st st_2use
                        gen st = "`current'"
                        keep if runiform() < .15 | st_2use == "`current'"
                    end
                    runby random_match, by(st)
                    isid st st_2use, sort
                    save "scope_long.dta", replace
                    
                    * create dataset with about 1M obs
                    use "st.dta", clear
                    expand 20000
                    bysort st: gen Y = runiform()
                    gen year = runiformint(2000, 2017)
                    gen long id = _n
                    save "Y_data.dta", replace
                    
                    * create a 10% sample
                    keep if runiform() < .1
                    save "Y_data_10%.dta", replace
                    With these datasets saved in the current directory, I use runby to loop over the dataset of states (one observation per state). Each state forms a by-group with a single observation. When the scope_it program is called, it first loads the relevant state pairings stored in "scope_long.dta". These are then merged with the 10% sample dataset and reduced to matching observations. I then implement a loop to call rangestat using a time window that goes back in time. Valid lower and upper bounds for the interval are only defined for the current state, all other observations will not get results (rangestat does not calculate results if the interval is invalid). Note that I use (mean) instead of (min) because it reduces the likelihood that a spot check will fail if the current observation has the lowest value for Y. Each call to rangestat with the loop generates separate variables for the statistic and the observation count. The last command reduces the dataset to observations from the current state. These are considered results and accumulate.

                    Code:
                    clear all
                    
                    * use -runby- (from SSC) to loop over states
                    use "st.dta", clear
                    program scope_it
                        local current = st
                        use "scope_long.dta" if st == "`current'", clear
                        keep st_2use
                        rename st_2use st
                        merge 1:m st using "Y_data_10%.dta", assert(match using) keep(match) nogen
                        sort year
                        dis _n "sample for states within the geographical scope of `current'"
                        tab year st
                        forvalues i = 0/5 {
                            gen low  = cond(st == "`current'", year-`i', 1)
                            gen high = cond(st == "`current'", year, 0)
                            rangestat (mean) minY_`i'=Y (count) n_`i'=Y, interval(year low high)
                            drop low high
                        }
                        keep if st == "`current'"
                    end
                    runby scope_it, by(st) verbose
                    
                    isid id, sort
                    save "results_10%.dta", replace
                    It's a bit tricky to spot check results because you have to be able to recreate the correct scope used to calculate results for the target observation. Here's code that does just that for two observations in AK and two in CA.

                    Code:
                    * show that the results contain the original variables and observations
                    use "results_10%.dta", clear
                    cf st Y year id using "Y_data_10%.dta", all
                    
                    * merge geographical scope for AK with results
                    use "scope_long.dta" if st == "AK", clear
                    keep st_2use
                    list
                    rename st_2use st
                    merge 1:m st using "results_10%.dta"
                    sort id
                    cf st Y year id using "Y_data_10%.dta", all // show that nothing changed
                    
                    * spot check for some observations in AK
                    list in 56
                    forvalues j=0/5 {
                        sum Y if _merge == 3 & inrange(year, year[56]-`j', year[56]), meanonly
                        dis as txt "minY_`j' = " as res r(mean) as txt " n = " as res r(N)
                    }
                    list in 1933
                    forvalues j=0/5 {
                        sum Y if _merge == 3 & inrange(year, year[1933]-`j', year[1933]), meanonly
                        dis as txt "minY_`j' = " as res r(mean) as txt " n = " as res r(N)
                    }
                    
                    * merge geographical scope for CA with results
                    use "scope_long.dta" if st == "CA", clear
                    keep st_2use
                    list
                    rename st_2use st
                    merge 1:m st using "results_10%.dta"
                    sort id
                    cf st Y year id using "Y_data_10%.dta", all // show that nothing changed
                    
                    * spot check for some observations in CA
                    list in 8031
                    forvalues j=0/5 {
                        sum Y if _merge == 3 & inrange(year, year[8031]-`j', year[8031]), meanonly
                        dis as txt "minY_`j' = " as res r(mean) as txt " n = " as res r(N)
                    }
                    list in 9983
                    forvalues j=0/5 {
                        sum Y if _merge == 3 & inrange(year, year[9983]-`j', year[9983]), meanonly
                        dis as txt "minY_`j' = " as res r(mean) as txt " n = " as res r(N)
                    }
                    -----------
                    Here are the results for the AK spot checks:

                    Code:
                    . * spot check for some observations in AK
                    . list in 56
                    
                         +------------------------------------------------------------------------------------+
                     56. | st |        Y | year |  id |    minY_0 | n_0 |    minY_1 |  n_1 |    minY_2 |  n_2 |
                         | AK | .4255773 | 2014 | 582 | .52047304 | 912 | .51272567 | 1834 | .50705929 | 2725 |
                         |----------------------+-------------------------------------------------------------|
                         |     minY_3  |   n_3  |     minY_4  |   n_4  |     minY_5  |   n_5  |       _merge  |
                         |  .50640955  |  3657  |  .50784093  |  4492  |  .50467394  |  5379  |  matched (3)  |
                         +------------------------------------------------------------------------------------+
                    
                    . forvalues j=0/5 {
                      2.         sum Y if _merge == 3 & inrange(year, year[56]-`j', year[56]), meanonly
                      3.         dis as txt "minY_`j' = " as res r(mean) as txt " n = " as res r(N)
                      4. }
                    minY_0 = .52047304 n = 912
                    minY_1 = .51272567 n = 1834
                    minY_2 = .50705929 n = 2725
                    minY_3 = .50640955 n = 3657
                    minY_4 = .50784093 n = 4492
                    minY_5 = .50467394 n = 5379
                    
                    . list in 1933
                    
                          +-------------------------------------------------------------------------------------+
                    1933. | st |        Y | year |    id |    minY_0 | n_0 |    minY_1 |  n_1 |   minY_2 |  n_2 |
                          | AK | .9820464 | 2007 | 19343 | .51469251 | 907 | .50133866 | 1801 | .4989838 | 2696 |
                          |----------------------+--------------------------------------------------------------|
                          |     minY_3  |   n_3  |     minY_4  |   n_4  |     minY_5  |   n_5  |       _merge   |
                          |  .50115699  |  3551  |  .49818084  |  4459  |  .49963947  |  5352  |  matched (3)   |
                          +-------------------------------------------------------------------------------------+
                    
                    . forvalues j=0/5 {
                      2.         sum Y if _merge == 3 & inrange(year, year[1933]-`j', year[1933]), meanonly
                      3.         dis as txt "minY_`j' = " as res r(mean) as txt " n = " as res r(N)
                      4. }
                    minY_0 = .51469251 n = 907
                    minY_1 = .50133866 n = 1801
                    minY_2 = .4989838 n = 2696
                    minY_3 = .50115699 n = 3551
                    minY_4 = .49818084 n = 4459
                    minY_5 = .49963947 n = 5352
                    
                    .
                    and the results for the CA spot checks

                    Code:
                    . * spot check for some observations in CA
                    . list in 8031
                    
                          +------------------------------------------------------------------------------------+
                    8031. | st |        Y | year |    id |   minY_0 | n_0 |    minY_1 | n_1 |    minY_2 |  n_2 |
                          | CA | .8724612 | 2017 | 80312 | .4918918 | 441 | .49346747 | 898 | .50508394 | 1317 |
                          |----------------------+------------------------------------+------------------------|
                          |     minY_3  |   n_3  |     minY_4  |   n_4  |     minY_5  |   n_5  |       _merge  |
                          |  .49826881  |  1763  |  .49352776  |  2191  |  .49645966  |  2635  |  matched (3)  |
                          +------------------------------------------------------------------------------------+
                    
                    . forvalues j=0/5 {
                      2.         sum Y if _merge == 3 & inrange(year, year[8031]-`j', year[8031]), meanonly
                      3.         dis as txt "minY_`j' = " as res r(mean) as txt " n = " as res r(N)
                      4. }
                    minY_0 = .4918918 n = 441
                    minY_1 = .49346747 n = 898
                    minY_2 = .50508394 n = 1317
                    minY_3 = .49826881 n = 1763
                    minY_4 = .49352776 n = 2191
                    minY_5 = .49645966 n = 2635
                    
                    . list in 9983
                    
                          +-------------------------------------------------------------------------------------+
                    9983. | st |        Y | year |    id |    minY_0 | n_0 |    minY_1 | n_1 |    minY_2 |  n_2 |
                          | CA | .1749907 | 2010 | 99984 | .49497293 | 457 | .50147114 | 894 | .49737837 | 1392 |
                          |-------------------------------------------------------------------------------------|
                          |    minY_3  |   n_3  |     minY_4  |   n_4  |     minY_5  |   n_5   |       _merge   |
                          |  .5018406  |  1835  |  .50560071  |  2241  |  .50362051  |  2677   |  matched (3)   |
                          +-------------------------------------------------------------------------------------+
                    
                    . forvalues j=0/5 {
                      2.         sum Y if _merge == 3 & inrange(year, year[9983]-`j', year[9983]), meanonly
                      3.         dis as txt "minY_`j' = " as res r(mean) as txt " n = " as res r(N)
                      4. }
                    minY_0 = .49497293 n = 457
                    minY_1 = .50147114 n = 894
                    minY_2 = .49737837 n = 1392
                    minY_3 = .5018406 n = 1835
                    minY_4 = .50560071 n = 2241
                    minY_5 = .50362051 n = 2677
                    
                    .
                    The code can easily be extended to the full 1M sample. On my laptop, the full sample run takes just under 5 minutes to run.

                    Comment


                    • #11
                      Hi,
                      Robert Picard
                      I have been experimenting with runby and I think I am close to getting what I want but I have one quick follow up: are you allowed to pass arguments into a program as part of a runby commnand?
                      Something like:
                      Code:
                      runby calcMean arg1("input"), by(state)
                      I was reading through the runby doc and my impression is that it is not possible to pass through further arguments to the runby function but perhaps I'm missing something.
                      Thanks!
                      Last edited by Hong-Yi Tu Ye; 17 Mar 2018, 13:32.

                      Comment


                      • #12
                        With runby, your user-written program will be called once for each by-group in the data. I see no point in having runby pass the same argument to the same program as it processes by-groups. In that case, why not bake the argument into your program? If that's not something you can/want to do, you can also store the argument in a scalar before calling runby. Since the scope of a scalar is global, you can access it within your program. If the argument(s) are more complicated, you can save them in a dataset/file and read them within the program.

                        Comment


                        • #13
                          Thanks, your global macro suggestion worked!

                          Comment

                          Working...
                          X