Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating a Loop over a Loop

    Greetings

    I would like to run the following operation a 1000 times and store the coefficient estimates used in each sub sample. Here is a summary of what I am doing, I have a data set of about 64033 individuals , it has been difficult to create a weight matrix due to memory challenges. I use the following loop to draw a random sample from full data set and create weights based on the random sample. Also i calculate the mean hiv prevalence for all the neighbors in my sample , hiv is my dependent variable. after this i keep the only the sample that has been drawn and run a probit regression on it . However, I am trying to repeat this operation a 1000 times and store the regression estimates of each randomly drawn sample. I need to create a loop that will enable me to do this but it has been a challenge could anyone assist. The procedure i am trying to repeat a 1000 times is given below

    *Load full data
    use "D:\Users\mi8318ch\Stata Files\Medical Study\DHS Data\Panel Data DHS\New Merge\DHSFinalAnalysis.dta"

    /*Create a random variable to select sub-sample*/
    quietly generate random = uniform()
    quietly generate sample = random<.1

    /*Creating a temporal variable to stack the spatial lagged variable*/
    quietly generate Wy = .
    gsort -sample random
    quietly generate idtemp = _n
    /*Calculating the spatial lagged variable*/
    /* for the dependant variable */
    /*Identifying the begining and end of the calulation to be made*/
    quietly summarize idtemp if sample==1

    /*Identifying the # of observations to loop on*/
    global nstart = r(min)
    global nend = r(max)

    /*Starting the calculation for the same time period*/ z 1/3
    forvalues i = $nstart/$nend {
    *display `i'
    /*Determining the number of nearest neighbours*/
    global nn = 5
    /*Identifying the geographical coordinate of the observation*/
    quietly summarize longnum if idtemp==`i'
    global xi = r(mean)
    quietl summarize latnum if idtemp==`i'
    global yi = r(mean)
    /*Calculating the distance to all other observations*/
    quietly generate distance = sqrt((longnum - $xi)^2 + (latnum - $yi)^2)
    /*Creating a temporary variable*/
    quietly generate temp = sample==0
    quietly replace temp = 2 if idtemp==`i'
    /*Sorting observations*/
    gsort distance
    /*Calculating the mean hiv prevalence of nearest neighbours*/
    quietly summarize hivpositive if idtemp!=`i' & _n<=($nn + 1)
    /*Replacing the value for the spatial lagged variable*/
    quietly replace Wy = r(mean) if idtemp==`i'
    /*Dropping unecessary and temporary variables*/
    drop distance temp
    }

    /*Save the temporary data set with a subsample*/
    keep if sample==1
    drop idtemp
    *savetemporary data (that is radom sample)
    save "D:\Users\mi8318ch\Stata Files\Medical Study\DHS Data\Panel Data DHS\New Merge\Temporary data\TempDHSdata4.dta", replace

    /*Make the analysis*/

    svy: probit hivpositive churchkm Wy churchW lnhospitalkm dhsprotestant2 Age Age2 Married female urban1 river10kmdum Explorer50kmdum Rail50kmdum lnElevationMean i.Province1 i.WealthIndex HIVKnowledge i.occupation2 i.highested
    estimates store reg_1


  • #2
    Wrap the calculations in an -eclass- program that returns the probit results, and then use the -simulate- command to iterate it.

    Alternatively, have you considered doing this with -bootstrap-?

    Comment


    • #3
      Hi Clyde

      Thank you so much for your response, much appreciated. I have not used eclass or bootstrap before, could you recommend any forums where the execution of these programs are discussed?

      Comment


      • #4
        Frankly, I always have difficulty using the search facilities on this Forum. Some people seem to easily find whatever they are looking for, but I usually strike out.

        In any case, -eclass- is easy. You write a program that contains all the calculations you want to do on the random sample, culminating in your probit command. When you write that program, in the -program define …- statement, add the -eclass- option. That way -simulate- will know that your program returns estimation results. And that also means that without your having to do anything more, -simulate- will also retrieve the coefficients that -probit- leaves behind. Don't forget to specify the -saving()- option in your -simulate- command. And don't forget to include the code that selects the random sample inside the program: -simulate- does not do that for you.

        As for -bootstrap-, on second thought, I don't think your problem really lends itself to it. The advantage of -bootstrap- over -simulate- is that it does do the sampling for you--but it does so using sampling with replacement (there is nothing wrong with that but it appears not to be what you have in mind), and sometimes it does not "play nicely" with programs that filter the data with -if- conditions, as yours does. While there are ways to work around that last problem, they can be complicated and also they may substantially slow down the execution, which might add up to a real chunk of time when replicated x 1000.

        Finally, one question for you. Do you really want that -save- command? It will leave you only with the very last sample since the command simply overwrites the same file in each iteration. What use will you make of that saved file? And while you could modify that command to save a separate file at each iteration, that will put 1,000 new files, each 10% of the size of your starting data, on your hard drive. What would you do with those? I can see perhaps keeping that save command in there while you are debugging your program, so you can examine it and see if your various transformations and filtering conditions have been applied properly. But when it comes time to actually run the 1,000 iterations, I think it would be better to comment out that line.

        Comment


        • #5
          Beyond what Clyde mentions, I have a possible suggestion, which arises from a few lines of your code. You have:
          Code:
          global xi = r(mean)
          quietl summarize latnum if idtemp==`i'
          global yi = r(mean)
          /*Calculating the distance to all other observations*/
          quietly generate distance = sqrt((longnum - $xi)^2 + (latnum - $yi)^2)
          This is a difficult and unusual way to find the nearest geographical neighbors of each observation. It took me quite a while to realize that you were using -summarize- to obtain the value of a single observation. In any event, you're reinventing the wheel here. The user-contributed program -geonear- (-ssc describe geonear-) is *the* way to address all "nearest geographical neighbors" problems. I'm not the author, just a fan. (It solved a nearest neighbor problem for me in 5 min. for which another program was estimated to take 6 months!)

          For example, here's an illustration similar to your situation. It took about 3 min. on my machine to find the nearest 5 neighbors of all 64,000 observations.
          Code:
          //Simulate data.
          clear
          set seed 54321
          set obs 64000
          gen long id_self = _n
          gen double latitude = 20 + (60 - 20 ) * uniform()
          gen double longitude = -120 + (120 - 80) * uniform()
          gen hivpos = runiform()/100
          rename id_self id_neighbor
          tempfile nborfile
          save `nborfile'
          // end simulate data
          //  
          // The real work. Will result in each observation being paired
          // with its nearest five neighbors using long layout.
          rename id_neighbor id_self
          geonear id_self latitude longitude using "`nborfile'",  ///
             long nearcount(5) ignoreself ///
             neighbors(id_neighbor latitude longitude) report(5)
          //  
          // Obtain neighbors' hiv data for each observation and get the mean.
          merge m:1 id_neighbor using "`nborfile'", keep(master match) nogen
          egen hiv_neigh = mean(hivpos), by(id_self)
          // Only need one observation on self to carry neighbor mean.
          bysort id_self: keep if _n == 1

          Another question: It sounds like you plan to find the nearest neighbors 1,000 times. Is this really what you want? My best understanding here is that you're wanting to pull out the geographic component of your response with a weighted mean of the 5 nearest neighbors. What are you getting by doing it multiple times? One use of -geonear- will give you the mean of all your 64,000 observations relative to their neighbors. Were you perhaps doing it 1,000 times because you couldn't work with more than 6400 or so observations at a time with your code? If so, -geonear- solves that issue. And, why only 5 neighbors? You could include a lot more, if relevant, and could, if you like, weight them by distance if that's meaningful, or restrict to those within a certain radius, etc. I'm only modestly familiar with the spatial regression approach you're trying to work out here, but something seems odd to me, and I think one of the more spatially expert folks could help you out if you gave some more explanation.


          Other comments:
          Your use of -global- macros is ill-advised, as their use is a dangerous practice and would confuse many experienced programmers. Using a global 3 lines before you use its value is particularly unusual. Every use you have for a global here would better fulfilled by a local, and more generally, there's almost never any reason to use a global in Stata.

          Having
          Code:
          global nn = 5
          inside the loop is confusing. Its value won't change from iteration to iteration, so you only need to set it once, before the loop. That won't actually affect efficiency, but it could be confusing, as the reason to put something inside a loop is because it will change from one iteration to the next.

          Comment


          • #6
            Hi Clyde
            Thank you for explaining how -eclass- works , I will try this out and see how far I get. Regarding the save command I will ultimately not use it.

            Comment


            • #7
              Hi Mike

              Thank you for your detailed response. i have been trying to find an easier way of actually creating nearest neighbor weights with a large data set, and using -geonear-, is by far the easiest solution I have found. The reason why i was calculating the nearest neighbor a 1000 times is as you have stated my code could not work with the entire sample, since i was resampling about a 1000 times, I sort of had to to make new nearest neighbor calculations with each sub-sample. Also stata cannot calculate anything over an 11 000 x 11 000 weight matrix, the memory allow for this. That is why I ultimately took the sub-sampling route. I have also noted you comment on the use of the -global- macros.. I will try out -geonear- program. Thank you again for all your suggestions.

              Comment


              • #8
                Your question inspired me to do some experimenting with -geonear-. One thing I found is that its run time goes up modestly with increasing numbers of neighbors, so if that matters to you, you might experiment in that direction. Also, although you might think that run time would go up as the square of the number of units, its performance is more nearly linear in this respect, at least at realistic sample sizes. Even with more neighbors, you should be able to keep your full sample and run within a reasonable amount of time.

                Comment


                • #9
                  Hi Mike

                  Thank you for this suggestion.

                  Kind Regards

                  Comment


                  • #10
                    Hi Clyde

                    After trying out the different suggestions , I ended up using your initial suggestion "Wrap the calculations in an -eclass- program that returns the probit results, and then use the -simulate- command to iterate it"

                    However, I would like your advice on whether i have set the program the right way and if i am rightly simulating it?

                    program define subsampling,eclass
                    drop _all
                    use "D:\Users\mi8318ch\Stata Files\Medical Study\DHS Data\Panel Data DHS\New Merge\DHSFinalAnalysis.dta"
                    keep hivpositive churchkm lnhospitalkm dhsprotestant2 Age Age2 Married female urban1 river10kmdum Explorer50kmdum Rail50kmdum lnElevationMean Province1 WealthIndex HIVKnowledge occupation2 highested v023 v021 v005 longnum latnum
                    gen wgt=v005/10^6
                    svyset [pweight=wgt], psu(v021) strata(v023)
                    quietly generate random = uniform()
                    quietly generate sample = random<.1
                    quietly generate Wy = .
                    gsort -sample random
                    quietly generate idtemp = _n
                    quietly summarize idtemp if sample==1
                    global nstart = r(min)
                    global nend = r(max)
                    /*Starting the calculation for the same time period*/
                    forvalues i = $nstart/$nend {
                    *display `i'
                    /*Determining the number of nearest neighbours*/
                    global nn = 10
                    /*Identifying the geographical coordinate of the observation*/
                    quietly summarize longnum if idtemp==`i'
                    global xi = r(mean)
                    quietly summarize latnum if idtemp==`i'
                    global yi = r(mean)
                    /*Calculating the distance to all other observations*/
                    quietly generate distance = sqrt((longnum - $xi)^2 + (latnum - $yi)^2)
                    /*Creating a temporary variable*/
                    quietly generate temp = sample==0
                    quietly replace temp = 2 if idtemp==`i'
                    /*Sorting observations*/
                    gsort distance
                    /*Calculating the mean hiv prevalence of nearest neighbours*/
                    quietly summarize hivpositive if idtemp!=`i' & _n<=($nn + 1)
                    /*Replacing the value for the spatial lagged variable*/
                    quietly replace Wy = r(mean) if idtemp==`i'
                    /*Dropping unecessary and temporary variables*/
                    drop distance temp
                    }
                    keep if sample==1
                    drop idtemp
                    svy: probit hivpositive churchkm Wy lnhospitalkm dhsprotestant2 Age Age2 Married female urban1 river10kmdum Explorer50kmdum Rail50kmdum lnElevationMean i.Province1 i.WealthIndex HIVKnowledge i.occupation2 i.highested
                    end

                    simulate _b _se , reps(1000): subsampling

                    Comment


                    • #11
                      Offhand, it looks mostly ok. The only glaring thing is that you did not specify the -saving()- option in your -simulate- command, so you will run the 1000 iterations, but everything done will be discarded!

                      I'm not sure I grasp what is going on in your -forvalues i = $nstart/$nend- loop. It seems you are trying to find observations that are geographically near each other or something like that. I wonder if you would benefit by using Robert Picard's -geonear- program (available from SSC) for this rather than homebrewing. If -geonear- does what you need, it will not only be less error prone than crafting your own code, it will run much faster. Since your planning to iterate this 1,000 times, even small speed improvements will make a noticeable difference. I don't use -geonear- myself (because I don't work with spatial data), so I can't say for sure if it's what you need, but I urge you to check it out. I've heard it's pretty close to an all purpose program for managing spatial data.

                      Also, I notice that you are selecting samples by comparing a random number to 0.1. That means that your sample size is itself random rather than fixed. If that's truly what you want to simulate, that's fine. But it's kind of unusual. In this type of analysis people usually stipulate a desired fixed sample size and code the sampling accordingly.

                      A couple of other things to consider. Don't forget to set your random number seed before you run this. (And don't put the -set seed- command inside program subsampling!) Also, probit regression estimations sometimes fail to converge. So you probably should specify a maximum number of iterations in your -probit- command so it doesn't just waste time. The -iterate()- option of -probit- will let you do that. And then add e(converged) to the list of parameters in your -simulate- command. Then when you have all your results, you can exclude the non-converged iterations from further analysis.

                      Comment

                      Working...
                      X