Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to randomly assign individuals to different groups (other than the actual group)

    Hello, I have a panel dataset. This dataset records individual-level health across different months in the same year. Individuals live in different county. Each county has different levels of pollution. The no. of people from each county is different.

    So, my outcome variable is health, the explanatory variable is pollution. My main regression examines the impact of pollution on (individual's) health.

    Now, I want to randomly assign individuals to different counties (other than the actual county) where pollution is different to their actual/original county. I want to examine if the results will be robust.

    How can achieve this?

  • #2
    To give a helpful answer to your question, having more information would be helpful.

    1. I would guess that you want to repeatedly reassign ("shuffle") the county values, and perform some estimation command for each such assignment. Is this correct? If so, what estimation would that be? And, if so, what about your data makes conventional statistic procedures undesirable for your situation?

    2. If #1 is correct, do you want confidence intervals, or p-values?

    3. Do you want to reassign the county values in such a way that an individual stays in the same (reassigned) county for every month? Presuming that, do you want county values reassigned based on "month 1," "month N", or what?

    5. Please show an example of your data. In this regard, see item 12.2 of the StataList FAQ that newer participants on StataList are asked to read.

    I suspect that your data structure and goal will not fit nicely with either -permute- or -bootstrap-, although those might be a good starting point for your consideration. A better command might be the more versatile user-written command -ritest- (-search ritest-).

    Comment


    • #3
      Depends on whether you never want the county to be the same.

      I'd think you'd want to assign randomly without restraint (imposing the null). For that, use shufflevar. It will maintain the sample sizes by county, since it just shuffles the county identifier. It creates a new variable so as not to meddle with the data. Set a loop and shuffle repeated and store the estimates in some way.

      Crude example. There are other ways to store results. Could also move R to a new frame so as not to impact the original data.

      Code:
      sysuse auto, clear
      
      reg price weight i.foreign , r
      scalar beta = _b[weight]
      
      matrix R = J(100,2,.)
      matrix colnames R = coef reject
      
      forv i = 1/100 {
          quietly {
          shufflevar foreign
          reg price weight i.foreign_shuffled , r
          matrix R[`i',1] = r(table)[1,1]
          matrix R[`i',2] = r(table)[4,1] < 0.05
      }
      }
      svmat R , n(col)
      summ coef reject
      qui gunique make if coef > beta
      di "Permuted beta > beta = " r(J)/100


      Comment


      • #4
        I had thought of suggesting -shufflevar- as well, but -search shufflevar- no longer finds it.

        Comment


        • #5
          https://ideas.repec.org/c/boc/bocode/s457116.html

          Comment


          • #6
            Originally posted by Mike Lacy View Post
            To give a helpful answer to your question, having more information would be helpful.

            1. I would guess that you want to repeatedly reassign ("shuffle") the county values, and perform some estimation command for each such assignment. Is this correct? If so, what estimation would that be? And, if so, what about your data makes conventional statistic procedures undesirable for your situation?

            2. If #1 is correct, do you want confidence intervals, or p-values?

            3. Do you want to reassign the county values in such a way that an individual stays in the same (reassigned) county for every month? Presuming that, do you want county values reassigned based on "month 1," "month N", or what?

            5. Please show an example of your data. In this regard, see item 12.2 of the StataList FAQ that newer participants on StataList are asked to read.

            I suspect that your data structure and goal will not fit nicely with either -permute- or -bootstrap-, although those might be a good starting point for your consideration. A better command might be the more versatile user-written command -ritest- (-search ritest-).

            Thank you Mike. See my sample data below. Note that individuals stay at the same county (countyid) across different months.

            The regression that I run is this:

            reghdfe health pollution control, absorb(id countyid month) vce(cluster county_id)

            Now I want to randomly assign individuals to other counties where pollution is different. Would ritest help in this case? Could you suggestion codes or other codes?
            id countyid month pollution health age (years)
            1 1 1 100 9 20
            1 1 2 80 8 20
            2 1 1 100 8 22
            2 1 2 80 7 22
            2 1 3 70 5 22
            3 2 1 50 6 19
            3 2 2 60 8 19
            3 2 3 30 7 19
            4 2 1 50 6 18
            4 2 2 60 9 18
            5 3 1 50 8 19
            5 3 2 80 8 19
            6 4 1 60 9 16
            6 4 2 60 8 16
            6 4 3 70 6 16
            7 2 1 50 6 18
            7 2 2 60 5 18
            7 2 3 30 4 18
            8 3 1 50 8 17
            8 3 2 80 9 17

            Comment


            • #7
              Originally posted by George Ford View Post
              Depends on whether you never want the county to be the same.

              I'd think you'd want to assign randomly without restraint (imposing the null). For that, use shufflevar. It will maintain the sample sizes by county, since it just shuffles the county identifier. It creates a new variable so as not to meddle with the data. Set a loop and shuffle repeated and store the estimates in some way.

              Crude example. There are other ways to store results. Could also move R to a new frame so as not to impact the original data.

              Code:
              sysuse auto, clear
              
              reg price weight i.foreign , r
              scalar beta = _b[weight]
              
              matrix R = J(100,2,.)
              matrix colnames R = coef reject
              
              forv i = 1/100 {
              quietly {
              shufflevar foreign
              reg price weight i.foreign_shuffled , r
              matrix R[`i',1] = r(table)[1,1]
              matrix R[`i',2] = r(table)[4,1] < 0.05
              }
              }
              svmat R , n(col)
              summ coef reject
              qui gunique make if coef > beta
              di "Permuted beta > beta = " r(J)/100

              Thank you George. I just shared my sample dataset above. Could you kindly share your further thoughts/comments/suggestions, please?

              Comment


              • #8
                Why not shuffle the pollution variable?

                Comment


                • #9
                  Code:
                  ritest pollution _b[pollution] , r(500) : reghdfe health pollution control, absorb(id countyid month) vce(cluster county_id)
                  
                  ** I think this permutes so values are chosen only from the same month.
                  ritest pollution _b[pollution] , r(500) strata(month) : reghdfe health pollution control, absorb(id countyid month) vce(cluster county_id)
                  ritest pollution _b[pollution]/_se[pollution] , r(500) strata(month) : reghdfe health pollution control, absorb(id countyid month) vce(cluster county_id)

                  Comment


                  • #10
                    if id doesn't change county, then you can drop countyid from absorb (reghdfe tells you if redundant).

                    Comment


                    • #11
                      Originally posted by George Ford View Post
                      Code:
                      ritest pollution _b[pollution] , r(500) : reghdfe health pollution control, absorb(id countyid month) vce(cluster county_id)
                      
                      ** I think this permutes so values are chosen only from the same month.
                      ritest pollution _b[pollution] , r(500) strata(month) : reghdfe health pollution control, absorb(id countyid month) vce(cluster county_id)
                      ritest pollution _b[pollution]/_se[pollution] , r(500) strata(month) : reghdfe health pollution control, absorb(id countyid month) vce(cluster county_id)
                      Hi George, thank you very much. How do you make sense of strata and cluster in this case? Should county be cluster or strata?

                      Comment


                      • #12
                        If you care about the time series properties, then strata.

                        Strata: Values are only permuted within each stratum (month in your case). This preserves the time structure of your data.
                        Cluster: The entire cluster is permuted as a unit, which would keep all observations for an individual/group together.

                        Comment

                        Working...
                        X