Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to randomly assign values depending on distribution within another variable

    Dear Statalist,

    i have quite a large dataset of administrative data (about 800.000 observations) and need to assign a random state to each observation.
    Every state has an original state assigned to it, but I want to anonymize the data in such a way that one cannot deduce to the actual state of the observation.
    This means, as it is German data, that obs x is randomly assigned e.g. 1 for Schleswig-Holstein, 2 for Hamburg and so forth.
    Problem: The assignment has to be correct with respect to east/west Germany (states 1-11 are west, 12-16 are east) AND I want to reproduce the actual distribution of states.

    The distribution of original states looks like that


    state || Freq.
    ------------+-----------------------------------
    1 | 34,178
    2 | 13,142
    3 | 74,775
    4 | 7,298
    5 | 195,622
    6 | 59,847
    7 | 55,253
    8 | 161,160
    10 | 8,843
    11 | 19,395
    12 | 19,766
    13 | 18,860
    14 | 45,521
    15 | 25,814
    16 | 29,925
    ------------+-----------------------------------
    Total | 769,399


    I have written some code that more or less does the job but I want a nicer and more generalizable solution.
    The code I wrote was as follows:
    Variables: Regio_U is the region (1 for west, 2 for east), state is the original state assignment variable

    set seed 123

    gen double zufall_west = uniform() if state !=. & Regio_U == 1
    sort zufall_west
    gen count = _n if zufall_west !=.

    gen double zufall_ost = uniform() if state !=. & Regio_U == 2
    sort zufall_ost
    gen count_ost = _n if zufall_ost !=.

    sort count
    gen Land_U = 1 if count >=1 & count <=34178
    replace Land_U = 2 if count > 34178 & count <=47320
    replace Land_U = 3 if count > 47320 & count <=122095
    replace Land_U = 4 if count > 122095 & count <=129393
    replace Land_U = 5 if count > 129393 & count <=325015
    replace Land_U = 6 if count > 325015 & count <=384862
    replace Land_U = 7 if count > 384862 & count <=437992
    replace Land_U = 8 if count > 437992 & count <=598634
    replace Land_U = 10 if count > 598634 & count <=607389
    replace Land_U = 11 if count > 607389 & count <=626692

    sort count_ost
    replace Land_U = 12 if count_ost>=1 & count_ost<=19395
    replace Land_U = 13 if count_ost>19395 & count_ost<=39102
    replace Land_U = 14 if count_ost>39102 & count_ost<=84303
    replace Land_U = 15 if count_ost>84303 & count_ost<=110037
    replace Land_U = 16 if count_ost>110037 & count_ost<=140000

    This means that i did the random assignment by randomly ordering the observations.
    The big problem I have is that I did not find out how to randomly assign the observations according to the distribution in the original state variable.

    I hope I could make the problem clear.

    Thank you very much for your help in advance


    EDIT:
    I forgot one thing: As it is a Panel, the state assignment should stay the same over the years for one the same Observation (if possible)
    Last edited by Philipp Schrauth; 10 Feb 2016, 02:02.

  • #2
    So, I think this does it:

    Code:
    // CREATE A CROSSWALK BETWEEN REAL STATE # 
    // AND A NEWLY CREATED RANDOMIZED STATE #
    clear
    set obs 16
    gen real_state = _n
    set seed 1234 // OR ANY OTHER INTEGER YOU LIKE
    gen shuffle = runiform()
    gen byte east = (real_state >= 12)
    by east (shuffle), sort: gen randomized_state = _n
    replace randomized_state = 11 + randomized_state if east
    sort real_state
    keep real_state randomized_state
    list, noobs clean
    save state_crosswalk, replace
    
    // NOW APPLY IT TO YOUR DATA
    use my_data, clear
    rename state real_state
    merge m:1 real_state using state_crosswalk, assert(match) nogenerate
    order randomized_state, before(real_state)
    drop real_state
    save my_data_with_randomized_state, replace
    And you can keep the state_crosswalk.dta file around so that you can refer to it in the future if you ever have to unscramble the state variable.

    Comment


    • #3
      Thank you very much for that fast answer.

      There remains a problem though, as I probably did not express my question properly.

      Your solution results in a random state assignment, which always assigns a random state to a fixed real state.
      Thus, all observations with real state being 1 now might have a random state 4 assigned to them.

      The Problem is: thereby one can still deduce which real state lies behind it by looking at frequency tables.

      To make my question a little more clear:
      I have several businesses (with a business number) over a 15 years period.
      Now, business one lies in state 2 but I want a random state assigned to it, let's say 5 (over all the years).
      Business two also lies in state 2, but I would like it to be in random state 8.
      In the end, I want every business (depending on east and west) to have a random state assigned to, but frequencies of appearances should be similar to the frequencies in the table stated above.
      Thus, there are still about 34K Businesses with a random state 1 assigned to them (even some, which might have been in one before) and about 13K in state 2 and so on.

      For now, I changed my former code a little, resulting in the following:

      Code:
      gen double randomstate_west = uniform() if realstate !=. & west== 1
      sort randomstate_west
      gen Land_B = .
      replace Land_B = 1 in 1/35139
      replace Land_B = 2 in 35140/45593
      replace Land_B = 3 in 45594/122928
      replace Land_B = 4 in 122929/129740
      replace Land_B = 5 in 129741/329208
      replace Land_B = 6 in 329209/388916
      replace Land_B = 7 in 388916/448449
      replace Land_B = 8 in 448450/612351
      replace Land_B = 10 in 612352/622340
      replace Land_B = 11 in 622341/639066
      
      gen double randomstate_east = uniform() if realstate !=. & east == 1
      sort randomstate_east
      replace Land_B = 12 in 1/25482
      replace Land_B = 13 in 25483/48125
      replace Land_B = 14 in 48126/101089
      replace Land_B = 15 in 101090/131994
      replace Land_B = 16 in 131995/166783

      Not nice though.
      Last edited by Philipp Schrauth; 11 Feb 2016, 02:17.

      Comment


      • #4
        I'm not sure I understand, but perhaps this does it. In particular, I don't know if the table of frequencies in #1 is frequencies states of businesses or frequencies of states of business-year observations. What I'm writing below will reproduce the distribution of states of businesses.

        Code:
        // CREATE A FILE OF STATES & BUSINESSES
        use my_data, clear
        keep business state
        duplicates drop
        gen byte east = (state >= 12)
        sort east state business
        
        // MAKE TEMPFILE OF STATES IN SORTED ORDER
        // EACH OCCURING WITH SAME FREQUENCY AS IN my_data
        preserve
        tempfile states
        keep state
        rename state randomized_state
        save `states'
        restore
        
        // NOW SHUFFLE THE ORDER OF THE BUSINESSES
        // MAINTAINING EAST WEST DISTINCTION
        set seed 1234 // OR OTHER SEED OF OUR CHOICE
        gen shuffle = runiform()
        sort east shuffle
        keep business
        
        // NOW MERGE BACK WITH THE ORIGINAL ORDER OF THE STATES
        merge 1:1 _n using `states', assert(match) nogenerate
        
        // AND NOW MERGE THIS BACK WITH THE ORIGINAL DATA
        merge 1:m business using my_data, assert(match) nogenerate
        Note: untested, but I'm pretty confident this is right.


        Comment


        • #5
          Hello,

          I stumbled onto this thread since I have a similar problem with a database that needs to be put into shape for research. I'm hoping you could help me out, Clyde, since I'm in need of something a bit more specific.

          I understand that the code you wrote for Phillipp works when you're working within the same database, but what happens when you want to apply the distribution of states of one subset of data to another subset of data (which doesn't have to have the same number of observations)? I have a number of databases that I need to put together, but since they come from different sources there are variables that are missing in some databases but not in the others. I'd like to "guess" the values contained in these (categorical) variables using the distribution of said variable in those databases where it exists, using other variables in common so as to minimize error while assigning values discriminately.

          I'm specifically having trouble with this case: all the values for the variable marital_status are missing in database A, but not in databases B and C. Also, all three databases have the variables age, gender, economic_dependents, and school_years. What I'm trying to do is to join the three databases with append and divide economic_dependents, age and school_years into quintiles, so as to create subgroups within subgroups (with the different possible ages, school years, gender, etc.) and then extract the distribution of marital_status (which is comprised of five states) from the databases B and C. My objective is to then assign values for marital_status for database A, using the subgroups, according to the distribution of marital_status given gender and age, school years and economic_dependents quintile groups.

          I hope I could make myself clear, and thanks in advance.

          Andrés

          Comment


          • #6
            The way to deal with missing values that is similar, but has better statistical properties is called multiple imputation, and is implemented in Stata as the mi suit of commands. To get started see help mi
            ---------------------------------
            Maarten L. Buis
            University of Konstanz
            Department of history and sociology
            box 40
            78457 Konstanz
            Germany
            http://www.maartenbuis.nl
            ---------------------------------

            Comment


            • #7
              Maarten, thank you very much! I've been reading through Stata's material on the subject and have one question only concerning the registering of variables. As I understood it, registering variables as passive implies that these variables are correlated with the imputed variable and as such has to be taken into account, while registering (or not registering at all) variables as regular implies that said variable isn't correlated with the variable. In other words, could I just create a dataset by dropping all variables I think are not correlated to the imputed variable and then registering the remaining variables as passive?

              Comment


              • #8
                Andres, here is what the help file says about "passive":
                Passive variables are algebraic combinations of imputed, regular, or other passive variables.
                an example might be:
                Code:
                mi passive: egen total=rowtotal(v1-v5)

                Comment


                • #9
                  Thank you all very much. I've been reading about and working with this method of filling in missing values and it served my needs perfectly. Thanks for all the help!

                  Comment

                  Working...
                  X