Dear Statalist,
i have quite a large dataset of administrative data (about 800.000 observations) and need to assign a random state to each observation.
Every state has an original state assigned to it, but I want to anonymize the data in such a way that one cannot deduce to the actual state of the observation.
This means, as it is German data, that obs x is randomly assigned e.g. 1 for Schleswig-Holstein, 2 for Hamburg and so forth.
Problem: The assignment has to be correct with respect to east/west Germany (states 1-11 are west, 12-16 are east) AND I want to reproduce the actual distribution of states.
The distribution of original states looks like that
state || Freq.
------------+-----------------------------------
1 | 34,178
2 | 13,142
3 | 74,775
4 | 7,298
5 | 195,622
6 | 59,847
7 | 55,253
8 | 161,160
10 | 8,843
11 | 19,395
12 | 19,766
13 | 18,860
14 | 45,521
15 | 25,814
16 | 29,925
------------+-----------------------------------
Total | 769,399
I have written some code that more or less does the job but I want a nicer and more generalizable solution.
The code I wrote was as follows:
Variables: Regio_U is the region (1 for west, 2 for east), state is the original state assignment variable
set seed 123
gen double zufall_west = uniform() if state !=. & Regio_U == 1
sort zufall_west
gen count = _n if zufall_west !=.
gen double zufall_ost = uniform() if state !=. & Regio_U == 2
sort zufall_ost
gen count_ost = _n if zufall_ost !=.
sort count
gen Land_U = 1 if count >=1 & count <=34178
replace Land_U = 2 if count > 34178 & count <=47320
replace Land_U = 3 if count > 47320 & count <=122095
replace Land_U = 4 if count > 122095 & count <=129393
replace Land_U = 5 if count > 129393 & count <=325015
replace Land_U = 6 if count > 325015 & count <=384862
replace Land_U = 7 if count > 384862 & count <=437992
replace Land_U = 8 if count > 437992 & count <=598634
replace Land_U = 10 if count > 598634 & count <=607389
replace Land_U = 11 if count > 607389 & count <=626692
sort count_ost
replace Land_U = 12 if count_ost>=1 & count_ost<=19395
replace Land_U = 13 if count_ost>19395 & count_ost<=39102
replace Land_U = 14 if count_ost>39102 & count_ost<=84303
replace Land_U = 15 if count_ost>84303 & count_ost<=110037
replace Land_U = 16 if count_ost>110037 & count_ost<=140000
This means that i did the random assignment by randomly ordering the observations.
The big problem I have is that I did not find out how to randomly assign the observations according to the distribution in the original state variable.
I hope I could make the problem clear.
Thank you very much for your help in advance
EDIT:
I forgot one thing: As it is a Panel, the state assignment should stay the same over the years for one the same Observation (if possible)
i have quite a large dataset of administrative data (about 800.000 observations) and need to assign a random state to each observation.
Every state has an original state assigned to it, but I want to anonymize the data in such a way that one cannot deduce to the actual state of the observation.
This means, as it is German data, that obs x is randomly assigned e.g. 1 for Schleswig-Holstein, 2 for Hamburg and so forth.
Problem: The assignment has to be correct with respect to east/west Germany (states 1-11 are west, 12-16 are east) AND I want to reproduce the actual distribution of states.
The distribution of original states looks like that
state || Freq.
------------+-----------------------------------
1 | 34,178
2 | 13,142
3 | 74,775
4 | 7,298
5 | 195,622
6 | 59,847
7 | 55,253
8 | 161,160
10 | 8,843
11 | 19,395
12 | 19,766
13 | 18,860
14 | 45,521
15 | 25,814
16 | 29,925
------------+-----------------------------------
Total | 769,399
I have written some code that more or less does the job but I want a nicer and more generalizable solution.
The code I wrote was as follows:
Variables: Regio_U is the region (1 for west, 2 for east), state is the original state assignment variable
set seed 123
gen double zufall_west = uniform() if state !=. & Regio_U == 1
sort zufall_west
gen count = _n if zufall_west !=.
gen double zufall_ost = uniform() if state !=. & Regio_U == 2
sort zufall_ost
gen count_ost = _n if zufall_ost !=.
sort count
gen Land_U = 1 if count >=1 & count <=34178
replace Land_U = 2 if count > 34178 & count <=47320
replace Land_U = 3 if count > 47320 & count <=122095
replace Land_U = 4 if count > 122095 & count <=129393
replace Land_U = 5 if count > 129393 & count <=325015
replace Land_U = 6 if count > 325015 & count <=384862
replace Land_U = 7 if count > 384862 & count <=437992
replace Land_U = 8 if count > 437992 & count <=598634
replace Land_U = 10 if count > 598634 & count <=607389
replace Land_U = 11 if count > 607389 & count <=626692
sort count_ost
replace Land_U = 12 if count_ost>=1 & count_ost<=19395
replace Land_U = 13 if count_ost>19395 & count_ost<=39102
replace Land_U = 14 if count_ost>39102 & count_ost<=84303
replace Land_U = 15 if count_ost>84303 & count_ost<=110037
replace Land_U = 16 if count_ost>110037 & count_ost<=140000
This means that i did the random assignment by randomly ordering the observations.
The big problem I have is that I did not find out how to randomly assign the observations according to the distribution in the original state variable.
I hope I could make the problem clear.
Thank you very much for your help in advance
EDIT:
I forgot one thing: As it is a Panel, the state assignment should stay the same over the years for one the same Observation (if possible)
Comment