How to randomly assign values depending on distribution within another variable

Philipp Schrauth

Join Date: Feb 2016

Posts: 31
#1

How to randomly assign values depending on distribution within another variable

10 Feb 2016, 01:25

Dear Statalist,

i have quite a large dataset of administrative data (about 800.000 observations) and need to assign a random state to each observation.
Every state has an original state assigned to it, but I want to anonymize the data in such a way that one cannot deduce to the actual state of the observation.
This means, as it is German data, that obs x is randomly assigned e.g. 1 for Schleswig-Holstein, 2 for Hamburg and so forth.
Problem: The assignment has to be correct with respect to east/west Germany (states 1-11 are west, 12-16 are east) AND I want to reproduce the actual distribution of states.

The distribution of original states looks like that

state || Freq.
------------+-----------------------------------
1 | 34,178
2 | 13,142
3 | 74,775
4 | 7,298
5 | 195,622
6 | 59,847
7 | 55,253
8 | 161,160
10 | 8,843
11 | 19,395
12 | 19,766
13 | 18,860
14 | 45,521
15 | 25,814
16 | 29,925
------------+-----------------------------------
Total | 769,399

I have written some code that more or less does the job but I want a nicer and more generalizable solution.
The code I wrote was as follows:
Variables: Regio_U is the region (1 for west, 2 for east), state is the original state assignment variable

set seed 123

gen double zufall_west = uniform() if state !=. & Regio_U == 1
sort zufall_west
gen count = _n if zufall_west !=.

gen double zufall_ost = uniform() if state !=. & Regio_U == 2
sort zufall_ost
gen count_ost = _n if zufall_ost !=.

sort count
gen Land_U = 1 if count >=1 & count <=34178
replace Land_U = 2 if count > 34178 & count <=47320
replace Land_U = 3 if count > 47320 & count <=122095
replace Land_U = 4 if count > 122095 & count <=129393
replace Land_U = 5 if count > 129393 & count <=325015
replace Land_U = 6 if count > 325015 & count <=384862
replace Land_U = 7 if count > 384862 & count <=437992
replace Land_U = 8 if count > 437992 & count <=598634
replace Land_U = 10 if count > 598634 & count <=607389
replace Land_U = 11 if count > 607389 & count <=626692

sort count_ost
replace Land_U = 12 if count_ost>=1 & count_ost<=19395
replace Land_U = 13 if count_ost>19395 & count_ost<=39102
replace Land_U = 14 if count_ost>39102 & count_ost<=84303
replace Land_U = 15 if count_ost>84303 & count_ost<=110037
replace Land_U = 16 if count_ost>110037 & count_ost<=140000

This means that i did the random assignment by randomly ordering the observations.
The big problem I have is that I did not find out how to randomly assign the observations according to the distribution in the original state variable.

I hope I could make the problem clear.

Thank you very much for your help in advance

EDIT:
I forgot one thing: As it is a Panel, the state assignment should stay the same over the years for one the same Observation (if possible)

Last edited by Philipp Schrauth; 10 Feb 2016, 02:02.
Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30003

10 Feb 2016, 11:12

So, I think this does it:

Code:

// CREATE A CROSSWALK BETWEEN REAL STATE # 
// AND A NEWLY CREATED RANDOMIZED STATE #
clear
set obs 16
gen real_state = _n
set seed 1234 // OR ANY OTHER INTEGER YOU LIKE
gen shuffle = runiform()
gen byte east = (real_state >= 12)
by east (shuffle), sort: gen randomized_state = _n
replace randomized_state = 11 + randomized_state if east
sort real_state
keep real_state randomized_state
list, noobs clean
save state_crosswalk, replace

// NOW APPLY IT TO YOUR DATA
use my_data, clear
rename state real_state
merge m:1 real_state using state_crosswalk, assert(match) nogenerate
order randomized_state, before(real_state)
drop real_state
save my_data_with_randomized_state, replace

And you can keep the state_crosswalk.dta file around so that you can refer to it in the future if you ever have to unscramble the state variable.

Comment

Philipp Schrauth

Join Date: Feb 2016

Posts: 31
#3

11 Feb 2016, 02:06

Thank you very much for that fast answer.

There remains a problem though, as I probably did not express my question properly.

Your solution results in a random state assignment, which always assigns a random state to a fixed real state.
Thus, all observations with real state being 1 now might have a random state 4 assigned to them.

The Problem is: thereby one can still deduce which real state lies behind it by looking at frequency tables.

To make my question a little more clear:
I have several businesses (with a business number) over a 15 years period.
Now, business one lies in state 2 but I want a random state assigned to it, let's say 5 (over all the years).
Business two also lies in state 2, but I would like it to be in random state 8.
In the end, I want every business (depending on east and west) to have a random state assigned to, but frequencies of appearances should be similar to the frequencies in the table stated above.
Thus, there are still about 34K Businesses with a random state 1 assigned to them (even some, which might have been in one before) and about 13K in state 2 and so on.

For now, I changed my former code a little, resulting in the following:

Code:

gen double randomstate_west = uniform() if realstate !=. & west== 1 sort randomstate_west gen Land_B = . replace Land_B = 1 in 1/35139 replace Land_B = 2 in 35140/45593 replace Land_B = 3 in 45594/122928 replace Land_B = 4 in 122929/129740 replace Land_B = 5 in 129741/329208 replace Land_B = 6 in 329209/388916 replace Land_B = 7 in 388916/448449 replace Land_B = 8 in 448450/612351 replace Land_B = 10 in 612352/622340 replace Land_B = 11 in 622341/639066 gen double randomstate_east = uniform() if realstate !=. & east == 1 sort randomstate_east replace Land_B = 12 in 1/25482 replace Land_B = 13 in 25483/48125 replace Land_B = 14 in 48126/101089 replace Land_B = 15 in 101090/131994 replace Land_B = 16 in 131995/166783

Not nice though.

Last edited by Philipp Schrauth; 11 Feb 2016, 02:17.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30003

11 Feb 2016, 08:29

I'm not sure I understand, but perhaps this does it. In particular, I don't know if the table of frequencies in #1 is frequencies states of businesses or frequencies of states of business-year observations. What I'm writing below will reproduce the distribution of states of businesses.

Code:

// CREATE A FILE OF STATES & BUSINESSES
use my_data, clear
keep business state
duplicates drop
gen byte east = (state >= 12)
sort east state business

// MAKE TEMPFILE OF STATES IN SORTED ORDER
// EACH OCCURING WITH SAME FREQUENCY AS IN my_data
preserve
tempfile states
keep state
rename state randomized_state
save `states'
restore

// NOW SHUFFLE THE ORDER OF THE BUSINESSES
// MAINTAINING EAST WEST DISTINCTION
set seed 1234 // OR OTHER SEED OF OUR CHOICE
gen shuffle = runiform()
sort east shuffle
keep business

// NOW MERGE BACK WITH THE ORIGINAL ORDER OF THE STATES
merge 1:1 _n using `states', assert(match) nogenerate

// AND NOW MERGE THIS BACK WITH THE ORIGINAL DATA
merge 1:m business using my_data, assert(match) nogenerate

Note: untested, but I'm pretty confident this is right.

Comment

Andres Santos

Join Date: Jun 2016

Posts: 4
#5

27 Jun 2016, 01:19

Hello,

I stumbled onto this thread since I have a similar problem with a database that needs to be put into shape for research. I'm hoping you could help me out, Clyde, since I'm in need of something a bit more specific.

I understand that the code you wrote for Phillipp works when you're working within the same database, but what happens when you want to apply the distribution of states of one subset of data to another subset of data (which doesn't have to have the same number of observations)? I have a number of databases that I need to put together, but since they come from different sources there are variables that are missing in some databases but not in the others. I'd like to "guess" the values contained in these (categorical) variables using the distribution of said variable in those databases where it exists, using other variables in common so as to minimize error while assigning values discriminately.

I'm specifically having trouble with this case: all the values for the variable marital_status are missing in database A, but not in databases B and C. Also, all three databases have the variables age, gender, economic_dependents, and school_years. What I'm trying to do is to join the three databases with append and divide economic_dependents, age and school_years into quintiles, so as to create subgroups within subgroups (with the different possible ages, school years, gender, etc.) and then extract the distribution of marital_status (which is comprised of five states) from the databases B and C. My objective is to then assign values for marital_status for database A, using the subgroups, according to the distribution of marital_status given gender and age, school years and economic_dependents quintile groups.

I hope I could make myself clear, and thanks in advance.

Andrés
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3436
#6

27 Jun 2016, 01:34

The way to deal with missing values that is similar, but has better statistical properties is called multiple imputation, and is implemented in Stata as the mi suit of commands. To get started see help mi

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment
Andres Santos

Join Date: Jun 2016

Posts: 4
#7

27 Jun 2016, 10:29

Maarten, thank you very much! I've been reading through Stata's material on the subject and have one question only concerning the registering of variables. As I understood it, registering variables as passive implies that these variables are correlated with the imputed variable and as such has to be taken into account, while registering (or not registering at all) variables as regular implies that said variable isn't correlated with the variable. In other words, could I just create a dataset by dropping all variables I think are not correlated to the imputed variable and then registering the remaining variables as passive?
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4447
#8

27 Jun 2016, 11:26

Andres, here is what the help file says about "passive":

Passive variables are algebraic combinations of imputed, regular, or other passive variables.

an example might be:

Code:

mi passive: egen total=rowtotal(v1-v5)
Comment
Andres Santos

Join Date: Jun 2016

Posts: 4
#9

28 Jun 2016, 22:41

Thank you all very much. I've been reading about and working with this method of filling in missing values and it served my needs perfectly. Thanks for all the help!
Comment

Announcement