random assignment of birthplace

River Huang

Join Date: Mar 2016

Posts: 1908
#1

random assignment of birthplace

06 May 2018, 01:32

Dear All, I found this question here http://bbs.pinggu.org/forum.php?mod=...=1#pid50869982. The purpose is to randomly assign a new birth place from the original `placeofborn'.

Code:

clear input str7 name int years str9 placeofborn "" 1998 "" "John" 1998 "Nashville" "Brian" 1998 "Nork York" "Mary" 1998 "" "Charles" 1998 "" "" 1999 "" "Susan" 1999 "San Diego" "Brian" 1999 "Nork York" "Frank" 1999 "Chicago" end

Note that:
Nothing to do with the element without a name, e.g., the first/sixth observation.

All the other observations have to be (randomly) replaced with a new birth place (maybe the same as the original one), with or without information on the original place of birth (`placeofborn').

Any suggestions are appreciated. Thanks.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

06 May 2018, 09:03

The question, as posed here, does not tell us enough to provide a solution. Why isn't the following code a solution - it meets the few constraints posed:

Code:

replace placeofborn = "New York" if name != ""

Other questions.
Does "name" uniquely identify individuals - is the "Brian" in posts 3 and 8 the same individual, so the placeofbirth must be the same?

If so, is there an ID code in the data used to distinguish distinct individuals, rather than the name, which may be shared by two or more individuals?

Must every occurrence of "Chicago" - regardless of the name it is associated with - be replaced with the same city?

Must the city names be chosen from among the cities present in the data, including "missing"?

Must the city names be real cities, or can they be invented - City001, City002, ... for example?

But the fundamental question is what is the objective that the user wants to accomplish? The answer very much depends on that.
Comment
River Huang

Join Date: Mar 2016

Posts: 1908
#3

06 May 2018, 17:38

Hi, William, The point is to "randomly" assign a new place of birth to someone with a name. My understanding is as follows.
Does "name" uniquely identify individuals - is the "Brian" in posts 3 and 8 the same individual, so the placeofbirth must be the same? Let's assume this is true in the data so that the placeof birth must be the same.

If so, is there an ID code in the data used to distinguish distinct individuals, rather than the name, which may be shared by two or more individuals? So far, on the name can be used to distinguish distinct individuals.

Must every occurrence of "Chicago" - regardless of the name it is associated with - be replaced with the same city? I don't understand your this question.

Must the city names be chosen from among the cities present in the data, including "missing"? Yes, please choose the name from the list of city names in the data, but exclude the "missing" one.

Must the city names be real cities, or can they be invented - City001, City002, ... for example? Answered in the above point.

But the fundamental question is what is the objective that the user wants to accomplish? The answer very much depends on that. I don't really know that at this point.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

06 May 2018, 18:27

With regard to bullet 3, if both Frank and George have placeofborn=="Chicago", must they both be assigned to the same new placeofborn?

Another way of thinking about this is, are we trying to preserve "birthplace effects" but just relabel them? Because if Frank and George have different birthplaces where they once had the same birthplace, then the estimate of birthplace effects will be changed.

This leads to another question. Here's the short form of the question: what distribution is used for the random assignment of birthplaces to individuals?

I can think of three types of random assignment. Suppose we have a list of 100 individuals, and among them there are 30 distinct birthplaces (one of which is ""), and one of those birthplaces is Chicago, which is the birthplace of 10 individuals.
For each individual we have a card with their birthplace. So we have a pack of 100 cards with the birthplaces, and we shuffle the cards, and assign the birthplace on the top card to the first individual, then the birthplace on the second card to the second individual, and so on.

For each individual we have a card with their birthplace. So we have a pack of 100 cards with their birthplaces, and we shuffle the cards, and assign the birthplace on the top card to the first individual, then return it to the pack, reshuffle, then assign the birthplace on the card that is now on top to the second individual, and so on.

We have a pack of 30 cards, where each distinct birthplace appears on a single card. We now follow the procedure in the previous bullet to assign birthplaces.

In the first option, the probability distribution of birthplaces is unchanged: since there were 10 individual from Chicago originally, 10 of the 100 individuals will be assigned a new birthplace of Chicago.

In the third option, each individual has the same probability for each of the cities of being assigned it as their new birthplace: the probability of being assigned Chicago is 1/30, so we expect 3.33 individuals to be assigned a new birthplace of Chicago.

In the second option, each individual will have the same probability distribution for the city they are assigned as their new birthplace: the probability of being assigned Chicago is 10/100 so we expect 10 individual to be assigned a new birthplace of Chicago. But unlike the first option, it could be more or it could be less.

But then, if you tell me that, with regard to bullet 3, both Frank and George must have the same new birthplace, then the problem reduces to something like the following.
We have a list of 30 distinct birthplaces, and a pack of 30 cards, where each distinct birthplace appears on a single card. We shuffle the pack, an assign the birthplace on the first card to every individual whose birthplaces was the first on the list; second card to the second on the list, etc.

This leads back to my original question: why can we not just assign everyone the same new birthplace? If that's not acceptable, what is required to be acceptable? How is the random distribution to be generated?

And the answers to these questions depend on the use to which the data is to be put. Without that information, no proposed "random assignment" can be evaluated as acceptable or unacceptable. The question is unanswerable as it now stands.
Comment
River Huang

Join Date: Mar 2016

Posts: 1908
#5

06 May 2018, 18:57

Dear William, With regard to bullet 3, if both Frank and George have placeofborn=="Chicago", must they both be assigned to the same new placeofborn? Not necessarily.

After your explanation, I find that this question is more complicated than I thought. However, I think that the one raised this question wants to the following.
First, delete the observations without a name.

Given a slate of "distinct" places of birth (placeofborn, but not including ""), using the third option to randomly assign a birth place to each one/name with or without a placeofborn.

No other relationship needs to be addressed at this moment.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

06 May 2018, 19:42

Code:

clear
input str7 name int years str9 placeofborn
""        1998 ""        
"John"    1998 "Nashville"
"Brian"   1998 "Nork York"
"Mary"    1998 ""        
"Charles" 1998 ""        
""        1999 ""        
"Susan"   1999 "San Diego"
"Brian"   1999 "Nork York"
"Frank"   1999 "Chicago"  
end
tempfile master
save `master'
// confirm each name has the same placeofborn in each year
bysort name (years): assert placeofborn==placeofborn[1]
// create a list of distinct values of placeofborn
drop name years
drop if missing(placeofborn)
duplicates drop placeofborn, force
sort placeofborn
generate newid = _n
rename placeofborn newplace
list, noobs
tempfile new
save `new'
local maxid = newid[_N]
// assign a random newid to each individual
use `master', clear
set seed 42
bysort name (years): generate newid = runiformint(1,`maxid') if !missing(name) & _n ==1
bysort name (years): replace newid = newid[1]
// merge on the newplace values
merge m:1 newid using `new'
drop if _merge==2
sort name years
list, noobs sepby(name)

Code:

. // confirm each name has the same placeofborn in each year
. bysort name (years): assert placeofborn==placeofborn[1]

. // create a list of distinct values of placeofborn
. drop name years

. drop if missing(placeofborn)
(4 observations deleted)

. duplicates drop placeofborn, force

Duplicates in terms of placeofborn

(1 observation deleted)

. sort placeofborn

. generate newid = _n

. rename placeofborn newplace

. list, noobs

  +-------------------+
  |  newplace   newid |
  |-------------------|
  |   Chicago       1 |
  | Nashville       2 |
  | Nork York       3 |
  | San Diego       4 |
  +-------------------+

. tempfile new

. save `new'
file /var/folders/xr/lm5ccr996k7dspxs35yqzyt80000gp/T//S_20547.000002 saved

. local maxid = newid[_N]

. // assign a random newid to each individual
. use `master', clear

. set seed 42

. bysort name (years): generate newid = runiformint(1,`maxid') if !missing(name) & _n ==1
(3 missing values generated)

. bysort name (years): replace newid = newid[1]
(1 real change made)

. // merge on the newplace values
. merge m:1 newid using `new'

    Result                           # of obs.
    -----------------------------------------
    not matched                             3
        from master                         2  (_merge==1)
        from using                          1  (_merge==2)

    matched                                 7  (_merge==3)
    -----------------------------------------

. drop if _merge==2
(1 observation deleted)

. sort name years

. list, noobs sepby(name)

  +-------------------------------------------------------------------+
  |    name   years   placeof~n   newid    newplace            _merge |
  |-------------------------------------------------------------------|
  |            1998                   .               master only (1) |
  |            1999                   .               master only (1) |
  |-------------------------------------------------------------------|
  |   Brian    1998   Nork York       4   San Diego       matched (3) |
  |   Brian    1999   Nork York       4   San Diego       matched (3) |
  |-------------------------------------------------------------------|
  | Charles    1998                   3   Nork York       matched (3) |
  |-------------------------------------------------------------------|
  |   Frank    1999     Chicago       4   San Diego       matched (3) |
  |-------------------------------------------------------------------|
  |    John    1998   Nashville       1     Chicago       matched (3) |
  |-------------------------------------------------------------------|
  |    Mary    1998                   4   San Diego       matched (3) |
  |-------------------------------------------------------------------|
  |   Susan    1999   San Diego       1     Chicago       matched (3) |
  +-------------------------------------------------------------------+

Comment

River Huang

Join Date: Mar 2016

Posts: 1908
#7

06 May 2018, 20:23

Dear William, Many thanks for your time and effort.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment

Romalpa Akzo

Join Date: Oct 2017
Posts: 369

07 May 2018, 00:41

encode and decode would provide a concise solution.

Code:

encode placeofborn, gen(_place)
sum _place, meanonly
replace _place = runiformint(1,`r(max)') if !missing(name)
decode _place, gen(newplace)
drop _place

Comment

River Huang

Join Date: Mar 2016

Posts: 1908
#9

07 May 2018, 03:00

Hi, Romalpa, Many thanks for this interesting/concise suggestion.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#10

07 May 2018, 05:24

Two comments about the code I presented in post #6.

Code:

set seed 42

is included to ensure that the results are replicable from run to run. For more about this, see the output of help seed.

Code:

bysort name (years): generate newid = runiformint(1,`maxid') if !missing(name) & _n ==1 bysort name (years): replace newid = newid[1]

ensures that every observation for a given name is assigned the same value for newplace.
Comment
Romalpa Akzo

Join Date: Oct 2017

Posts: 369
#11

07 May 2018, 06:19

The comments of Williams does make sense if it is expected of only one newplace for a person (with unique name). In that case, an additional line to my suggestion in #8 might help.

Code:

encode placeofborn, gen(_place) sum _place, meanonly replace _place = runiformint(1,`r(max)') if !missing(name) bys name: replace _place = _place[1] decode _place, gen(newplace) drop _place
Comment
Light Ma

Join Date: May 2018

Posts: 3
#12

07 May 2018, 08:15

Hi,Romalpa,Thank you for your advice. However,there is still a question that the new generated variable—— "_place" has many missing values that should not be generated.That's to say,Some people had “placeofborn”, but now they don't.
Comment
Romalpa Akzo

Join Date: Oct 2017

Posts: 369
#13

07 May 2018, 09:11

Encode is working properly and it generates numeric variable "_place", which is missing only when the corresponding (string) placeofborn is missing.

I guess the issue that you described might be that some placeofborn are not missing (i.e.blank) but happen to be one or several spaces. If my guess is correct, use this line before my suggestion would solve it.

Code:

replace placeofborn =trim(placeofborn)
Comment
Light Ma

Join Date: May 2018

Posts: 3
#14

07 May 2018, 11:03

yep,you are right，thx a lot . And another question is that I used the code you gave to address the other variable——“gt”,which is an Integer variable representing 0-1，but it didn't work ,so how can i solve this new problem ? Here is my code ,would you please correct it ？Thank you again.
tostring gt,gen(gt_1)
encode gt_1, gen(gt_2)
sum gt_2, meanonly
replace gt_2 = runiformint(1,`r(max)') if !missing(gt)
decode gt_2, gen(gt_3)
drop gt_3
Comment
Romalpa Akzo

Join Date: Oct 2017

Posts: 369
#15

07 May 2018, 17:01

encode and decode are required for string (original) variable. For a numeric var [0,1] like your gt, it is better not such running around and around.

Code:

gen gt_new=runiform() if !missing(name) bys name: replace gt_new=gt_new[1]

Last edited by Romalpa Akzo; 07 May 2018, 17:05.
Comment

Announcement

random assignment of birthplace

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment