duplicate observations by group following a pattern

gustavoz

Join Date: Jan 2015

Posts: 3
#1

duplicate observations by group following a pattern

12 May 2015, 09:22

Hello,

I have a dataset of 1000 firms (id_firm) over 10 years (year). Each year, a firm produces a single 2-digit output (idmain2_out) using several 2-digit inputs (id2_in), so that each observation is described by id_firm year idmain2_out id2_in as is shown below:

Code:

id_firm year idmain2_out id2_in 1 1990 44 01 1 1990 44 02 1 1991 50 20 ... 2 1990 28 33 ... 3 1990 44 01 3 1990 44 06

It follows that every year, to produce each output (year idmain2_out), a group of inputs (id2_in) is used; and firms that produce an output use some of those inputs and some others not. For example, in 1990 idmain2_out={44} was produced using id2_in = {01, 02, 06}; firm 1 used id2_in = {01, 02} to produce idmain2_out={44}, while firm 3 used id2_in={02, 06} to do it.

By id_firm year idmain2_out, I want to create(expand) new observations so that it shows all the inputs that were used by all firms that produced that output that year, along with an indicator variable (use) which takes the value 1 if the firm used that input and 0 if the firm didn't use that input. The new dataset should look the following way

Code:

id_firm year idmain2_out id2_in use 1 1990 44 01 1 1 1990 44 02 1 1 1990 44 06 0 1 1991 50 20 1 ... 2 1990 28 33 1 ... 3 1990 44 01 1 3 1990 44 02 0 3 1990 44 06 1

I haven't figured out a way to do this so I appreciate your help very much. (I also want to know how to create appropriate lists of observations in the forum )

Best,
Gustavo
Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30353

12 May 2015, 10:47

Code:

clear*

input id_firm year idmain2_out id2_in
1 1990 44 01
1 1990 44 02
1 1991 50 20
2 1990 28 33
3 1990 44 01
3 1990 44 06
end
format id2_in %02.0f

list, noobs clean

levelsof id2_in, local(inputs)
foreach inp of local inputs {
 gen byte use`inp' = (id2_in == `inp')
}
collapse (max) use*, by(id_firm year idmain2_out)
reshape long use, i(id_firm year idmain2_out) j(id2_in)
format id2_in %02.0f

list, noobs clean

Comment

Roberto Ferrer

Join Date: Apr 2014
Posts: 449

12 May 2015, 11:54

One way is

Code:

clear
set more off

input ///
id_firm year idmain2_out id2_in
1 1990 44 01
1 1990 44 02
1 1991 50 20
2 1990 28 33
3 1990 44 01
3 1990 44 06
end

gen used = 1

tempfile orig
save "`orig'"

list, sepby(id_firm)

*----- what you want -----

duplicates drop year idmain2_out id2_in, force
drop id_firm used

tempfile outin
save "`outin'"

// list all inputs used each year, for each output
sort year idmain2_out id2_in
list, sepby(year idmain2_out)

clear
use "`orig'"

// keep one obs per (firm year output)
bysort id_firm year : keep if _n == 1
drop id2_in used

// bring in inputs
joinby year idmain2_out using "`outin'"  

// bring in usage
merge 1:1 id_firm year id2_in using "`orig'", nogen
replace used = 0 if missing(used)

// final result
sort id_firm year id2_in
list, sepby(id_firm year)

You should:

1. Read the FAQ carefully.

2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.

Comment

gustavoz

Join Date: Jan 2015

Posts: 3
#4

12 May 2015, 12:33

Originally posted by Clyde Schechter View Post

Code:

clear* input id_firm year idmain2_out id2_in 1 1990 44 01 1 1990 44 02 1 1991 50 20 2 1990 28 33 3 1990 44 01 3 1990 44 06 end format id2_in %02.0f list, noobs clean levelsof id2_in, local(inputs) foreach inp of local inputs { gen byte use`inp' = (id2_in == `inp') } collapse (max) use*, by(id_firm year idmain2_out) reshape long use, i(id_firm year idmain2_out) j(id2_in) format id2_in %02.0f list, noobs clean

Thank you for your help Clyde. In fact, I wanted to reshape wide the database after adding the observations. You did it the other way round. However, I still need to remove the inputs that no firm used to produce the output in a particular year. As result of the code, the observation {id_firm, year, idmain2_out, id2_in, use} = {1,1990,44,20,0} should not exist or, alternatively, the variable use should display a missing value because in that year (1990) no firm used id2_in=20 to produce dimain2_out=44.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30353
#5

12 May 2015, 13:18

So I think you can clean up the observations you don't want with this code immediately after what I showed you above.

Code:

egen ever_used = max(use), by(id2_in year idmain2_out) drop if ever_used == 0 drop ever_used list, noobs clean

And, of course, you can -reshape wide-, if you like, after that. Just remember, though, that the vast majority of Stata commands work much more felicitously with long layouts than with wide. So think about it carefully.
2 likes
Comment
Raeedur Rahman

Join Date: May 2015

Posts: 70
#6

12 May 2015, 21:27

clyde sir what is the command of finding out duplicate data??
Comment
Charlie Joyez

Join Date: Dec 2014

Posts: 421
#7

13 May 2015, 01:39

what is the command of finding out duplicate data?

Either one of them :

Code:

duplicates report varlist duplicates list varlist duplicates tag varlist

See help duplicates for further details.
Comment

Announcement