Create additional varaibles based on values of an existing variable

Rochelle Zhang

Join Date: Jul 2025

Posts: 0
#16

09 Jun 2016, 13:00

Hi Clyde,

As regards to your post #14, if psic is not always the lowest number among all sic,

e.g. for firm X, psic=3500, and the other segment sic=3100, 3200.

can I stiil use

Code:

// ASSIGN SEQUENCE NUMBERS TO EACH SEGMENT WITHIN A FIRM by gvkey sic, sort: gen int seq = 1 if _n == 1 & sic != psic by gvkey (sic): replace seq = sum(seq)

instead of using
isid gvkey sic gen not_psic = (sic != psic) // 0 FOR PSIC, 1 FOR OTHERS by gvkey (not_psic sic), sort: gen seq = _n-1 // SORT THE PSIC TO THE TOP OF THE LIST drop not_psic
Comment
Rochelle Zhang

Join Date: Jul 2025

Posts: 0
#17

09 Jun 2016, 13:28

Hi Clyde ,

Given that psic is not always the lowest number, would you still advise replace the code as you indicated in #14.

Thanks,

Rochelle
Comment

Rochelle Zhang

Join Date: Jul 2025
Posts: 0

#18

09 Jun 2016, 14:37

Dear all,

I have to change my data a little- leave out segments with missing sales. This will reduce my data size , so when i do reshape, the computer will not freeze.

data have

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int(gvkey sic psic) byte segsale
1000 3500 3500 40 
1000 3200 . 10 
2000 3500 3500 15 
1200 3300 . 12 
1200 3500 3500 49 
end

my desired output is

Code:

clear
input int(gvkey sic psic) byte segsale int(totsale_psic sic1) byte totsale_sic1 int sic2 byte totsale_sic2
1000 3500 3500 40 104 . . . . 
1000 3200 . 10 . 3200 10 . . 
2000 3500 3500 15 104 . . . . 
1200 3300 . 12 . . . 3300 12 
1200 3500 3500 49 104 . . . . 
end

gvkey is firm identifier.
psic is the primary segment id - segment with largest sales.
each firm could operates in one segment, or multiple segment , e.g. firm 1000 operates in 2 segments 3500, 3200
segsale is sales in each segment

desired output shows new variables
totsale_psic : this is the sum of segsale across the same psic
totsale_sic1: the sum of segsale across the first non psic segment (in my case it is 3200, given only 1 firm has has segment in 3200 with sales 10, so total is 10)
totsale_sic2: the sum of segsale across the second non psic segment (in my case it is 3300, given only 1 firm has segment 3300 with sales 12, so total is 12)

my code (Clyde wrote the code for me for a different data), I want to ask you help to edit it to produce the output above.

Code:

use have,clear

* combine segments if they have the same sic
collapse (sum) segsale=segsale, by(gvkey sic) 
* this above adds up sales, which means I need to redefine pri seg 

*drop psic

gen sale=float(segsale)
bysort gvkey : egen maxsale=max(segsale)
gen psic=sic if sale==maxsale
drop maxsale sale

// VERIFY EACH GVKEY HAS ONLY ONE NON-MISSINSG VALUE OF PSIC
// AND SPREAD THAT TO ALL OBSERVATIONS OF THAT GVKEY
by gvkey (psic), sort: assert psic == psic[1] if !missing(psic)
by gvkey (psic): replace psic = psic[1]

Code:

// CALCULATE Across each PSIC cohort - TOTAL SALES IN EACH SEGMENT
egen tot_sales_sic = total(segsale), by(psic sic)

isid gvkey sic
gen not_psic = (sic != psic) // 0 FOR PSIC, 1 FOR OTHERS
by gvkey (not_psic sic), sort: gen seq = _n-1 // SORT THE PSIC TO THE TOP OF THE LIST
drop not_psic

* count # of segments
summ seq, meanonly
local seq_max = r(max)

// RESHAPE WIDE
drop segsale
reshape wide tot_sales_sic sic, i(gvkey) j(seq)

// MERGE BACK THE ORIGINAL DATA
merge 1:m gvkey using UnStretch, assert(match) nogenerate
sort gvkey sic
order sic psic, after(gvkey)


forvalues i = 0/`seq_max' {
    replace tot_sales_sic`i' = . if sic != sic`i'
    replace sic`i' = . if sic != sic`i'
}

Best,
Rochelle

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30114

#19

09 Jun 2016, 15:09

The code you sent does not quite run with the example data. The -merge- command refers to a file UnStretch, which you do not create or show us in this post. Based on earlier posts in the thread, I'm guessing that UnStretch is a copy of the original data, so I've modified your code accordingly to experiment with it.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int(gvkey sic psic) byte segsale
1000 3500 3500 40 
1000 3200 . 10 
2000 3500 3500 15 
1200 3300 . 12 
1200 3500 3500 49 
end

tempfile holding
save `holding'

* combine segments if they have the same sic
collapse (sum) segsale=segsale, by(gvkey sic) 
* this above adds up sales, which means I need to redefine pri seg 

*drop psic

gen sale=float(segsale)
bysort gvkey : egen maxsale=max(segsale)
gen psic=sic if sale==maxsale
drop maxsale sale

// VERIFY EACH GVKEY HAS ONLY ONE NON-MISSINSG VALUE OF PSIC
// AND SPREAD THAT TO ALL OBSERVATIONS OF THAT GVKEY
by gvkey (psic), sort: assert psic == psic[1] if !missing(psic)
by gvkey (psic): replace psic = psic[1]

// CALCULATE Across each PSIC cohort - TOTAL SALES IN EACH SEGMENT
egen tot_sales_sic = total(segsale), by(psic sic)

isid gvkey sic
gen not_psic = (sic != psic) // 0 FOR PSIC, 1 FOR OTHERS
by gvkey (not_psic sic), sort: gen seq = _n-1 // SORT THE PSIC TO THE TOP OF THE LIST
stop
drop not_psic

* count # of segments
summ seq, meanonly
local seq_max = r(max)

// RESHAPE WIDE
drop segsale
reshape wide tot_sales_sic sic, i(gvkey) j(seq)
assert sic0 == psic
drop sic0
rename tot_sales_sic0 tot_sales_psic

// MERGE BACK THE ORIGINAL DATA
merge 1:m gvkey using `holding', assert(match) nogenerate
sort gvkey sic
order sic psic, after(gvkey)


forvalues i = 0/`seq_max' {
    replace tot_sales_sic`i' = . if sic != sic`i'
    replace sic`i' = . if sic != sic`i'
}

is my starting point.

Getting to the desired output you show is not just a matter of editing. You're asking for different results. In the data in #18, firm 1200 only has two segments: 3500 (its psic) and 3300. The code was designed to create sic1 total_sales_sic1 sic2 total_sales_sic2, etc. based on the number of segments within firms. So there is no reason why firm 1200 (or any other firm in your example) should have an sic2 and total_sales_sic2, as firm has only one segment other than its psic. Based on your original problem description, the output you are getting now is correct and the output you are now requesting would be wrong..

Now maybe you actually want the sic1 sic2 etc., to refer to sic's within psic, not within firm. If that's what you want, the code can be changed to produce that. But before I do it, I want to be sure that's really what you're asking for, because it is different from what you originally described in the early posts.

Comment

Rochelle Zhang

Join Date: Jul 2025

Posts: 0
#20

09 Jun 2016, 15:40

Dear Clyde,

You are correct, the unstrech refers to data have in my post #18.

data have has 3 firms identified by gvkey. these 3 firms form a cohort identified by psic. If I may, let us present 3500 is plastic toy manufacturer, so the three firms primary business is making plastic toys, but each of them also do other lines of business as denoted by 3200, 3300.

data want in #18:

totsale_psic is the sum of primary segment sale for this cohort of 3 firms
totsale_sic1: is the sum of sales for 3200 segment or this cohort of 3 firms = (10+0+0)
totsale_sic2: is the sum of sales for 3300 segment or this cohort of 3 firms = (0+0+12)

so from data have to the data want , I need to add on 2 new sic variable , and the corresponding sum of sales across all three firms.

My real data has other cohorts, e.g. another primary segment cohort psic=6781, it has 100 firms in this cohort, and they have 20 distinct sic, and I need to create the columns of 20 sic , and the total sales across the cohort. Later I would adapt this code to all other cohorts with different psic.

so, data want is a data structure that list all sic in columns for each of the firm in the cohort. I hope this makes sense, and data want in #18 is the desired outcome.

I want to thank you for again for reaching my long posts and help me continuously !!!

p.s. I hope it makes sense to show the new data (different from post #1), which delete missing sales rows in data want , and help with reshape later on.

Rochelle
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30114

#21

09 Jun 2016, 16:32

OK. The code below will do it. The only aspect of the results data set that I did not replicate is the order of observations. You have gvkey 2000 in the middle as observation #3, whereas I have it at the end because I have sorted on gvkey within psic. It isn't clear to me what the pattern is that puts 2000 in the middle. Except for that, however, this code produces your output exactly as you wanted.

The code has changed from earlier versions in several places, so I have not highlighted them. Just treat this as an entirely new file of code, even though much of it is as before.

Code:

set more off
* Example generated by -dataex-. To install: ssc install dataex
clear
input int(gvkey sic psic) byte segsale
1000 3500 3500 40 
1000 3200 . 10 
2000 3500 3500 15 
1200 3300 . 12 
1200 3500 3500 49 
end

tempfile holding
save `holding'

* combine segments if they have the same sic
collapse (sum) segsale=segsale, by(gvkey sic) 
* this above adds up sales, which means I need to redefine pri seg 

gen sale=float(segsale)
bysort gvkey : egen maxsale=max(segsale)
gen psic=sic if sale==maxsale
drop maxsale sale

// VERIFY EACH GVKEY HAS ONLY ONE NON-MISSINSG VALUE OF PSIC
// AND SPREAD THAT TO ALL OBSERVATIONS OF THAT GVKEY
by gvkey (psic), sort: assert psic == psic[1] if !missing(psic)
by gvkey (psic): replace psic = psic[1]

// CALCULATE Across each PSIC cohort - TOTAL SALES IN EACH SEGMENT
egen tot_sales_sic = total(segsale), by(psic sic)

isid psic gvkey sic
gen not_psic = (sic != psic) // 0 FOR PSIC, 1 FOR OTHERS
by psic not_psic sic, sort: gen seq = 1 if not_psic // SORT THE PSIC TO THE TOP OF THE LIST
by psic: replace seq = sum(seq)
drop not_psic

* count # of segments
summ seq, meanonly
local seq_max = r(max)

// RESHAPE WIDE
drop segsale
reshape wide tot_sales_sic sic, i(psic gvkey) j(seq)
assert sic0 == psic

// MERGE BACK THE ORIGINAL DATA
merge 1:m gvkey using `holding', assert(match) nogenerate
sort gvkey sic
order sic psic, after(gvkey)


forvalues i = 0/`seq_max' {
    replace tot_sales_sic`i' = . if sic != sic`i'
    replace sic`i' = . if sic != sic`i'
}
drop sic0
rename tot_sales_sic0 tot_sales_psic
order gvkey sic psic segsale tot_sales_psic, first
gen byte not_psic = (sic != psic)
sort psic gvkey not_psic sic 
drop not_psic

Comment

Rochelle Zhang

Join Date: Jul 2025

Posts: 0
#22

09 Jun 2016, 20:40

It works great for my reduced size data

Sincerely,

Rochelle
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment