Conversion suggestion

Tariq Abdullah

Join Date: Apr 2021
Posts: 366

Conversion suggestion

24 Nov 2023, 17:10

From the following data if the value of probability_weight for unique cpc4 exceeds or equals 0.5 then i need to create final variable naics07_final from the naics07_02 column's value for that observation.

NAICS represents different US industries. The issue is that for NAICS, codes 31-33 represent manufacturing, 44-45 represent retail, and 48-49 represent transportation. For each type of CPC4 observation, I intend to sum their probability weights in cases that fall into either 31-33, 44-45 or 48-49. In that case, it will be easier for me to understand if the probability_weight for that specific CPC4 exceeds my threshold of 0.5 or not in a unique NAICS category. Can anyone tell me how I can do this?

Code:

* Define labels for industries
label define industry_labels ///
    11 "Agriculture, Forestry, Fishing and Hunting" ///
    31 "Manufacturing" ///
    32 "Manufacturing" ///
    33 "Manufacturing" ///
    42 "Wholesale Trade" ///
    44 "Retail Trade" ///
    45 "Retail Trade" ///
    48 "Transportation and Warehousing" ///
    49 "Transportation and Warehousing" ///
   end

My initial idea of coding is the following but that doesn't sum up the probability across 31-33, 44-45 and 48-49

Code:

gen naics07_final = .
replace naics07_final = naics07_2 if probability_weight >= 0.5 & cpc4 != "."

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str4 cpc4 byte naics07_2 float probability_weight
"A01B" 11        1
"A01C" 11        1
"A01D" 11 .1844643
"A01D" 23 .8155357
"A01F" 11 .4023609
"A01F" 23 .5976391
"A01G" 11 .8568231
"A01G" 22 .0223281
"A01G" 23 .1208488
"A01H" 11  .097597
"A01H" 31 .9024029
"A01J" 31        1
"A01K" 11 .9318168
"A01K" 23 .0312263
"A01K" 31 .0369569
"A01L" 32 .9562588
"A01L" 33 .0437413
"A01M" 11  .787376
"A01M" 23 .1239584
"A01M" 32 .0886656
"A01N" 11 .1843447
"A01N" 31 .0819803
"A01N" 32  .733675
"A21B" 31        1
"A21C" 31        1
"A21D" 31        1
"A22B" 11 .0859366
"A22B" 31 .8831139
"A22B" 33 .0309495
"A22C" 11 .2808393
"A22C" 31 .7191607
"A23B" 11 .2927963
"A23B" 31 .7072037
"A23C" 11 .0557356
"A23C" 31 .9442644
"A23D" 11 .1084658
"A23D" 31 .8915342
"A23F" 31        1
"A23G" 31        1
"A23J" 11 .1541328
"A23J" 31 .8458672
"A23K" 11 .3007674
"A23K" 31 .6992326
"A24C" 33 .0269198
"A24D" 31 .9561672
"A24D" 32 .0438328
"A24F" 31        1
"A41B" 31 .9496588
"A41B" 32 .0503412
"A41C" 31 .8472361
"A41C" 32 .1527639
"A41D" 31        1
"A41F" 31 .9394662
"A41F" 32 .0316944
"A41F" 33 .0288394
"A41G" 32 .9763525
"A41G" 33 .0236475
"A41H" 31 .7940938
"A41H" 32  .047757
"A41H" 33 .1581493
end

Last edited by Tariq Abdullah; 24 Nov 2023, 17:43.

Tags: None

Tariq Abdullah

Join Date: Apr 2021
Posts: 366

24 Nov 2023, 17:55

I figured out a better way. I'm rather taking the maximum for each unique cpc4 out of all the naics07_2 it relates to.

Code:

bysort cpc4 (probability_weight): gen max_weight = probability_weight[_N]
gen naics_final = naics07_2 if probability_weight == max_weight
replace naics_final = . if probability_weight != max_weight
drop max_weight

Announcement

Conversion suggestion

Comment