Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Conversion suggestion

    From the following data if the value of probability_weight for unique cpc4 exceeds or equals 0.5 then i need to create final variable naics07_final from the naics07_02 column's value for that observation.

    NAICS represents different US industries. The issue is that for NAICS, codes 31-33 represent manufacturing, 44-45 represent retail, and 48-49 represent transportation. For each type of CPC4 observation, I intend to sum their probability weights in cases that fall into either 31-33, 44-45 or 48-49. In that case, it will be easier for me to understand if the probability_weight for that specific CPC4 exceeds my threshold of 0.5 or not in a unique NAICS category. Can anyone tell me how I can do this?

    Code:
    * Define labels for industries
    label define industry_labels ///
        11 "Agriculture, Forestry, Fishing and Hunting" ///
        31 "Manufacturing" ///
        32 "Manufacturing" ///
        33 "Manufacturing" ///
        42 "Wholesale Trade" ///
        44 "Retail Trade" ///
        45 "Retail Trade" ///
        48 "Transportation and Warehousing" ///
        49 "Transportation and Warehousing" ///
       end
    My initial idea of coding is the following but that doesn't sum up the probability across 31-33, 44-45 and 48-49

    Code:
    gen naics07_final = .
    replace naics07_final = naics07_2 if probability_weight >= 0.5 & cpc4 != "."
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str4 cpc4 byte naics07_2 float probability_weight
    "A01B" 11        1
    "A01C" 11        1
    "A01D" 11 .1844643
    "A01D" 23 .8155357
    "A01F" 11 .4023609
    "A01F" 23 .5976391
    "A01G" 11 .8568231
    "A01G" 22 .0223281
    "A01G" 23 .1208488
    "A01H" 11  .097597
    "A01H" 31 .9024029
    "A01J" 31        1
    "A01K" 11 .9318168
    "A01K" 23 .0312263
    "A01K" 31 .0369569
    "A01L" 32 .9562588
    "A01L" 33 .0437413
    "A01M" 11  .787376
    "A01M" 23 .1239584
    "A01M" 32 .0886656
    "A01N" 11 .1843447
    "A01N" 31 .0819803
    "A01N" 32  .733675
    "A21B" 31        1
    "A21C" 31        1
    "A21D" 31        1
    "A22B" 11 .0859366
    "A22B" 31 .8831139
    "A22B" 33 .0309495
    "A22C" 11 .2808393
    "A22C" 31 .7191607
    "A23B" 11 .2927963
    "A23B" 31 .7072037
    "A23C" 11 .0557356
    "A23C" 31 .9442644
    "A23D" 11 .1084658
    "A23D" 31 .8915342
    "A23F" 31        1
    "A23G" 31        1
    "A23J" 11 .1541328
    "A23J" 31 .8458672
    "A23K" 11 .3007674
    "A23K" 31 .6992326
    "A24C" 33 .0269198
    "A24D" 31 .9561672
    "A24D" 32 .0438328
    "A24F" 31        1
    "A41B" 31 .9496588
    "A41B" 32 .0503412
    "A41C" 31 .8472361
    "A41C" 32 .1527639
    "A41D" 31        1
    "A41F" 31 .9394662
    "A41F" 32 .0316944
    "A41F" 33 .0288394
    "A41G" 32 .9763525
    "A41G" 33 .0236475
    "A41H" 31 .7940938
    "A41H" 32  .047757
    "A41H" 33 .1581493
    end
    Last edited by Tariq Abdullah; 24 Nov 2023, 17:43.

  • #2
    I figured out a better way. I'm rather taking the maximum for each unique cpc4 out of all the naics07_2 it relates to.

    Code:
    bysort cpc4 (probability_weight): gen max_weight = probability_weight[_N]
    gen naics_final = naics07_2 if probability_weight == max_weight
    replace naics_final = . if probability_weight != max_weight
    drop max_weight

    Comment

    Working...
    X