Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Counting frequency of dominant values

    Hello,

    In a dataset like this:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input double id byte var1 double var2
     1 1  -3348.331566996959
     1 2 -12258.347805444035
     2 1 -3385.7240619345557
     2 2   -11301.4957054189
     3 1  -3082.825885040739
     3 2 -12374.821660737736
     4 1 -3309.0077761199655
     4 2 -11923.645354958966
     5 1 -2363.4950444452707
     5 2 -11739.878389288227
     6 1  -2870.765075241693
     6 2 -12223.491101587631
     7 1  -3440.455118031651
     7 2 -11586.180210576631
     8 1  -2879.530331466697
     8 2 -12411.194472623254
     9 1 -3663.8540427039675
     9 2 -12311.093745624179
    10 1 -1943.8513780855646
    10 2 -11848.575169763335
    11 1  -2562.368252950966
    11 2 -12486.757790837975
    12 1  -3053.403217860106
    12 2 -11279.761851692387
    13 1 -2915.6880516384244
    13 2  -11538.35238849248
    14 1  -2721.273809145597
    14 2 -10878.271406140717
    15 1 -3130.6859456758925
    15 2 -11355.943344553834
    16 1 -2696.5362969727626
    16 2 -11989.199934413056
    17 1 -3729.6202625214105
    17 2 -11662.125717279263
    18 1 -2718.9885319780406
    18 2 -12067.438722107863
    19 1  -2710.313557039223
    19 2  -10764.77844327837
    20 1  -2230.475260946375
    20 2 -11335.060265923423
    21 1  -3902.974735628137
    21 2 -12250.962858532945
    22 1 -3303.7587423982905
    22 2  -11591.06043152441
    23 1 -3178.3873813100963
    23 2 -12695.341629879618
    24 1 -3810.8403785234577
    24 2 -11663.350758506991
    25 1 -2871.7974042681444
    25 2 -12641.502334431827
    26 1 -3206.1072144838154
    26 2  -12607.27114065293
    27 1 -2492.1329315288945
    27 2 -11866.613732960985
    28 1 -3195.9760630471433
    28 2 -11831.669068914915
    29 1 -3327.0103659714664
    29 2 -12644.977265524569
    30 1  -3022.603080068251
    30 2 -10994.706326254727
    31 1 -2978.0023660306656
    31 2  -12235.86303415122
    32 1  -2942.442443576155
    32 2 -12473.511900603506
    33 1  -2805.400236408546
    33 2 -12077.562027808272
    34 1  -3687.215808364619
    34 2 -12197.331192554011
    35 1  -3525.406469649003
    35 2  -12288.98830282034
    36 1 -3515.9871033834975
    36 2 -12426.948061753421
    37 1   -3843.07926629132
    37 2 -11730.793600243547
    38 1 -2653.4191486085533
    38 2 -11112.137535825386
    39 1 -2230.9316323818152
    39 2 -11534.205230296784
    40 1 -3349.0024872846225
    40 2 -12436.152580542652
    41 1    -3225.2432746388
    41 2 -11796.185232441696
    42 1   -3391.61736312115
    42 2 -12801.819794989711
    43 1 -2095.1826594466825
    43 2 -11610.383493799716
    44 1 -2315.8516760156344
    44 2 -11761.863263104753
    45 1  -2342.693801672127
    45 2 -12201.238232981135
    46 1 -3911.2950036704856
    46 2 -12695.670118855902
    47 1 -3689.6496802706606
    47 2 -12636.210532571324
    48 1  -2519.131784578982
    48 2 -13130.544173341606
    49 1 -2960.2125844698403
    49 2 -11083.943158574133
    50 1  -2793.171211062792
    50 2 -12783.317898960162
    end

    I want to be able to order var2 and count the frequency of times var1 == 1 has a lower value in comparison to var1 == 2.

    In this example dataset is pretty clear that the response is 100%, because if I sort by var2, I can easily check that the first 100 observations are var1 == 1 and the last 100 observations are var1 == 2.

    However, I will be handling datasets with huge amount of var1 groups ( var1 == 1/n) and I will need to calculate the frequency of "dominance" for each group.


    I'm sorry if my problem is not clear. I'm not even sure how to state what I need.

    Thank you in advance,

    Rafael

  • #2
    In this data, for every single value of id, it is always the case that the var2 is lower when var1 == 2. So the answer here is not 100% as you say, it is 0%.

    This particular calculation is easier to do if you go to a wide data layout.
    Code:
    reshape wide var2, i(id) j(var1)
    egen wanted = mean(var21 < var22)
    I will be handling datasets with huge amount of var1 groups ( var1 == 1/n)
    I do not understand what this means. var1 == 1/n? What is n?

    Comment


    • #3
      Dear Clyde,

      Thank you for your rapid response.

      I meant that I will be handling more subgrups. For instance, in the following example I present 4 "var1" values (var1 == 1/4).

      Can you please show me the solution here? I'm still not sure yet on how to solve this.

      Thank you in advance,

      Rafael

      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input double id byte var1 double var2
       9 1 -1632.6402417300533
       7 1 -18.507656963302907
       3 1  151.35777707753186
       1 1  421.66964785682285
       6 1  437.77851313727206
      10 1   638.5943531589855
       8 1  1209.1098113790804
       4 1   1375.141707205933
       5 1   1422.659396254885
       2 1  3795.6596173632224
       2 4  18377.976575367684
       7 4  19788.321715661747
       4 4   21680.87145057269
      10 4  22723.774813856275
       3 4  23141.348214001715
       5 4   24382.38382469559
       1 4  25725.296968826944
      10 3  26615.742651873414
       6 4  27018.924183134244
       8 4  29137.511925730752
       9 4   29292.36521476868
       5 2   30944.19739789841
       2 3  30962.295947890663
       3 3  31030.425096840492
       9 3   32862.08894372136
       8 2   32971.03405345717
       2 2   33676.63954807641
       1 2   34353.56454772337
       5 3    35089.0382685053
      10 2    36402.7017502104
       6 2   36643.27497597544
       4 3   37479.16667006486
       7 3    38269.5594322172
       3 2   39056.13420993909
       6 3   40702.59505045218
       4 2   40928.62703145932
       9 2   41494.02207821586
       7 2    42838.5748765135
       8 3   45091.57252979315
       1 3   45254.38580844723
      end

      ps: I edited the dataex.
      Last edited by Rafael Pacheco; 25 Nov 2023, 22:45.

      Comment


      • #4
        So, something along these lines
        Code:
        isid id var1, sort
        by id (var1): assert _N == 4
        
        //    IDENTIFY VALUE OF VAR1 ASSOCIATED WITH SMALLEST VAR2
        by id (var2), sort: gen byte var1_lowest_var2 = var1[1]
        
        //    IDENTIFY PROPORTION OF ID'S FOR WHICH VAR1_LOWEST_VAR2 IS VAR1
        by var1, sort: egen proportion = mean(var1_lowest_var2 == var1)
        
        //    IDENTIFY VALUE OF VAR1 WITH GREATEST PROPORTION
        sort proportion
        gen byte dominant_var1 = var1[_N]
        Now, there is a gap in how you have specified your problem, and this code handles that gap badly. The gap is that you do not say what to do if there are ties. This can happen at a couple of levels. Within an id, there may be ties for the smallest value of var2--so it is unclear which var1 value to pick for the var1 with the lowest var2. Then, when we are picking the value of var1 that has the greatest proportion of id's for which it is the var1 with the lowest var2 (the "dominant" var1 value), there can be ties in those proportions. Stata's default behavior in situations like this is to break the tie randomly and irreproducibly. Although that is seldom a desirable solution to the problem, in the absence of any specification of how ties are to be broken, this is the best one can do.

        So you need to give some thought to this issue of breaking ties. The code can be modified to handle any kind of tie-breaking rule you come up with.

        Comment

        Working...
        X