Counting frequency of dominant values

Rafael Pacheco

Join Date: Jun 2021
Posts: 19

Counting frequency of dominant values

25 Nov 2023, 21:49

Hello,

In a dataset like this:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input double id byte var1 double var2
 1 1  -3348.331566996959
 1 2 -12258.347805444035
 2 1 -3385.7240619345557
 2 2   -11301.4957054189
 3 1  -3082.825885040739
 3 2 -12374.821660737736
 4 1 -3309.0077761199655
 4 2 -11923.645354958966
 5 1 -2363.4950444452707
 5 2 -11739.878389288227
 6 1  -2870.765075241693
 6 2 -12223.491101587631
 7 1  -3440.455118031651
 7 2 -11586.180210576631
 8 1  -2879.530331466697
 8 2 -12411.194472623254
 9 1 -3663.8540427039675
 9 2 -12311.093745624179
10 1 -1943.8513780855646
10 2 -11848.575169763335
11 1  -2562.368252950966
11 2 -12486.757790837975
12 1  -3053.403217860106
12 2 -11279.761851692387
13 1 -2915.6880516384244
13 2  -11538.35238849248
14 1  -2721.273809145597
14 2 -10878.271406140717
15 1 -3130.6859456758925
15 2 -11355.943344553834
16 1 -2696.5362969727626
16 2 -11989.199934413056
17 1 -3729.6202625214105
17 2 -11662.125717279263
18 1 -2718.9885319780406
18 2 -12067.438722107863
19 1  -2710.313557039223
19 2  -10764.77844327837
20 1  -2230.475260946375
20 2 -11335.060265923423
21 1  -3902.974735628137
21 2 -12250.962858532945
22 1 -3303.7587423982905
22 2  -11591.06043152441
23 1 -3178.3873813100963
23 2 -12695.341629879618
24 1 -3810.8403785234577
24 2 -11663.350758506991
25 1 -2871.7974042681444
25 2 -12641.502334431827
26 1 -3206.1072144838154
26 2  -12607.27114065293
27 1 -2492.1329315288945
27 2 -11866.613732960985
28 1 -3195.9760630471433
28 2 -11831.669068914915
29 1 -3327.0103659714664
29 2 -12644.977265524569
30 1  -3022.603080068251
30 2 -10994.706326254727
31 1 -2978.0023660306656
31 2  -12235.86303415122
32 1  -2942.442443576155
32 2 -12473.511900603506
33 1  -2805.400236408546
33 2 -12077.562027808272
34 1  -3687.215808364619
34 2 -12197.331192554011
35 1  -3525.406469649003
35 2  -12288.98830282034
36 1 -3515.9871033834975
36 2 -12426.948061753421
37 1   -3843.07926629132
37 2 -11730.793600243547
38 1 -2653.4191486085533
38 2 -11112.137535825386
39 1 -2230.9316323818152
39 2 -11534.205230296784
40 1 -3349.0024872846225
40 2 -12436.152580542652
41 1    -3225.2432746388
41 2 -11796.185232441696
42 1   -3391.61736312115
42 2 -12801.819794989711
43 1 -2095.1826594466825
43 2 -11610.383493799716
44 1 -2315.8516760156344
44 2 -11761.863263104753
45 1  -2342.693801672127
45 2 -12201.238232981135
46 1 -3911.2950036704856
46 2 -12695.670118855902
47 1 -3689.6496802706606
47 2 -12636.210532571324
48 1  -2519.131784578982
48 2 -13130.544173341606
49 1 -2960.2125844698403
49 2 -11083.943158574133
50 1  -2793.171211062792
50 2 -12783.317898960162
end

I want to be able to order var2 and count the frequency of times var1 == 1 has a lower value in comparison to var1 == 2.

In this example dataset is pretty clear that the response is 100%, because if I sort by var2, I can easily check that the first 100 observations are var1 == 1 and the last 100 observations are var1 == 2.

However, I will be handling datasets with huge amount of var1 groups ( var1 == 1/n) and I will need to calculate the frequency of "dominance" for each group.

I'm sorry if my problem is not clear. I'm not even sure how to state what I need.

Thank you in advance,

Rafael

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30153
#2

25 Nov 2023, 22:25

In this data, for every single value of id, it is always the case that the var2 is lower when var1 == 2. So the answer here is not 100% as you say, it is 0%.

This particular calculation is easier to do if you go to a wide data layout.

Code:

reshape wide var2, i(id) j(var1) egen wanted = mean(var21 < var22)

I will be handling datasets with huge amount of var1 groups ( var1 == 1/n)

I do not understand what this means. var1 == 1/n? What is n?
Comment

Rafael Pacheco

Join Date: Jun 2021
Posts: 19

25 Nov 2023, 22:40

Dear Clyde,

Thank you for your rapid response.

I meant that I will be handling more subgrups. For instance, in the following example I present 4 "var1" values (var1 == 1/4).

Can you please show me the solution here? I'm still not sure yet on how to solve this.

Thank you in advance,

Rafael

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input double id byte var1 double var2
 9 1 -1632.6402417300533
 7 1 -18.507656963302907
 3 1  151.35777707753186
 1 1  421.66964785682285
 6 1  437.77851313727206
10 1   638.5943531589855
 8 1  1209.1098113790804
 4 1   1375.141707205933
 5 1   1422.659396254885
 2 1  3795.6596173632224
 2 4  18377.976575367684
 7 4  19788.321715661747
 4 4   21680.87145057269
10 4  22723.774813856275
 3 4  23141.348214001715
 5 4   24382.38382469559
 1 4  25725.296968826944
10 3  26615.742651873414
 6 4  27018.924183134244
 8 4  29137.511925730752
 9 4   29292.36521476868
 5 2   30944.19739789841
 2 3  30962.295947890663
 3 3  31030.425096840492
 9 3   32862.08894372136
 8 2   32971.03405345717
 2 2   33676.63954807641
 1 2   34353.56454772337
 5 3    35089.0382685053
10 2    36402.7017502104
 6 2   36643.27497597544
 4 3   37479.16667006486
 7 3    38269.5594322172
 3 2   39056.13420993909
 6 3   40702.59505045218
 4 2   40928.62703145932
 9 2   41494.02207821586
 7 2    42838.5748765135
 8 3   45091.57252979315
 1 3   45254.38580844723
end

ps: I edited the dataex.

Last edited by Rafael Pacheco; 25 Nov 2023, 22:45.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30153
#4

26 Nov 2023, 10:37

So, something along these lines

Code:

isid id var1, sort by id (var1): assert _N == 4 // IDENTIFY VALUE OF VAR1 ASSOCIATED WITH SMALLEST VAR2 by id (var2), sort: gen byte var1_lowest_var2 = var1[1] // IDENTIFY PROPORTION OF ID'S FOR WHICH VAR1_LOWEST_VAR2 IS VAR1 by var1, sort: egen proportion = mean(var1_lowest_var2 == var1) // IDENTIFY VALUE OF VAR1 WITH GREATEST PROPORTION sort proportion gen byte dominant_var1 = var1[_N]

Now, there is a gap in how you have specified your problem, and this code handles that gap badly. The gap is that you do not say what to do if there are ties. This can happen at a couple of levels. Within an id, there may be ties for the smallest value of var2--so it is unclear which var1 value to pick for the var1 with the lowest var2. Then, when we are picking the value of var1 that has the greatest proportion of id's for which it is the var1 with the lowest var2 (the "dominant" var1 value), there can be ties in those proportions. Stata's default behavior in situations like this is to break the tie randomly and irreproducibly. Although that is seldom a desirable solution to the problem, in the absence of any specification of how ties are to be broken, this is the best one can do.

So you need to give some thought to this issue of breaking ties. The code can be modified to handle any kind of tie-breaking rule you come up with.
1 like
Comment

Announcement

Counting frequency of dominant values

Comment

Comment

Comment