Sorting data ascending and descending based on variables by group

Adam Gallaher

Join Date: Feb 2021
Posts: 41

Sorting data ascending and descending based on variables by group

10 Oct 2022, 11:14

Hello, I am looking to sort by dataset by Annual_MWh (largest to smallest) and by moderate_2022 (smallest to largest). This is to get the cheapest projects that produce the most energy. Once sorted I want to keep a subset of the entire dataset selecting an observation from each group (town_no) until all observations sum up to 163441 (Annual_MWh). I am not sure how to sort and select x observations by town_no until I reach the Annual_RPS_MWhs criteria.

Any help is appreciated, thank you.

What I have so far is:

Code:

gsort Moderate_2022 -Annual_MWh
gen Annual_RPS_MWhs = sum(Annual_MWh)
keep if Annual_RPS_MWhs <= 163441 | Annual_RPS_MWhs < 165000

dataex town_no town Annual_MWh moderate_2022 in 1/50

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int town_no str16 town float(Annual_MWh moderate_2022)
 53 "Franklin"     1164.5305 .04218284
 66 "Harwinton"    1155.1118 .04252679
168 "Woodbury"     1168.1362 .04205263
 98 "Norfolk"      1155.2405 .04252206
130 "Southbury"    1149.0099 .04275263
 74 "Litchfield"   1158.7079 .04239481
 66 "Harwinton"    1163.7751 .04221022
117 "Redding"      1160.5948 .04232588
 17 "Bristol"      1159.5659 .04236344
 78 "Mansfield"     1127.379 .04357293
 14 "Branford"     1171.5803 .04192901
120 "Roxbury"      1164.8457 .04217142
 87 "Morris"       1149.4806 .04273513
 85 "Monroe"       1193.6835 .04252437
120 "Roxbury"       1202.671 .04220659
 28 "Colchester"   1193.0332 .04254755
 77 "Manchester"   1179.5166 .04303512
 19 "Brooklyn"      1188.686 .04270315
 61 "Haddam"       1203.8093 .04216668
  5 "Barkhamsted"   1179.927 .04302016
 81 "Middlebury"   1204.8436 .04213048
153 "Watertown"    1204.8436 .04213048
168 "Woodbury"     1187.9658 .04272904
169 "Woodstock"    1191.3422 .04260794
 44 "East Haven"   1202.6604 .04220696
135 "Stamford"      1215.113 .04177442
 27 "Clinton"        1213.57 .04182753
 66 "Harwinton"    1193.6156 .04252679
 31 "Cornwall"     1193.9282 .04251565
 32 "Coventry"     1196.6082 .04242044
 97 "Newtown"      1209.0032 .04198553
109 "Plainfield"   1188.9944 .04269208
  5 "Barkhamsted"  1190.1978 .04264891
 66 "Harwinton"    1195.6924 .04245292
 81 "Middlebury"   1192.7975 .04255595
166 "Wolcott"      1197.9694 .04237223
169 "Woodstock"    1180.0409   .043016
 38 "Durham"       1199.9655 .04230175
 70 "Killingworth"    1201.9 .04223366
152 "Waterford"      1224.58 .04145147
 30 "Columbia"     1203.0897  .0421919
 32 "Coventry"      1195.866 .04244676
 75 "Lyme"         1198.0687 .04236872
108 "Oxford"        1193.616 .04252678
 70 "Killingworth"  1241.613 .04220162
134 "Stafford"      1238.725 .04230001
 72 "Ledyard"      1246.0105 .04205268
 97 "Newtown"        1252.84 .04182344
 38 "Durham"       1238.2623 .04231581
120 "Roxbury"       1246.007 .04205279
end

------------------ copy up to and including the previous line ------------------

Listed 50 out of 4295 observations

Tags: None

Hemanshu Kumar

Join Date: Mar 2015

Posts: 1478
#2

10 Oct 2022, 11:43

I'm not sure your problem is well defined. Since Annual_MWh and moderate_2022 are continuous variables, there appear to be almost no ties between them. You can probably sort by one OR the other. Also, whether you do

Code:

gsort moderate_2022 -Annual_MWh

or

Code:

gsort -Annual_MWh moderate_2022

will produce different results. Could you clarify?
Comment
Adam Gallaher

Join Date: Feb 2021

Posts: 41
#3

10 Oct 2022, 11:51

I should have explained more. moderate_2022 is a calculated value of the levelized cost of energy based on cost assumptions and the annual_MWhs variable. So I am looking for out of the entire dataset, the projects that produce the most amount of energy at the lowest LCOE (moderate_2022). I also want to subset the data to produce a even or nearly even number of observations across my groups (town_no) up to the point where all projects in that subset = 163441 MWhs or 165000 MWhs.

Does that clarify things a bit?
Comment
Hemanshu Kumar

Join Date: Mar 2015

Posts: 1478
#4

10 Oct 2022, 11:53

Assuming you are happy with the way you were sorting the data, perhaps you need something like this? This keeps the best observation for each town, and then keeps the towns within the Annual_MWh criterion.

Code:

gen neg_annual_mwh = -Annual_MWh bys town_no (moderate_2022 neg_annual_mwh): keep if _n == 1 sort moderate_2022 neg_annual_mwh drop neg_annual_mwh gen Annual_RPS_MWhs = sum(Annual_MWh) keep if Annual_RPS_MWhs <= 163441

Edit: this was cross-posted with #3. That seems to suggest you don't want one observation per town, you actually want the best projects within the cumulative limit for EACH town. If so, this is not the code that will do it.

Last edited by Hemanshu Kumar; 10 Oct 2022, 11:57.
Comment
Hemanshu Kumar

Join Date: Mar 2015

Posts: 1478
#5

10 Oct 2022, 12:04

Adam Gallaher I am still not sure your problem is well-specified.
Given that moderate_2022 and Annual_MWh are continuous variables, you need to be more specific than "projects that produce the most amount of energy at the lowest LCOE". You can sort by the lowest LCOE, or by the most amount of energy, but in what way do you want to sort by BOTH criteria?

Your cumulative target is also unclear to me. Your code keep if Annual_RPS_MWhs <= 163441 | Annual_RPS_MWhs < 165000 has a redundancy -- whenever Annual_RPS_MWhs is less than 163,441, it is also less than 165,000. So do you just want keep if Annual_RPS_MWhs < 165000 ? Or do you have something else in mind?
Comment
Adam Gallaher

Join Date: Feb 2021

Posts: 41
#6

10 Oct 2022, 12:37

Hemanshu Kumar I understand what you are saying. What I was thinking was, of all possible projects in a given group (town_no) I would like to sort by the projects (observations) that produce the most energy (annual_MWhs) at the lowest cost (moderate_2022). However, perhaps it is easier to simply identify all possible projects in a given group (town_no) which have the lowest cost (moderate_2022).

If I gen annual_RPS_MWhs I cannot guarantee that a value is equal to or less than 163,441 so I might end up with values totaling less than 163,441 in which case those results are not satisfactory. This is why I added the or (|) less than 165,000 to ensure I find values totaling to 163,411. I guess I could only use the second search criteria and keep if annual_RPS_MWhs > 165,000
Comment

Announcement

Sorting data ascending and descending based on variables by group

Comment

Comment

Comment

Comment

Comment