Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sorting data ascending and descending based on variables by group

    Hello, I am looking to sort by dataset by Annual_MWh (largest to smallest) and by moderate_2022 (smallest to largest). This is to get the cheapest projects that produce the most energy. Once sorted I want to keep a subset of the entire dataset selecting an observation from each group (town_no) until all observations sum up to 163441 (Annual_MWh). I am not sure how to sort and select x observations by town_no until I reach the Annual_RPS_MWhs criteria.

    Any help is appreciated, thank you.

    What I have so far is:
    Code:
    gsort Moderate_2022 -Annual_MWh
    gen Annual_RPS_MWhs = sum(Annual_MWh)
    keep if Annual_RPS_MWhs <= 163441 | Annual_RPS_MWhs < 165000
    dataex town_no town Annual_MWh moderate_2022 in 1/50

    ----------------------- copy starting from the next line -----------------------
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input int town_no str16 town float(Annual_MWh moderate_2022)
     53 "Franklin"     1164.5305 .04218284
     66 "Harwinton"    1155.1118 .04252679
    168 "Woodbury"     1168.1362 .04205263
     98 "Norfolk"      1155.2405 .04252206
    130 "Southbury"    1149.0099 .04275263
     74 "Litchfield"   1158.7079 .04239481
     66 "Harwinton"    1163.7751 .04221022
    117 "Redding"      1160.5948 .04232588
     17 "Bristol"      1159.5659 .04236344
     78 "Mansfield"     1127.379 .04357293
     14 "Branford"     1171.5803 .04192901
    120 "Roxbury"      1164.8457 .04217142
     87 "Morris"       1149.4806 .04273513
     85 "Monroe"       1193.6835 .04252437
    120 "Roxbury"       1202.671 .04220659
     28 "Colchester"   1193.0332 .04254755
     77 "Manchester"   1179.5166 .04303512
     19 "Brooklyn"      1188.686 .04270315
     61 "Haddam"       1203.8093 .04216668
      5 "Barkhamsted"   1179.927 .04302016
     81 "Middlebury"   1204.8436 .04213048
    153 "Watertown"    1204.8436 .04213048
    168 "Woodbury"     1187.9658 .04272904
    169 "Woodstock"    1191.3422 .04260794
     44 "East Haven"   1202.6604 .04220696
    135 "Stamford"      1215.113 .04177442
     27 "Clinton"        1213.57 .04182753
     66 "Harwinton"    1193.6156 .04252679
     31 "Cornwall"     1193.9282 .04251565
     32 "Coventry"     1196.6082 .04242044
     97 "Newtown"      1209.0032 .04198553
    109 "Plainfield"   1188.9944 .04269208
      5 "Barkhamsted"  1190.1978 .04264891
     66 "Harwinton"    1195.6924 .04245292
     81 "Middlebury"   1192.7975 .04255595
    166 "Wolcott"      1197.9694 .04237223
    169 "Woodstock"    1180.0409   .043016
     38 "Durham"       1199.9655 .04230175
     70 "Killingworth"    1201.9 .04223366
    152 "Waterford"      1224.58 .04145147
     30 "Columbia"     1203.0897  .0421919
     32 "Coventry"      1195.866 .04244676
     75 "Lyme"         1198.0687 .04236872
    108 "Oxford"        1193.616 .04252678
     70 "Killingworth"  1241.613 .04220162
    134 "Stafford"      1238.725 .04230001
     72 "Ledyard"      1246.0105 .04205268
     97 "Newtown"        1252.84 .04182344
     38 "Durham"       1238.2623 .04231581
    120 "Roxbury"       1246.007 .04205279
    end
    ------------------ copy up to and including the previous line ------------------

    Listed 50 out of 4295 observations


  • #2
    I'm not sure your problem is well defined. Since Annual_MWh and moderate_2022 are continuous variables, there appear to be almost no ties between them. You can probably sort by one OR the other. Also, whether you do
    Code:
    gsort moderate_2022 -Annual_MWh
    or

    Code:
    gsort -Annual_MWh moderate_2022
    will produce different results. Could you clarify?

    Comment


    • #3
      I should have explained more. moderate_2022 is a calculated value of the levelized cost of energy based on cost assumptions and the annual_MWhs variable. So I am looking for out of the entire dataset, the projects that produce the most amount of energy at the lowest LCOE (moderate_2022). I also want to subset the data to produce a even or nearly even number of observations across my groups (town_no) up to the point where all projects in that subset = 163441 MWhs or 165000 MWhs.

      Does that clarify things a bit?

      Comment


      • #4
        Assuming you are happy with the way you were sorting the data, perhaps you need something like this? This keeps the best observation for each town, and then keeps the towns within the Annual_MWh criterion.

        Code:
        gen neg_annual_mwh = -Annual_MWh
        bys town_no (moderate_2022 neg_annual_mwh): keep if _n == 1
        
        sort moderate_2022 neg_annual_mwh  
        drop neg_annual_mwh
        gen Annual_RPS_MWhs = sum(Annual_MWh)
        keep if Annual_RPS_MWhs <= 163441
        Edit: this was cross-posted with #3. That seems to suggest you don't want one observation per town, you actually want the best projects within the cumulative limit for EACH town. If so, this is not the code that will do it.
        Last edited by Hemanshu Kumar; 10 Oct 2022, 11:57.

        Comment


        • #5
          Adam Gallaher I am still not sure your problem is well-specified.
          • Given that moderate_2022 and Annual_MWh are continuous variables, you need to be more specific than "projects that produce the most amount of energy at the lowest LCOE". You can sort by the lowest LCOE, or by the most amount of energy, but in what way do you want to sort by BOTH criteria?
          • Your cumulative target is also unclear to me. Your code keep if Annual_RPS_MWhs <= 163441 | Annual_RPS_MWhs < 165000 has a redundancy -- whenever Annual_RPS_MWhs is less than 163,441, it is also less than 165,000. So do you just want keep if Annual_RPS_MWhs < 165000 ? Or do you have something else in mind?

          Comment


          • #6
            Hemanshu Kumar I understand what you are saying. What I was thinking was, of all possible projects in a given group (town_no) I would like to sort by the projects (observations) that produce the most energy (annual_MWhs) at the lowest cost (moderate_2022). However, perhaps it is easier to simply identify all possible projects in a given group (town_no) which have the lowest cost (moderate_2022).

            If I gen annual_RPS_MWhs I cannot guarantee that a value is equal to or less than 163,441 so I might end up with values totaling less than 163,441 in which case those results are not satisfactory. This is why I added the or (|) less than 165,000 to ensure I find values totaling to 163,411. I guess I could only use the second search criteria and keep if annual_RPS_MWhs > 165,000

            Comment

            Working...
            X