Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Exporting Mean/Median of Variable X by variables Y and Z

    I am using data from the IPUMS-CPS ASEC series. I am looking to calculate the median income for each age and for each education category. A sample of the data for only 16 year olds is as follows:
    Code:
    year    age   educ    inc    weight
    1992    16    "HS"    1000    20
    1992    16    "CG"    1200    23
    1993    16    "CG"    1400    24
    1994    16    "HS"    1500    19
    In short, there is data from multiple years with different educational categories. What I want to do is to calculate median income for 16 year olds (and subsequently for all of the ages from 16-65) by age and education.

    I have tried the following:
    Code:
    bysort age educ: egen inc_median_age_educ = median(inc)
    I think this works well, but there are two issues:

    1. I cannot include the weights variable in this generation, which makes this wrong. In this FAQ, I am not sure I can include two level variables: https://www.stata.com/support/faqs/d...ry-statistics/. Kindly correct me if I'm wrong.

    2. Would there be a way to export this to an external txt/xls file rather than adding an additional column? For now I am using the following:

    Code:
    #delimit ;
    tabulate age [aw = weight], summ(inc_median_age_educ) means;
    /*or*/
    
    /*doesn't work, creates N and mean for age and education separately*/
    tabout age educ [aw = weight] using try.xls, replace format(8) ptotal(none) lines (none)
    sum cells(N wage_median_age_educ mean wage_median_age_educ);
    Would there be a way to find this mean/median using tabout since it allows to export without creation of additional columns?

    I am using Stata/SE 16.1 on Windows 10.

  • #2
    I think you may want to look at the collapse command. A starting point would be
    Code:
    collapse (median) median_inc_age_educ=inc [aw = weight], by(age educ)
    which replaces the dataset in memory with one observation for each combination of age and educ. You can then use the export command to create a text file or excel worksheet containing this data.

    Comment


    • #3
      You don't say what kind of weights the weight variable is. If it is an fweight or aweight, you can do this:

      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input int year byte age str4 educ int inc byte weight
      1992 16 `""HS""' 1000 20
      1992 16 `""CG""' 1200 23
      1993 16 `""CG""' 1400 24
      1994 16 `""HS""' 1500 19
      end
      
      capture program drop one_group
      program define one_group
          summ inc [aweight = weight], detail // OR fweight ALSO ALLOWED
          gen weighted_median_inc = `r(p50)'
          exit
      end
      
      runby one_group, by(age educ)
      -runby- is written by Robert Picard and me, and is available from SSC.

      If these are iweights or pweights, then I think you would instead have to use -qreg- to estimate the weighted median. But that won't work in your example data because there aren't enough observations for -qreg- to run.

      As for exporting the results somewhere else, after you have created the weighted median variable you can do:

      Code:
      collapse (first) weighted_median_inc, by(age educ)
      and then you can export that in the usual ways (-export delimited-, -export excel-, whatever.)

      In the future, when showing data examples, please use the -dataex- command to do so, as I have here. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

      Added: Crossed with #2. William Lisowski's solution is better.

      Comment


      • #4
        Thanks William and Clyde for your replies.
        I have two follow-up questions, please look into them if you can.

        This is a sample dataset generated using -dataex- and -randomtag-, which I hadn't used previously, sorry about that! If you require more observations, please let me know.
        I think the weights are aweights.

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input int year byte age int educ_new float incwage_defl double asecwt
        1971 52 4        0 1266.69
        1971 52 1    47234 1419.43
        1980 32 4 41101.16  245.14
        1980 31 3  26117.1  720.98
        1982 44 2  28961.4  365.43
        1987 23 2 18337.28 1774.19
        1997 50 2        0 3563.42
        1998 33 3        0    2435
        1999 37 2 26306.28   331.7
        2002 39 4    65870 2512.95
        2002 41 2    10351  806.42
        2008 47 4        0 1707.23
        2009 64 1        0 1195.99
        2011 19 1        0 1191.23
        2012 58 2    66690 1219.54
        end
        label values age age_lbl
        label def age_lbl 19 "19", modify
        label def age_lbl 23 "23", modify
        label def age_lbl 31 "31", modify
        label def age_lbl 32 "32", modify
        label def age_lbl 33 "33", modify
        label def age_lbl 37 "37", modify
        label def age_lbl 39 "39", modify
        label def age_lbl 41 "41", modify
        label def age_lbl 44 "44", modify
        label def age_lbl 47 "47", modify
        label def age_lbl 50 "50", modify
        label def age_lbl 52 "52", modify
        label def age_lbl 58 "58", modify
        label def age_lbl 64 "64", modify
        label values educ_new neweduc
        label def neweduc 1 "Less than high school", modify
        label def neweduc 2 "High school", modify
        label def neweduc 3 "Some college", modify
        label def neweduc 4 "College degree and above", modify
        1. I looked into the collapse command, however I would preferably not like to change the dataset itself, but rather just generate tables if possible. -tabulate- and -tabstat- were the options in my knowledge, so if you have any suggestions of what I can look into that somehow keep the dataset intact, please let me know!

        2. Since median can not be used with gen, should I be using something like this (egen)?

        Code:
        capture program drop one_group
        program define one_group
            summ incwage_defl [aweight = asecwt], detail // OR fweight ALSO ALLOWED
            egen weighted_median_inc = `median()'
            exit
        end
        
        runby one_group, by(age educ)

        Comment


        • #5
          2. Since median can not be used with gen, should I be using something like this (egen)?
          No. The code I showed in #3 correctly generates a weighted median variable. `r(p50)' is the way you access the median after -summ, detail-, and after most commands that return a median. The code you propose in #4 will not work: it will just give you a syntax error.

          To create the tables you want without modifying the data set (much) you can do this:
          Code:
          egen age_educ = group(age educ_new), label
          tabstat incwage_defl [aweight = asecwt], by(age_educ) statistics(p50)
          provided your weight is an aweight or fweight.

          If the new variable age_educ really irritates you, you can always -drop- it after the tabulation.

          Comment


          • #6
            Got it Clyde, thank you!
            Thanks for explaining the median syntax, and -tabstat- does the trick.

            Comment


            • #7
              Dear Clyde Schechter

              I want to create a remoteness index between the two countries. That is:

              Click image for larger version

Name:	Remoteness.PNG
Views:	1
Size:	5.7 KB
ID:	1707084

              The first term is the simple average of (the natural logarithms of) the mean distance of country i from all of its trading partners except j and the second term is the mean distance of country j from all of its trading partners except i. The data in my case looks like:
              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input str3(iso_i iso_j) double distance
              "ABW" "AFG"   13257.814453125
              "ABW" "AGO"   9516.9130859375
              "ABW" "AIA" 983.2682495117188
              "ABW" "ALB"      9091.7421875
              "AFG" "ABW"   13257.814453125
              "AFG" "AGO"   7607.7998046875
              "AFG" "AIA"   12292.158203125
              "AFG" "ALB"  4342.11669921875
              "AGO" "ABW"   9516.9130859375
              "AGO" "AFG"   7607.7998046875
              "AGO" "AIA"   8901.4638671875
              "AGO" "ALB"  5623.82666015625
              end
              The distance measures are symmetric; the distance between ABW-AFG is the same as the distance between AFG-ABW and so on.

              Thanks,
              (Ridwan)




              Comment


              • #8
                This question has nothing to do with the topic of this thread. While it is easy to think of Forum threads as dialogs between a questioner and a responder, they are also read, and searched for, by other Forum members and visitors to the site. Those people will generally choose what to read based on the thread title. By including extraneous material, you make those who came looking for the topic in the thread title waste their time, and you prevent those who are searching for the off-topic material from finding it. In the future, please always start a new thread when you change topic.

                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input str3(iso_i iso_j) double distance
                "ABW" "AFG"   13257.814453125
                "ABW" "AGO"   9516.9130859375
                "ABW" "AIA" 983.2682495117188
                "ABW" "ALB"      9091.7421875
                "AFG" "ABW"   13257.814453125
                "AFG" "AGO"   7607.7998046875
                "AFG" "AIA"   12292.158203125
                "AFG" "ALB"  4342.11669921875
                "AGO" "ABW"   9516.9130859375
                "AGO" "AFG"   7607.7998046875
                "AGO" "AIA"   8901.4638671875
                "AGO" "ALB"  5623.82666015625
                end
                
                //  VERIFY NECESSARY ASSUMPTIONS
                //  ASSUMPTION 1: ISO_I AND ISO_J ARE ALWAYS DIFFERENT
                assert iso_i != iso_j
                
                //  ASSUMPTION 2: EVERY ISO_I ISO_J PAIR HAS SAME DISTANCE AS ISO_J ISO_i
                gen pair = cond(iso_i < iso_j, iso_i+iso_j, iso_j+iso_i)
                by pair (distance), sort: assert distance[1] == distance[_N]
                
                tempfile copy
                save `copy'
                
                //  FOR PAIRS ISO_I ISO_J THAT ONLY OCCUR IN ONE ORDER, CREATE
                //  A REVERSED ORDER PAIR
                by pair: gen expander = cond(_N == 1, 2, 1)
                expand expander
                by pair, sort: gen temp = iso_i if expander == 2 & _n == 2
                by pair: replace iso_i = iso_j if expander == 2 & _n == 2
                by pair: replace iso_j = temp if expander == 2 & _n == 2
                
                drop expander temp
                isid iso_i iso_j
                
                rangestat (mean) t1 = distance, by(iso_i) excludeself interval(distance . .)
                rangestat (mean) t2 = distance, by(iso_j) excludeself interval(distance . .)
                gen wanted = 0.5*(log(t1) + log(t2))
                
                merge 1:1 iso_i iso_j using `copy', keep(match using) nogenerate
                There are key assumptions required for this code to work properly, and they are verified at the top of the code.
                Note: -rangestat- is written by Robert Picard, Nick Cox, and Roberto Ferrer. It is available from SSC.

                Added: The formula you cite in #7 is incorrect. In both terms, N-1 should be (N-1).
                Last edited by Clyde Schechter; 25 Mar 2023, 10:17.

                Comment


                • #9
                  Thank you Clyde Schechter . I am sorry for posting this query here. I shall be very careful in future.


                  Added: I am sorry for missing it before, but in the original data, one of the observations is country's distance from itself. (it is non-negative number) That is; there exist a pairs where iso_i==iso_j (ABW-ABW; AFG-AFG etc.). Does it have any difference in creating the remoteness index? What do you mean by iso_i and iso_j are always different. (assumption 1) ? I am not clearly getting it

                  Thanks,
                  (Ridwan)
                  Last edited by Ridwan Sheikh; 25 Mar 2023, 13:15.

                  Comment


                  • #10
                    What I meant when I said iso_i and iso_j must be different is precisely what you now say is not true of your data. You have said that you want to compute remoteness of a country from itself with the formula, which my code in #2 would not permit.

                    Here is modified code that will allow iso_i == iso_j:
                    Code:
                    * Example generated by -dataex-. To install: ssc install dataex
                    clear
                    input str3(iso_i iso_j) double distance
                    "ABW" "AFG"   13257.814453125
                    "ABW" "AGO"   9516.9130859375
                    "ABW" "AIA" 983.2682495117188
                    "ABW" "ALB"      9091.7421875
                    "AFG" "ABW"   13257.814453125
                    "AFG" "AGO"   7607.7998046875
                    "AFG" "AIA"   12292.158203125
                    "AFG" "ALB"  4342.11669921875
                    "AGO" "ABW"   9516.9130859375
                    "AGO" "AFG"   7607.7998046875
                    "AGO" "AIA"   8901.4638671875
                    "AGO" "ALB"  5623.82666015625
                    "AGO" "AGO"  1234.567890
                    end
                    
                    //  VERIFY NECESSARY ASSUMPTIONS
                    
                    //  ASSUMPTION: EVERY ISO_I ISO_J PAIR HAS SAME DISTANCE AS ISO_J ISO_i
                    gen pair = cond(iso_i < iso_j, iso_i+iso_j, iso_j+iso_i)
                    by pair (distance), sort: assert distance[1] == distance[_N]
                    
                    // ASSUMPTION: NO PAIR APPEARS TWICE IN THE DATA IN THE SAME ORDER
                    isid iso_i iso_j
                    
                    tempfile copy
                    save `copy'
                    
                    //  FOR PAIRS ISO_I ISO_J THAT ONLY OCCUR IN ONE ORDER, CREATE
                    //  A REVERSED ORDER PAIR
                    by pair: gen expander = cond(_N == 1, 2, 1)
                    expand expander
                    by pair, sort: gen temp = iso_i if expander == 2 & _n == 2
                    by pair: replace iso_i = iso_j if expander == 2 & _n == 2
                    by pair: replace iso_j = temp if expander == 2 & _n == 2
                    
                    drop expander temp
                    encode iso_i, gen(n_iso_i)
                    encode iso_j, gen(n_iso_j)
                    
                    rangestat (count) N1 = n_iso_j (mean) t1 = distance, by(iso_i) excludeself interval(distance . .)
                    rangestat (count) N2 = n_iso_i (mean) t2 = distance, by(iso_j) excludeself interval(distance . .)
                    replace t1 = (N1*t1 - distance)/(N1-1) if iso_i == iso_j
                    replace t2 = (N2*t2 - distance)/(N2-1) if iso_i == iso_j
                    gen wanted = 0.5*(log(t1) + log(t2))
                    by pair, sort: keep if _n == 1
                    drop N1 N2 t1 t2 n_iso_*
                    
                    merge 1:m pair using `copy', keep(match using) nogenerate

                    Comment


                    • #11
                      Thanks Clyde Schechter !
                      When I run
                      Code:
                      gen pair = cond(iso_i < iso_j, iso_i+iso_j, iso_j+iso_i)
                      
                      by pair (distance), sort: assert distance[1] == distance[_N]
                      It throws an error message: (23,526 contradictions in 25,200 by-groups) assertion is false.
                      The symmetric distance is a more plausible assumption. In most pairs, every (iso_i, iso_j) pair has the same distance as (iso_j, iso_i). But possibly because of missing values and rounding errors (2 places after decimals), we get an error message that assertion is false. I have briefly check the data (some observations) and it seems that dist_i,j = dist_j,i (symmetric distance) in most of the cases.

                      Please get back to me, how should we proceed in this case

                      I am attaching my original data-file here. If you can have a look at it, with these codes, that would be of great help.


                      Thanks and regards,
                      (Ridwan)


                      Attached Files
                      Last edited by Ridwan Sheikh; 27 Mar 2023, 04:42.

                      Comment


                      • #12
                        Well, on your assurance that the differences between the distances are due to missing values and small rounding errors, if this were my project I would verify that these are the only problems, and then use the average of the distances reported for the same pair. To verify that these are the only problems with distance asymmetry, I would set a tolerance level for the distance and -assert- that the difference between them is at most that much. In the code below, I have set the tolerance at 0.05, which means that the two values would round to the same value to 1 decimal place. If you want assurance that they would round to the same value to 2 decimal places, you would use 0.005 instead.

                        Code:
                        * Example generated by -dataex-. For more info, type help dataex
                        clear
                        input str3(iso_i iso_j) double distance
                        "ABW" "AFG"   13257.814453125
                        "ABW" "AGO"   9516.9130859375
                        "ABW" "AIA" 983.2682495117188
                        "ABW" "ALB"      9091.7421875
                        "AFG" "ABW"   13257.814453125
                        "AFG" "AGO"   7607.7998046875
                        "AFG" "AIA"   12292.158203125
                        "AFG" "ALB"  4342.11669921875
                        "AGO" "ABW"   9516.9130859375
                        "AGO" "AFG"   7607.7998046875
                        "AGO" "AIA"   8901.4638671875
                        "AGO" "ALB"  5623.82666015625
                        "AGO" "AGO"  1234.567890
                        end
                        
                        //  VERIFY NECESSARY ASSUMPTIONS
                        
                        //  ASSUMPTION: EVERY ISO_I ISO_J PAIR HAS SAME DISTANCE AS ISO_J ISO_i
                        gen pair = cond(iso_i < iso_j, iso_i+iso_j, iso_j+iso_i)
                        by pair (distance), sort: egen max_distance = max(distance)
                        by pair (distance): egen min_distance = min(distance)
                        assert max_distance - min_distasnce <= 0.05
                        by pair (distance): egen mean_distance = mean(distance)
                        
                        // ASSUMPTION: NO PAIR APPEARS TWICE IN THE DATA IN THE SAME ORDER
                        isid iso_i iso_j
                        
                        tempfile copy
                        save `copy'
                        
                        //  FOR PAIRS ISO_I ISO_J THAT ONLY OCCUR IN ONE ORDER, CREATE
                        //  A REVERSED ORDER PAIR
                        by pair: gen expander = cond(_N == 1, 2, 1)
                        expand expander
                        by pair, sort: gen temp = iso_i if expander == 2 & _n == 2
                        by pair: replace iso_i = iso_j if expander == 2 & _n == 2
                        by pair: replace iso_j = temp if expander == 2 & _n == 2
                        
                        drop expander temp
                        encode iso_i, gen(n_iso_i)
                        encode iso_j, gen(n_iso_j)
                        
                        rangestat (count) N1 = n_iso_j (mean) t1 = mean_distance, by(iso_i) excludeself interval(distance . .)
                        rangestat (count) N2 = n_iso_i (mean) t2 = mean_distance, by(iso_j) excludeself interval(distance . .)
                        replace t1 = (N1*t1 - mean_distance)/(N1-1) if iso_i == iso_j
                        replace t2 = (N2*t2 - mean_distance)/(N2-1) if iso_i == iso_j
                        gen wanted = 0.5*(log(t1) + log(t2))
                        by pair, sort: keep if _n == 1
                        drop N1 N2 t1 t2 n_iso_*
                        
                        merge 1:m pair using `copy', keep(match using) nogenerate
                        By the way, I never download attachments from people I do not know, so I have not looked at your full data set. I appreciate your willingness to make the full data set available. But I think working off the same -dataex- example as before does the trick here. If this code proves unsuitable for your data, please post back with another -dataex- example that exhibits whatever problem arises.

                        Comment


                        • #13
                          Thank you very much Clyde Schechter .

                          I tried running your earlier codes in #10 using another measure of distance, and the assertion we set about symmetric distance is valid now. But I am facing a different issue now.
                          1) When I run the following codes, they did not made any real changes to the data.

                          Code:
                          by pair: replace iso_i = iso_j if expander == 2 & _n == 2
                          (0 real changes made)
                          
                          by pair: replace iso_j = temp if expander == 2 & _n == 2
                          (0 real changes made)
                          2) The following code
                          Code:
                           rangestat (count) N1 = n_iso_j (mean) t1 = distcap, by(iso_i) excludeself interval(distcap. .)
                          displayed an error message, factor variables and time-series operators not allowed.

                          When I run these code in the following dataex example
                          Code:
                          * Example generated by -dataex-. To install: ssc install dataex
                          clear
                          input str3(iso_i iso_j) double distance
                          "ABW" "ABW"  5.225314617156982
                          "ABW" "AFG"    13257.814453125
                          "ABW" "AGO"    9516.9130859375
                          "ABW" "AIA"  983.2682495117188
                          "AFG" "ABW"    13257.814453125
                          "AFG" "AFG"  303.7613525390625
                          "AFG" "AGO"    7607.7998046875
                          "AFG" "AIA"    12292.158203125
                          "AGO" "ABW"    9516.9130859375
                          "AGO" "AFG"    7607.7998046875
                          "AGO" "AGO"  419.9666442871094
                          "AGO" "AIA"    8901.4638671875
                          "AIA" "ABW"  983.2682495117188
                          "AIA" "AFG"    12292.158203125
                          "AIA" "AGO"    8901.4638671875
                          "AIA" "AIA" 3.7986903190612793
                          end

                          everything works fine. All the variables get calculated as coded for. My original data (which is also attached in #11) is exact replica of the dataex example as in above. I am clueless if things workout in dataex why not in the original file. That is why, i thought if you replicates these codes (as in #10) it would be of great help. Sorry for asking something which is not your style of work.


                          Regards,
                          (Ridwan)

                          Comment

                          Working...
                          X