Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Finding difference between variables based on identifier

    Hi All,

    I am attempting to calculate the difference in means between two variables existing in different geographic regions.

    I have two variables avg1 and avg2 that calculate the mean score of a specific variable based on each communities location within the country. There two averages represent different community types (eg. "Rural vs Urban"). I now want to calculate the difference in the averages between rural and urban communities in the same geographic location.

    So the variables I am working with are:
    identifier = the variable that categorizes the communities (amalgamates a regional and census division code to identify specific locations) STRING
    avg1 = average wellbeing score of Urban communities within a specific region, based on identifier. FLOAT var
    avg2 = average wellbeing score of Rural Communities, within a specific region, based on identifier.FLOAT


    I tried codes like gen difference = avg1 - avg2 but that doesn't work, and it doesn't specific how to select only those avg1 and avg2s that occur in the same geo graphic region.

    I apologize in advance if this is unclear, I'm a newbie at Stata.

    Many thanks in advance for the help,
    Linz
    Last edited by linz mcginn; 06 Apr 2016, 12:17.

  • #2
    As urged in your last thread, example datasets help mightily. Please do look at FAQ Advice #12 and drop in quickly at #18 before returning.

    Here is a guess at what is happening.

    Presumably the urban and rural communities are in different observations (rows) of your dataset. Nothing in your code tells Stata to look elsewhere than in the current observations. Indeed presumably your first average is not even defined (non-missing) for regions in the second group, and vice versa.

    Here is an example you can run. You can copy and paste the code into a new do-file editor window in a new Stata.

    We want the difference in mean mpg between foreign and domestic cars with the same report record. The trick is just to ignore the foreign cars while calculating the mean for the domestic cars, but to put results in all pertinent observations, and vice versa. Then the differences are just differences in each observation. Those differences are necessarily repeated.

    Code:
    * start copy here 
    sysuse auto, clear 
    egen meanf = mean(cond(foreign == 1, mpg, .)), by(rep78) 
    egen meand = mean(cond(foreign == 0, mpg, .)), by(rep78) 
    gen diff = meanf - meand 
    tabdisp rep78, cell(meanf meand diff
    * stop copy here 
    
    * output should be 
    
    ----------------------------------------------
    Repair    |
    Record    |
    1978      |      meanf       meand        diff
    ----------+-----------------------------------
            1 |                     21            
            2 |                 19.125            
            3 |   23.33333          19    4.333334
            4 |   24.88889    18.44444    6.444445
            5 |   26.33333          32   -5.666666
            . |         14       23.25       -9.25
    ----------------------------------------------
    Actually, this kind of thing is tricky not just for beginners. There's a tutorial at http://www.stata-journal.com/sjpdf.h...iclenum=dm0055

    There are other ways to do it too.



    Comment


    • #3
      Thanks Nick, have read the FAQs, will endeavor to improve my posts, your feedback and assistance are appreciated.

      Just to clarify a bit about my dataset:

      Urban and Rural communities are in different rows and my problem is in trying to tell Stata to only look in the rows that have the same identifier and then subtract one mean from the other.

      There are often multiple urban communities and one or two rural communities within each 'identifier' region. There are many identifier regions where there are only urban or only rural communities.

      However, only one average is reported in each row. Where urban communities have a mean score reported the spot for rural mean is blank, and vice versa.

      I ran your code and did a sum to get this output:
      Code:
      . sum diff
      
       Variable Obs Mean Std. Dev. Min Max
            
       diff 3242 .0298086 .3146975 -2.375 2.314007
      
      .
      Which looked encouraging but on closer inspection with the Data Editor, in fact the majority of rows are reporting either a zero or a blank. I will go through the tutorial you suggest above but if there's something obvious I am missing please let me know. Thanks
      Last edited by linz mcginn; 06 Apr 2016, 13:08.

      Comment


      • #4
        My code is designed to cope with the situation sketched in your opening.

        But (broken record here) you'd need to post

        your exact code

        your exact results for an example dataset (just a small part of your data would suffice)

        and

        your precise comments on why results are puzzling

        to allow specific comment.

        Comment


        • #5
          Found my error!

          My code:
          Code:
          egen meanf = mean(cond(rural == 1, var1, .)), by(identifier)
          egen meand = mean(cond(urban == 0, var1, .)), by(identifier)
          gen diff = meanf - meand
          Here var1 is a community specific well-being measurement, that returns a specific value for each community. I am trying to determine the difference in mean var1 between rural and urban communities in the same geographic region.

          As you can see I coded the urban syntax incorrectly, I have since changed the 0 to a 1 and now I have been able to calculate the difference.

          Thanks for your assistance and patient Nick!

          Comment


          • #6
            Good. Here's another way to do it.

            Code:
            sysuse auto, clear 
            collapse (mean) mpg, by(foreign rep78) 
            reshape wide mpg, i(rep78) j(foreign) 
            gen diff = mpg1 - mpg0 
            list , sep(0) 
            
                 +---------------------------------------+
                 | rep78      mpg0      mpg1        diff |
                 |---------------------------------------|
              1. |     1        21         .           . |
              2. |     2    19.125         .           . |
              3. |     3        19   23.3333    4.333334 |
              4. |     4   18.4444   24.8889    6.444445 |
              5. |     5        32   26.3333   -5.666666 |
              6. |     .     23.25        14       -9.25 |
                 +---------------------------------------+
            Inserting list or edit commands along the way will show more of what is happening. This produces a reduced dataset, good for some purposes but not all; it could always be merged back with the main dataset.

            Comment


            • #7
              egen's tag() function does precisely this. See the help for egen. For more than you want to know, see http://www.statalist.org/forums/foru...n-gets-tag-ged

              Comment

              Working...
              X