Finding difference between variables based on identifier

linz mcginn

Join Date: Mar 2016

Posts: 12
#1

Finding difference between variables based on identifier

06 Apr 2016, 11:28

Hi All,

I am attempting to calculate the difference in means between two variables existing in different geographic regions.

I have two variables avg1 and avg2 that calculate the mean score of a specific variable based on each communities location within the country. There two averages represent different community types (eg. "Rural vs Urban"). I now want to calculate the difference in the averages between rural and urban communities in the same geographic location.

So the variables I am working with are:
identifier = the variable that categorizes the communities (amalgamates a regional and census division code to identify specific locations) STRING
avg1 = average wellbeing score of Urban communities within a specific region, based on identifier. FLOAT var
avg2 = average wellbeing score of Rural Communities, within a specific region, based on identifier.FLOAT

I tried codes like gen difference = avg1 - avg2 but that doesn't work, and it doesn't specific how to select only those avg1 and avg2s that occur in the same geo graphic region.

I apologize in advance if this is unclear, I'm a newbie at Stata.

Many thanks in advance for the help,
Linz

Last edited by linz mcginn; 06 Apr 2016, 12:17.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35734
#2

06 Apr 2016, 11:59

As urged in your last thread, example datasets help mightily. Please do look at FAQ Advice #12 and drop in quickly at #18 before returning.

Here is a guess at what is happening.

Presumably the urban and rural communities are in different observations (rows) of your dataset. Nothing in your code tells Stata to look elsewhere than in the current observations. Indeed presumably your first average is not even defined (non-missing) for regions in the second group, and vice versa.

Here is an example you can run. You can copy and paste the code into a new do-file editor window in a new Stata.

We want the difference in mean mpg between foreign and domestic cars with the same report record. The trick is just to ignore the foreign cars while calculating the mean for the domestic cars, but to put results in all pertinent observations, and vice versa. Then the differences are just differences in each observation. Those differences are necessarily repeated.

Code:

* start copy here sysuse auto, clear egen meanf = mean(cond(foreign == 1, mpg, .)), by(rep78) egen meand = mean(cond(foreign == 0, mpg, .)), by(rep78) gen diff = meanf - meand tabdisp rep78, cell(meanf meand diff * stop copy here * output should be ---------------------------------------------- Repair | Record | 1978 | meanf meand diff ----------+----------------------------------- 1 | 21 2 | 19.125 3 | 23.33333 19 4.333334 4 | 24.88889 18.44444 6.444445 5 | 26.33333 32 -5.666666 . | 14 23.25 -9.25 ----------------------------------------------

Actually, this kind of thing is tricky not just for beginners. There's a tutorial at http://www.stata-journal.com/sjpdf.h...iclenum=dm0055

There are other ways to do it too.
Comment
linz mcginn

Join Date: Mar 2016

Posts: 12
#3

06 Apr 2016, 12:58

Thanks Nick, have read the FAQs, will endeavor to improve my posts, your feedback and assistance are appreciated.

Just to clarify a bit about my dataset:

Urban and Rural communities are in different rows and my problem is in trying to tell Stata to only look in the rows that have the same identifier and then subtract one mean from the other.

There are often multiple urban communities and one or two rural communities within each 'identifier' region. There are many identifier regions where there are only urban or only rural communities.

However, only one average is reported in each row. Where urban communities have a mean score reported the spot for rural mean is blank, and vice versa.

I ran your code and did a sum to get this output:

Code:

. sum diff Variable Obs Mean Std. Dev. Min Max diff 3242 .0298086 .3146975 -2.375 2.314007 .

Which looked encouraging but on closer inspection with the Data Editor, in fact the majority of rows are reporting either a zero or a blank. I will go through the tutorial you suggest above but if there's something obvious I am missing please let me know. Thanks

Last edited by linz mcginn; 06 Apr 2016, 13:08.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35734
#4

06 Apr 2016, 13:07

My code is designed to cope with the situation sketched in your opening.

But (broken record here) you'd need to post

your exact code

your exact results for an example dataset (just a small part of your data would suffice)

and

your precise comments on why results are puzzling

to allow specific comment.
Comment
linz mcginn

Join Date: Mar 2016

Posts: 12
#5

06 Apr 2016, 13:22

Found my error!

My code:

Code:

egen meanf = mean(cond(rural == 1, var1, .)), by(identifier) egen meand = mean(cond(urban == 0, var1, .)), by(identifier) gen diff = meanf - meand

Here var1 is a community specific well-being measurement, that returns a specific value for each community. I am trying to determine the difference in mean var1 between rural and urban communities in the same geographic region.

As you can see I coded the urban syntax incorrectly, I have since changed the 0 to a 1 and now I have been able to calculate the difference.

Thanks for your assistance and patient Nick!
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35734

06 Apr 2016, 15:12

Good. Here's another way to do it.

Code:

sysuse auto, clear 
collapse (mean) mpg, by(foreign rep78) 
reshape wide mpg, i(rep78) j(foreign) 
gen diff = mpg1 - mpg0 
list , sep(0) 

     +---------------------------------------+
     | rep78      mpg0      mpg1        diff |
     |---------------------------------------|
  1. |     1        21         .           . |
  2. |     2    19.125         .           . |
  3. |     3        19   23.3333    4.333334 |
  4. |     4   18.4444   24.8889    6.444445 |
  5. |     5        32   26.3333   -5.666666 |
  6. |     .     23.25        14       -9.25 |
     +---------------------------------------+

Inserting list or edit commands along the way will show more of what is happening. This produces a reduced dataset, good for some purposes but not all; it could always be merged back with the main dataset.

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35734
#7

07 Apr 2016, 13:24

egen's tag() function does precisely this. See the help for egen. For more than you want to know, see http://www.statalist.org/forums/foru...n-gets-tag-ged
Comment

Announcement

Finding difference between variables based on identifier

Comment

Comment

Comment

Comment

Comment

Comment