Generating New Variable As A Function of Group-Level Differences Of Another Variable

Ryan LeCount

Join Date: May 2015

Posts: 23
#1

Generating New Variable As A Function of Group-Level Differences Of Another Variable

14 Jan 2019, 21:25

Hello All, and Thanks in Advance.

I have a very large dataset for which I need to create a variable- no doubt using egen- but the process is not completely clear to me.

Cases are distributed across about 125 spatial groups and these spatial groupings in continuous variable (1-125).

I am interested in generating a variable that quantifies the mean difference in another binary subgroup variable *within spatial groups*

Example: What is the mean difference in height between subgroup A and subgroup B ----within spatial groups (1-125)

Variables: Height, Subgroup, Spatial Group

Thanks in advance for any advice, and I'm happy to provide further information and/or clarification.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30177
#2

14 Jan 2019, 22:06

Your description leaves a fair amount to the imagination. Here's what I imagine your data to look like and what you would like to get from it. If I'm wrong, post back with example starting data (using the -dataex- command) and an illustration of what you want the end result to look like.

Code:

// CREATE DEMO DATA SET clear* set obs 125 gen group = _n expand 2 by group, sort: gen subgroup = cond(_n == 1, "A", "B") expand 5 set seed 1234 gen height = rnormal(67, 2) sort group subgroup // CALCULATE MEAN SUBGROUP HEIGHT DIFFERENCE WITHIN EACH GROUP by group subgroup, sort: egen subgroup_mean = mean(height) by group (subgroup), sort: gen mean_diff = height[1] - height[_N]

If you are running version 15.1 or a fully updated version 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment

David Benson

Join Date: Oct 2018
Posts: 489

14 Jan 2019, 22:38

EDIT: Crossed with Clyde's post in #2.

It would be *really* helpful if you could post a sample of your data using Stata's dataex command. There is a tutorial on using dataex on Youtube here

I have no idea if this is what your data look like at all, but here is an example:

Code:

* Example shared by -dataex-. To install: ssc install dataex
dataex id spat_group subgroup height
clear
input byte(id spat_group) str1 subgroup byte height
 1 1 "A" 59
 2 1 "A" 76
 3 1 "B" 48
 4 1 "B" 61
 5 1 "B" 76
 6 2 "A" 47
 7 2 "A" 59
 8 2 "A" 63
 9 2 "B" 72
10 2 "B" 60
11 3 "A" 64
12 3 "A" 72
13 3 "B" 67
14 3 "B" 73
15 3 "B" 71
16 4 "A" 49
17 4 "A" 71
18 4 "A" 42
19 4 "B" 69
20 4 "B" 65
end

. list, sepby(spat_group) noobs abbrev(12)

  +-------------------------------------+
  | id   spat_group   subgroup   height |
  |-------------------------------------|
  |  1            1          A       59 |
  |  2            1          A       76 |
  |  3            1          B       48 |
  |  4            1          B       61 |
  |  5            1          B       76 |
  |-------------------------------------|
  |  6            2          A       47 |
  |  7            2          A       59 |
  |  8            2          A       63 |
  |  9            2          B       72 |
  | 10            2          B       60 |
  |-------------------------------------|
  | 11            3          A       64 |
  | 12            3          A       72 |
  | 13            3          B       67 |
  | 14            3          B       73 |
  | 15            3          B       71 |
  |-------------------------------------|
  | 16            4          A       49 |
  | 17            4          A       71 |
  | 18            4          A       42 |
  | 19            4          B       69 |
  | 20            4          B       65 |
  +-------------------------------------+

sort spat_group subgroup id
egen avg_overall  = mean(height), by(spat_group )  // height avg by spatial_group
egen avg_subgroup = mean(height), by(spat_group subgroup)  // height avg by subgroup within spatial_group, 
bysort spat_group subgroup (id): gen n = _n  // just creating a counter var
order n, after(height)  // just moving the var to be after height
format avg_overall avg_subgroup %10.2fc

. list, sepby(spat_group subgroup ) noobs abbrev(12)

  +----------------------------------------------------------------------+
  | id   spat_group   subgroup   height   n   avg_overall   avg_subgroup |
  |----------------------------------------------------------------------|
  |  1            1          A       59   1         64.00          67.50 |
  |  2            1          A       76   2         64.00          67.50 |
  |----------------------------------------------------------------------|
  |  3            1          B       48   1         64.00          61.67 |
  |  4            1          B       61   2         64.00          61.67 |
  |  5            1          B       76   3         64.00          61.67 |
  |----------------------------------------------------------------------|
  |  6            2          A       47   1         60.20          56.33 |
  |  7            2          A       59   2         60.20          56.33 |
  |  8            2          A       63   3         60.20          56.33 |
  |----------------------------------------------------------------------|
  |  9            2          B       72   1         60.20          66.00 |
  | 10            2          B       60   2         60.20          66.00 |
  |----------------------------------------------------------------------|
  | 11            3          A       64   1         69.40          68.00 |
  | 12            3          A       72   2         69.40          68.00 |
  |----------------------------------------------------------------------|
  | 13            3          B       67   1         69.40          70.33 |
  | 14            3          B       73   2         69.40          70.33 |
  | 15            3          B       71   3         69.40          70.33 |
  |----------------------------------------------------------------------|
  | 16            4          A       49   1         59.20          54.00 |
  | 17            4          A       71   2         59.20          54.00 |
  | 18            4          A       42   3         59.20          54.00 |
  |----------------------------------------------------------------------|
  | 19            4          B       69   1         59.20          67.00 |
  | 20            4          B       65   2         59.20          67.00 |
  +----------------------------------------------------------------------+


. table spat_group subgroup, c(mean height) row col  format(%12.1fc) stubwidth(10)

-----------------------------------------------------
           |                 subgroup                
spat_group |            A             B         Total
-----------+-----------------------------------------
         1 |         67.5          61.7          64.0
         2 |         56.3          66.0          60.2
         3 |         68.0          70.3          69.4
         4 |         54.0          67.0          59.2
           |
     Total |         60.2          66.2          63.2
-----------------------------------------------------

This doesn't calculate the mean difference between subgroups (within a spat_group), but hopefully it can get you started (and others can come along to help).

Code:

gen byte sub_a = ( subgroup=="A")  // creates indicator for sub_group=="A"
regress height i.spat_group sub_a

      Source |       SS           df       MS      Number of obs   =        20
-------------+----------------------------------   F(4, 15)        =      1.06
       Model |  430.608333         4  107.652083   Prob > F        =    0.4079
    Residual |  1516.59167        15  101.106111   R-squared       =    0.2211
-------------+----------------------------------   Adj R-squared   =    0.0134
       Total |      1947.2        19  102.484211   Root MSE        =    10.055

------------------------------------------------------------------------------
      height |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
  spat_group |
          2  |  -2.841667    6.42534    -0.44   0.665    -16.53695    10.85362
          3  |        5.4   6.359437     0.85   0.409     -8.15482    18.95482
          4  |  -3.841667    6.42534    -0.60   0.559    -17.53695    9.853622
             |
       sub_a |  -4.791667   4.589529    -1.04   0.313    -14.57402    4.990682
       _cons |   65.91667   4.857101    13.57   0.000       55.564    76.26933
------------------------------------------------------------------------------

Last edited by David Benson; 14 Jan 2019, 22:45.

Comment

Ryan LeCount

Join Date: May 2015
Posts: 23

14 Mar 2019, 10:31

Thanks so much to those who have taken the time to respond and apologies for my delay in following up.

I have included the DATAEX summary below, but it occurs to me that I might frame the question more generally as my sense is that a solution is easy easier than might have been understood.

My dataset (below) includes variables at both individual and group levels. My analysis is only interested in the group level, thus I would like to generate a new dataset that aggregates all individual observations into group means. In this case, I have about ~ 27,000 observations which fall into about ~125 groups. Note that all of the observations included below are from the same group.

To be clear, my objective is that each of the individual observations be grouped to produce a variable which is the mean of all of the observations in that group. You will see that some variables are already distributed this way as they are the same value for each observation. Please let me know if more information is needed and I will follow up promptly. Thanks again- so much- for your assistance!

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double group int income byte(height weight) double happiness
27053000101  41 0 1 8.4
27053000101  49 0 0 8.4
27053000101  62 1 0 8.4
27053000101  42 0 1 8.4
27053000101  38 1 1 8.4
27053000101  36 1 1 8.4
27053000101  58 0 1 8.4
27053000101  40 1 1 8.4
27053000101 106 1 1 8.4
27053000101  41 1 1 8.4
27053000101  51 1 0 8.4
27053000101 142 1 1 8.4
27053000101  40 1 1 8.4
27053000101  50 1 1 8.4
27053000101  80 1 1 8.4
27053000101  40 1 1 8.4
27053000101  32 1 1 8.4
27053000101  25 1 1 8.4
27053000101  52 0 1 8.4
27053000101  46 1 1 8.4
27053000101   . 1 1 8.4
27053000101  33 1 1 8.4
27053000101  36 0 1 8.4
27053000101  41 0 1 8.4
27053000101  30 1 1 8.4
27053000101  26 0 1 8.4
27053000101  38 0 1 8.4
27053000101  41 1 1 8.4
27053000101  51 0 1 8.4
27053000101  42 0 1 8.4
27053000101  28 0 0 8.4
27053000101   . 0 0 8.4
27053000101 131 1 1 8.4
27053000101 126 1 1 8.4
27053000101  59 1 1 8.4
27053000101   . 1 1 8.4
27053000101  36 0 1 8.4
27053000101   . 1 1 8.4
27053000101  42 0 0 8.4
27053000101  68 1 1 8.4
27053000101  95 1 0 8.4
27053000101  68 1 1 8.4
27053000101 112 1 1 8.4
27053000101  28 0 0 8.4
27053000101  62 1 1 8.4
27053000101 240 1 1 8.4
27053000101  64 1 1 8.4
27053000101  68 1 1 8.4
27053000101  25 1 1 8.4
27053000101   . 1 1 8.4
27053000101  42 0 0 8.4
27053000101  26 1 0 8.4
27053000101  28 0 1 8.4
27053000101  41 1 1 8.4
27053000101  86 1 1 8.4
27053000101 221 1 1 8.4
27053000101  28 1 1 8.4
27053000101  40 1 1 8.4
27053000101  97 0 1 8.4
27053000101  26 0 1 8.4
27053000101 360 0 0 8.4

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30177
#5

14 Mar 2019, 11:23

-help collapse-
Comment

Announcement