How to calculate a mean value using bysort command?

Mohamud Hussein

Join Date: Jan 2015
Posts: 99

How to calculate a mean value using bysort command?

28 Jul 2017, 02:19

Hi,

I have the variable summarised below which capture 9 (aggregated) groups of foods. I would like to calculate a mean value of a set of other variables in my data such as unit_price (unitp), expenditure(exp), quantity etc. The ultimate goal is to regress (using sureg) mean exp on mean unip and other explanatory so that I get 9 coefficient estimates, one for each group.

Would the bysort command below be useful in this case?,

Code:

bysort itemid:egen avexp=mean (exp)

And if yes, how I can write the sureg command so I can get an estimate for each food group? Grateful if you can tell me macro option(s) to write the command line efficiently.

Code:

 itemid Freq.
Percent
Cum.




grains
74,936
15.81
15.81

fruit_vegs
124,895
26.35
42.16

pulses
27,755
5.86
48.01

meat
39,080
8.24
56.26

fish
11,816
2.49
58.75

milk_dairy
23,752
5.01
63.76

drinks_bever
31,444
6.63
70.39

oils_fats
23,547
4.97
75.36

others
116,800
24.64
100.00




Total
474,025
100.00

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#2

28 Jul 2017, 08:21

The code you show for the average expressions by itemid looks mostly correct. I would remove the space between mean and (. It may or may not cause problems, because sometimes the parser allows this, but it might. Safer to just not have that space there.

As for the rest of your question, you do not provide enough information about the data itself, nor the model you want to implement for anyone to give you a specific answer.

That said, if you are looking to do item-specific regression analyses, -sureg- may not be your best approach. -sureg- is more commonly used when you want to do simultaneous regressions with different outcome variables, and especially so if the predictor variables are not the same for each outcome but the error terms are likely to be correlated. This may well be your situation, in which case, do proceed. But what you describe sounds more like item-specific regressions of a single outcome variable and a single set of predictors. If that's what you have, the simplest way to do that is to write a single regression in which all of those predictors are interacted with itemid. Your description isn't clear enough for me to really know which direction is appropriate for you.
Comment
Mohamud Hussein

Join Date: Jan 2015

Posts: 99
#3

28 Jul 2017, 14:31

Thanks Clyde. I will remove the space.

Regarding my second question, the mean values for my regressor (-exp-) and explanatory variables vary across the 9 food groups under -itemid-, so I intend to run 9 simultaneous equations using -sureg-.

What I am not sure is how to write the command line correctly so that STATA recognises -exp- and my explanatory variables vary across the 9 groups. When I run the command below I got a single coefficient estimate rather than 9 different coefficient estimates.

Code:

sureg avexp avc

Code:

avexp | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- avexp | avc | -.5873171 .0253241 -23.19 0.000 -.6369514 -.5376828 _cons | 6.643892 .0460813 144.18 0.000 6.553575 6.73421 ------------------------------------------------------------------------------

Thanks,
Mohamud
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30095

28 Jul 2017, 15:41

The syntax of -sureg- requires each regression equation to be written out separately and bound in parentheses. Here's an example:

Code:

. sysuse auto, clear
(1978 Automobile Data)

. sureg (price mpg headroom) (displacement trunk i.foreign)

Seemingly unrelated regression
--------------------------------------------------------------------------
Equation             Obs   Parms        RMSE    "R-sq"       chi2        P
--------------------------------------------------------------------------
price                 74       2    2615.502    0.2029      11.35   0.0034
displacement          74       2    61.77244    0.5414     102.93   0.0000
--------------------------------------------------------------------------

------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
price        |
         mpg |   -174.868   52.78979    -3.31   0.001    -278.3341   -71.40196
    headroom |  -238.6282   363.1862    -0.66   0.511    -950.4601    473.2036
       _cons |   10603.75   1855.711     5.71   0.000     6966.619    14240.87
-------------+----------------------------------------------------------------
displacement |
       trunk |   7.854589   1.696489     4.63   0.000     4.529532    11.17965
             |
     foreign |
    Foreign  |   -106.597   15.23018    -7.00   0.000    -136.4476   -76.74643
       _cons |   120.9346   26.18031     4.62   0.000     69.62218    172.2471
------------------------------------------------------------------------------

.

But it doesn't actually look like you're trying to do that If what you want is to separately regress avexp against avc in each of the 9 groups, that isn't a -sureg- problem. That's:

Code:

by itemid, sort: regress avexp avc

or, better,

Code:

regress avexp avc##i.itemid

Yet another way, that creates a data file with the regression results is:

Code:

tempfile results
statsby, saving(`results') by(itemid): regress avexp avc

which will give you a file, `results', with 9 observations, each containing the coefficient and standard error from the regression for one group.

Yet another approach similar to -statsby- , but faster to run and provides fuller output:

Code:

gen obs_no = _n
rangestat (reg) avexp avc, by(itemid) interval(obs_no . .)

To use that you have to install -rangestat- from SSC.

Last edited by Clyde Schechter; 28 Jul 2017, 15:49.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#5

28 Jul 2017, 17:15

On further thought, the approaches outlined in #4 are not workable because the variable avexp, according to #1 is actually constant within each value of itemid. Therefore any attempt to do separate regressions for each level of itemid, no matter how coded, will inevitably lead to null results: the outcome variable is just a constant in each regression.

So, either Mohamud Hussein's plan is ill-conceived, or I have misunderstood it. What I understand him to want to do cannot be done. I leave it to him to decide whether I have understood him correctly (in which case he needs to make a new plan) or to try to explain it more clearly.
Comment

Mohamud Hussein

Join Date: Jan 2015
Posts: 99

29 Jul 2017, 07:46

Thanks Clyde. None of the approaches suggestion in #4 allow estimation of a separate coefficient for each group; I agree therefore I need to come up a different strategy for the calculation of mean values.

Stepping back to where I started; the code below allowed me generate the 9 groups under the -itemid-. I then used the -bysort- command to account the above mean values over -itemid-

What I am not sure is what other strategy I can adopt to calculate the mean values for my variables, avc avp and avexp, so that I can estimate a separate coefficient for each of the 9 groups in -itemid-?

Code:

 generate group=0

. replace group=1 if inlist(itemid,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19)

(74,936 real changes made)

. replace group=2 if inlist(itemid,20,21,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,58,59

> ,60,61,62,63,64,65,66,67,105,109,110)

(124,895 real changes made)

. replace group=3 if inlist(itemid,22,23,24,25,26,27,28,29)

(27,755 real changes made)

. replace group=4 if inlist(itemid,47,48,49,50,51,52,57,112,113,114)

(39,080 real changes made)

. replace group=5 if inlist(itemid,54,55,56)

(11,816 real changes made)

. replace group=6 if inlist(itemid,68,69,70,71,72,73,78)

(27,816 real changes made)

. replace group=7 if inlist(itemid,78,89,90,91,92,93,94,122)

(31,444 real changes made)

. replace group=8 if inlist(itemid,83,84,85,86,117,118)

(23,547 real changes made)

. replace group=9 if inlist(itemid,120,53,74,75,76,77,79,80,81,82,87,88,95,96,97,98,99,100,101,10

> 2,103,104,106,107,108,111,115,116,119,121)

(116,800 real changes made)

. drop itemid

. rename group itemid

. label define mylabel 1 grains 2 fruit_vegs 3 pulses 4 meat 5 fish 6 milk_dairy 7 drinks_bever 8

> oils_fats 9 others

. label values itemid mylabel

. tab itemid

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#7

29 Jul 2017, 10:23

What I am not sure is what other strategy I can adopt to calculate the mean values for my variables, avc avp and avexp, so that I can estimate a separate coefficient for each of the 9 groups in -itemid-?

It is not a matter of finding a different strategy to do this. The goal itself is inappropriate: you are trying to do something that cannot, in principle, be done. When you divide your population into groups, and then use a within-group mean value as the "outcome" in a regression within each group, you are trying to develop a regression model for a constant. So you just get a constant term equal to the group mean and all other coefficients zero.

So you need to rethink what you actually are trying to calculate here. It certainly would be possible to regress the group-mean values against other variables among all 9 groups in a single regression of your entire population. For example:

Code:

regress avexp avc

That will give you a single regression that expresses the between-groups relationship between group mean exp and value of avc.

It would also make sense to regress exp (not the group-average value) against avc separately within each group (and, a slight modification of the code in #4 shows you several way to do that). That will give you estimates of the within group association of exp with avc.

But I can't really advise you here. I don't know what these variable are, nor what your research question is. I can only say that the specific task you have set for yourself cannot be done in any meaningful way.
Comment
Mohamud Hussein

Join Date: Jan 2015

Posts: 99
#8

29 Jul 2017, 12:20

Many thanks Clyde.

My data is from one cross-section survey of food consumption by 4000 households. The data includes unit price for purchases, quantity purchased, demographics etc.

I am in an early exploratory stage of the data preparation but my ultimate goal is to run a QUADS demand system for M=9 equations based SURE, as discussed here:http://ageconsearch.umn.edu/bitstrea...art_st0029.pdf

At this stage I am trying to see whether the items aggregated under each food group makes sense, in terms signs of coefficients etc based on the mean values, an approach which is not right as you described. So my question is how I can best set up data so that I get the estimates for coefficients and henceforth elasticities for each of my of 9 groups without calculating a mean value for each group.

I initially thought using the above bysort command but replace -itemid- with group labels (grains, pulses etc ) but STATA does not recognise the lgroup abels as they are not variables. So I think the only option left with me is to generate a variable for group and then calculate a mean for that new variable?

[CODE][
bysort grains:egen avexp=mean (wexp2)

variable grains not found

/CODE]
Comment
Mohamud Hussein

Join Date: Jan 2015

Posts: 99
#9

29 Jul 2017, 12:38

Many thanks Clyde.

My data is from one cross-section survey of food consumption by 4000 households. The data includes unit price for purchases, quantity purchased, demographics etc.

I am in an early exploratory stage of the data preparation but my ultimate goal is to run a QUADS demand system for M=9 equations based SURE, as discussed here:http://ageconsearch.umn.edu/bitstrea...art_st0029.pdf

At this stage I am trying to see whether the items aggregated under each food group makes sense, in terms signs of coefficients etc based on the mean values, an approach which is not right as you described. So my question is how I can best set up data so that I get the estimates for coefficients and henceforth elasticities for each of my of 9 groups without calculating a mean value for each group.

I initially thought using the above bysort command but replace -itemid- with group labels (grains, pulses etc ) but STATA does not recognise the lgroup abels as they are not variables. So I think the only option left with me is to generate a variable for group and then calculate a mean for that new variable?

[CODE][
bysort grains:egen avexp=mean (wexp2)

variable grains not found

/CODE]
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#10

29 Jul 2017, 12:41

I think you need help here from an economist or econometrician. There are several who participate in this forum, and I hope one of them will pick up the thread from here.
Comment

Announcement

How to calculate a mean value using bysort command?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment