How to calculate the mean with repetition?

Christine Lee

Join Date: Mar 2023

Posts: 8
#1

How to calculate the mean with repetition?

04 Mar 2023, 15:32

I have a table with the columns Department_ID and Person_ID and Salary (among other columns). I want to calculate the average salary of the departments and save it in a new column "ave_Salary". There are sometimes multiple observation with the same Person_ID. How to calculate ave_Salary (i.e. as if Person_ID never repeats) without dropping all other columns?

I think the following is the code but I am not sure:

bys Department_ID (Person_ID): egen ave_Salary= mean(Salary)

Thank you very much for your help.

Last edited by Christine Lee; 04 Mar 2023, 15:49. Reason: added tag
Tags: bysort, repetition
Clyde Schechter

Join Date: Apr 2014

Posts: 30093
#2

04 Mar 2023, 17:06

The code you wrote will count a repeating person as many times as he or she repeats. So that's not what you want. I'm going to assume that the repeating person is only in one Department. (If the same person appears in multiple departments, then it's unclear how you would go about what you ask: if you don't want to count the person more than once, you would have to provide some rule for which department gets to count that person.)

Code:

// VERIFY ANY PERSON APPEARS IN ONLY ONE DEPARTMENT by person_id (department_id), sort: assert department_id[1] == department_id[_N] // CALCULATE AVERAGE SALARY IN EACH DEPARTMENT COUNTING EACH PERSON ONLY ONCE egen tag = tag(department_id person_id) by department_id, sort: egen ave_salary = mean(cond(tag, salary, .))

In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

Also, for your future posts, in Stata-speak we refer to data sets as data sets, not tables. What you have called columns we call variables. And, should it come up in the future, what you might call rows are called observations. Use of this terminology will help avoid confusion.
Comment
Christine Lee

Join Date: Mar 2023

Posts: 8
#3

04 Mar 2023, 23:27

Thank you for your help and suggestion for future posts Clyde!
Comment

Christine Lee

Join Date: Mar 2023
Posts: 8

05 Mar 2023, 00:55

I just realize that there are indeed repeating person in multiple department, since there is a "Year" variable in the data and people can switch department.

Data example (unit of Salary is in thousand):

Code:

input int PersonalID float Year int DepartmentID byte Salary
300 2018 14  1
405 2018 31  4
 40 2018 31  5
564 2018 31  1
300 2019 32 2
 26 2013 32  4
 67 2013 32 11
 62 2013 32  2
 35 2014 32  1
 26 2014 32  2
 67 2014 32  4
 67 2015 32  1
 26 2015 32  1
 35 2015 32  1
 35 2016 32  1
 26 2016 32  1
 58 2017 32  1
 26 2017 32  1
 17 2018 32  1
 26 2018 32  3
 26 2019 32  2
 58 2019 32  1
 71 2019 32  1
124 2019 36  1
 36 2013 49  1
 55 2013 49  3
118 2013 49  2
 55 2014 49  2

I intend to calculate the yearly salary mean of a department (no matter whether the person has already been in a different department before), my code is (correct me if it is wrong):

Code:

by PersonalID Year (DepartmentID), sort: assert DepartmentID[1] == DepartmentID[_N]

egen tag = tag(DepartmentID PersonalID Year)
by DepartmentID Year, sort: egen dy_Salary = mean(cond(tag, Salary, .))

Last edited by Christine Lee; 05 Mar 2023, 00:59.

Comment

Christine Lee

Join Date: Mar 2023

Posts: 8
#5

05 Mar 2023, 02:20

Another related question:
If the code

Code:

by PersonalID Year (DepartmentID), sort: assert DepartmentID[1] == DepartmentID[_N]

gives a false output, what would the calculated dy_Salary generated by the next two lines of code be?
i.e. the dataset would look like this (other variables omitted):

Code:

input int PersonalID float Year int DepartmentID byte Salary byte JobID 405 2018 31 4 2 405 2018 31 4 1 564 2018 31 1 4 300 2018 14 1 3 300 2019 32 2 3

If I want to weight the calculation of dy_Salary in each "DepartmentID-Year" combination by the number of unique occurrence/ frequency of JobID for a PersonalID in each combination (i.e. the fact that the Personal 405 having two JobID (1 and 2) in Department 31 and Year 2018 is accounted for), how should I change the code?
Would the following code work?

Code:

egen tag = tag(DepartmentID PersonalID Year JobID) by DepartmentID Year, sort: egen dy_Salary = mean(cond(tag, Salary, .))

Thank you all in advance!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30093
#6

05 Mar 2023, 12:02

I'm not sure I follow what you want to do now. Here's what I think you are asking for:
Calculate mean average salary in each department in each year.

If the same person appears in two or more different departments in the same year, they can be counted separately in each department.

If the same person appears in the same department and year in two or more different job titles, each of those can be counted separately.

But if the same person appears in the same department, year, and job title more than once, only count one of those appearances.

This seems a sensible thing to do, on condition that when you do have the same person appearing in the same department, year and job title more than once, all of those observations have the same value in the Salary variable--otherwise it is unclear which value of Salary to count and which to omit. So here is code that will verify this assumption, and if true, will do the calculation you ask for:

Code:

// VERIFY CONSISTENCY OF SALARY AMONG REPEATED OBSERVATIONS // OF SAME PERSON IN SAME JOB IN SAME DEPARTMENT IN SAME YEAR by DepartmentID Year PersonalID JobID (Salary), sort: /// assert Salary[1] == Salary[_N] // CALCULATE DEPARTMENT AVERAGE SALARIES egen byte include = tag(DepartmentID Year PersonalID JobID) by Year DepartmentID (PersonalID JobID), sort: egen avg_salary /// = mean(cond(include, Salary, .))

That said, I have to ask why your data set would have more than one record for the same person in the same department in the same year with the same job. Is that kind of duplication necessary, because there are other variables that distinguish these observation (maybe different locations within the same department or something like that)? If there is no real-world reason to have those, then it suggests that the data management leading to this data set was incorrect and should be fixed to produce a clean data set.
Comment
Christine Lee

Join Date: Mar 2023

Posts: 8
#7

05 Mar 2023, 23:11

Thank you again for your help Clyde!

Actually the dataset I have does not have "more than one record for the same person in the same department in the same year with the same job (or different job)" (BTW in my example the JobID is different, see the rows with PersonalID of 405). This hypothetical example was made up so that I would know what to do when I need to do this in another context. I admit that it is confusing and I should have thought up a better example.

This is (I believe) a more intuitive example (assuming that other variables that distinguish repeated observations are omitted from the data set below.

Code:

input int ReportID float Year int BrokerageID float SectorReturn byte SectorID 405 2018 31 4.4 2 405 2018 31 4.4 2 406 2018 31 4.4 2 564 2018 31 1.1 4 300 2018 14 2.2 3 300 2019 32 2.2 3

In the example above, brokerage (BrokerageID) issues analyst report (ReportID) for companies, which belong to different sectors (SectorID). Brokerage may issue report for companies in different sectors.

What I want to do: calculate a brokerage-level value of yearly weighted average SectorReturn by accounting for the number of times a sector is included in different reports. I.e. the weight is the the number of times a sector is included in different reports.

E.g. for Brokerage 31, the the return of Sector 2 is weighted by 2 since it appears in Report 405 and 406, while the return of Sector 4 is weighted by 1 since it only appears in Report 564.

How can such a yearly weighted average be calculated?
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30093

06 Mar 2023, 09:45

Like this:

Code:

//  VERIFY CONSISTENCY OF SECTOR RETURN
by Year BrokerageID SectorID (SectorReturn), sort: assert SectorReturn[1] == SectorReturn[_N]

//  CALCULATE NUMBER OF REPORTS FOR EACH SECTOR
by Year BrokerageID SectorID (ReportID), sort: gen weight = sum(ReportID != ReportID[_n-1])
by Year BrokerageID SectorID (ReportID): replace weight = weight[_N]

//  CALCULATE WEIGHTED AVERAGE RETURN
egen tag = tag(Year BrokerageID SectorID)
by Year BrokerageID (SectorID), sort: egen numerator = total(cond(tag, weight*SectorReturn, .))
by Year BrokerageID (SectorID): egen denominator = total(cond(tag, weight, .))
gen wanted = numerator/denominator

Again, this is only sensible if the value of SectorReturn is consistent among all observations of the same BrokerageID, SectorID, and Year. The weighted average has been calculated including each SectorID in a given BrokerageID and Year only once, but given weight according to the number of distinct ReportIDs it is associated with.

Announcement