Discrepancy distinct counts STATA vs Excel

Laura Hill

Join Date: Aug 2021
Posts: 42

Discrepancy distinct counts STATA vs Excel

18 Aug 2021, 08:12

Dear all,

I had a question regarding a discrepancy I found between STATA and Excel when it comes to distinct count.
Namely, when counting distinct employee ID's, I get different outputs in STATA than in Excel.

I believe this is because of an error on my side (beginner in STATA), but I haven't been able to figure it out.

Unfortunately, I cannot share the data (personal data).
But essentially I have a dataset that looks as follows (my dataset contains +/- 1M observations)

CompanyID	employeeID	TypeCompany	Country
BE1234	p12345	public limited	Belgium
BE1234	p12345	public limited	Belgium
BE1234	p98765	public limited	Belgium
BE5678	p67891	private limited liability	Belgium
GB9123	p23456	private limited liability	UK
RU4567	p78912	professional partnership	Russia

Code:

bysort employeeID: gen distinctemployee= _n ==1
count if distinctemployee

If I then do

Code:

 bysort Country: count if distinctemployee

I get 2626 distinct employees in total.

However, when I then export the file to Excel. I notice that when using a pivot table and data model, and selecting distinct count, I have 2710 distinct employeeID's.

If I delete all non-Belgian companies from the onset from my dataset, based on Country variable, and then just do count if distinctemployee then I also get 2710.

But not when I drop the non-Belgian observations after creating the variable distinctemployee, then I get 2626 as well.

I'm trying to figure out what I did wrong. I'm assuming 2710 is the correct answer, rather than 2626.

Thanks in advance for your help!
Best regards,
Laura Hill

Tags: None

William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

18 Aug 2021, 08:55

Is it possible that you have the same employeeID in more than one country?

Your Stata code will count the employeeID in just one of the countries in which they appear. Your Excel analysis likely counted the employeeIDs separately by country.

If the same employeeID appearing in two different countries means two distinct employees, try changing your Stata code to

Code:

bysort Country employeeID: gen distinctemployee= _n ==1

and see what happens. And try the simpler code designed for this purpose

Code:

egen distinctemployee = tag(Country employeeID)
2 likes
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2400
#3

18 Aug 2021, 09:05

Something else to consider is whether capitalization or leading/trailing blanks may affect the way Excel counts unique values. You might also consider running this before counting unique employee IDs.

Code:

replace employeeID = strtrim(strlower(employeeID))
2 likes
Comment
Laura Hill

Join Date: Aug 2021

Posts: 42
#4

18 Aug 2021, 10:08

William Lisowski Thank you so much! This solved the issue and both codes gave the result I needed (2710).
Indeed, there are employee IDs linked to two or more countries(something I wasn't aware of until now, which is very good to know).

Leonardo Guizzetti Thank you for your code, I will use it in the future to avoid any trailing blanks.

Thanks for all the help!

Last edited by Laura Hill; 18 Aug 2021, 10:14.
Comment

Announcement

Discrepancy distinct counts STATA vs Excel

Comment

Comment

Comment