Dear all,
I had a question regarding a discrepancy I found between STATA and Excel when it comes to distinct count.
Namely, when counting distinct employee ID's, I get different outputs in STATA than in Excel.
I believe this is because of an error on my side (beginner in STATA), but I haven't been able to figure it out.
Unfortunately, I cannot share the data (personal data).
But essentially I have a dataset that looks as follows (my dataset contains +/- 1M observations)
If I then do
I get 2626 distinct employees in total.
However, when I then export the file to Excel. I notice that when using a pivot table and data model, and selecting distinct count, I have 2710 distinct employeeID's.
If I delete all non-Belgian companies from the onset from my dataset, based on Country variable, and then just do count if distinctemployee then I also get 2710.
But not when I drop the non-Belgian observations after creating the variable distinctemployee, then I get 2626 as well.
I'm trying to figure out what I did wrong. I'm assuming 2710 is the correct answer, rather than 2626.
Thanks in advance for your help!
Best regards,
Laura Hill
I had a question regarding a discrepancy I found between STATA and Excel when it comes to distinct count.
Namely, when counting distinct employee ID's, I get different outputs in STATA than in Excel.
I believe this is because of an error on my side (beginner in STATA), but I haven't been able to figure it out.
Unfortunately, I cannot share the data (personal data).
But essentially I have a dataset that looks as follows (my dataset contains +/- 1M observations)
CompanyID | employeeID | TypeCompany | Country |
BE1234 | p12345 | public limited | Belgium |
BE1234 | p12345 | public limited | Belgium |
BE1234 | p98765 | public limited | Belgium |
BE5678 | p67891 | private limited liability | Belgium |
GB9123 | p23456 | private limited liability | UK |
RU4567 | p78912 | professional partnership | Russia |
Code:
bysort employeeID: gen distinctemployee= _n ==1 count if distinctemployee
Code:
bysort Country: count if distinctemployee
However, when I then export the file to Excel. I notice that when using a pivot table and data model, and selecting distinct count, I have 2710 distinct employeeID's.
If I delete all non-Belgian companies from the onset from my dataset, based on Country variable, and then just do count if distinctemployee then I also get 2710.
But not when I drop the non-Belgian observations after creating the variable distinctemployee, then I get 2626 as well.
I'm trying to figure out what I did wrong. I'm assuming 2710 is the correct answer, rather than 2626.
Thanks in advance for your help!
Best regards,
Laura Hill
Comment