Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Discrepancy distinct counts STATA vs Excel

    Dear all,

    I had a question regarding a discrepancy I found between STATA and Excel when it comes to distinct count.
    Namely, when counting distinct employee ID's, I get different outputs in STATA than in Excel.

    I believe this is because of an error on my side (beginner in STATA), but I haven't been able to figure it out.

    Unfortunately, I cannot share the data (personal data).
    But essentially I have a dataset that looks as follows (my dataset contains +/- 1M observations)
    CompanyID employeeID TypeCompany Country
    BE1234 p12345 public limited Belgium
    BE1234 p12345 public limited Belgium
    BE1234 p98765 public limited Belgium
    BE5678 p67891 private limited liability Belgium
    GB9123 p23456 private limited liability UK
    RU4567 p78912 professional partnership Russia

    Code:
    bysort employeeID: gen distinctemployee= _n ==1
    count if distinctemployee
    If I then do

    Code:
     bysort Country: count if distinctemployee
    I get 2626 distinct employees in total.

    However, when I then export the file to Excel. I notice that when using a pivot table and data model, and selecting distinct count, I have 2710 distinct employeeID's.


    If I delete all non-Belgian companies from the onset from my dataset, based on Country variable, and then just do count if distinctemployee then I also get 2710.

    But not when I drop the non-Belgian observations after creating the variable distinctemployee, then I get 2626 as well.

    I'm trying to figure out what I did wrong. I'm assuming 2710 is the correct answer, rather than 2626.

    Thanks in advance for your help!
    Best regards,
    Laura Hill



  • #2
    Is it possible that you have the same employeeID in more than one country?

    Your Stata code will count the employeeID in just one of the countries in which they appear. Your Excel analysis likely counted the employeeIDs separately by country.

    If the same employeeID appearing in two different countries means two distinct employees, try changing your Stata code to
    Code:
    bysort Country employeeID: gen distinctemployee= _n ==1
    and see what happens. And try the simpler code designed for this purpose
    Code:
    egen distinctemployee = tag(Country employeeID)

    Comment


    • #3
      Something else to consider is whether capitalization or leading/trailing blanks may affect the way Excel counts unique values. You might also consider running this before counting unique employee IDs.

      Code:
      replace employeeID = strtrim(strlower(employeeID))

      Comment


      • #4
        William Lisowski Thank you so much! This solved the issue and both codes gave the result I needed (2710).
        Indeed, there are employee IDs linked to two or more countries(something I wasn't aware of until now, which is very good to know).

        Leonardo Guizzetti Thank you for your code, I will use it in the future to avoid any trailing blanks.

        Thanks for all the help!
        Last edited by Laura Hill; 18 Aug 2021, 10:14.

        Comment

        Working...
        X