Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • A doubt regarding duplicate identifier in a panel data set

    Dear Members
    I have some doubts which I will illustrate with a sample dataset




    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str5 ric int year str28 companyname byte industry
    "ONE.V" 1999 "01 Communique Laboratory Inc" .
    "ONE.V" 1999 "01 Communique Laboratory Inc" .
    "ONE.V" 2000 "01 Communique Laboratory Inc" .
    "ONE.V" 2000 "01 Communique Laboratory Inc" .
    "ONE.V" 2001 "01 Communique Laboratory Inc" .
    "ONE.V" 2001 "01 Communique Laboratory Inc" .
    "ONE.V" 2002 "01 Communique Laboratory Inc" .
    "ONE.V" 2002 "01 Communique Laboratory Inc" .
    "ONE.V" 2003 "01 Communique Laboratory Inc" .
    "ONE.V" 2003 "01 Communique Laboratory Inc" .
    "ONE.V" 2004 "01 Communique Laboratory Inc" .
    "ONE.V" 2004 "01 Communique Laboratory Inc" .
    "ONE.V" 2005 "01 Communique Laboratory Inc" .
    "ONE.V" 2005 "01 Communique Laboratory Inc" .
    "ONE.V" 2006 "01 Communique Laboratory Inc" .
    "ONE.V" 2006 "01 Communique Laboratory Inc" .
    "ONE.V" 2007 "01 Communique Laboratory Inc" .
    "ONE.V" 2007 "01 Communique Laboratory Inc" .
    "ONE.V" 2008 "01 Communique Laboratory Inc" .
    "ONE.V" 2008 "01 Communique Laboratory Inc" .
    "ONE.V" 2009 "01 Communique Laboratory Inc" .
    "ONE.V" 2009 "01 Communique Laboratory Inc" .
    "ONE.V" 2010 "01 Communique Laboratory Inc" .
    "ONE.V" 2010 "01 Communique Laboratory Inc" .
    "ONE.V" 2011 "01 Communique Laboratory Inc" .
    "ONE.V" 2011 "01 Communique Laboratory Inc" .
    "ONE.V" 2012 "01 Communique Laboratory Inc" .
    "ONE.V" 2012 "01 Communique Laboratory Inc" .
    "ONE.V" 2013 "01 Communique Laboratory Inc" .
    "ONE.V" 2013 "01 Communique Laboratory Inc" .
    "ONE.V" 2014 "01 Communique Laboratory Inc" .
    "ONE.V" 2014 "01 Communique Laboratory Inc" .
    "ONE.V" 2015 "01 Communique Laboratory Inc" .
    "ONE.V" 2015 "01 Communique Laboratory Inc" .
    "ONE.V" 2016 "01 Communique Laboratory Inc" .
    "ONE.V" 2016 "01 Communique Laboratory Inc" .
    "ONE.V" 2017 "01 Communique Laboratory Inc" .
    "ONE.V" 2017 "01 Communique Laboratory Inc" .
    "ONE.V" 2018 "01 Communique Laboratory Inc" .
    "ONE.V" 2018 "01 Communique Laboratory Inc" .
    "ONE.V" 2019 "01 Communique Laboratory Inc" .
    "ONE.V" 2019 "01 Communique Laboratory Inc" .
    "ONE.V" 2020 "01 Communique Laboratory Inc" .
    "ONE.V" 2020 "01 Communique Laboratory Inc" .
    "A.V"   1999 "Armor Minerals Inc"           .
    "A.V"   1999 "Armor Minerals Inc"           .
    "A.V"   2000 "Armor Minerals Inc"           .
    "A.V"   2000 "Armor Minerals Inc"           .
    "A.V"   2001 "Armor Minerals Inc"           .
    "A.V"   2001 "Armor Minerals Inc"           .
    "A.V"   2002 "Armor Minerals Inc"           .
    "A.V"   2002 "Armor Minerals Inc"           .
    "A.V"   2003 "Armor Minerals Inc"           .
    "A.V"   2003 "Armor Minerals Inc"           .
    "A.V"   2004 "Armor Minerals Inc"           .
    "A.V"   2004 "Armor Minerals Inc"           .
    "A.V"   2005 "Armor Minerals Inc"           .
    "A.V"   2005 "Armor Minerals Inc"           .
    "A.V"   2006 "Armor Minerals Inc"           .
    "A.V"   2006 "Armor Minerals Inc"           .
    "A.V"   2007 "Armor Minerals Inc"           .
    "A.V"   2007 "Armor Minerals Inc"           .
    "A.V"   2008 "Armor Minerals Inc"           .
    "A.V"   2008 "Armor Minerals Inc"           .
    "A.V"   2009 "Armor Minerals Inc"           .
    "A.V"   2009 "Armor Minerals Inc"           .
    "A.V"   2010 "Armor Minerals Inc"           .
    "A.V"   2010 "Armor Minerals Inc"           .
    "A.V"   2011 "Armor Minerals Inc"           .
    "A.V"   2011 "Armor Minerals Inc"           .
    "A.V"   2012 "Armor Minerals Inc"           .
    "A.V"   2012 "Armor Minerals Inc"           .
    "A.V"   2013 "Armor Minerals Inc"           .
    "A.V"   2013 "Armor Minerals Inc"           .
    "A.V"   2014 "Armor Minerals Inc"           .
    "A.V"   2014 "Armor Minerals Inc"           .
    "A.V"   2015 "Armor Minerals Inc"           .
    "A.V"   2015 "Armor Minerals Inc"           .
    "A.V"   2016 "Armor Minerals Inc"           .
    "A.V"   2016 "Armor Minerals Inc"           .
    "A.V"   2017 "Armor Minerals Inc"           .
    "A.V"   2017 "Armor Minerals Inc"           .
    "A.V"   2018 "Armor Minerals Inc"           .
    "A.V"   2018 "Armor Minerals Inc"           .
    "A.V"   2019 "Armor Minerals Inc"           .
    "A.V"   2019 "Armor Minerals Inc"           .
    "A.V"   2020 "Armor Minerals Inc"           .
    "A.V"   2020 "Armor Minerals Inc"           .
    "RP.V"  1999 "Replicel Life Sciences Inc"   .
    "Rp.V"  1999 "Replicel Life Sciences Inc"   .
    "Rp.V"  1999 "Replicel Life Sciences Inc"   .
    "RP.V"  2000 "Replicel Life Sciences Inc"   .
    "Rp.V"  2000 "Replicel Life Sciences Inc"   .
    "Rp.V"  2000 "Replicel Life Sciences Inc"   .
    "Rp.V"  2001 "Replicel Life Sciences Inc"   .
    "Rp.V"  2001 "Replicel Life Sciences Inc"   .
    "RP.V"  2001 "Replicel Life Sciences Inc"   .
    "Rp.V"  2002 "Replicel Life Sciences Inc"   .
    "RP.V"  2002 "Replicel Life Sciences Inc"   .
    "Rp.V"  2002 "Replicel Life Sciences Inc"   .
    "Rp.V"  2003 "Replicel Life Sciences Inc"   .
    "RP.V"  2003 "Replicel Life Sciences Inc"   .
    "Rp.V"  2003 "Replicel Life Sciences Inc"   .
    "Rp.V"  2004 "Replicel Life Sciences Inc"   .
    "Rp.V"  2004 "Replicel Life Sciences Inc"   .
    "RP.V"  2004 "Replicel Life Sciences Inc"   .
    "Rp.V"  2005 "Replicel Life Sciences Inc"   .
    "Rp.V"  2005 "Replicel Life Sciences Inc"   .
    "RP.V"  2005 "Replicel Life Sciences Inc"   .
    "RP.V"  2006 "Replicel Life Sciences Inc"   .
    "Rp.V"  2006 "Replicel Life Sciences Inc"   .
    "Rp.V"  2006 "Replicel Life Sciences Inc"   .
    "Rp.V"  2007 "Replicel Life Sciences Inc"   .
    "Rp.V"  2007 "Replicel Life Sciences Inc"   .
    "RP.V"  2007 "Replicel Life Sciences Inc"   .
    "Rp.V"  2008 "Replicel Life Sciences Inc"   .
    "RP.V"  2008 "Replicel Life Sciences Inc"   .
    "Rp.V"  2008 "Replicel Life Sciences Inc"   .
    "Rp.V"  2009 "Replicel Life Sciences Inc"   .
    "RP.V"  2009 "Replicel Life Sciences Inc"   .
    "Rp.V"  2009 "Replicel Life Sciences Inc"   .
    "RP.V"  2010 "Replicel Life Sciences Inc"   .
    "Rp.V"  2010 "Replicel Life Sciences Inc"   .
    "Rp.V"  2010 "Replicel Life Sciences Inc"   .
    "RP.V"  2011 "Replicel Life Sciences Inc"   .
    "Rp.V"  2011 "Replicel Life Sciences Inc"   .
    "Rp.V"  2011 "Replicel Life Sciences Inc"   .
    "RP.V"  2012 "Replicel Life Sciences Inc"   .
    "Rp.V"  2012 "Replicel Life Sciences Inc"   .
    "Rp.V"  2012 "Replicel Life Sciences Inc"   .
    "RP.V"  2013 "Replicel Life Sciences Inc"   .
    "Rp.V"  2013 "Replicel Life Sciences Inc"   .
    "Rp.V"  2013 "Replicel Life Sciences Inc"   .
    "Rp.V"  2014 "Replicel Life Sciences Inc"   .
    "RP.V"  2014 "Replicel Life Sciences Inc"   .
    "Rp.V"  2014 "Replicel Life Sciences Inc"   .
    "RP.V"  2015 "Replicel Life Sciences Inc"   .
    "Rp.V"  2015 "Replicel Life Sciences Inc"   .
    "Rp.V"  2015 "Replicel Life Sciences Inc"   .
    "Rp.V"  2016 "Replicel Life Sciences Inc"   .
    "Rp.V"  2016 "Replicel Life Sciences Inc"   .
    "RP.V"  2016 "Replicel Life Sciences Inc"   .
    "Rp.V"  2017 "Replicel Life Sciences Inc"   .
    "Rp.V"  2017 "Replicel Life Sciences Inc"   .
    "RP.V"  2017 "Replicel Life Sciences Inc"   .
    "Rp.V"  2018 "Replicel Life Sciences Inc"   .
    "Rp.V"  2018 "Replicel Life Sciences Inc"   .
    "RP.V"  2018 "Replicel Life Sciences Inc"   .
    "Rp.V"  2019 "Replicel Life Sciences Inc"   .
    "RP.V"  2019 "Replicel Life Sciences Inc"   .
    "Rp.V"  2019 "Replicel Life Sciences Inc"   .
    "Rp.V"  2020 "Replicel Life Sciences Inc"   .
    "RP.V"  2020 "Replicel Life Sciences Inc"   .
    "Rp.V"  2020 "Replicel Life Sciences Inc"   .
    "SAh.V" 1999 "Southern Arc Minerals Inc"    .
    "SAH.V" 1999 "Southern Arc Minerals Inc"    .
    "SAH.V" 1999 "Southern Arc Minerals Inc"    .
    "SAh.V" 1999 "Southern Arc Minerals Inc"    .
    "SAh.V" 2000 "Southern Arc Minerals Inc"    .
    "SAh.V" 2000 "Southern Arc Minerals Inc"    .
    "SAH.V" 2000 "Southern Arc Minerals Inc"    .
    "SAH.V" 2000 "Southern Arc Minerals Inc"    .
    "SAH.V" 2001 "Southern Arc Minerals Inc"    .
    "SAh.V" 2001 "Southern Arc Minerals Inc"    .
    "SAh.V" 2001 "Southern Arc Minerals Inc"    .
    "SAH.V" 2001 "Southern Arc Minerals Inc"    .
    "SAH.V" 2002 "Southern Arc Minerals Inc"    .
    "SAh.V" 2002 "Southern Arc Minerals Inc"    .
    "SAh.V" 2002 "Southern Arc Minerals Inc"    .
    "SAH.V" 2002 "Southern Arc Minerals Inc"    .
    "SAh.V" 2003 "Southern Arc Minerals Inc"    .
    "SAH.V" 2003 "Southern Arc Minerals Inc"    .
    "SAh.V" 2003 "Southern Arc Minerals Inc"    .
    "SAH.V" 2003 "Southern Arc Minerals Inc"    .
    "SAh.V" 2004 "Southern Arc Minerals Inc"    .
    "SAH.V" 2004 "Southern Arc Minerals Inc"    .
    "SAh.V" 2004 "Southern Arc Minerals Inc"    .
    "SAH.V" 2004 "Southern Arc Minerals Inc"    .
    "SAh.V" 2005 "Southern Arc Minerals Inc"    .
    "SAH.V" 2005 "Southern Arc Minerals Inc"    .
    "SAh.V" 2005 "Southern Arc Minerals Inc"    .
    "SAH.V" 2005 "Southern Arc Minerals Inc"    .
    "SAh.V" 2006 "Southern Arc Minerals Inc"    .
    "SAH.V" 2006 "Southern Arc Minerals Inc"    .
    "SAh.V" 2006 "Southern Arc Minerals Inc"    .
    "SAH.V" 2006 "Southern Arc Minerals Inc"    .
    "SAh.V" 2007 "Southern Arc Minerals Inc"    .
    "SAH.V" 2007 "Southern Arc Minerals Inc"    .
    "SAh.V" 2007 "Southern Arc Minerals Inc"    .
    "SAH.V" 2007 "Southern Arc Minerals Inc"    .
    "SAh.V" 2008 "Southern Arc Minerals Inc"    .
    "SAH.V" 2008 "Southern Arc Minerals Inc"    .
    "SAH.V" 2008 "Southern Arc Minerals Inc"    .
    "SAh.V" 2008 "Southern Arc Minerals Inc"    .
    "SAH.V" 2009 "Southern Arc Minerals Inc"    .
    "SAH.V" 2009 "Southern Arc Minerals Inc"    .
    "SAh.V" 2009 "Southern Arc Minerals Inc"    .
    "SAh.V" 2009 "Southern Arc Minerals Inc"    .
    "SAh.V" 2010 "Southern Arc Minerals Inc"    .
    "SAH.V" 2010 "Southern Arc Minerals Inc"    .
    "SAH.V" 2010 "Southern Arc Minerals Inc"    .
    "SAh.V" 2010 "Southern Arc Minerals Inc"    .
    "SAh.V" 2011 "Southern Arc Minerals Inc"    .
    "SAH.V" 2011 "Southern Arc Minerals Inc"    .
    "SAH.V" 2011 "Southern Arc Minerals Inc"    .
    "SAh.V" 2011 "Southern Arc Minerals Inc"    .
    "SAH.V" 2012 "Southern Arc Minerals Inc"    .
    "SAh.V" 2012 "Southern Arc Minerals Inc"    .
    "SAH.V" 2012 "Southern Arc Minerals Inc"    .
    "SAh.V" 2012 "Southern Arc Minerals Inc"    .
    "SAH.V" 2013 "Southern Arc Minerals Inc"    .
    "SAh.V" 2013 "Southern Arc Minerals Inc"    .
    "SAh.V" 2013 "Southern Arc Minerals Inc"    .
    "SAH.V" 2013 "Southern Arc Minerals Inc"    .
    "SAh.V" 2014 "Southern Arc Minerals Inc"    .
    "SAH.V" 2014 "Southern Arc Minerals Inc"    .
    "SAH.V" 2014 "Southern Arc Minerals Inc"    .
    "SAh.V" 2014 "Southern Arc Minerals Inc"    .
    "SAh.V" 2015 "Southern Arc Minerals Inc"    .
    "SAh.V" 2015 "Southern Arc Minerals Inc"    .
    "SAH.V" 2015 "Southern Arc Minerals Inc"    .
    "SAH.V" 2015 "Southern Arc Minerals Inc"    .
    "SAh.V" 2016 "Southern Arc Minerals Inc"    .
    "SAh.V" 2016 "Southern Arc Minerals Inc"    .
    "SAH.V" 2016 "Southern Arc Minerals Inc"    .
    "SAH.V" 2016 "Southern Arc Minerals Inc"    .
    "SAH.V" 2017 "Southern Arc Minerals Inc"    .
    "SAh.V" 2017 "Southern Arc Minerals Inc"    .
    "SAh.V" 2017 "Southern Arc Minerals Inc"    .
    "SAH.V" 2017 "Southern Arc Minerals Inc"    .
    "SAh.V" 2018 "Southern Arc Minerals Inc"    .
    "SAh.V" 2018 "Southern Arc Minerals Inc"    .
    "SAH.V" 2018 "Southern Arc Minerals Inc"    .
    "SAH.V" 2018 "Southern Arc Minerals Inc"    .
    "SAh.V" 2019 "Southern Arc Minerals Inc"    .
    "SAH.V" 2019 "Southern Arc Minerals Inc"    .
    "SAH.V" 2019 "Southern Arc Minerals Inc"    .
    "SAh.V" 2019 "Southern Arc Minerals Inc"    .
    "SAh.V" 2020 "Southern Arc Minerals Inc"    .
    "SAH.V" 2020 "Southern Arc Minerals Inc"    .
    "SAh.V" 2020 "Southern Arc Minerals Inc"    .
    "SAH.V" 2020 "Southern Arc Minerals Inc"    .
    end
    
    *note that both ric and companyname are identifiers
    
    
    * I would like set the panel and I tried 
    encode companyname, gen (id)
    encode ric, gen (id2)
    
    xtset id year 
    repeated time values within panel
    r(451);
    
    xtset id2 year 
    repeated time values within panel
    r(451);
    
    *Removing duplicates
    duplicates report id year 
    /*
    -------------------------------------
       Copies | Observations       Surplus
    ----------+---------------------------
            2 |           88            44
            3 |           66            44
            4 |           88            66
    */
    sort id year
    quietly by id year: gen dup = cond(_N==1,0,_n)
    drop if dup>1
    *(154 observations deleted)
    xtset id year 
    /*
    xtset id year 
    
    Panel variable: id (strongly balanced)
     Time variable: year, 1999 to 2020
             Delta: 1 unit
    */
    
    
    xtset id2 year 
    /*
     xtset id2 year 
    
    Panel variable: id2 (unbalanced)
     Time variable: year, 1999 to 2020, but with gaps
             Delta: 1 unit
    
    . 
    */
    Why this difference? When I tried using user written command -distinct- (ssc install distinct), I got the following output
    distinct id id2 
    /*
    
           |        Observations
           |      total   distinct
    -------+----------------------
        id |         88          4
       id2 |         88          6
    */
    *Though both id and id2 are unique identifiers for each unit(companyname) why is id2 (ric)> id (companyname)

  • #2
    Neelakanda, because "ric" is upper-lower case mixed. For example, for a same company, "ric" is "RP.V" or "Rp.V", and Stata treats them as different values.

    Code:
    replace ric = strupper(ric)
    duplicates drop

    Comment


    • #3
      You are inconsistently coding ric. Sometimes you use all upper case letters, but sometimes you don't. So, for example, Replicel Life Science's ric is sometimes RP.V and sometimes Rp.V. Similarly Southern Arc Minerals Inc is sometimes associated with ric SAH.V and sometimes with SAh.V. That's why you have more id2's than id's. I suggest you -replace ric = trim(itrim(upper(ric)))-. That will put everything into all upper case and will also eliminate any initial or terminal blank padding, and any surplus internal blanks. (Those are the commonest kind of data entry errors for this kind of variable.)

      Added: Crossed with #2.

      Comment


      • #4
        Thanks Fei Wang and @ Clyde Schechter. Yes, you both are right that in my example 2 companies are having different ric which is inconsistent. Since this is a small dataset, I could eyeball and find it out but how to do this in a large dataset (rather how quickly you both could find).
        ric and companynames are directly downloaded from the source and I am not sure whether SAME COMPANIES ARE GIVEN DIFFERENT RIC. In this example as you both said I can convert the cases but in some cases this might not be the case hence is there a general command or code to figure out those units that are having different ric during the period. For example, in the below sample data


        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input str3 ric str1 companyname int year
        "A.a" "a" 1991
        "a.A" "a" 1992
        "A.a" "a" 1993
        "A.a" "a" 1994
        "A1"  "b" 1991
        "A1"  "b" 1992
        "A1"  "b" 1993
        "A2"  "b" 1994
        "B1"  "c" 1991
        "B1"  "c" 1992
        "B1"  "c" 1993
        "B1"  "c" 1994
        end
        In the example, the ric for company "a" is A.a however, there was data entry mistake in the year 1992 where ric was miscoded as a.A. Similarly, for the company b the ric is A1 but in the year 1994 it was miscoded as A2. So, in the above case how to find out those companies that are having different ric? What is the general method. Have I made myself clear

        Comment


        • #5
          Neelakanda, if I understand correctly, you assume the "companyname" is correct through the data and "ric" should be the same within a given "companyname". The following code would be able to help you find which "companyname" falsely has multiple values of "ric".

          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input str3 ric str1 companyname int year
          "A.a" "a" 1991
          "a.A" "a" 1992
          "A.a" "a" 1993
          "A.a" "a" 1994
          "A1"  "b" 1991
          "A1"  "b" 1992
          "A1"  "b" 1993
          "A2"  "b" 1994
          "B1"  "c" 1991
          "B1"  "c" 1992
          "B1"  "c" 1993
          "B1"  "c" 1994
          end
          
          bys companyname (ric): gen miscode = ric[1] != ric[_N]
          The code is only for finding errors rather than correcting them. For correction, you need more specific algorithm. For example, in you example in #4, what if the correct "ric" for company b is "A2" instead of "A1"? What are the criteria to judge which ones are correct and which are wrong?

          Comment


          • #6
            Dear Fei Wang
            You are absolutely right! I assumed that companyname is correct and ric could be wrong but that may not be the case always. However, your code could help me spot out the erroneous pairs so that I can record them and further explore where the error actually lies. Thanks a lot for the help!

            Comment


            • #7
              Neelakanda, some further information: If the error code is always the minority, like "a.A" for company a and "A2" for company b, you may use the following code to correct the error. But please make sure this algorithm is conceptually correct (when values are half right and half wrong, the correction may be beyond control).

              Code:
              * Example generated by -dataex-. For more info, type help dataex
              clear
              input str3 ric str1 companyname int year
              "A.a" "a" 1991
              "a.A" "a" 1992
              "A.a" "a" 1993
              "A.a" "a" 1994
              "A1"  "b" 1991
              "A1"  "b" 1992
              "A1"  "b" 1993
              "A2"  "b" 1994
              "B1"  "c" 1991
              "B1"  "c" 1992
              "B1"  "c" 1993
              "B1"  "c" 1994
              end
              
              bys companyname ric: gen f = -_N
              bys companyname (f): replace ric = ric[_n-1] if !mi(ric[_n-1])
              drop f

              Comment


              • #8
                Dear Fei Wang, The code works perfectly well even in example #1. Since such cases will be minimum, I think the command you gave can help me a lot.

                Comment


                • #9
                  ric and companynames are directly downloaded from the source
                  Moral of the story: never trust the data. Even the most carefully curated data sets from sources with good reputations often contain errors. Before beginning analysis, it is always wise to inspect the data carefully to verify that all numerical values are within allowable ranges, and that variables that should be consistent within or across observations are. If combining multiple data sets, it is also important to check that the variable names and storage types are consistent across data sets.

                  Comment


                  • #10
                    Dear Clyde Schechter. Thanks for the excellent advice. Before inputting the data, I always visually inspect the data in MS excel but it is time consuming and cumbersome when data are very large.
                    Are there some hygiene checks for the data before starting analysis. I know this question is trivial but in some FAQs related to Stata, commands like -describe-, -codebook- are used before indepth analysis. Can you suggest in general some commands that can be used sequentially for understanding the data starting from string variables to continuous variables. You can just refer some sources so that I can learn myself. Statistics text books mostly start with analysis and data cleansing part is not dealt seriously. If you can suggest some or those steps that you follow, if explained can be very much helpful in the future also. Once again Thanks

                    Comment


                    • #11
                      While -describe- and -codebook- are helpful, to me the most useful command in the data cleaning process is -assert-. The reason is that it does not let you miss something you're trying to look for. Yes, it is nice to run a command like -summ age- and then look at the results to verify that the minimum and maximum age in the data set are in the range they should be. But when you do a lot of those, it is easy to miss something. By contrast, if you know for a start that every age in your data set should be, say, >= 18 because they must all be adults, -assert age >= 18 - is foolproof. You can't not notice if there is a violation, because -assert- prints out an error message in red and then, more to the point, it halts execution! So I do my range checking with -assert- statements. Similarly for categorical variables, I will use -assert- with the -inlist()- function. For example, if some variable is supposed to take on one of the values 1, 2, or 3, unless it is missing, we have -assert inlist(var, 1, 2, 3) | missing(var)-.

                      Another important thing to check is consistency where it is expected. For example, in a longitudinal data set, the same date of birth should be recorded for all observations of the same person.
                      Code:
                      by person_id (date_of_birth), sort: assert date_of_birth[1] == date_of_birth[_N]
                      It is sometimes important to check for consistency within single observations. For example, if the value of a variable called pregnant is yes, then sex should be female and age should be < 60 years (or an even lower limit). So again, -assert sex == "female":sex & age < 60 if pregnant- (modify that according the actual coding of sex and pregnant in the data set.)

                      Counting is important as well. If the data are supposed to contain one observation per id for each year between 2000 and 2010 inclusive:
                      [code]
                      by id (year), sort: assert year == 1999 + _n & _N == 11
                      [code]

                      Another useful command is -isid- for verifying that whatever variable(s) should uniquely identify observations actually does.

                      String variables are trickier. It is much harder to specify validity conditions. Generally, unless the string values are known to be case sensitive, I generally eliminate case variation, and deal with hidden blanks (which can be extremely difficult to find in debugging situations because you literally cannot see them) by routinely doing:
                      Code:
                      replace var = trim(itrim(lower(var)))
                      on all string variables. (Again, you have to be sure that differences in case or use of blanks is not meaningful before doing this--but they hardly ever are and variations on these things are usually errors, not true distinctions.) In some situations you can also check the length of all values of a string variable. For example, if you have US social security numbers in their usual XXX-XX-XXXX format, you can verify -assert strlen(social_security_number) == 11-.

                      In a project involving multiple linked data sets, it is important to verify the IDs of the observations in the different data sets match as they should. For example, if I have a master list of all people in a study, and I have another data set that contains, say, lab results on some people from the study, I use -merge- with the -assert()- option to verify that all of the IDs in the lab data match somebody in the master list. If there is supposed to be lab data on everybody, then I verify that all IDs in the master list data find a match in the lab data. Also important is if you have multiple linked data sets that contain common variables other than just the linking ID variable: in that case I verify that the common variables are the same in both data sets. It will not do if ID 51 is said to be 22 years old in one data set but 25 years old in another data set that is from the same time.

                      Before combining datasets with -merge- or -append- it is helpful to use Mark Chattfield's -precombine-, which is available from Stata Journal. It checks for compatibility of data types, identifies which variables are common to the data sets and which are not, and, crucially, identifies value-labeled numeric variables that have inconsistent labelings. That is a really important thing to check for, as combining data sets with conflicting labelings leads to a fatally flawed data set, yet one that on visual inspection appears perfectly normal.


                      Let me also emphasize that my practice is to do this when I build my Stata data sets, right after I import them from whatever source provided them. During data analysis, I will sometimes repeat some of these validations if they are necessary assumptions for the validity of the planned analysis, and especially if the data have changed since I first built the data sets. Building in all of these checks takes some time and effort, but it is well worth it. I'm never in a hurry to get the wrong answers; and wrong answers are pretty much all you can get from wrong data.
                      Last edited by Clyde Schechter; 07 Nov 2021, 20:01.

                      Comment


                      • #12
                        Dear Clyde Schechter
                        Thanks from the bottom of my heart. I haven't used -assert- and I wasn't aware of these many uses of it. Also, thanks for reminding me that it is very important to sit with data and understand it before beginning with the analysis. Though I do it in a limited fashion, I thought it was a sheer waste of time as people used to say why to invest in small errors that cannot tamper with the results. However, I realized the need for knowing the data well before doing any analysis.

                        Comment

                        Working...
                        X