Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How Can I Find coding of a Categorical Variable is Wrong in Stata?

    Hi, I have a small dataset as follow,
    clear
    input byte (id male)
    1 0
    1 0
    1 0
    1 1
    2 1
    2 0
    2 0
    2 0
    3 0
    3 0
    3 0
    3 0
    4 0
    4 0
    4 1
    end

    It is evident that the male variable is incorrectly coded. How can I use Stata code to find this issue?
    Thank you for your help!

  • #2
    if you just want to find whether there is such an issue:
    Code:
    . ta id male
    
               |         male
            id |         0          1 |     Total
    -----------+----------------------+----------
             1 |         3          1 |         4
             2 |         3          1 |         4
             3 |         4          0 |         4
             4 |         2          1 |         3
    -----------+----------------------+----------
         Total |        12          3 |        15
    as, for each id, there should be a positive count in only one column

    if your question is really something else, please clarify

    Comment


    • #3
      Also see FAQ https://www.stata.com/support/faqs/d...ions-in-group/

      Comment


      • #4
        Whether wrong coding is evident or not depends on knowledge outside of your data, e.g. it may well be that for id==4 all values should be 1, not 0 (assuming that 0=female). That said, you can use -tab2- and -list-:
        Code:
        tab2 id male
        list id male if male==0
        By the way, you can improve you post by placing Stata commands between code delimiters.

        Comment


        • #5
          Thank you!

          Comment


          • #6
            Originally posted by Dirk Enzmann View Post
            Whether wrong coding is evident or not depends on knowledge outside of your data, e.g. it may well be that for id==4 all values should be 1, not 0 (assuming that 0=female). That said, you can use -tab2- and -list-:
            Code:
            tab2 id male
            list id male if male==0
            By the way, you can improve you post by placing Stata commands between code delimiters.
            However, the real dataset has 100000000000 observations and the error message said, "too many values".
            Could you please show me what the alternative is to solve this problem in Stata.

            Thank you!

            Comment


            • #7
              With a data set that size, trying to actually create a list of all id's corresponding to males is almost guaranteed to fail due to limits of the output of -tab-, or the number of entries in a macro, or the dimensions of a matrix. Even if you were to succeed in this way, the resulting list would be much too large for you to work with in any reasonable way.

              In any case, what you are trying to do is obtain information about those id's where the variable male is coded inconsistently. The response in #3 by Andrew Musau points you in a direction that will succeed. Follow the link he posted there to see the details of the solution. The only way in which I would change it is that I would make the last command -browse if ...- instead of -list if...-.

              Added: Be patient when trying this approach. Sorting a data set of that size, which is a crucial part of that code, is likely to take a long time.

              More added: As an aside, viewing many of your recent posts here, it is apparent that you are a beginner with Stata. That is fine: we were all beginners once, and beginner questions are very welcome here. But I would argue that working with a data set containing 100,000,000,000 observations is not the best way to learn the Stata basics. It will bump up against many of Stata's limits that either cannot be overcome or require intermediate to advanced programming to work around, or, for the commands that will work regardless of size, they will in general be very slow.

              Moreover, I imagine that you are using a data set of that size as part of a serious project, where there are consequences for you, or worse, for others, for getting things wrong. This, too, is not the right setting for learning the basics.

              I urge you, if at all possible, to put this project aside, and invest some time in learning Stata fundamentals. William Lisowski offers excellent advice on how to go about this at https://www.statalist.org/forums/for...uplicate-dates. The time you spend on that will be amply and rapidly repaid.
              Last edited by Clyde Schechter; 17 Jul 2022, 14:25.

              Comment


              • #8
                Thank you, Professor!

                Comment

                Working...
                X