Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Keeping leading zeros when destringing a variable

    Hi,
    I have a string variable that I want to make numeric. Since the variable has not a value label, I used the destring command to generate the numeric variable but the issue is that the new variable get rid of the leading zeros. Exemple old variable is 023052 the new variable is 23052. But, I want to keep the leading zeros.
    Thanks for helping with that.

  • #2
    You have asked this before and the answer was, and is, that you can't do this.

    A leading zero is at most something that can be displayed on instruction; it is not to be considered as something stored in a way that is accessible.

    The main problem is independent of variables and conversion commands. These examples

    Code:
    . di real("0042")
    42
    
    . di real("042")
    42
    
    . di real("42")
    42
    show that leading zeros disappear and -- even more crucially -- it is true that the conversion is not reversible as string(42) will yield only the last and you need to know the exact number of leading zeros to restore them retrospectively. In short, destring is futile here if the aim is to preserve leading zeros.

    You didn't answer my question in your earlier thread on why you think you want to do this. http://www.statalist.org/forums/foru...t-in-the-other

    Comment


    • #3
      Thanks Prof. Cox. I'm sorry I didn't answer your question before but the reason is that I'm not sure why this variable has to be numeric!! Sorry, that seems weird but this variable is my id variable (and since I'm working with a colleague on the data, he requires the id variable to be numeric otherwise he can't run the regressions!!). So, I looked on the web about this, and it seems that it's rather the opposite which is true, I mean that the id variable has rather to be string! So, I'm confused and really don't understand the logic behind all this!
      I appreciate any clarification of this issue.

      Comment


      • #4
        It is generally better to have id variables be strings.

        So your colleague needs numeric ID variables for whatever reason. Why do the leading zeroes matter? Do you have IDs that are the same but for the leading zeroes? If not, just give him the -destring- version and don't worry about the leading zeroes. If you do have IDs that differ only by the presence/absence/number of leading zeroes, then, just make up a sequential numerical ID and give your colleague that. If it's an ID variable, all that matters is that it distinctly identify distinct cases in the data.

        Code:
        egen long pseudo_id = group(id)
        Make sure you save the file with both the original ID and the pseudo ID so that you can communicate if there are questions about particular observations.

        Comment


        • #5
          Salma,

          Just out of curiosity, what regression command is being used that requires numeric ID variables?

          Regards,
          Joe

          Comment


          • #6
            I have a really bad idea that may solve the issue. Assuming they all have the same width, you add a constant (looks like 1000000 will do) after de-stringing. This will resurrect the leading zeros, though all cases, will of course start with 1. Seems a strange thing to do, though. Best off just keeping them strings if the leading 0's are meaningful.

            Comment


            • #7
              Clyde,
              The zeroes matter because the id may differ by the presence/absence or number of leading zeroes. So, I tried your solution. Hope it will solve the issue for my colleague!!
              Joe,
              I'm not sure yet. I'm just arranging my data to get it ready for analysis. I know that my colleague will run two tests: OLS test and fixed effects test. Sorry but things are a little ambiguous for me now.

              Comment


              • #8
                The most obvious reason that numeric identifiers are needed is because a panel identifier should be numeric. In that case, don't destring at all, just use encode or egen's group() function.

                Comment


                • #9
                  A messy situation. Numbers distinguished only by leading zeros are not really distinguished; a problem if you use destring. And strings that look like numbers may lead to confusion if they are matched to different numbers that don't correspond; a possible source of confusion if you use encode, as Nick Cox recommends .

                  I would either use
                  egen my_id = "ID_" + id
                  encode my_id, gen(id_numeric)
                  That way, string ID numbers will not look like numeric ID numbers, and both can be used as needed.

                  Or add a very large constant to every number, as Ben Earnhart suggests.

                  I would also try to make sure that whoever supplied the data knows that this is not a sensible way to define an ID number.
                  02471 and 2471 can be confused by people as well as by computers.


                  Comment


                  • #10
                    If identifiers aren't used consistently when recording data, then there will be certainly be problems. The advice to consider encode can't solve that. In fact, nothing can solve that except consistent recording.

                    Comment


                    • #11
                      Omar: I think you misunderstand what is perceived to be the problem and offer a solution that is not guaranteed to work.

                      If "05000028" is an example of a string identifier then you may need all its characters and you should almost never think about applying destring. destring is for variables that should be numeric, but have been somehow misread as string. The only exceptions to this would be if (for example) all identifiers contained leading zeros and those leading zeros are not informative. But in general string identifiers really are best left as string variables. If you need a numeric identifier, e.g. for panel work, solutions have already been discussed in this thread.

                      Furthermore, even if destring was applied mistakenly, applying a display format that shows leading zeros only affects what is displayed. It doesn't restore what was removed by destring. There is enough confusion on this point among Stata users who are still learning (see e.g. http://www.stata-journal.com/article.html?article=dm0067).




                      Comment


                      • #12
                        There is a big difference between you making small changes to your dataset that preserve the crucial information in your data (that's fine) and recommending as a general strategy what is not guaranteed to work. Mu general advice remains as given in #8.

                        This FAQ including information on getting numeric identifiers when you have string identifiers has been public since 1999: http://www.stata.com/support/faqs/da...ers/index.html
                        Last edited by Nick Cox; 15 Nov 2014, 07:02.

                        Comment


                        • #13
                          Note for later readers: My previous two replies won't make full sense because they were replies to a member who posted twice and then deleted their comments. The miniature debate was based in large part on a misunderstanding. As an hour has passed since posting, I can't delete my posts unilaterally.

                          Comment

                          Working...
                          X