Keeping leading zeros when destringing a variable

salma ktat

Join Date: Jul 2014

Posts: 76
#1

Keeping leading zeros when destringing a variable

07 Nov 2014, 09:27

Hi,
I have a string variable that I want to make numeric. Since the variable has not a value label, I used the destring command to generate the numeric variable but the issue is that the new variable get rid of the leading zeros. Exemple old variable is 023052 the new variable is 23052. But, I want to keep the leading zeros.
Thanks for helping with that.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35734
#2

07 Nov 2014, 09:50

You have asked this before and the answer was, and is, that you can't do this.

A leading zero is at most something that can be displayed on instruction; it is not to be considered as something stored in a way that is accessible.

The main problem is independent of variables and conversion commands. These examples

Code:

. di real("0042") 42 . di real("042") 42 . di real("42") 42

show that leading zeros disappear and -- even more crucially -- it is true that the conversion is not reversible as string(42) will yield only the last and you need to know the exact number of leading zeros to restore them retrospectively. In short, destring is futile here if the aim is to preserve leading zeros.

You didn't answer my question in your earlier thread on why you think you want to do this. http://www.statalist.org/forums/foru...t-in-the-other
Comment
salma ktat

Join Date: Jul 2014

Posts: 76
#3

07 Nov 2014, 10:11

Thanks Prof. Cox. I'm sorry I didn't answer your question before but the reason is that I'm not sure why this variable has to be numeric!! Sorry, that seems weird but this variable is my id variable (and since I'm working with a colleague on the data, he requires the id variable to be numeric otherwise he can't run the regressions!!). So, I looked on the web about this, and it seems that it's rather the opposite which is true, I mean that the id variable has rather to be string! So, I'm confused and really don't understand the logic behind all this!
I appreciate any clarification of this issue.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#4

07 Nov 2014, 10:18

It is generally better to have id variables be strings.

So your colleague needs numeric ID variables for whatever reason. Why do the leading zeroes matter? Do you have IDs that are the same but for the leading zeroes? If not, just give him the -destring- version and don't worry about the leading zeroes. If you do have IDs that differ only by the presence/absence/number of leading zeroes, then, just make up a sequential numerical ID and give your colleague that. If it's an ID variable, all that matters is that it distinctly identify distinct cases in the data.

Code:

egen long pseudo_id = group(id)

Make sure you save the file with both the original ID and the pseudo ID so that you can communicate if there are questions about particular observations.
Comment
Joe Canner

Join Date: Mar 2014

Posts: 580
#5

07 Nov 2014, 10:29

Salma,

Just out of curiosity, what regression command is being used that requires numeric ID variables?

Regards,
Joe
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#6

07 Nov 2014, 10:37

I have a really bad idea that may solve the issue. Assuming they all have the same width, you add a constant (looks like 1000000 will do) after de-stringing. This will resurrect the leading zeros, though all cases, will of course start with 1. Seems a strange thing to do, though. Best off just keeping them strings if the leading 0's are meaningful.
Comment
salma ktat

Join Date: Jul 2014

Posts: 76
#7

07 Nov 2014, 10:37

Clyde,
The zeroes matter because the id may differ by the presence/absence or number of leading zeroes. So, I tried your solution. Hope it will solve the issue for my colleague!!
Joe,
I'm not sure yet. I'm just arranging my data to get it ready for analysis. I know that my colleague will run two tests: OLS test and fixed effects test. Sorry but things are a little ambiguous for me now.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35734
#8

07 Nov 2014, 11:15

The most obvious reason that numeric identifiers are needed is because a panel identifier should be numeric. In that case, don't destring at all, just use encode or egen's group() function.
2 likes
Comment
Paul T Seed

Join Date: Apr 2014

Posts: 66
#9

11 Nov 2014, 02:58

A messy situation. Numbers distinguished only by leading zeros are not really distinguished; a problem if you use destring. And strings that look like numbers may lead to confusion if they are matched to different numbers that don't correspond; a possible source of confusion if you use encode, as Nick Cox recommends .

I would either use
egen my_id = "ID_" + id
encode my_id, gen(id_numeric)
That way, string ID numbers will not look like numeric ID numbers, and both can be used as needed.

Or add a very large constant to every number, as Ben Earnhart suggests.

I would also try to make sure that whoever supplied the data knows that this is not a sensible way to define an ID number.
02471 and 2471 can be confused by people as well as by computers.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35734
#10

11 Nov 2014, 03:05

If identifiers aren't used consistently when recording data, then there will be certainly be problems. The advice to consider encode can't solve that. In fact, nothing can solve that except consistent recording.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35734
#11

15 Nov 2014, 06:04

Omar: I think you misunderstand what is perceived to be the problem and offer a solution that is not guaranteed to work.

If "05000028" is an example of a string identifier then you may need all its characters and you should almost never think about applying destring. destring is for variables that should be numeric, but have been somehow misread as string. The only exceptions to this would be if (for example) all identifiers contained leading zeros and those leading zeros are not informative. But in general string identifiers really are best left as string variables. If you need a numeric identifier, e.g. for panel work, solutions have already been discussed in this thread.

Furthermore, even if destring was applied mistakenly, applying a display format that shows leading zeros only affects what is displayed. It doesn't restore what was removed by destring. There is enough confusion on this point among Stata users who are still learning (see e.g. http://www.stata-journal.com/article.html?article=dm0067).
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35734
#12

15 Nov 2014, 06:51

There is a big difference between you making small changes to your dataset that preserve the crucial information in your data (that's fine) and recommending as a general strategy what is not guaranteed to work. Mu general advice remains as given in #8.

This FAQ including information on getting numeric identifiers when you have string identifiers has been public since 1999: http://www.stata.com/support/faqs/da...ers/index.html

Last edited by Nick Cox; 15 Nov 2014, 07:02.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35734
#13

15 Nov 2014, 09:38

Note for later readers: My previous two replies won't make full sense because they were replies to a member who posted twice and then deleted their comments. The miniature debate was based in large part on a misunderstanding. As an hour has passed since posting, I can't delete my posts unilaterally.
Comment

Announcement

Keeping leading zeros when destringing a variable

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment