force destring

Timea De Wispelaere

Join Date: Mar 2020

Posts: 95
#1

force destring

15 Mar 2020, 03:02

hi!

In a previous post someone mentioned that I had to be careful when forcing to destring because some values might be lost. I have, however, forced to destring company codes. I destringed it again to check, this time without destringing, and it went through anyways. However, the output is two times different. Could you explain to me what the difference is, and whether force destring has implications for my data in this case? this is my code:

Code:

destring BVD, generate(numeric_BVD) force BVD contains nonnumeric characters; numeric_BVD generated as long clear use "C:\Users\timea\OneDrive\Documents\master\THEsis\stata\cash\temp2009.dta", clear destring BVD, generate(numeric_BVD) BVD has all characters numeric; numeric_BVD generated as long

thank you!
Tags: None
Joseph Coveney

Join Date: Apr 2014

Posts: 4421
#2

15 Mar 2020, 03:44

You check on it with something along the following lines.

Code:

assert !missing(BVD) generate byte has_nonnumeric = indexnot(BVD, "0123456789-.") summarize has_nonnumeric, meanonly if r(max) > 0 tabulate numeric_BVD if has_nonumeric // or list BVD numeric_BVD if has_nonumeric

But, really, if company codes are IDs, then don't destring them. Keep them as string, and use -encode- instead.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35726
#3

15 Mar 2020, 03:59

Joseph Coveney gives excellent advice. Here are a few more points,

Code:

destring, generate() force

changes nothing in the dataset as it just adds a new variable.

Whether that is a good idea depends on the variable and the force needed. As Joseph says, using destring if you need encode is a bad idea,

The key to knowing what force deals with is

Code:

tab BVD if missing(real(BVD))

which shows you what needs force -- or other documented options, notably ignore().

Recently with a file I was looking destring was refusing to act for three variables. Close scrutiny revealed the text unknown in two observations. Using force here is not only fine, it's the easiest solution once you have identified the problem.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30119
#4

15 Mar 2020, 12:32

Using

force

here is not only fine, it's the easiest solution once you have identified the problem.

The easiest, yes, but perhaps not always the best.

If the data set in question is a final version, then, yes, with that knowledge of what the problem is, -force- is your friend. But if this is a data set that might be updated, the update might introduce new non-numeric material, such as a mistyped number (2..5 where 2.5 was meant, for example) which -destring, force- will convert to a missing value without making you aware of the problem. So if there is a chance that the data set will change (or that the code will be reused with a different data set), then it would be safer to do this:

Code:

replace BVD = "" if BVD == "unknown" destring BVD, generate(n_BVD)

That way, if the newer data contains problems other than "unknown," they won't be swept under the rug.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35726
#5

15 Mar 2020, 12:40

My point is that force is fine if and only if you are completely confident that it solves the problem in hand (and I was). If a string variable is updated, then indeed you still need to check what inhibits or prohibits a destring.
Comment
Timea De Wispelaere

Join Date: Mar 2020

Posts: 95
#6

16 Mar 2020, 13:26

thank you for your feedback. So if I understand you right Nick Cox , if I destring a variable turnover, but it does not work and the tab comment declares that the values that need to be forced are n.a., then force is no problem since since the destring command makes n.a. into . which is the same outcome anyways, right?
Comment
Timea De Wispelaere

Join Date: Mar 2020

Posts: 95
#7

16 Mar 2020, 14:52

also, could you tell me the difference between encode and destring? Thank you very much for your valuable answers
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30119
#8

16 Mar 2020, 15:08

-destring- is used when a string variable contains only material that looks, to the human eye, like numbers and you want to be able to do calculations with those numbers. For example if you have a string variable whose values are "487.3", "99", "12345" etc., you can convert that to a numeric variable whose values are 487.3 (approximately), 99, and 12345 using -destring-.

-encode- is used when you have a string variable that defines a limited number of categories such as "Male" and "Female" or "France", "Germany", "India"..., or the like and you want to create a variable that represents those things numerically. The numbers that are used for this purpose are just arbitrary numbers and should not be used to perform calculations. It makes no sense to talk about the mean of "France", "Ireland" and "Russia." Numeric variables like this are useful to represent categorical variables in regression models, and also to designate panels in panel data. With -encode- the resulting numeric variable will have values like 1, 2, 3..., or, the values in a pre-specified value label whose name appears in the -label()- option of the command.

If you try to -destring- a variable that should be -encode-d nothing terrible happens: Stata notices that the variable contains material that cannot be represented as numbers, refuses to do anything with the variable, and warns you about this.

If you -encode- a variable that should be handled with -destring-, however, you will get a dangerous variable. It will look to your eye like you have successfully converted
"487.3", "99", "12345" into 487.3 (approximately), 99, and 12345. But what you will really have is a variable whose values are, respectively, 2, 1, and 3--but they are labeled to look like the numbers 487.3, 99, 12345. This is a trap, which you might only recognize when you see the results of some analysis. For example if you summarize that variable you will see a mean of 2 rather than the 4310.4333 that you were expecting.

Last edited by Clyde Schechter; 16 Mar 2020, 15:21.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4421
#9

16 Mar 2020, 18:25

Originally posted by Timea De Wispelaere View Post

force is no problem since since the destring command makes n.a. into . which is the same outcome anyways, right?

In your particular case, maybe, but not always. There are occasions when not applicable (or not available?) for an observation is not the same as unknown or unknowable for a given analysis, and you'd want to keep them distinct.

In those cases you might be better off with something along the following lines.

Code:

replace outcome = ".n" if outcome == "n.a." replace outcome = ".u" if missing(outcome) destring outcome, replace label define Outcomes .n "Not applicable" .u Unknown label values outcome Outcomes
1 like
Comment
Timea De Wispelaere

Join Date: Mar 2020

Posts: 95
#10

17 Mar 2020, 03:00

Thank you very much for the understandable explanation! You really helped me further
Comment
Timea De Wispelaere

Join Date: Mar 2020

Posts: 95
#11

18 Mar 2020, 15:21

I am still struggling with destringing my companyID variable. Because I need to select the companies I need from a very big database, I have to destring them anyways, because coding won't do the trick, right? since all that does is give a code to the the company number. I just don't see how destringing would be a problem... All the companies I need have a format 0123456789 and do not include any letters or other symbols, so my guess is that I won't loose any datapoints when forcing to destring?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30119
#12

18 Mar 2020, 22:51

If your companies all have a format 0123456789 and do not include any letters or other symbols, you don't need the -force- option. So, yes, you can -destring- this if you want to: but don't use -force-. If it won't work without -force-, then your data are not what you think they are and you need to discover that and fix it before mangling the data and creating errors.

This is one of those uncommon situations where either -encode- or -destring- will work, because the strings read as numbers to the human eye. Since the function of these strings is to identify companies, and you are not going to do actual calculations with the numericalized company identifiers, there is really no reason to use -destring- here; -encode- would be more typical. But -destring- is possible and, if your data are truly the way you describe them, it won't do you any harm.
Comment
Timea De Wispelaere

Join Date: Mar 2020

Posts: 95
#13

21 Mar 2020, 07:47

thank you very much!
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment