Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Numeric variable turned into string. How do i recode or destring it?

    Hello,

    Long story short I used SQL to add some data from the world bank about oil and gas rents into my already big dataset. Everything seemed to work fine at first except missing labels.
    Now I discover that it have also affected some data, where the values have gone from numeric to string. So how do I make amends and fix this.
    In hindsight I should just have learned how to properly merge datasets instead of being impatient.


    Before:
    Click image for larger version

Name:	colonial_original.png
Views:	1
Size:	12.6 KB
ID:	1508519



    After:
    Click image for larger version

Name:	colonial_new.png
Views:	1
Size:	12.5 KB
ID:	1508520

    Last edited by Martin Skou; 19 Jul 2019, 13:12.

  • #2
    It is strange for a Stata merge to replace existing variables in your data--so yes, my first suggestion would be to fix your merge. (Presumably, you are saving all your code and steps for both SQL and Stata in do-files or scripts, so recreating your results will involve just re-running the scripts from the top, and it will be easy to change the merge to be correct the first time. If you aren't saving your intermediate steps in scripts, I advise you to fix that aspect of your project before doing anything else).

    That said, if you just want to encode the variable, run
    Code:
    encode ht_colonial, label(ht_colonial) gen(ht_colonial_numeric)
    (it might not work exactly right if labels have changed, as with the label for group 0 in your example--but that could be fixed manually).

    Comment


    • #3
      Well, it may not be quite that simple. -codebook- has done us the favor of listing the new string values in sort order, which is the same order that -encode- uses when creating the numeric variable.

      So we see that "" will be encoded as missing value, which is fine. But then "0. Never colonized..." will encode as numeric value 1 (but labeled as 0. Never colonized...). Then "1. Dutch" will encode as 2, and "10. Australian" will encode as 3, and so on. Now, this may or may not matter. If the original ordering of the numeric codes assigned to each nationality are important, this will scramble it and make the variable problematic. If the original order was just arbitrary, then it's OK, although it might be confusing to have a variable whose actual numeric value is, say, 3, labeled as "10. ...".

      Comment


      • #4
        Thanks so much for the quick reply's to both of you. I did what Kye Lippold suggested. Clyde Schechter you were right about it being out of order, but the order does not matter as of now.

        I guess I will have to inspect every variable I use in my model to determine wether other variables suffer from the same issue. If so I will have to re-do the merging of the datasets.
        Last edited by Martin Skou; 19 Jul 2019, 14:07.

        Comment


        • #5
          I am pretty sure the order is handled correctly by the use of the label() option (in this case, having the old ht_colonial label defined is the key, as that lets -encode- map values to categories based on the label rather than alphabetically). For example:
          Code:
          clear
          input var
          1
          2
          10
          3
          end
          lab def varlab 1 "Val1" 2 "Val2" 3 "Val3" 10 "Val10"
          lab val var varlab
          decode var, gen(var_string) // string variable from numeric 
          codebook var_string // listed in alphabetical order
          encode var_string,  gen(var_encode_nolabel)
          codebook var_encode_nolabel // numeric values are incorrect for labels due to sorting, as stated
          encode var_string,  label(varlab) gen(var_encode)
          codebook var_encode // labels are correct with the label() option
          In Martin's case, there will still be errors in ordering to the extent that the string values don't match the labels. For example, the value for 0 is truncated (missing the word "power") in the example output, so that label won't match. But cases like "Australian" should be correctly encoded as a numeric value of 10.

          Comment

          Working...
          X