Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Destring vs. encode

    Dear all,

    I have a question with regard to the code destring and encode.
    In my regressions, I have “real” string variables that contain nonnumeric text (like country names or industries) and variables that are string variables, but that contain meaningful numeric text (like 5.2, 6.7, etc.). If I understand it right I have to use encode for the variables with nonnumeric text and destring for the variables with numeric text.
    As I didn’t know the code destring, I converted all variables by using encode. My regressions work perfectly, but when I look at the descriptive statistic, the values of the variables with numeric text are wrong.
    If I use destring for my numeric text variables and encode for the real string variables, I have two problems:
    (1) nearly all observations are dropped because of collinearity in my OLS regression
    (2) I get “no observations” for my IV regressions.

    I don’t really understand these two problems as I don’t know who the two commands convert the variables. Is it possible to convert some of my variables with destring and others with encode within one dataset? And is there any possibility to solve my problems?

    Thanks
    Teresa


  • #2
    Your specific diffficulties with your data are hard to diagnose. Collinearity of numeric variables can't be blamed on how they were converted from string, so far as I can see. The same applies to a report of "no observations". As you are not giving us a reproducible example showing your data I don't know what to add, except comments on the difference between the two commands you ask about.

    encode and destring have quite different purposes despite their superficial similarity in mapping string variables to numeric. That is indeed why two separate commands were introduced. It is entirely likely in a large, complicated dataset that you may use both, depending on how data were given to you.

    encode is for mapping pure text (which could include numeric characters) to numeric variables: the mapping will be expressed by value labels. One kind of example is that you have answers to a survey question: how much do you like Stata? as values of a string variable howgoodStata expressing answers on an ordered or graded scale. The possible values are given as

    "outstanding!!"
    "excellent!"
    "good"

    and in this case you would want to define value labels and then apply encode using

    Code:
    label def howgood 1 "good" 2 "excellent!" 3 "outstanding!!"
    encode howgoodStata, gen(n_howgoodStata) label(howgood)
    so you are setting up a translation scheme to be used by encode and telling encode about it.

    The pitfall with encode is its default, which is to define labels in alphabetic order. If you merely issue

    Code:
    encode howgoodStata, gen(n_howgoodStata) label(howgood)
    then encode will draw up the scheme

    Code:
    label def howgood 1 "excellent!"  2 "good" 3 "outstanding!!"
    as part of the encode. In this and many other example alphabetical order is not what you want and results downstream will be at best awkward and at worst statistical garbage.

    destring
    has quite a different role. It is for cleaning up variables that "should be" numeric but somehow got imported as string. The "should be" could cover many reasons. Historically (Cox and Gould STB-37: 34-37, 1997) it was that users were typing header text into Stata's data editor as if it were a spreadsheet. Often later it was that users were importing data from spreadsheets and that spreadsheet extras (e.g. intermediate rows with metadata) were causing variables to be read as string. Even when the extra stuff was deleted variables remained string, which was why destring was needed.

    Now some of the main reasons that destring is needed are covered by some of its options:


    ignore(): remove specified nonnumeric characters, as characters or as bytes, and illegal Unicode characters

    percent: convert percent variables to fractional form

    dpcomma: convert variables with commas as decimals to period-decimal format

    On this topic generally, in addition to the explanations in the documentation, a tutorial from 2002 could be of some use (it's a little out-of-date, as for example tostring is now an official command):


    SJ-2-3 pr0006 . . . . . . . . . . . . Speaking Stata: On numbers and strings
    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
    Q3/02 SJ 2(3):314--329 (no commands)
    explains the use of numeric and string data types in Stata
    and how to convert from one kind to another

    http://www.stata-journal.com/sjpdf.h...iclenum=pr0006

    Final warning: neither encode nor destring is the way to tackle string dates.
    Last edited by Nick Cox; 31 Aug 2015, 04:05.

    Comment


    • #3
      Thank you for your detailed answer.
      I used the dpcomma option and now I get my results without any problems.

      Comment


      • #4
        I did all my regressions with encode one the one side and both encode and destring on the other side.

        If I use encode for all my variables, I get significant results. However, if I use destring and encode, my estimators become insignificant. The reason for that might be the fact that many observations are dropped due to collinearity. But why is there collinearity in case of destring?
        Above, Nickk said that “Collinearity of numeric variables can't be blamed on how they were converted from string”. So how is this possible?

        My regression is
        Code:
        xi: ivreg2 lnGermanFDIs lnInfrastructureIndex lnQualityofPublicSchools lnCapitalLaborRatios lnOrganizedCrimeIndex lnDistancekm Commonlanguage  i.Country i.Year i.Industry (lnGDP lnEnvironmentalStringency lnEnvironmentalStability lnTariffRate lnIPRP = Tractorsagriculturalworker Landagriculturalworker Regionalcapitallaborratios RegionalOrganizedCrime Regionalpublicschoolquality Regionalinfrastructurequality Regionaltractorsagriculturalwo Regionallandagriculturalworker), robust endog(lnGDP lnEnvironmentalStringency lnEnvironmentalStability lnTariffRate lnIPRP)
        If I use encode, I get the following descriptive statistics
        (Variable Observations Mean St.Dev. Min Max)

        Click image for larger version

Name:	encode.PNG
Views:	1
Size:	19.6 KB
ID:	1308492


        If I use destring, I get the following descriptive statistics:

        Click image for larger version

Name:	destring.PNG
Views:	1
Size:	19.7 KB
ID:	1308493


        Like this, my variables are displayed correctly, e.g. the variable stringency of environmental policy takes values between 2.7 and 6.7.




        Comment


        • #5
          This is the same question again, is it not?

          From what I can gather encode is wrong here. If strings are like "2,7" "2,8" etc. they will get encoded to 1 2 etc. That is not what you want.

          Comment


          • #6

            Right, so I have to use destring. But I don't understand why in this case lots of variables are dropped due to collinearity. When I use encode, I don't have problems with collinearity.
            Shouldn't be the relationship between variables be unaffected by the way how I convert them from string? That would imply that I have collinearity in both cases with destring and encode, or I haven’t collinearity in both cases. Or is it possible that my collinearity problems indeed comes from the way I convert the variables?

            Comment


            • #7
              Try looking at a correlation matrix or scatter plot matrix to understand. That is the first course in statistics!

              So far as we can tell encode here produces quasi-random garbage because the alphanumeric order of strings determines numeric codes and that's partly meaningless. For example, "2" "20" "3" "30" will map to 1...4, Such variables are not highly correlated with each other; hence you appear to have no collinearity problem, but that's for the wrong reason.

              Conversely, collinearity that bites is likely to be real, arising relationships between predictors.

              This is guesswork, however. You know that we do not have your dataset so a fortiori we cannot examine relationships in it.

              Comment

              Working...
              X