Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • format variable as string if numeric

    Hi

    I have two more or less identical .csv files.

    However, if I insheet both .csv files by using

    Code:
    insheet using "example1.csv",delimiter (";")
    respectively

    Code:
    insheet using "example2.csv",delimiter (";")
    the first variable (code) of the first .csv file is formatted as string and the first variable (code) of the sencond .csv file is formatted as numeric.

    I checked both csv. files. In Excel they have the same format.

    So I have no idea what is the reason for that.

    However, I thought maybe it is possible to change the format afterwards.

    So what I would like to do is to to change the variable if it is numeric to string. If it is allready a string variable, it is ok.

    I use this command

    Code:
    tostring code, gen(code2) format(%15.0f)
            drop code
            ren code2 code
    This code works if the variable code is numeric. But if code is a string an error message appears that the variable "code" is allready formatted as string and therefore no new variable "code2" will be generated.

    So what I need is something like a condition that tells stata to change the format from numeric to string only if the variable is numeric.

    Any ideas?

  • #2
    If you want to change from numeric to string, or vice versa, that is a matter of changing storage type. In Stata format means "display format", which is secondary. You don't change types by changing format; nor is that the way to describe it. It's unfortunate, and clearly not your fault, that the term "format" is overloaded and used in different senses in computing.

    What you are asking for here is in effect the replace option of tostring. If the variable is already string, nothing bad happens. If the variable is numeric, you get the change you want.

    All that said, it's much more likely to me that you have it the wrong way round. You should be applying destring to reverse the incorrect string representation. But at some point you need to know why you are getting different entry of data as there could be implications for your data analysis if you are making the wrong decisions.

    Comment


    • #3
      Let me add more:

      A variable that is usually imported as numeric but sometimes imported as string sounds like one that "should be" numeric. So why is Stata importing it as string?

      You need to look to find out, but it could be something in just one value. If you are surprised at a variable being string, then destring alone is always safe, but never use destring, force unless (1) you really know what you are doing and (2) you are prepared to re-do everything if you get it wrong. Calling the option force is a StataCorp convention but the wording is not sufficient to deter some users. The option ought to be called mindlessviolenceandItakefullresponsibility or some such.

      A good routine check for puzzling strings is

      Code:
       
      tabulate puzzlingstring if missing(real(puzzlingstring))
      That shows the string values that can't be converted to numeric without yielding missing values. Often they will be something like

      "NA" and siblings
      hyphens used to indicate missing values
      header information imported as data
      letters mistyped for numbers

      Comment


      • #4
        I completely agree with Nick, the destring command is much more sensitive than the opposite tostring, but more frequently required.
        I would add that the force option is often misinterpreted. It does not force a variable to convert into numeric but considers them as missing values.
        I recommend to consider first the ignore() option, very useful for deleting blanks between numbers (when written 25 000 instead of 25000), or any given character you know it prevent Stata for seeing numbers while it is.

        Hence the name of the two options might be misleading. force ignores the observation and gives it a missing value, whereas ignore() ignores one (or many) character when reeding the observation, and then might transform string to numeric.

        Comment


        • #5
          I am always happy with complete agreement, but nevertheless I didn't mean to imply that destring is "much more" sensitive than tostring. Parental feelings aside, the intent of both commands is that there is no irreversible loss of information unless you ask for it.

          Otherwise Charlie captures the spirit of these commands quite well. I would summarize destring, force as yielding missing numeric values if and only if a string value cannot be converted to numeric non-missing values using real() after any other destring options specified have done their best.

          Comment


          • #6
            I would add one thing to Nick's list in #3 of what you may see when you run:

            Code:
            tabulate puzzlingstring if missing(real(puzzlingstring))
            Sometimes you will see things that look for all the world like correct numbers. This can result from the inclusion of non-printing characters in those entries. These characters do not show up when the data are displayed, but they are there to gum up the works. And chasing them down can be maddening.

            If you do see what look like perfectly good numbers, the next step should be

            Code:
            charlist puzzlingstring if missing(real(puzzlingstring))
            return list
            -charlist- is available from SSC and, if I recall correctly, is another of Nick's many contributions. The immediate output of -charlist- may not be very informative about non-printing characters because, typically, they are just as invisible there. But looking at the ASCII codes that come out of -return list-, and referring to a table of ASCII codes, you can ferret them out. (If you don't have a table of ASCII codes handy, you can download -asciiplot- from SSC [this one is also Nick's, along with Svend Juul and Michael Blasnik] and that command will display such a table in the Graph window.)

            Comment


            • #7
              Thanks to Clyde for the mention of charlist. I was thinking about it last night in terms of a kind of string variable checker that would report on numeric variables that resist destring, perhaps even several at once.

              Comment


              • #8
                @Nick Cox: Yes you were right! One of the datasets contained a string observation. Furthermore many thanks for all this helpful information!

                Comment

                Working...
                X