Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • destring -->contains nonnumeric characters-->charlist to identify-->very strange output

    destring varname, gen (varname1)-->contains nonnumeric characters-->charlist varname to identify all the different characters for the variable varname-->very strange output -->

     !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\ ]^_`abcdefghijklmnopqrstuvwxyz{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ ‘’“”•–—˜™š›œžŸ ¡¢£¤¥¦§¨©ª«¬*®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒ ÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóô

    I could use destring varname, gen (varname1) force but I want to be able to see which characters cause the destring error "contains nonnumeric characters"

    Any advice is appreciated..

  • #2
    There is no easy advice here. Somehow or other your string variable contains dozens of different non-numeric characters. You need to back up and tell us what it is and what you expect it to contain. There are any number of possible explanations, including corruption of your data. But the chance of getting consistently meaningful numeric values from applying destring looks like zero. Also, your own output seems to answer your question, so what advice you seek is not clear.

    Note that you referred to charlist, which is user-written from SSC. See the FAQ Advice for the etiquette of explaining that. Also, see the same place for how to present code more readably.

    Comment


    • #3
      Wow aren't you the original author of the -destring- command? Apparently my dataset contains missing values (blank cells, no ".") which caused the strange output. I cannot use the ignore() for blank/empty cells. Normally I would drop the obs with the blank cells, however, I need those rows for other research purposes. I could use 'force' which would make a new var and replace the blank cells in "." However, I want to be 100% sure that only blank cells are converted into ".". At the moment, I'm trying to figure out how to do that and still be able to -destring- the variable. The code -real- isn't appropriate I believe.

      Comment


      • #4
        Yes; I was the original author of destring, and the author of charlist too, and of some other stuff, but unfortunately that doesn't allow me to add much to my previous answer. If the last output you give is the result of

        Code:
         
        charlist varname
        then as said destring can't help you to get numbers out of a mess. You haven't yet said what the variable should contain. Incidentally, using varname as a variable name can't help you or anybody else get a clear idea of what is happening. Conversely, if you are not using varname then you are not following the advice to state exactly what you typed.

        Comment


        • #5
          Very useful contribution to Stata. I'm using birth years as the varname (e.g. the format= 1970). So I guess that it's not common that charlist converts blank/empty cells into  !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\ ]^_`abcdefghijklmnopqrstuvwxyz{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ ‘’“”•–—˜™š›œžŸ ¡¢£¤¥¦§¨©ª«¬*®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒ ÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóô

          Comment


          • #6
            From what you say, you have a string variable that should contain birth years. So a good value is something like "1970" and a bad value looks like line noise. You could try

            Code:
              
            count if missing(real(birthyear))
            and see whether bad values are concentrated in parts of the dataset. My leading guess remains that your data are corrupted somehow..

            Comment


            • #7
              You might try this:
              Code:
              preserve
              forvalues i=0/9{
              replace subinstr(varnamr,"`i'","")
              }
              replace varname=trim(varname)
              list if length(varname)>0
              restore
              this should show you the cases with non-numeric values, which should be progress.

              Comment


              • #8
                Thank you both. @Nick, count if missing(real(birthyear)) had 197 as a result. -Destring- with force led to "(197 missing values generated)".
                So only blank/empty cells led to the problem with charlist

                Comment


                • #9
                  -Destring- with force led to "(197 missing values generated)".
                  So only blank/empty cells led to the problem with charlist
                  This isn't an accurate assessment of what happened. Destring with the force option will create missing values for any observations that contain non-numeric values for the original variable. Your results from charlist make it pretty clear that the problem is not observations that are all blank. If that is the output from running charlist on the variable in question, you have values that contain a whole range of non-numeric characters. Truly empty observations, that is those where birthyear=="", wouldn't affect the output of charlist at all. Observations containing blanks would show up as blanks in charlist. You need to look carefully at the observations that are causing you problems. Ben's code is one way to achieve that.

                  Comment


                  • #10
                    I've manually checked 10 of those specific observations (in browse-mode and edit-mode). The cells of the 'birthyear' variable really seem to be blank. Originally it was a dataset in Excel, which got imported in Stata. The data was manually collected.

                    Comment


                    • #11
                      Minor point: there are two small errors in Ben Earnhart's code above: -subinstr()- requires 4 arguments, the last being a count of how many occurrences are to be replaced, with . being a code for "replace them all." There is also a typo in the first argument. So it should read:

                      Code:
                      preserve
                      forvalues i=0/9{
                      replace subinstr(varname,"`i'","", .) // underlining shows change from original
                      replace varname=trim(varname)
                      list if length(varname)>0
                      restore

                      Comment


                      • #12
                        Originally posted by Sarah Edgington View Post

                        This isn't an accurate assessment of what happened. Destring with the force option will create missing values for any observations that contain non-numeric values for the original variable. Your results from charlist make it pretty clear that the problem is not observations that are all blank. If that is the output from running charlist on the variable in question, you have values that contain a whole range of non-numeric characters. Truly empty observations, that is those where birthyear=="", wouldn't affect the output of charlist at all. Observations containing blanks would show up as blanks in charlist. You need to look carefully at the observations that are causing you problems. Ben's code is one way to achieve that.

                        count if missing(real(birthyear)) had 197 as a result .....this command counts how many missing cells (blank cells) there are right? If there are 197 blank cells according to that command and destring with force has 300 as a result (example) ...then that would mean that only 197/300 would be blank/missing right? However, 197/197 seems to be a nice check in my opinion. Please correct me if I'm wrong because I probably misunderstood you Sarah.

                        Comment


                        • #13
                          No.


                          Code:
                          . display missing(real("ABC"))
                          1
                          count if missing(real(birthyear)) will give you a count of all values of birthyear that are not convertible to numeric: that would include both blanks and observations with non-numeric characters in them. That it gives you the same result as the warning issued by -destring, force- is no surprise whatsoever and tells you nothing at all about the contents of those 197 cells, except that they are not numeric. You need to do what Nick, Sarah, Ben, and I have suggested.

                          Comment


                          • #14
                            Originally posted by Clyde Schechter View Post
                            Minor point: there are two small errors in Ben Earnhart's code above: -subinstr()- requires 4 arguments, the last being a count of how many occurrences are to be replaced, with . being a code for "replace them all." There is also a typo in the first argument. So it should read:

                            Code:
                            preserve
                            forvalues i=0/9{
                            replace subinstr(varname,"`i'","", .) // underlining shows change from original
                            replace varname=trim(varname)
                            list if length(varname)>0
                            restore
                            I ran the above code (with "}" )
                            That resulted in a list with obs which contained the specific 'birthyear'. I could only see and check the last results of the list. However, I don't understand the use of the list. When I browse my dataset after I sorted the birthyears, I see 197 blank cells.

                            Code:
                            . display missing(real("ABC"))
                            1
                            I tried the above code, however, nothing happened (also nothing when I ran that code without the "." at the beginning and the "1" at the end)

                            Comment


                            • #15
                              Originally posted by Clyde Schechter View Post
                              No.

                              count if missing(real(birthyear)) will give you a count of all values of birthyear that are not convertible to numeric: that would include both blanks and observations with non-numeric characters in them.
                              Originally posted by ben earnhart View Post
                              You might try this:

                              this should show you the cases with non-numeric values, which should be progress.
                              The above seems to contradict each other because count if missing(real(birthyear)) results in the number of blanks (based on the manual check with -sort-) and the code of Ben gives a list of the obs which contain non-blanks, which Ben calls non-numeric values. So, it appears that in this case something else is the problem. Otherwise count if missing(real(birthyear)) would include both blanks and observations with non-numeric characters in them.

                              Comment

                              Working...
                              X