Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    If you want to count the number of observations that just spaces or entirely missing for birthyear you want the following code
    Code:
    *create a new variable with leading and trailing spaces removed
    *this will cause any variable that is all spaces to be missing
    gen newby=trim(birthyear)
    
    **count if the new variable is missing
    count if missing(newby)
    To look at the values of the birthyears with non-numeric values you might try

    Code:
    ** prevent results scrolling past
    set more on
    
    **create a flag for observations that are not real numbers and also not blank or missing
    mark nonnum if missing(real(birthyear))==1 & missing(trim(birthyear))~=1
    
    **now look at the contents of birthyear for those observations
    list birthyear if nonnum==1

    Comment


    • #17
      1. Good pickup on the missing }!

      2. Sorry for the confusing code display. What you should type is:
      Code:
       display missing(real("ABC"))
      Stata will respond with "1" showing that if, you have an observation where birthyear contains non-numeric characters other than just blank, real(birthyear) will be missing. So, again, the conclusion is that count if missing(real(birthyear)) just gives a count of how many birthyears couldn't be converted to numeric: they may be blank, but they may not, and your -charlist- results strongly imply that at least one of them is very much not blank.

      3. If you run Ben Earnhart's code (with corrections, including the -list- command):

      Code:
      preserve
      forvalues i=0/9{
           replace subinstr(varname,"`i'","", .) // underlining shows change from original
           replace varname=trim(varname)
      }
      list if length(varname)>0
      restore
      you will get a listing of all the observations where the variable in question could not be converted to numeric and is not blank. That is because the code in between the curly braces will have eliminated all numeric characters and leading or trailing blanks, so anything that was not originally a number and not blank will have something left in it, but numbers and blanks will be empty.

      You really need to do this to see why you are getting such bizarre results from -charlist-. There is something quite wrong when -charlist- shows you that kind of wild output on a variable that you expect to be 4 digit numbers! You shouldn't use the data for analysis until you figure it out and fix it.

      Comment


      • #18

        Code:
          replace subinstr(varname,"`i'","", .)
        isn't legal code. Perhaps what is meant is
        Code:
             
         replace varname = subinstr(varname,"`i'","", .)

        Comment


        • #19
          Thank you Sarah

          Originally posted by Sarah Edgington View Post
          If you want to count the number of observations that just spaces or entirely missing for birthyear you want the following code
          Code:
          *create a new variable with leading and trailing spaces removed
          *this will cause any variable that is all spaces to be missing
          gen newby=trim(birthyear)
          
          **count if the new variable is missing
          count if missing(newby)
          To look at the values of the birthyears with non-numeric values you might try

          Code:
          ** prevent results scrolling past
          set more on
          
          **create a flag for observations that are not real numbers and also not blank or missing
          mark nonnum if missing(real(birthyear))==1 & missing(trim(birthyear))~=1
          
          **now look at the contents of birthyear for those observations
          list birthyear if nonnum==1

          Result of the first code:

          -->(197missing values generated)

          Result of the second code:

          -->197

          To look at the values of the birthyears with non-numeric values you might try

          Originally posted by Sarah Edgington View Post
          Code:
          ** prevent results scrolling past
          set more on
          I should have realized that.

          Originally posted by Sarah Edgington View Post
          Code:
          **create a flag for observations that are not real numbers and also not blank or missing
          mark nonnum if missing(real(birthyear))==1 & missing(trim(birthyear))~=1
          Originally posted by Sarah Edgington View Post
          Code:
          **now look at the contents of birthyear for those observations
          list birthyear if nonnum==1
          --> only 1 observation was listed with year 1947
          I manually checked what was wrong with that observation.... there was a space after 1947. After I deleted that space and ran your codes again, that problem was solved. However, the problem with the 197 (196) blank cells still remains

          Comment


          • #20
            replace birthyear=trim(birthyear) ....removed the space after the obs....but -destring- doesn't work after all, after the trim() command....it only works if I manually delete the space.

            Any suggestion?
            Last edited by LydiaSmit; 24 Jul 2014, 15:36.

            Comment


            • #21
              My apologies, trim() led to (1 real change made), however, the space still remains after the observation unless I manually delete it.

              Comment


              • #22
                When you say manually delete it, do you mean that you're deleting the space through the data editor?
                My guess is that what you're seeing as a space is not actually a space but rather some other character or set of weird characters that looks like a space when you list the variable or look in the data editor.

                I'm a little lost on what all you've done to your variable at this point. It sounds like what you've ended up with is one observation where there are some strange characters that look like space but aren't. It's hard at this point to guess at what exactly is in that observation or how it got there.

                What output (show exactly the output) do you get if you run the following code (all of it) starting from your original birthyear variable before you make any changes?

                Code:
                clonevar check=birthyear
                replace check=trim(check)
                replace check="" if ~missing(real(check))
                charlist check
                display r(ascii)

                Comment


                • #23
                  Originally posted by Sarah Edgington View Post
                  When you say manually delete it, do you mean that you're deleting the space through the data editor?
                  Yes indeed, so far that's the only solution. I'll try your code now. I also guess that the 'space' is actually a set of weird characters....the set of chars -charlist- gave me....even though that set of chars still remains after manually deleting the 'space' in the data editor.

                  I manually collected that data and put it in Excel, after that I imported it into Stata. Now I'm trying to remove that space based on that dataset (so without the influence of other codes/commands in my .do-file)
                  Last edited by LydiaSmit; 24 Jul 2014, 16:10.

                  Comment


                  • #24

                    Code:
                    clonevar check=age2013
                    (195 missing values generated)
                    
                    .
                    . replace check=trim(check)
                    (1 real change made)
                    
                    .
                    . replace check="" if ~missing(real(check))
                    (2574 real changes made)
                    
                    
                    charlist check
                            
                    
                     !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\
                    > ]^_`abcdefghijklmnopqrstuvwxyz{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ ¡¢£¤¥¦§¨©ª«¬*®¯°±²³´µ¶·¸¹º»¼½
                    > ¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóô
                    
                    .
                    . display r(ascii)
                    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
                    > 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 6
                    > 9 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84
                    Last edited by LydiaSmit; 24 Jul 2014, 16:22.

                    Comment


                    • #25
                      I find it puzzling that charlist gives you ascii codes for many fewer characters than what it shows you in the output.
                      Nonetheless it's pretty clear that there's something wrong with that observation. Possibly you copy and pasted the value into excel and accidentally copied a bunch of other characters somehow as well.

                      You also appear to have 196 observations for which birthyear is missing. If that's the number you expect based on what you entered in excel there's no problem. However, if you don't have exactly 196 blank cells in the original excel sheet, you're going to want to carefully compare your original excel data to the data Stata produces after importing the data. In my experience, Stata's import excel command is very good, but there's always the possibility that weird characters in a cell of a spreadsheet might cause import problems.

                      I'm generally opposed to editing raw data manually but this is one case where it probably makes sense to do so. You just want to make absolutely sure that you're editing the correct case and that you verify that the value you enter when you do the edit is correct. If it were me I would also put a note in my import dofile reminding myself of the issue and indicating that I should only import the corrected excel spreadsheet, not the original with the error.

                      Comment


                      • #26
                        I checked the Excel-file and found out that the value is stored as text instead of a number. However, even changing the format in Excel to number (for that specific cell) doesn't change the fact that Excel sees it as text. (based on the formula ISTEXT and a sum in another formula like 1947+5 gives the error #VALUE!) In Stata it has the same format as the other birth years.

                        drop if strmatch(birthyear,"*!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ ‘’“”•–—˜™š›œžŸ ¡¢£¤¥¦§¨©ª«¬*®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒ ÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóô*")

                        That should work in theory, however, this error shows up: too few quotes r(132);

                        Last edited by LydiaSmit; 24 Jul 2014, 17:45.

                        Comment


                        • #27
                          I haven't followed along carefully, but it sounds like there is a problem with one cell, maybe one case, in an Excel file you created yourself. Assuming you know the correct value, why not just go ahead and enter it or reenter the entire row if necessary? I figure things like this are Acts of God and they're not worth spending the time to figure them out if a quick fix is possible. Maybe your dog accidentally jumped on your keyboard when the cursor was on that cell. ;-)
                          -------------------------------------------
                          Richard Williams, Notre Dame Dept of Sociology
                          StataNow Version: 19.5 MP (2 processor)

                          EMAIL: [email protected]
                          WWW: https://www3.nd.edu/~rwilliam

                          Comment


                          • #28
                            Maybe I should just solve it in Excel, however, I don't like to give up on 'problems'. I can also manually change the cell in Stata and then it works, however, then the code is dangerous: replace birthyear = "1947" in 587

                            The change will be made before merging datasets and running other codes, however, it would be better if I can use a different code to change "1947 " into "1947" for that specific cell (without the need of giving a position/order number like 587 in the above case)

                            Comment


                            • #29
                              Solved it!

                              Comment


                              • #30
                                In case anybody else ever has a problem like this, what was the solution?
                                -------------------------------------------
                                Richard Williams, Notre Dame Dept of Sociology
                                StataNow Version: 19.5 MP (2 processor)

                                EMAIL: [email protected]
                                WWW: https://www3.nd.edu/~rwilliam

                                Comment

                                Working...
                                X