Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identifying completely missing (empty) rows

    Hi All,

    I have a data set comprising only off string values but with many missing values and sometimes completely missing rows. A similar small data set can be inputted as follows
    (operating system windows 7)

    Code:
    version 12.1
    clear all
    set obs 6
    input str1 var1 str1 var2 str3 var3
    a b c
    a c
    "" "" d
    "" "" ""
    "" k ""
    "" "" ""
    end
    Note that the fourth and the sixth rows/observations are completely missing or empty. What I want to do is delete these two rows only. The actual dataset has 29 variables and 1284 rows (and there are many such datasets) hence eyeballing and deleting the missing rows is slightly problematic. So if anyone knows how to do it with any particular set of commands for one dataset, it would greatly help me. All variables are string variables, no numeric variables.
    Last edited by Krishanu Karmakar; 22 Aug 2014, 11:18.

  • #2
    try the "rowmiss" option to the -egen- command; e.g., egen countmiis=rowmiss(var1 var2 var3)

    Comment


    • #3
      For a convenience command, find and install dropmiss (SJ).

      Code:
      search dropmiss
      dropmiss, obs
      For a principled approach,

      Code:
       
      drop if missing(var1) & missing(var2) & missing(var3)
      will work for your example. For more complicated examples, you could use the rowmiss() function of egen

      Code:
       
      egen nmiss = rowmiss(var*)
      drop if nmiss == 29

      Comment


      • #4
        Thank you both.

        Comment


        • #5
          Usual caveat when working with strings: "" is missing, while " " (blank space) is not. So if you are fully 100% confident in your data, then the suggested solutions will work. But if you have any doubt, might run a loop to run trim() on them first, which isn't too bad, given only 29 variables. Something like:
          Code:
           foreach var of varlist var1-var3 {
          replace `var'=trim(`var')
          }
          Could be a bit messy if you have a mix of strings and numeric variables, but again, with only 29 of them, not too bad, and better safe than sorry.

          Comment


          • #6
            Ben makes a good point. dropmiss (see #3) also has an option to catch multiple spaces when the user wants to regard such values as missing too.

            Comment


            • #7
              Yes that is a concern. Thank you for pointing it out. Till now I did not face the problem, but there is no assurance that I will not face the problem mentioned by Ben with my other datasets. I have been using the trim() function at a latter step, but I think I should use it earlier (in reference to my workflow) and using a loop like Ben's.

              Comment


              • #8
                I want to delete rows with missing observations across 110 variables. My dataset has noth string and numeric values.
                First I tried what Nick has suggested
                egen nmiss = rowmiss(var*) drop if nmiss == 110 It says var* not found. Then I realised my data has both string and numeric values. Next, I tried what Ben has suggested i.e. loop. foreach var of varlist hhid-gc { replace `var'=trim(`var') } Here, hhid and gc are my first and last variable respectively. But there is an error, type mismatch. I know this error is because there string as well as numeric data. I don't understand why m i getting this error ? From Ben's comment I gathered -trim- command allows string " " (singlr blank spaces).

                Comment


                • #9
                  Copying code carries an obligation to think about whether it makes sense for your situation.

                  The syntax var* makes sense if and only if your variable names have names like that.

                  The most that trim() -- which is a function, not a command -- can do for you is remove leading and trailing spaces from string variables, and it won't ignore numeric variables.

                  The help explains:

                  trim(s)
                  Domain: strings
                  Range: strings without leading or trailing blanks
                  Description: returns s without leading and trailing blanks; equivalent to ltrim(rtrim(s)). trim(" this ") = "this"
                  If spaces in string variables are to you not informative, you can do this, identifying the string variables and then looping over those alone:

                  Code:
                  ds, has(type string) 
                  foreach var in `r(varlist)' {
                       replace `var'=trim(`var') 
                  }
                  You can now install missings from the Stata Journal and then go

                  Code:
                  missings dropobs, force
                  See http://www.statalist.org/forums/foru...aging-missings

                  and in Stata

                  Code:
                  search missings
                  to find the latest download location.

                  Comment

                  Working...
                  X