Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • looping over lines

    Hello,

    I am very new to Stata I ve just started today and I need some help.
    I have a xlsx file with some useless information in some lines. This information is text, it is like a header, but it comes like twice in the file. The columns and rows don't have names. So I would like to loop over the lines, and see if one of these lines contains a string let's say "str", if so then I will drop it.
    The pseudo code would be :
    for (i =0;i<nb_lines;i++){
    if line.contains("str"){
    drop line
    }
    }
    The data looks like this:
    the header
    useful info
    ....
    useful info
    .....
    useful info
    ....
    the header
    Thank you in advance,

    Lamya

  • #2
    In the future, please read the FAQ, especially #12 that shows how to give a sample of your data with dataex and how to use code delimters.

    Read through help strmatch
    This will loop through all variables for each observation (row) and drop those observations not containing "string":



    Code:
    gen flag=.
    forvalues i = 1 / `=_N' {
        local counter=0
        foreach var of varlist _all {
            local j `=`var'[`i']'
            if `=strmatch("`j'" , "*string*")' == 1 local ++counter
            }
         if `counter'==0 replace flag=1 in `i'
          }
    *after examining the flag variable to ensure you got the right string:
    drop if flag==1
    Stata/MP 14.1 (64-bit x86-64)
    Revision 19 May 2016
    Win 8.1

    Comment


    • #3
      Thank you so much Carole.

      Comment


      • #4
        I think Carole J. Wilson means keep not drop in her last line.

        Further, the double loop over observations and variables isn't needed here so far as I can see.

        Code:
        gen flag = 0
        quietly foreach var of varlist _all {
            replace flag = flag + strmatch(`var' , "*string*")
        }
        keep if flag
        has the same consequence

        Comment


        • #5
          Nick's code is much cleaner and easier to read, but the original post requested to drop if the string was present.
          Stata/MP 14.1 (64-bit x86-64)
          Revision 19 May 2016
          Win 8.1

          Comment


          • #6
            I indeed changed drop to keep.
            Thank you Nick for this optimized answer, it worked too ( I used drop this time ).

            Comment


            • #7
              I wanted to use your code @Carole J. Wilson, to drop all the lines that come after a certain position (line where I find a "string", unique in the file ). I store the value of the position in the flag. How can I use the value of the flag?


              Code:
               gen flag=.
              
              forvalues i = 1 / `=_N' {
              
              
                  local counter=0
              
              
                  foreach var of varlist _all {
              
              
                      local j `=`var'[`i']'
              
              
                      if `=strmatch("`j'" , "*string*")' == 1 local ++counter
              
              
                      }
              
              
                   if `counter'==1 replace flag=`i' in `i'
              
              
                    }
              
              
              drop in `flag'/`_N'
              Last edited by lamya kejji; 09 May 2016, 07:27.

              Comment


              • #8
                Carole:

                Thanks for the quick reply. You are quite right that the original post asked for drop. I was reacting to your comment that your code
                would drop those observations not containing "string" (emphasis added).

                As Iamya appears now to be saying that keep is interesting too, we can perhaps all agree that the code finds observations with matches and can then be used to drop or keep depending on circumstance.

                Comment


                • #9
                  Nick or Carole, could you please help with my second question ?

                  Comment


                  • #10
                    If I understand the new question correctly an answer is

                    Code:
                    gen flag = 0
                    quietly foreach var of varlist _all {    
                         replace flag = flag + strmatch(`var' , "*string*")
                    }
                    
                    keep if sum(flag[_n-1]) < 1

                    Comment


                    • #11
                      After you get the flag variable (my way or Nick's), we'll create a counter variable id that is just the number of the line:

                      Code:
                      gen id=_n
                      sum id if flag==1
                      The resulting minimum value is the id number of the first time flag==1

                      Code:
                      drop if id > r(min)
                      Stata/MP 14.1 (64-bit x86-64)
                      Revision 19 May 2016
                      Win 8.1

                      Comment


                      • #12
                        Exactly! Thank you Nick.

                        Could you please explain the last line ?

                        Comment


                        • #13
                          Aah I understand now. Thank you Carole this is really helping !

                          Originally posted by Carole J. Wilson View Post
                          After you get the flag variable (my way or Nick's), we'll create a counter variable id that is just the number of the line:

                          Code:
                          gen id=_n
                          sum id if flag==1
                          The resulting minimum value is the id number of the first time flag==1

                          Code:
                          drop if id > r(min)

                          Comment


                          • #14
                            You should be able to work it out! If flag goes 0, 0, 0, 0, 0, ..., 1 then its cumulative sum is the same through the sequence 0, 0, 0, 0, 0, ..., 1, after which you don't care. The offset of 1 observation is needed if you want to keep the line which is flagged, which is implied by wanting to drop after that line.

                            Carole's technique is essentially equivalent, although in this problem it seems that we don't need to create a new variable as flag contains precisely the information we need already.

                            For write-ups of Carole's technique see references yielded by

                            http://www.stata-journal.com/sjsearc...h+observations
                            Last edited by Nick Cox; 09 May 2016, 08:15.

                            Comment


                            • #15
                              Many thanks Nick !

                              Comment

                              Working...
                              X