Remove subgroup of observations if conditions are true

Tasneem Mohammed

Join Date: Jul 2021

Posts: 9
#1

Remove subgroup of observations if conditions are true

19 May 2022, 04:09

My example dataset (see attached) has only one variable that contains observations in the form shown under the "base" variable in the picture above. All observations are formatted as strings.

The observations follow this pattern:
1) The string "Header" comes first.
2) A varying number of string observations come next
3) The string "End-Header" comes next.
4) A varying number of string observations then follow.
5) The string "Header" comes again to signify the start of the pattern again.

My objective is to write code that removes the observations between the two strings "Header" and "End-Header", including the "Header" and "End-Header" strings themselves. So, I wish to get to a final list of observations as shown under the "target" variable in the picture above. What is the most efficient way to do this?

I tried the following code without success. I get a "is not a valid command name" error. My plan was to generate a "counter" variable equalling 1 initially, and add one each time either "Header" or "End-Header" showed up while looping through the observations. I could then delete the observations coinciding with odd numbers of the counter variable, followed by deleting occurrences of "End-Header".

Code:

* Code adapted from https://www.stata.com/statalist/archive/2007-03/msg00525.html gen count = 1 local N = _N forvalue i = 2/`N'{ if base[`i'] == "Header" | base[`i'] == "End-Header"{ qui replace count = count[_n-1]+1 in `i' } } * drop odd occurences, then drop "End-Header" drop if mod(count,2) == 1 drop if base == "End-Header"

Any suggestions on why the code is not working, or on a better way to reach my objective would be greatly appreciated!

Stata/SE 16.0

Attached Files

example_dataset.dta (1.3 KB, 1 view)
Tags: cleaning data, looping

Nick Cox

Join Date: Mar 2014
Posts: 36058

19 May 2022, 04:36

This is your example dataset shown more accessibly.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str10 base
"Header"    
"A"        
"B"        
"End-Header"
"C"        
"D"        
"Header"    
"E"        
"F"        
"G"        
"End-Header"
"H"        
"I"        
end

Unfortunately your image of what you want is not readable. It needs to be .png, I think.

This conversation may give some ideas. I don't usually find that a loop over observations is needed if the data format is consistent.

Code:

. gen header = sum(strpos(base, "Header") > 0)

. l , sepby(header)

     +---------------------+
     |       base   header |
     |---------------------|
  1. |     Header        1 |
  2. |          A        1 |
  3. |          B        1 |
     |---------------------|
  4. | End-Header        2 |
  5. |          C        2 |
  6. |          D        2 |
     |---------------------|
  7. |     Header        3 |
  8. |          E        3 |
  9. |          F        3 |
 10. |          G        3 |
     |---------------------|
 11. | End-Header        4 |
 12. |          H        4 |
 13. |          I        4 |
     +---------------------+

. drop if strpos(base, "Header") > 0
(4 observations deleted)

. l , sepby(header)

     +---------------+
     | base   header |
     |---------------|
  1. |    A        1 |
  2. |    B        1 |
     |---------------|
  3. |    C        2 |
  4. |    D        2 |
     |---------------|
  5. |    E        3 |
  6. |    F        3 |
  7. |    G        3 |
     |---------------|
  8. |    H        4 |
  9. |    I        4 |
     +---------------+

. sort header, stable

. by header : gen j = _n

. reshape wide base, i(header) j(j)
(j = 1 2 3)

Data                               Long   ->   Wide
-----------------------------------------------------------------------------
Number of observations                9   ->   4           
Number of variables                   3   ->   4           
j variable (3 values)                 j   ->   (dropped)
xij variables:
                                   base   ->   base1 base2 base3
-----------------------------------------------------------------------------

. l

     +--------------------------------+
     | header   base1   base2   base3 |
     |--------------------------------|
  1. |      1       A       B         |
  2. |      2       C       D         |
  3. |      3       E       F       G |
  4. |      4       H       I         |
     +--------------------------------+

Comment

Tasneem Mohammed

Join Date: Jul 2021
Posts: 9

19 May 2022, 20:15

Hi Nick,

Your suggestions worked like a charm. But, they went into more detail than what I needed, probably because my image showing the end goal was not readable.

Below, I will try to demonstrate what I wanted to achieve again so that other people who have a similar question find it useful.

I have the following example dataset. It is a single variable called "base", containing only strings.

Code:

     +------------------+
     |       base       |
     |------------------|
  1. |     Header       |
  2. |          A       |
  3. |          B       |
  4. | End-Header       |
     |------------------|
  5. |          C       |
  6. |          D       |
     |------------------|
  7. |     Header       |
  8. |          E       |
  9. |          F       |
 10. |          G       |
 11. | End-Header       |
     |------------------|
 12. |          H       |
 13. |          I       |
     +------------------+

I want to remove subgroups from my dataset of strings. Specifically, I want to remove observations from "Header" to "End-Header" inclusive, each time they occur in my dataset. Thus, my end goal is to only have "C", "D", "H" and "I" in my dataset. My final dataset should look like this:

Code:

     +------+
     | base |
     |------|
  1. |    C |
  2. |    D |
  3. |    H |
  4. |    I |
     +------+

Building on Nick's suggestion, I wrote the following code to reach my end goal:

Code:

* A variable that increments by 1 each time "Header" or "End-Header" appears as an observation
. gen header = sum(strpos(base, "Header") > 0)

     +---------------------+
     |       base   header |
     |---------------------|
  1. |     Header        1 |
  2. |          A        1 |
  3. |          B        1 |
  4. | End-Header        2 |
  5. |          C        2 |
     |---------------------|
  6. |          D        2 |
  7. |     Header        3 |
  8. |          E        3 |
  9. |          F        3 |
 10. |          G        3 |
     |---------------------|
 11. | End-Header        4 |
 12. |          H        4 |
 13. |          I        4 |
     +---------------------+

* Dropping "Header" and "End-Header" observations
. drop if strpos(base, "Header") > 0

     +---------------+
     | base   header |
     |---------------|
  1. |    A        1 |
  2. |    B        1 |
  3. |    C        2 |
  4. |    D        2 |
  5. |    E        3 |
     |---------------|
  6. |    F        3 |
  7. |    G        3 |
  8. |    H        4 |
  9. |    I        4 |
     +---------------+

* Observations I want (C,D,H,I) are listed next to even numbers. So, drop odd-numbered observations.
* For how this works, see Nick's explanation in RE: identifying even & odd numbers (stata.com)
. drop if mod(header,2) == 1

     +---------------+
     | base   header |
     |---------------|
  1. |    C        2 |
  2. |    D        2 |
  3. |    H        4 |
  4. |    I        4 |
     +---------------+

. drop header

Thus, I have removed the 2 subgroups of (Header ... End-Header) that were present in my dataset and I am left with what I only need (C,D,H,I). Hope this is useful to others!

Announcement