how to extract specific values from a variable

Kishan Bakrania

Join Date: Jun 2014

Posts: 18
#1

how to extract specific values from a variable

16 Jun 2014, 06:18

Dear all,

Forgive me if this question has been posted and answered before, however I am really struggling with the command/program required to solve the STATA problem below. I really hope someone is able to help me.

My issue;

I have a categorical variable 'x' (coded: 0/1). For example:

ID x

1 1

2 1

3 1

4 1

5 1

6 1

7 1

8 1

9 0

10 0

11 0

12 0

13 1

14 1

15 1

16 1

17 1

18 1

19 0

20 1

21 0

22 1

23 1

24 1

25 1

26 1

27 0

28 1

29 1

. .

. .

. .

This variable continues down into the millions. Now I need to extract specific data from this 'x' variable and this is where I am struggling.

I need to extract all the number of consecutive observations which are surrounded between 5 or more consecutive 'x'=1s before and after.

I guess this is best explained with an example. Therefore, for instance in the example above, I would require a list/output which includes the following 2 values; 4 and 3 - Since 4 values (ID = 9, 10, 11 and 12) are surrounded by 8 and 6 values of 1 and similarly 3 observations (ID = 19, 20 and 21) are surrounded by 6 and 5 values of 1.

Does anyone on here know how I can possibly do this? Thank you so much in advance.
Tags: None
Kishan Bakrania

Join Date: Jun 2014

Posts: 18
#2

16 Jun 2014, 06:48

PS. I have used the following thread - which has helped me slightly but it doesnt do exactly what I need it to do :-(

How to count number observations with the same value in a row for a variable in STATA

http://www.talkstats.com

Hi, I am struggling with the following problem, hope somebody can help. I need to create a new variable that counts the number of observation with the same value in a row for a variable in a dataset in STATA. For instance for the following variable a: a 0 0 1 1 1
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35818

16 Jun 2014, 06:49

I don't follow your example completely, as I don't understand the rule whereby 0 1 0 in 19 20 21 are taken as a group. I've followed the version of this problem cross-posted at http://www.talkstats.com/showthread....simple-problem

One way to do this requires a user-written program which you need to install first:

Code:

 
. ssc inst tsspell 

. tsset id
        time variable:  id, 1 to 29
                delta:  1 unit

. tsspell , cond(x == 1)

. egen max = max(_seq), by(_spell)

. drop _end _seq 

. gen prev1 = max[_n-1] if _spell ==0 & _spell[_n-1] > 0
(25 missing values generated)

. replace prev1 = prev1[_n-1] if _spell == 0 & prev1 == .
(3 real changes made)

. gsort -id

. gen post1 = max[_n-1] if _spell ==0 & _spell[_n-1] > 0
(25 missing values generated)

. replace post1 = prev1[_n-1] if _spell == 0 & post1 == .
(3 real changes made)

. sort id

. l if inrange(prev1, 5, .) & inrange(post1, 5, .)

     +---------------------------------------+
     | id   x   _spell   max   prev1   post1 |
     |---------------------------------------|
  9. |  9   0        0     0       8       8 |
 10. | 10   0        0     0       8       8 |
 11. | 11   0        0     0       8       8 |
 12. | 12   0        0     0       8       6 |
     +---------------------------------------+

Please study the FAQ carefully before posting, especially Sections 8 and 18.

Comment

Kishan Bakrania

Join Date: Jun 2014

Posts: 18
#4

16 Jun 2014, 07:25

Originally posted by Nick Cox View Post

I don't follow your example completely, as I don't understand the rule whereby 0 1 0 in 19 20 21 are taken as a group. I've followed the version of this problem cross-posted at http://www.talkstats.com/showthread....simple-problem

One way to do this requires a user-written program which you need to install first:

Code:

. ssc inst tsspell . tsset id time variable: id, 1 to 29 delta: 1 unit . tsspell , cond(x == 1) . egen max = max(_seq), by(_spell) . drop _end _seq . gen prev1 = max[_n-1] if _spell ==0 & _spell[_n-1] > 0 (25 missing values generated) . replace prev1 = prev1[_n-1] if _spell == 0 & prev1 == . (3 real changes made) . gsort -id . gen post1 = max[_n-1] if _spell ==0 & _spell[_n-1] > 0 (25 missing values generated) . replace post1 = prev1[_n-1] if _spell == 0 & post1 == . (3 real changes made) . sort id . l if inrange(prev1, 5, .) & inrange(post1, 5, .) +---------------------------------------+ | id x _spell max prev1 post1 | |---------------------------------------| 9. | 9 0 0 0 8 8 | 10. | 10 0 0 0 8 8 | 11. | 11 0 0 0 8 8 | 12. | 12 0 0 0 8 6 | +---------------------------------------+

Please study the FAQ carefully before posting, especially Sections 8 and 18.

Thank you very much for the prompt reply and code Dr Nick Cox. Rules regarding FAQ are noted.

Sorry I didn't explain the example too well.

I meant to say;

I require the number in each group of observations which are surrounded between 5 or more consecutive 'x'=1s before and after.

So as your last line of code shows id's 9, 10, 11 and 12, i would require the number 4. Which I can easily do by using 'count' instead of 'list'. But I also need to see the number 3 referring to the three individuals (id's 19, 20 and 21) who are surrounded between 5 or more consecutive 'x'=1s before and after. I am really sorry I did not explain myself clearly enough before.

Thank you.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35818
#5

16 Jun 2014, 07:35

Same comment from me, as you are repeating what you said, not explaining it.

If 1s are allowed within 0s, you need to say what the rule is. What about 0 1 1 1 0 within 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1? And so forth.

It seems that you have rules for two or more kinds of spells. That's your choice, but no one can write code without precise definitions.
Comment
Kishan Bakrania

Join Date: Jun 2014

Posts: 18
#6

16 Jun 2014, 07:49

Thank you for your reply.

1s are allowed between 0s.

What I require in simple terms (what I should've started off with);

The difference between two sets of 5 or more consecutive 1s.

i.e. your example;

For; 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1

The required value would be 5 - since the blue group of numbers are surrounded by between 5 or more consecutive 1s on each side.

eg. 2

1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1

Here the required value would be 3 - since the blue group of numbers are surrounded by between 5 or more consecutive 1s on each side.

eg. 3

1 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1

Here the required values would be a list/output displaying 2 and 3 - since each set of the blue group of numbers are surrounded by between 5 or more consecutive 1s on each side.

Sorry for not being able to explain myself as clearly as I am hoping.

Thank you.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35818
#7

16 Jun 2014, 07:56

Sorry, but I don't follow this and in any case I must focus on something else. Someone else may be able to help.
Comment

Jeph Herrin

Join Date: Apr 2014
Posts: 335

16 Jun 2014, 08:22

First create sets of 0s and 1s, tag those that are 5+ 1s, then group the others ("bluenums") and get the size:

Code:

* make a fake dataset
clear
set obs 10000
gen id=_n
gen x=rbinomial(1,.5)

* here's the part you need
gen set=0
sort id
replace set=cond(x !=x[_n-1],set[_n-1]+1,set[_n-1]) if _n>1
bys set : gen size=_N
gen five1s=size>=5 & x
gen bluenums = !five1s
sort id
replace set=set[_n-1] if bluenums & !five1s[_n-1] & _n>1
bys set : gen output = _N if _n==1 & bluenums
tab output

hth,
Jeph

Comment

Kishan Bakrania

Join Date: Jun 2014
Posts: 18

16 Jun 2014, 08:48

Originally posted by Jeph Herrin View Post

First create sets of 0s and 1s, tag those that are 5+ 1s, then group the others ("bluenums") and get the size:

Code:

* make a fake dataset
clear
set obs 10000
gen id=_n
gen x=rbinomial(1,.5)

* here's the part you need
gen set=0
sort id
replace set=cond(x !=x[_n-1],set[_n-1]+1,set[_n-1]) if _n>1
bys set : gen size=_N
gen five1s=size>=5 & x
gen bluenums = !five1s
sort id
replace set=set[_n-1] if bluenums & !five1s[_n-1] & _n>1
bys set : gen output = _N if _n==1 & bluenums
tab output

hth,
Jeph

Thank you so much. This is exactly what I was trying to do and explain earlier. Works perfectly fine. Thanks Jeph!!

Announcement