Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • how to extract specific values from a variable

    Dear all,

    Forgive me if this question has been posted and answered before, however I am really struggling with the command/program required to solve the STATA problem below. I really hope someone is able to help me.

    My issue;

    I have a categorical variable 'x' (coded: 0/1). For example:
    ID x
    1 1
    2 1
    3 1
    4 1
    5 1
    6 1
    7 1
    8 1
    9 0
    10 0
    11 0
    12 0
    13 1
    14 1
    15 1
    16 1
    17 1
    18 1
    19 0
    20 1
    21 0
    22 1
    23 1
    24 1
    25 1
    26 1
    27 0
    28 1
    29 1
    . .
    . .
    . .
    This variable continues down into the millions. Now I need to extract specific data from this 'x' variable and this is where I am struggling.

    I need to extract all the number of consecutive observations which are surrounded between 5 or more consecutive 'x'=1s before and after.

    I guess this is best explained with an example. Therefore, for instance in the example above, I would require a list/output which includes the following 2 values; 4 and 3 - Since 4 values (ID = 9, 10, 11 and 12) are surrounded by 8 and 6 values of 1 and similarly 3 observations (ID = 19, 20 and 21) are surrounded by 6 and 5 values of 1.

    Does anyone on here know how I can possibly do this? Thank you so much in advance.

  • #2
    PS. I have used the following thread - which has helped me slightly but it doesnt do exactly what I need it to do :-(

    Hi, I am struggling with the following problem, hope somebody can help. I need to create a new variable that counts the number of observation with the same value in a row for a variable in a dataset in STATA. For instance for the following variable a: a 0 0 1 1 1

    Comment


    • #3
      I don't follow your example completely, as I don't understand the rule whereby 0 1 0 in 19 20 21 are taken as a group. I've followed the version of this problem cross-posted at http://www.talkstats.com/showthread....simple-problem

      One way to do this requires a user-written program which you need to install first:

      Code:
       
      . ssc inst tsspell 
      
      . tsset id
              time variable:  id, 1 to 29
                      delta:  1 unit
      
      . tsspell , cond(x == 1)
      
      . egen max = max(_seq), by(_spell)
      
      . drop _end _seq 
      
      . gen prev1 = max[_n-1] if _spell ==0 & _spell[_n-1] > 0
      (25 missing values generated)
      
      . replace prev1 = prev1[_n-1] if _spell == 0 & prev1 == .
      (3 real changes made)
      
      . gsort -id
      
      . gen post1 = max[_n-1] if _spell ==0 & _spell[_n-1] > 0
      (25 missing values generated)
      
      . replace post1 = prev1[_n-1] if _spell == 0 & post1 == .
      (3 real changes made)
      
      . sort id
      
      . l if inrange(prev1, 5, .) & inrange(post1, 5, .)
      
           +---------------------------------------+
           | id   x   _spell   max   prev1   post1 |
           |---------------------------------------|
        9. |  9   0        0     0       8       8 |
       10. | 10   0        0     0       8       8 |
       11. | 11   0        0     0       8       8 |
       12. | 12   0        0     0       8       6 |
           +---------------------------------------+
      Please study the FAQ carefully before posting, especially Sections 8 and 18.

      Comment


      • #4
        Originally posted by Nick Cox View Post
        I don't follow your example completely, as I don't understand the rule whereby 0 1 0 in 19 20 21 are taken as a group. I've followed the version of this problem cross-posted at http://www.talkstats.com/showthread....simple-problem

        One way to do this requires a user-written program which you need to install first:

        Code:
        . ssc inst tsspell
        
        . tsset id
        time variable: id, 1 to 29
        delta: 1 unit
        
        . tsspell , cond(x == 1)
        
        . egen max = max(_seq), by(_spell)
        
        . drop _end _seq
        
        . gen prev1 = max[_n-1] if _spell ==0 & _spell[_n-1] > 0
        (25 missing values generated)
        
        . replace prev1 = prev1[_n-1] if _spell == 0 & prev1 == .
        (3 real changes made)
        
        . gsort -id
        
        . gen post1 = max[_n-1] if _spell ==0 & _spell[_n-1] > 0
        (25 missing values generated)
        
        . replace post1 = prev1[_n-1] if _spell == 0 & post1 == .
        (3 real changes made)
        
        . sort id
        
        . l if inrange(prev1, 5, .) & inrange(post1, 5, .)
        
        +---------------------------------------+
        | id x _spell max prev1 post1 |
        |---------------------------------------|
        9. | 9 0 0 0 8 8 |
        10. | 10 0 0 0 8 8 |
        11. | 11 0 0 0 8 8 |
        12. | 12 0 0 0 8 6 |
        +---------------------------------------+
        Please study the FAQ carefully before posting, especially Sections 8 and 18.



        Thank you very much for the prompt reply and code Dr Nick Cox. Rules regarding FAQ are noted.

        Sorry I didn't explain the example too well.

        I meant to say;

        I require the number in each group of observations which are surrounded between 5 or more consecutive 'x'=1s before and after.

        So as your last line of code shows id's 9, 10, 11 and 12, i would require the number 4. Which I can easily do by using 'count' instead of 'list'. But I also need to see the number 3 referring to the three individuals (id's 19, 20 and 21) who are surrounded between 5 or more consecutive 'x'=1s before and after. I am really sorry I did not explain myself clearly enough before.

        Thank you.

        Comment


        • #5
          Same comment from me, as you are repeating what you said, not explaining it.

          If 1s are allowed within 0s, you need to say what the rule is. What about 0 1 1 1 0 within 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1? And so forth.

          It seems that you have rules for two or more kinds of spells. That's your choice, but no one can write code without precise definitions.

          Comment


          • #6
            Thank you for your reply.

            1s are allowed between 0s.

            What I require in simple terms (what I should've started off with);

            The difference between two sets of 5 or more consecutive 1s.

            i.e. your example;

            For; 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1

            The required value would be 5 - since the blue group of numbers are surrounded by between 5 or more consecutive 1s on each side.

            eg. 2

            1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1

            Here the required value would be 3 - since the blue group of numbers are surrounded by between 5 or more consecutive 1s on each side.

            eg. 3

            1 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1

            Here the required values would be a list/output displaying 2 and 3 - since each set of the blue group of numbers are surrounded by between 5 or more consecutive 1s on each side.

            Sorry for not being able to explain myself as clearly as I am hoping.

            Thank you.




            Comment


            • #7
              Sorry, but I don't follow this and in any case I must focus on something else. Someone else may be able to help.

              Comment


              • #8
                First create sets of 0s and 1s, tag those that are 5+ 1s, then group the others ("bluenums") and get the size:

                Code:
                * make a fake dataset
                clear
                set obs 10000
                gen id=_n
                gen x=rbinomial(1,.5)
                
                * here's the part you need
                gen set=0
                sort id
                replace set=cond(x !=x[_n-1],set[_n-1]+1,set[_n-1]) if _n>1
                bys set : gen size=_N
                gen five1s=size>=5 & x
                gen bluenums = !five1s
                sort id
                replace set=set[_n-1] if bluenums & !five1s[_n-1] & _n>1
                bys set : gen output = _N if _n==1 & bluenums
                tab output
                hth,
                Jeph

                Comment


                • #9
                  Originally posted by Jeph Herrin View Post
                  First create sets of 0s and 1s, tag those that are 5+ 1s, then group the others ("bluenums") and get the size:

                  Code:
                  * make a fake dataset
                  clear
                  set obs 10000
                  gen id=_n
                  gen x=rbinomial(1,.5)
                  
                  * here's the part you need
                  gen set=0
                  sort id
                  replace set=cond(x !=x[_n-1],set[_n-1]+1,set[_n-1]) if _n>1
                  bys set : gen size=_N
                  gen five1s=size>=5 & x
                  gen bluenums = !five1s
                  sort id
                  replace set=set[_n-1] if bluenums & !five1s[_n-1] & _n>1
                  bys set : gen output = _N if _n==1 & bluenums
                  tab output
                  hth,
                  Jeph

                  Thank you so much. This is exactly what I was trying to do and explain earlier. Works perfectly fine. Thanks Jeph!!

                  Comment

                  Working...
                  X