Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem with generating three binary variables using three different numeric variables

    Hello
    I am new to Stata, and need some help.
    I have a problem with finding the correct command to generate a binary variable using three different numeric variables.
    My dataset contains 167 observations. For one observation, there is three different numeric variables. The values of these variables represent the same answer (so 1 in var1 is the same as 1 in var2). For each of the variables there are three values, 1, 2 and 4.

    What I am looking for is how to generate binary variables where all values = 1 is summarized regardless of which variable it came from. Something like this:
    - sum_cvk_place1: value 1 = sum of ALL values = 1 in both var1 + var2 + var3. 0 = ALL values in var 1 + var 2 + var 3 not being = 1
    - sum cvk_place2: value 2 = sum of all values = 2 in both var1 + var2 + var3. 0 = ALL values in var1 + var2 + var3 not being = 2
    - sum cvk_place3: value 3 = sum of all values = 4 in both var1 + var2 + var3. 0 = ALL values in var1 + var2 + var3 not being = 4

    Missing values is not to be included.

    Thank you

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte(cvk_place cvk_place_v2 cvk_place_v3)
    1 4 1
    1 . .
    1 2 .
    1 . .
    1 . .
    1 . .
    1 . .
    1 . .
    1 . .
    2 . .
    1 1 .
    1 . .
    1 . .
    1 . .
    1 . .
    1 . .
    4 2 .
    1 . .
    1 . .
    1 1 2
    end
    label values cvk_place cvk_place_
    label def cvk_place_ 1 "V.jugularis", modify
    label def cvk_place_ 2 "V.subclavia", modify
    label values cvk_place_v2 cvk_place_v2_
    label def cvk_place_v2_ 1 "V.jugularis", modify
    label def cvk_place_v2_ 2 "V.subclavia", modify
    label values cvk_place_v3 cvk_place_v3_
    label def cvk_place_v3_ 1 "V.jugularis", modify
    label def cvk_place_v3_ 2 "V.subclavia", modify




  • #2
    Thanks for providing the example data! One way to do this is to first count non-missing and then count 1s (and 2s, and so on), and then check if all valid codes are the value of interest:

    Code:
    egen nonmis = rownonmiss(cvk_place cvk_place_v2 cvk_place_v3)
    egen count1 = anycount(cvk_place cvk_place_v2 cvk_place_v3), values(1) 
    gen all1 = count1 == nonmis
    Wasn't clear to me how to interpret "ALL values in var 1 + var 2 + var 3 not being = 1", I took it to mean "not all non-missing are ones". For example row 3 has 1, 2, . so it is not all 1.

    Comment


    • #3
      The first problem is

      sum of ALL values = 1 in both var1 + var2 + var3. 0 = ALL values in var 1 + var 2 + var 3 not being = 1

      You may mean

      Code:
      gen y1 = 1 if var1 == 1 & var2 == 1 & var3 == 1 
      replace y1 = 0 if var1 != 1 & var2 != 1 & var3 != 1 
      which will leave y1 unassigned if any value is 1 but not all.

      Consider also expressions such as

      max(y1 == 1, y2 == 1, y3 == 1)

      and min() similarly which quantify "any" and "all" counts.

      Code:
      . di max(1 == 1, 2 == 1, 4 == 1)
      1
      
      . di min(1 == 1, 2 == 1, 4 == 1)
      0
      You can thus use expressions such

      Code:
      gen foo  = max(var1 == 1, var2 == 1, var3 == 1)
      Your data example raises a question of what your rules are with missings. I can't see any detail on that.




      Comment


      • #4
        Thank you so much for your answers!!! I have tried your commands, but I get the wrong results (ie the counts are wrong). It may be because I have not been correct in my question previously. I am sorry for the inconvenience.

        Yes I see now how my rules of missing are confusing.
        I will try to explain one more time.

        I want to create three binary variables:

        - New_var1: new value 1 = the sum/count of all the values (1) in the dataset. The sum/count I am looking for is 170. Value 0 = the sum/count of all the values not being value (1) in the dataset (so the count of all values (2) and (4)). The sum/count I am looking for is 30.

        - New_var2: new value 1 = the sum/count of all the values (2) in the dataset. The sum/count I am looking for is 18. Value 0 = the sum/count of alle the values not being value (2) in the dataset (sum/count of all the values (1) and (4)). The sum/count I am looking for is 182.

        - New_var3: new value 1 = the sum/count of all the values (4) in the dataset. The sum/count I am looking for is 12. Value 0 = the sum/count of all the values not being value (4) in the dataset (count/sum of all the values (1) and (2)). The sum/count I am looking for is 188.

        The missing values (.) in the originally dataset is not to be counted in any of the new variables.

        Comment


        • #5
          Helene, honestly, your description has been a puzzle for me. Anyhow, in the below code, I try to calculate all the relevant info in my imagination. Hope it might be useful for you to collect the thing you need.

          Code:
          clear
          input byte(cvk_place cvk_place_v2 cvk_place_v3)
          1 4 1
          1 . .
          1 2 .
          1 . .
          1 . .
          1 . .
          1 . .
          1 . .
          1 . .
          2 . .
          1 1 .
          1 . .
          1 . .
          1 . .
          1 . .
          1 . .
          4 2 .
          1 . .
          1 . .
          1 1 2
          end
          
          gen sum_all=0
          gen count_all = 0
          
          foreach i in 1 2 4 {
          egen count_`i' = anycount(cvk_place cvk_place_v2 cvk_place_v3), values(`i')
          egen count_not_`i' = rownonmiss(cvk_place cvk_place_v2 cvk_place_v3)
          replace count_not_`i' = count_not_`i' - count_`i'
          gen count_all_`i' = sum(count_`i')
          replace count_all_`i' = count_all_`i'[_N]
          gen sum_all_`i' = `i' * count_all_`i'
          replace sum_all = sum_all + sum_all_`i'
          replace count_all = count_all + count_all_`i'
          }
          
          foreach i in 1 2 4 {
          gen sum_all_not_`i' = sum_all - sum_all_`i'
          gen count_all_not_`i' = count_all - count_all_`i'
          }
          .

          Comment


          • #6
            The question is indeed puzzling. I wonder, looking again at the wording of #1, if what you want to know is what proportion of cases answer 1 in any of the three questions. In your example this is 18 out of 20 or 90%, which you can get most easily from the first line in Nick's suggestion at #3, followed by summarize:
            Code:
            . gen any1 = cvk_place ==1 | cvk_place_v2==1 | cvk_place_v3==1
            .
            . summarize any1
            
                Variable |        Obs        Mean    Std. Dev.       Min        Max
            -------------+---------------------------------------------------------
                    any1 |         20          .9    .3077935          0          1
            Here the variable any1 is 1 if any of the three answers is 1 and 0 otherwise. You can do the same for codes 2 (you get 20% ) and 4 (answer 10%), easily verified visually. The percents add to more than 100% because some people provide more than one answer (for example the first observation answers 1 and 4).

            The description in #4 says you are looking for sum/counts of 170/30, 18/182 and 12/188 which all add to 200. Is this in the full dataset? In that case it looks consistent with my interpretation. But I can't get those numbers with your sample data.
            Last edited by German Rodriguez; 28 Oct 2017, 11:44.

            Comment


            • #7
              I want to know the proportion/count of the answer 1 in ALL of the three questions (ie the total of answer 1 in the firs observation and so on).

              For example:
              First observation: answer 1 two times
              Second observation: answer 1 one time
              Third obsercation: answer 1 one time

              So if I counted the answer 1 now after these three examples: 2 + 1 + 1 = 4

              So I have 167 observations.
              The total count/sum of ALL the answers together is 200 (because one person may answer to more than one question).
              - The sum of all the answer 1 should be 170
              - The sum of all the answer 2 should be 18
              - The sum of all the answer 4 should be 12


              I will provide you the full dataset, maybe you can help me.

              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input byte(cvk_place cvk_place_v2 cvk_place_v3)
              1 4 1
              1 . .
              1 2 .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              2 . .
              1 1 .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              4 2 .
              1 . .
              1 . .
              1 1 2
              1 2 .
              1 . .
              1 . .
              2 4 .
              4 1 1
              1 . .
              2 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 1 .
              1 . .
              1 . .
              1 1 .
              4 1 .
              1 . .
              1 1 .
              1 . .
              1 . .
              4 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 2 .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 1 .
              1 . .
              2 1 .
              2 . .
              1 . .
              1 . .
              1 . .
              1 1 .
              1 1 .
              2 . .
              1 . .
              1 . .
              1 . .
              1 1 .
              1 . .
              4 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              4 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              2 . .
              1 . .
              1 . .
              2 . .
              1 . .
              1 . .
              1 . .
              1 . .
              2 . .
              1 . .
              1 2 .
              1 . .
              1 . .
              4 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 1 .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              4 . .
              1 1 1
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 1 .
              4 . .
              1 . .
              1 1 .
              1 . .
              1 . .
              1 . .
              2 . .
              1 . .
              1 . .
              1 . .
              1 . .
              4 1 1
              1 . .
              1 1 2
              1 1 2
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              1 . .
              end
              label values cvk_place cvk_place_
              label def cvk_place_ 1 "V.jugularis", modify
              label def cvk_place_ 2 "V.subclavia", modify
              label values cvk_place_v2 cvk_place_v2_
              label def cvk_place_v2_ 1 "V.jugularis", modify
              label def cvk_place_v2_ 2 "V.subclavia", modify
              label values cvk_place_v3 cvk_place_v3_
              label def cvk_place_v3_ 1 "V.jugularis", modify
              label def cvk_place_v3_ 2 "V.subclavia", modify
              [/CODE]

              Comment


              • #8
                I will also provide you the code given by Nick in #3. I get the wrong numbers because I want the value 1 in the new variable to be a total of 170 and not 151. Value 0 should be 30 and not 16. The total should be 200.

                I much appreciate your help!!!

                Code:
                . gen any1 = cvk_place ==1 | cvk_place_v2==1 | cvk_place_v3==1
                
                . tab any1

                Comment


                • #9
                  Code:
                  any1 | Freq. Percent Cum.
                  ------------+-----------------------------------
                  0 | 16 9.58 9.58
                  1 | 151 90.42 100.00
                  ------------+-----------------------------------
                  Total | 167 100.00

                  Comment


                  • #10
                    Helene, hope the below code mightbe the thing you need .
                    Code:
                    foreach i in 1 2 4 {
                        egen count_`i' = anycount(cvk_place cvk_place_v2 cvk_place_v3), values(`i')
                        replace count_`i' = sum(count_`i')
                        replace count_`i' = count_`i'[_N]
                        }

                    Comment


                    • #11
                      Romalpa's code can be taken this way:

                      Code:
                      foreach i in 1 2 4 {
                          egen count_`i' = anycount(cvk_place cvk_place_v2 cvk_place_v3), values(`i')
                          su count_`i', meanonly            
                          di "sum for `i': "  r(sum)
                      }
                      That way, you don't over-write your data with a constant that summarize will give you any way.

                      Comment


                      • #12

                        I think it is important to distinguish variables from summary statistics, as Nick does in #11. Back in #2, count1 adds up to 170, and nonmiss adds to 200, as required. The proportion of 1s among all valid answers, or 170/200 = 0.85, can be obtained as
                        Code:
                        , gen prop1 = count1/nonmis
                         
                        . sum prop1 [fw=nonmis]
                        
                            Variable |        Obs        Mean    Std. Dev.       Min        Max
                        -------------+---------------------------------------------------------
                               prop1 |        200         .85    .3049013          0          1
                        In other words, you calculate the proportion for each observation, and then summarize using the number of responses as a frequency weight. Same can be done for 2 and 4 by looping as shown before.

                        Comment

                        Working...
                        X