Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to generate a new variable by calculation using two variables in the same id

    Hi everyone,

    I have a dataset where v1 is the household id that gives out one id per household.
    I have other variables called v2 and v3 which are just consisted of numbers.
    I want to generate a new variable 'v5' which is v3-v2.
    However, the problem is that there are only one value of v2 and v3 in the same household id, so the value of v3 and v2 are not in the same column.

    So in concise, I am wondering how I could generate v5
    1) when v2 and v3 are not in the same column
    2) using v2 and v3 that are bounded by the same household id

    For example, the data looks like this:
    v1 v2 v3 v4 v5
    1 10 . . 10
    1 . 20 .
    2 13 . . 7
    2 . 20 .
    3 11 . . 6
    3 . 17 .
    4 22 . . -11
    4 . . 2
    4 . 11 .

    Here are some things that I tried but did not succeed:

    1) replace the missing value of v2 with the values from v3
    2) tried this code (bysort v1: replace v4=v3-v2)

    I tried to look up some threads on this but couldn't find much succesfully. (If anyone could generously help me with tips on how to search on this matter, that would be great as well)

    Thank you in advance!

    P.S. I use STATA 14.0 for Mac

  • #2
    Code:
    bys v1(v2 v3): gen wanted=(v3[2]-v2[1]) if !missing(v2)

    Comment


    • #3
      Andrew Musau Thank you for the code! But the thing is that the value of v3 isn't always in the second position like when v1=4.

      Comment


      • #4
        That does not matter, at least with your example data in #1.

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input byte(v1 v2 v3 v4 v5)
        1 10  . .  10
        1  . 20 .   .
        2 13  . .   7
        2  . 20 .   .
        3 11  . .   6
        3  . 17 .   .
        4 22  . . -11
        4  .  . 2   .
        4  . 11 .   .
        end
        
        bys v1(v2 v3): gen wanted=(v3[2]-v2[1]) if !missing(v2)
        Res.:

        Code:
        . l, sepby(v1)
        
             +----------------------------------+
             | v1   v2   v3   v4    v5   wanted |
             |----------------------------------|
          1. |  1   10    .    .    10       10 |
          2. |  1    .   20    .     .        . |
             |----------------------------------|
          3. |  2   13    .    .     7        7 |
          4. |  2    .   20    .     .        . |
             |----------------------------------|
          5. |  3   11    .    .     6        6 |
          6. |  3    .   17    .     .        . |
             |----------------------------------|
          7. |  4   22    .    .   -11      -11 |
          8. |  4    .   11    .     .        . |
          9. |  4    .    .    2     .        . |
             +----------------------------------+
        Provide an example which contradicts the above.

        Comment


        • #5
          Andrew Musau I thought that the error message that I was getting was because of the mismatch between the data and the code, but I guess it is not.
          I was getting this error, "factor variables and time-series operators not allowed" when using that code.

          But I tweaked the code a bit and got it to work!
          The code that I used is:

          bysort v1: gen diff=v3[2]-v2[1]

          and then I just erased the duplicates.

          Thank you so much for your input! It helped me out greatly.
          Last edited by Graceh Kim; 16 Oct 2020, 10:25.

          Comment


          • #6
            You do not show the exact command that you ran and your error message. As far as I can see, there is no issue with the code in #2, which means that your error is as a result of something that you are not showing us. Present a data example using dataex as recommended in FAQ Advice # 12.

            Code:
            ssc install dataex
            dataex in 1/20

            Comment


            • #7
              The data example given in my question is the same as my real data, just in different numbers and in a longer series. To give you the same value from my real data it goes like this:
              v1 v2 v3 v4 v5
              1000200 78 . .
              1000200 . 78 .
              1000202 80 . .
              1000202 . 92 .
              1000210 23 . .
              1000210 . 42 .
              1000214 60 . .
              1000216 . . 20
              1000216 . 60 .
              When I used your code from (2), it gave me an error message, "factor variables and time-series operators not allowed", as I have already stated. From searching up about this error code, I thought it could be due to stata's version problem. I am not so familiar with this error.

              Also, I have given you the code: bysort v1: gen diff=v3[2]-v2[1], as I have stated before. I was able to reach this code with your idea so thank you for that.
              But it did made duplicates in the same values of v1 like in the below table.
              v1 v2 v3 v4 v5
              1000200 78 . . 0
              1000200 . 78 . 0
              1000202 80 . . -8
              1000202 . 92 . -8
              1000210 23 . . 19
              1000210 . 42 . 19
              1000214 60 . . 0
              1000216 . . 20 0
              1000216 . 60 .
              Last edited by Graceh Kim; 16 Oct 2020, 11:01.

              Comment


              • #8
                What version of Stata do you have? Do you get the same error if you run the code in #4 exactly?

                Comment

                Working...
                X