Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Create new variables with values taken from another variable

    Hi,

    Here is a simplified version of my data:
    Row Year _id _nn
    1 39011 51805
    2 2006 51805 77823
    3 11921 33111
    4 2008 77823 69173
    5 14283 77823
    where _id and _nn respectively have no duplicates.

    I am looking for a way in which Stata could do the followings:
    1.Create a new variable (for example "NewYear")
    2. Then replace the variable "NewYear" with the value from "Year" if Stata could find overlapping/common values from variable "_nn" and "_id".
    For example, in row 1, the variable "NewYear" would be 2006 since the value of variable _nn contains the same value of 51805 with row 2 _id.
    3. For observations with no common values between _nn and _id, variable "NewYear" would be missing.
    4. For observations with value in "Year" variable, variable "NewYear" will take the value from "Year" variable

    The expected output is as follows:
    Row Year _id _nn NewYear Comments
    1 39011 51805 2006 Taken from Row 2
    2 2006 51805 77823 2006 No change since it got data in "Year" variable
    3 11921 33111 . Missing value since no common "_nn" with "_id" throughout the sample
    4 2008 77823 69173 2008 No change since it got data in "Year" variable
    5 14283 77823 2008 Taken from Row 4
    Where "Comments" is just for explanation purpose only.

    I have also refereed to http://www.stata.com/statalist/archive/2013-06/msg00057.html and http://www.stata.com/statalist/archi.../msg00508.html . But they seem not to be what I am looking for.

    Would appreciate if someone could assist me on this and I would improve my question if it is not clear enough.
    Thanks in advance.

    Regards


  • #2
    Hi Johnny,

    Welcome to Statalist! When sharing data on the forum it is strongly preferred that you use dataex (ssc install dataex) to produce a short snippet of your data that other members can easily copy in to a do-file and work with in Stata to help solve your problem. Do note this for future posts.

    For the first part of your problem I put together this code

    Code:
    //dataex output
    
    clear
    input int year float newyear long(_id _nn)
       .    . 39011 51805
    2006 2006 51805 77823
       .    . 11921 33111
    2008 2008 77823 69173
       .    . 14283 77823
    end
    
    //your code starts from here
    
    local cmd generate
    qui summ, meanonly
    forvalues a = 1/`r(N)'{
         forvalues e = 1/`r(N)'{
         `cmd' NewYear = year if _id[`a'] == _nn[`e'], a(year)
         local cmd replace
         }
    }
    I wasn't entirely sure what you meant by point 4 above, but if this snippet does not do everything you need do please try and clarify how it deviates from what you expect and I will put something else together.

    There is also no doubt a much better way of doing this. It works though, and unless you have a really large dataset and / or a really old computer it should be done in a jiffy!
    Last edited by Chris Larkin; 06 Feb 2017, 17:36.

    Comment


    • #3
      Hi Larkin,

      Thank you for the prompt reply.
      Will take note of the dataex in future.
      Might be a silly question, but may I know why the code is not working? No new variable of "NewYear" is generated although no error is shown.

      I have also tried this by myself , but it is not working (no observation in the end)
      Code:
      quietly su _id
      global minid `r(min)'
      global maxid `r(max)'
      
      quietly su _nn
      global minn1 `r(min)'
      global maxn1 `r(max)'
      
      gen NewYear=.
      forvalues i = $minid / $maxid {
      forvalues j = $minn1 / $maxn1 {
          replace NewYear= Year if `i'==`j'
          di `i'
      }
      }

      Thank you.
      Last edited by Johny Gul; 06 Feb 2017, 17:46.

      Comment


      • #4
        My code may not work because of the , a(year) option at the end of the line that generates the NewYear variable. Try removing that and it should work fine. It's a bit odd that when generate is stored within a local the options for the command disappear, but this is something for another day

        Your code should work in the end, but it'll take ages to get there. You only need to loop for as many rows as you have in your dataset squared.

        Comment


        • #5
          I was going to show you another way to do this, but it relied on

          where _id and _nn respectively have no duplicates
          being true. Unfortunately, the value 77823 appears twice in _nn, in the second and the last observations.

          Comment


          • #6
            Hi Chris,
            Thank you for the prompt reply. It seems like the variable "NewYear" is not generated despite me removing the option, a(year). Maybe I have done something wrong?
            May I know what do you mean by "You only need to loop for as many rows as you have in your dataset squared."? How can I incorporate your suggestion to my coding?


            Hi Clyde,
            My apologies for the mistake. The data should indeed be unique (no duplicates). Below is the expected outcome:
            Row Year _id _nn NewYear Comments
            1 39011 51805 2006 Taken from Row 2
            2 2006 51805 77823 2006 No change since it got data in "Year" variable
            3 11921 33111 . Missing value since no common "_nn" with "_id" throughout the sample
            4 2008 77823 69173 2008 No change since it got data in "Year" variable
            5 14283 16927 2010 Taken from Row 6
            6 2010 16927 42421 2010 No change since it got data in "Year" variable
            I am not sure this is the best way to reply each of you. Hope this reply is acceptable.

            Thank you.

            Regards

            Comment


            • #7
              Hi Johny,

              I'm not sure what's going wrong. I've copied and pasted my code verbatim (removing the , a(year) option) and it runs fine.

              Try doing it without the local cmd, so:

              Code:
              gen NewYear=.,a(year)
              qui summ, meanonly
              forvalues a = 1/`r(N)'{
                   forvalues e = 1/`r(N)'{
                   replace NewYear = year if _id[`a'] == _nn[`e']
                   }
              }

              Comment


              • #8
                I believe this does what you want.

                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input byte row int year long(_id _nn)
                1    . 39011 51805
                2 2006 51805 77823
                3    . 11921 33111
                4 2008 77823 69173
                5    . 14283 16927
                6 2010 16927 42421
                end
                
                preserve
                keep year _id
                rename _id _nn
                rename year newyear
                keep if !missing(newyear)
                tempfile holding
                save `holding'
                list
                
                restore
                gen newyear = year
                count
                merge 1:1 _nn using `holding',  update nogenerate
                sort row
                Please do note the use of -dataex- to create the example data. For future posting, please install it (-ssc install dataex-) and read the simple instructions in -help dataex-; use it whenever you post sample data going forward. That will make it as simple as possible for those who want to help you to create a completely faithful replica of your example with a simple copy paste operation.

                Comment

                Working...
                X