No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Assign value from different variable depending on value of other variable, without loop

    I'd like to create a variable with a value from a certain other variable, where the other variable the value comes from depends on the value of yet another variable. An example might help: the observations are journal articles, and I have the year they were published, as well as the number of citations each article got in each calendar year. I want a new variable that is the number of citations an article received 5 years after publication.

    I did this with a loop, but Is there a nicer, loop-free way to write this code? Here's a working example:

    *Make the original data set
    clear all
    set seed 1492
    set obs 100
    *Values for each year (ex: citation count for an article in a given year)
    forvalues X=2000/2015 {
    gen _`X'value=round(runiform(0,10),1)
    *Assign the year (ex: publication year of an article)
    gen year=2000+floor(runiform(0,8))

    *Create new var with value from other variable, which var depends on value of 'year'
    *Create var with value from 5 years after publication
    gen year5value=.
    forvalues X=2000/2007 {
    replace year5value=_`=`X'+5'value if year==`X'

  • #2
    What about this?

    gen id=_n
    reshape long _@value, i(id) j(yr)
    keep if yr==year+5
    In case you can to sum all citations within 6 years, you can do this instead:

    gen id=_n
    reshape long _@value, i(id) j(yr)
    keep if yr>=year & yr<=year+5
    collapse (sum) _value, by(id year)
    Last edited by Jean-Claude Arbaut; 17 May 2018, 18:36. Reason: year to year+5, it's 6


    • #3
      Jean-Claude's two solutions both use the reshape command, and they illustrate an important Stata principle worth mentioning.

      The first thing we notice is that your data is in what Stata would call a "wide" layout with the values of the citation count for different years in different variables. Jean-Claude transforms it to a "long" layout, where each observation has just one value of the citation count for one given year.

      The experienced users here generally agree that, with few exceptions, Stata makes it much more straightforward to accomplish complex analyses using a long layout of your data rather than a wide layout of the same data. It certainly does make things much easier here, eliminating the need for looping over the variable names.

      In particular, if you are a former SAS user comfortable with SAS "arrays" of variables, and realize this would not need a loop in SAS, then I'm sorry to tell you that there is no similar construct in Stata. Even after several years of SAS withdrawal, I still occasionally regret the lack of that capability in Stata. But not enough to to make me miss SAS.


      • #4
        If you don't want to lose the dataset shape, it's also possible to replace the solutions in my previous post by the following:

        gen id=_n
        reshape long _@value, i(id) j(yr)
        by id: egen ncit1=max(cond(yr==year+5,_value,0))
        by id: egen ncit2=sum(cond(yr>=year & yr<=year+5,_value,0))
        reshape wide
        The trick here is to make sure the values of ncit1 and ncit2 are unique within each id group, or reshape wide will fail.

        William Lisowski
        It's possible to use Mata for that. However, I'm not sure it's possible to do this task without a loop. My idea was something like:

        But of course it's wrong: years[.,pubyr:-1999] does not select the correct column for each row, it selects all columns. It's however easy to write a function to do that, using a loop.

        A last remark to the OP, as I see round(runiform(0,10)) and floor(runiform(0,8)): see runiformint in help random number functions.

        Hope this helps

        Jean-Claude Arbaut
        Last edited by Jean-Claude Arbaut; 18 May 2018, 03:08.