Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Issue with Time/Duration of Egen operation

    Good evening all,

    Apologies if for any mistakes in trying to explain the problem I have and thanks in advance for any replies. I'll try to both describe the problem in words and with a data example attached. Further apologies for things being named monthV4lue and ye4r, for some reason I'm having trouble with A's today.

    I am using Stata MP 14.2 on a data set of tens of millions of observation rows. The data contains a row for every person-year combination (each person has 4 years of data). Each month of each year has a value, and since each person has four years of data, each row contains 48 columns of monthValue data.

    In the original data set, for example, the row for person A year 2011 has information filled in for monthValue1 through monthValue12, but has no data for monthValue13 through monthValue48. The row for person A year 2012 has information filled in for monthValue13 through monthValue24, but has no data for monthValue1 through monthValue12 and monthValue25 through monthValue48. Etc.

    I wanted take the values for each row's non-missing monthValues and apply them to the other rows for the same person. For example, I wanted to make sure person A year 2011 had values for monthValue13 through monthValue24, taken from person A year 2014.

    I cannot provide exact details but have created a fake data set pair that I think captures the problem accurately. The original data set is named zach_adams_original.csv.

    So I ran a by-sorted egen max on the monthValues by person (see the below codeblock).

    Code:
    forval i=1/48 {
    bys person (id): egen filledMonthValue`i'=max(monthValue`i')
    }
    For the sake of ease of display here, the new variables created by egen were then renamed and replaced the original variables
    Code:
    forval i=1/48 {
    drop monthValue`i'
    rename filledMonthValue`i' monthValue`i'
    }


    The new data set is named zach_adams_post.csv and contains filled data across monthValues1 to monthValues24, missing data for monthValues25 to monthValues48 (since no one had data for these months in the sample/fake data set), across years 2011 to 2014 for each person.

    The problem is that while this method works, it takes an extremely large chunk of time (multiple hours or more, depending on how stressed the machine is and the size of the data set this code is run on). Is there a faster or more efficient way to perform this type of operation?

    Thank you!!
    Attached Files
    Last edited by Zach Adams; 28 Dec 2016, 14:06.

  • #2
    While I could probably come up with something that would shave a few seconds off the time to do this, I'm not going to.

    Instead, I'm going to propose that you not do this at all and do something entirely different, because the data set illustrated in zach_adams_post.csv does not really make sense from the perspective of data analysis in Stata. Each of the observations corresponding to a given person id ends up containing exactly the same information except for the year variable. It is quadruply redundant. Moreover, each of those observations contains variables recording data from years other than the year indicated by the year variable. Finally, the information is in a doubly-wide layout which will prove extremely cumbersome to work with for almost any analysis you want to do. I cannot think of a single analysis command in Stata that would require the data to be laid out this way, and I can think of only a few that would even be able to work with it. So the following code will create a very different layout (long): one observation for each person for each month from Jan 2011 through Dec 2014, and a single variable, mv, that contains the value for that person in that month. Nearly all Stata analyses will want the data to be laid out this way in order to work.

    Code:
    clear
    import delimited zach_adams_original.csv
    
    forvalues i = 1/12 {
        gen mv`i' = .
        forvalues y = 2011/2014 {
            local source = 12*(`y'-2011) + `i'
            replace mv`i' = monthv4lue`source' if ye4r == `y'
        }
    }
    order mv*, after(ye4r)
    
    drop monthv4lue*
    reshape long mv, i(id person ye4r) j(month)
    gen mdate = mofd(mdy(month, 1, ye4r))
    format mdate %tm
    drop month ye4r
    As for the amount of time it will take: it will be slow. It might even be slower than your current solution, I don't know. This is a huge data set. In particular, the -reshape- will take a long time with this many observations. However, with the data now arrayed in such a way that Stata can actually analyze it, I think you will easily recoup the time in terms of ease of coding your analyses, and time saved avoiding further transformations to make analysis possible.

    Note that if your analyses will require you to mix observations from different time periods, you will be able to do that by -xtset person mdate-* and then using the L. and F. time series operators as appropriate.

    *Well, not quite. The problem is that person is a string variable, and you need a numeric variable for -xtset-. But then this will work:
    Code:
    egen n_person = group(person), label
    xtset n_person mdate

    Comment


    • #3
      Originally posted by Clyde Schechter View Post
      While I could probably come up with something that would shave a few seconds off the time to do this, I'm not going to.

      Instead, I'm going to propose that you not do this at all and do something entirely different, because the data set illustrated in zach_adams_post.csv does not really make sense from the perspective of data analysis in Stata. Each of the observations corresponding to a given person id ends up containing exactly the same information except for the year variable. It is quadruply redundant. Moreover, each of those observations contains variables recording data from years other than the year indicated by the year variable. Finally, the information is in a doubly-wide layout which will prove extremely cumbersome to work with for almost any analysis you want to do. I cannot think of a single analysis command in Stata that would require the data to be laid out this way, and I can think of only a few that would even be able to work with it. So the following code will create a very different layout (long): one observation for each person for each month from Jan 2011 through Dec 2014, and a single variable, mv, that contains the value for that person in that month. Nearly all Stata analyses will want the data to be laid out this way in order to work.

      Code:
      clear
      import delimited zach_adams_original.csv
      
      forvalues i = 1/12 {
      gen mv`i' = .
      forvalues y = 2011/2014 {
      local source = 12*(`y'-2011) + `i'
      replace mv`i' = monthv4lue`source' if ye4r == `y'
      }
      }
      order mv*, after(ye4r)
      
      drop monthv4lue*
      reshape long mv, i(id person ye4r) j(month)
      gen mdate = mofd(mdy(month, 1, ye4r))
      format mdate %tm
      drop month ye4r
      As for the amount of time it will take: it will be slow. It might even be slower than your current solution, I don't know. This is a huge data set. In particular, the -reshape- will take a long time with this many observations. However, with the data now arrayed in such a way that Stata can actually analyze it, I think you will easily recoup the time in terms of ease of coding your analyses, and time saved avoiding further transformations to make analysis possible.

      Note that if your analyses will require you to mix observations from different time periods, you will be able to do that by -xtset person mdate-* and then using the L. and F. time series operators as appropriate.

      *Well, not quite. The problem is that person is a string variable, and you need a numeric variable for -xtset-. But then this will work:
      Code:
      egen n_person = group(person), label
      xtset n_person mdate
      Thanks for your reply Clyde, I'm going to think on this and experiment with some larger datasets to test timing.

      Would your solution remain the same if there were redundancies in the data such that
      person year monthValue1 monthValue2
      a 2011 4 .
      a 2011 . .
      a 2011 . 1
      b 2012 1 5
      becomes
      person year monthValue1 monthValue2
      a 2011 4 1
      a 2011 4 1
      a 2011 4 1
      b 2012 1 5
      or (obviously) preferably
      person year monthValue1 monthValue2
      a 2011 4 1
      b 2012 1 5
      (across a much more substantially sized data set)?

      Comment


      • #4
        Would your solution remain the same if there were redundancies in the data such that
        The way your original data set example was set up, monthValue1 meant the value in Jan 2011, and its value was missing in observations with other years. The example you show would not be handled as suggested by my code, because for year 2012, my code looks for the first month's value in monthValue13 and the second in monthValue14, not monthValue1 and monthValue2. So my code would not produce this solution from this output unless the values of monthValue13 and monthValue14 are also 1 and 5, respectively, in the person b, year 2012 observation.

        By the way, I wouldn't refer to what you show as redundancy in the data. It's inconsistency in the data. If the name monthValue1 means anything at all, it means the value for January 2011 for that person. Your new example provides contradictory values for that. My code in #2 would ignore those values and use whatever it found in monthValue13 and monthValue 14 there.

        Added: Your own original solution, by the way, would not, with the data in #3, produces the results you say you want there either. The final value of monthvalue1 would be 4 in all years and monthvalue2 would be 5 in all yours under your solution.

        Also, note that my code does not have variables monthvalue1, etc. in the results. It has 12 observations per year (one for each month), not one, and only a single variable, called mv, containing the corresponding value for that month.
        Last edited by Clyde Schechter; 28 Dec 2016, 16:16.

        Comment


        • #5
          Originally posted by Clyde Schechter View Post
          The way your original data set example was set up, monthValue1 meant the value in Jan 2011, and its value was missing in observations with other years. The example you show would not be handled as suggested by my code, because for year 2012, my code looks for the first month's value in monthValue13 and the second in monthValue14, not monthValue1 and monthValue2. So my code would not produce this solution from this output unless the values of monthValue13 and monthValue14 are also 1 and 5, respectively, in the person b, year 2012 observation.

          By the way, I wouldn't refer to what you show as redundancy in the data. It's inconsistency in the data. If the name monthValue1 means anything at all, it means the value for January 2011 for that person. Your new example provides contradictory values for that. My code in #2 would ignore those values and use whatever it found in monthValue13 and monthValue 14 there.

          Added: Your own original solution, by the way, would not, with the data in #3, produces the results you say you want there either. The final value of monthvalue1 would be 4 in all years and monthvalue2 would be 5 in all yours under your solution.

          Also, note that my code does not have variables monthvalue1, etc. in the results. It has 12 observations per year (one for each month), not one, and only a single variable, called mv, containing the corresponding value for that month.
          Clyde, I've worked on this a bit and forgot to respond earlier, but thought it'd be good to get back with my solution. It involves the wonderful user-created package called sreshape,detailed here: "A Sparser, Speedier Reshape", by Kenneth L. Simons, http://www.stata-journal.com/article...article=dm0090

          I've also modified the test data set to make my intentions clearer.

          Code:
          use "reshape magic sample data.dta"
          
          sreshape long monthValue, i(person year) j(month) missing(drop all)
          
          drop year
          
          reshape wide monthValue, i(person) j(month)

          Pictures attached below of how the data looks. This is much much faster than egen or collapse, though it took some tinkering. In the full dataset, it required using reshape wide (rather than sreshape wide) for some reason (I guess sreshape may save some macros of what it already reshaped on in the same file), and the monthValues had to be in numeric form rather than string.

          Cheers,
          Zach


          Click image for larger version

Name:	reshape magic 1.jpg
Views:	1
Size:	67.2 KB
ID:	1370248


          Click image for larger version

Name:	reshape magic 2.jpg
Views:	1
Size:	110.2 KB
ID:	1370249


          Click image for larger version

Name:	reshape magic 3.jpg
Views:	1
Size:	39.8 KB
ID:	1370250


          Click image for larger version

Name:	reshape magic 4.jpg
Views:	1
Size:	56.4 KB
ID:	1370251

          Comment


          • #6
            I'm having some trouble uploading the .dta data attachment, so I've pasted the comma-separated values below:

            person,year,monthValue1,monthValue2,monthValue3,mo nthValue13,monthValue14,monthValue15,monthValue25, monthValue26,monthValue27
            a,2011,1,2,3,,,,,,
            a,2012,,,,12,12,12,,,
            a,2013,,,,,,,6,19,4
            b,2011,28,29,30,,,,,,
            b,2012,,,,62,72,84,,,
            b,2013,,,,,,,25,26,5
            c,2011,35,37,50,,,,,,
            c,2012,,,,12,26,92,,,
            c,2013,,,,,,,99,98,96

            Comment

            Working...
            X