Good evening all,
Apologies if for any mistakes in trying to explain the problem I have and thanks in advance for any replies. I'll try to both describe the problem in words and with a data example attached. Further apologies for things being named monthV4lue and ye4r, for some reason I'm having trouble with A's today.
I am using Stata MP 14.2 on a data set of tens of millions of observation rows. The data contains a row for every person-year combination (each person has 4 years of data). Each month of each year has a value, and since each person has four years of data, each row contains 48 columns of monthValue data.
In the original data set, for example, the row for person A year 2011 has information filled in for monthValue1 through monthValue12, but has no data for monthValue13 through monthValue48. The row for person A year 2012 has information filled in for monthValue13 through monthValue24, but has no data for monthValue1 through monthValue12 and monthValue25 through monthValue48. Etc.
I wanted take the values for each row's non-missing monthValues and apply them to the other rows for the same person. For example, I wanted to make sure person A year 2011 had values for monthValue13 through monthValue24, taken from person A year 2014.
I cannot provide exact details but have created a fake data set pair that I think captures the problem accurately. The original data set is named zach_adams_original.csv.
So I ran a by-sorted egen max on the monthValues by person (see the below codeblock).
For the sake of ease of display here, the new variables created by egen were then renamed and replaced the original variables
The new data set is named zach_adams_post.csv and contains filled data across monthValues1 to monthValues24, missing data for monthValues25 to monthValues48 (since no one had data for these months in the sample/fake data set), across years 2011 to 2014 for each person.
The problem is that while this method works, it takes an extremely large chunk of time (multiple hours or more, depending on how stressed the machine is and the size of the data set this code is run on). Is there a faster or more efficient way to perform this type of operation?
Thank you!!
Apologies if for any mistakes in trying to explain the problem I have and thanks in advance for any replies. I'll try to both describe the problem in words and with a data example attached. Further apologies for things being named monthV4lue and ye4r, for some reason I'm having trouble with A's today.
I am using Stata MP 14.2 on a data set of tens of millions of observation rows. The data contains a row for every person-year combination (each person has 4 years of data). Each month of each year has a value, and since each person has four years of data, each row contains 48 columns of monthValue data.
In the original data set, for example, the row for person A year 2011 has information filled in for monthValue1 through monthValue12, but has no data for monthValue13 through monthValue48. The row for person A year 2012 has information filled in for monthValue13 through monthValue24, but has no data for monthValue1 through monthValue12 and monthValue25 through monthValue48. Etc.
I wanted take the values for each row's non-missing monthValues and apply them to the other rows for the same person. For example, I wanted to make sure person A year 2011 had values for monthValue13 through monthValue24, taken from person A year 2014.
I cannot provide exact details but have created a fake data set pair that I think captures the problem accurately. The original data set is named zach_adams_original.csv.
So I ran a by-sorted egen max on the monthValues by person (see the below codeblock).
Code:
forval i=1/48 { bys person (id): egen filledMonthValue`i'=max(monthValue`i') }
Code:
forval i=1/48 { drop monthValue`i' rename filledMonthValue`i' monthValue`i' }
The new data set is named zach_adams_post.csv and contains filled data across monthValues1 to monthValues24, missing data for monthValues25 to monthValues48 (since no one had data for these months in the sample/fake data set), across years 2011 to 2014 for each person.
The problem is that while this method works, it takes an extremely large chunk of time (multiple hours or more, depending on how stressed the machine is and the size of the data set this code is run on). Is there a faster or more efficient way to perform this type of operation?
Thank you!!
Comment