Question: Does Stata do some behind-the-scenes ordering in the reshape wide function?
I am currently trying to migrate a data pipeline from Stata into R and it has mostly gone according to plan, except for one line of code that produces a different distribution in Stata than its equivalent in R:
In Stata:
In R:
Type is a variable with 5 values, originally presented as a string, then labeled numerically. Accepted is a yes/no variable with two values, Received or Refused. seq is a numerical sequence that counts the number of rows per id-date. Observations may be duplicated either on type or accepted within an id-date instance, but the seq variable prevents full duplicates.
The resulting dataset, in both cases, has the column names <id, date, type_1, accepted_1, type_2, accepted_2, type_3, accepted_3>, where the suffix to the type and accepted variables is the seq value.
The distributions of the values in the type variables is exactly the same, but the distribution of the values in the accepted variables varies slightly.
I have looked at the dataset ordering in Stata; initially it is ordered through
Then type is coded from string to numeric, then labelled. Then it is sorted again
I have tried to replicate this in R, without the labels, but to no avail; the distribution is still different.
So my question is, as mentioned above, is there anything "extra" that Stata does in the reshape wide function? The fact that type and accepted may be duplicated is problematic, of course; how does Stata handle this? I would also be happy to hear any other insights that might be relevant. Thank you!
I am currently trying to migrate a data pipeline from Stata into R and it has mostly gone according to plan, except for one line of code that produces a different distribution in Stata than its equivalent in R:
In Stata:
Code:
reshape wide type accepted, i(id date) j(seq)
Code:
dcast(data, id + date ~ seq, value.var = c("type", "accepted"))
The resulting dataset, in both cases, has the column names <id, date, type_1, accepted_1, type_2, accepted_2, type_3, accepted_3>, where the suffix to the type and accepted variables is the seq value.
The distributions of the values in the type variables is exactly the same, but the distribution of the values in the accepted variables varies slightly.
I have looked at the dataset ordering in Stata; initially it is ordered through
Code:
sort id date type
Code:
gsort id date -accepted
So my question is, as mentioned above, is there anything "extra" that Stata does in the reshape wide function? The fact that type and accepted may be duplicated is problematic, of course; how does Stata handle this? I would also be happy to hear any other insights that might be relevant. Thank you!
Comment