reshape wide confusion

Sarah Hirsch

Join Date: Jun 2023

Posts: 2
#1

reshape wide confusion

12 Jun 2023, 14:45

Question: Does Stata do some behind-the-scenes ordering in the reshape wide function?

I am currently trying to migrate a data pipeline from Stata into R and it has mostly gone according to plan, except for one line of code that produces a different distribution in Stata than its equivalent in R:

In Stata:

Code:

reshape wide type accepted, i(id date) j(seq)

In R:

Code:

dcast(data, id + date ~ seq, value.var = c("type", "accepted"))

Type is a variable with 5 values, originally presented as a string, then labeled numerically. Accepted is a yes/no variable with two values, Received or Refused. seq is a numerical sequence that counts the number of rows per id-date. Observations may be duplicated either on type or accepted within an id-date instance, but the seq variable prevents full duplicates.

The resulting dataset, in both cases, has the column names <id, date, type_1, accepted_1, type_2, accepted_2, type_3, accepted_3>, where the suffix to the type and accepted variables is the seq value.

The distributions of the values in the type variables is exactly the same, but the distribution of the values in the accepted variables varies slightly.

I have looked at the dataset ordering in Stata; initially it is ordered through

Code:

sort id date type

Then type is coded from string to numeric, then labelled. Then it is sorted again

Code:

gsort id date -accepted

I have tried to replicate this in R, without the labels, but to no avail; the distribution is still different.

So my question is, as mentioned above, is there anything "extra" that Stata does in the reshape wide function? The fact that type and accepted may be duplicated is problematic, of course; how does Stata handle this? I would also be happy to hear any other insights that might be relevant. Thank you!

Last edited by Sarah Hirsch; 12 Jun 2023, 14:48. Reason: Too many Code: tags
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35662
#2

12 Jun 2023, 15:04

The new variable names should I think not have suffixes _1 _2 _3 etc. unless those are the string values of seq and you specify the option string.

So, please show a minimal data example matching your summary.
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#3

12 Jun 2023, 15:07

Can you provide some output? How much variation are we talking about? There are a few reasons the distribution might be different, but I don't think the order of the observations should affect the distribution of the data.
Comment
Sarah Hirsch

Join Date: Jun 2023

Posts: 2
#4

12 Jun 2023, 16:58

Let me give some thought to how to show the output. If I can create a minimal reproducible example, I will have solved my problem!
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#5

12 Jun 2023, 18:21

Thank you. You might just show us whatever statistics you use to demonstrate to yourself that the data are distributed differently; usually summary statistics for continuous variables, or frequency tables for categorical variables.
Comment

Announcement

reshape wide confusion

Comment

Comment

Comment

Comment