Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • reshape wide confusion

    Question: Does Stata do some behind-the-scenes ordering in the reshape wide function?

    I am currently trying to migrate a data pipeline from Stata into R and it has mostly gone according to plan, except for one line of code that produces a different distribution in Stata than its equivalent in R:

    In Stata:
    Code:
    reshape wide type accepted, i(id date) j(seq)
    In R:
    Code:
    dcast(data, id + date ~ seq, value.var = c("type", "accepted"))
    Type is a variable with 5 values, originally presented as a string, then labeled numerically. Accepted is a yes/no variable with two values, Received or Refused. seq is a numerical sequence that counts the number of rows per id-date. Observations may be duplicated either on type or accepted within an id-date instance, but the seq variable prevents full duplicates.

    The resulting dataset, in both cases, has the column names <id, date, type_1, accepted_1, type_2, accepted_2, type_3, accepted_3>, where the suffix to the type and accepted variables is the seq value.

    The distributions of the values in the type variables is exactly the same, but the distribution of the values in the accepted variables varies slightly.

    I have looked at the dataset ordering in Stata; initially it is ordered through
    Code:
    sort id date type
    Then type is coded from string to numeric, then labelled. Then it is sorted again
    Code:
    gsort id date -accepted
    I have tried to replicate this in R, without the labels, but to no avail; the distribution is still different.

    So my question is, as mentioned above, is there anything "extra" that Stata does in the reshape wide function? The fact that type and accepted may be duplicated is problematic, of course; how does Stata handle this? I would also be happy to hear any other insights that might be relevant. Thank you!
    Last edited by Sarah Hirsch; 12 Jun 2023, 14:48. Reason: Too many Code: tags

  • #2
    The new variable names should I think not have suffixes _1 _2 _3 etc. unless those are the string values of seq and you specify the option string.

    So, please show a minimal data example matching your summary.

    Comment


    • #3
      Can you provide some output? How much variation are we talking about? There are a few reasons the distribution might be different, but I don't think the order of the observations should affect the distribution of the data.

      Comment


      • #4
        Let me give some thought to how to show the output. If I can create a minimal reproducible example, I will have solved my problem!

        Comment


        • #5
          Thank you. You might just show us whatever statistics you use to demonstrate to yourself that the data are distributed differently; usually summary statistics for continuous variables, or frequency tables for categorical variables.

          Comment

          Working...
          X