Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • "Pivotting" dataset

    Hello everyone,

    I am designing a quality control framework for a longitudinal cohort study that uses survey data. Every year, twenty new datasets must be checked for potential mistakes in data collection. For each dataset, the "diagnosis" is a dataset of the form:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float id str64 question float(wrong correct)
    1 "A"  1  0
    1 "G" -3 -1
    2 "A"  8  9
    5 "E"   4  5
    5 "G"  2  6
    end
    
    
         | id   question   value   correct |
         |---------------------------------|
      1. |  1          A       1         0 |
      2. |  1          G      -3        -1 |
      3. |  2          A       8         9 |
      4. |  5          E       4         5 |
      5. |  5          G       2         6 |


    Where "id" is the participant id, "question" is the name of the variable that flagged an exception (the letters are an illustration, but could be any string), "wrong" is the wrong value, and "correct" is the correct value.

    From this, I want to be able to output a corrected version of the original dataset. So I want to transform the diagnosis to the following:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(id A G E)
    1 0 -1 .
    2 9  . .
    5 .  6 5
    end
    
    
         | id   A    G   E |
         |-----------------|
      1. |  1   0   -1   . |
      2. |  2   9    .   . |
      3. |  5   .    6   5 |
    The variables are the distinct values of "question", the observations are grouped by "id", and the values are the "correct" values. That way, the result can be merged with the original set using -merge- (I already implemented this last part) to implement the corrections easily

    Question: Is there a way in Stata to obtain my second example from my first? For those familiar with Python data management, in pandas you would write
    Code:
    pivoted_df = df.pivot(index='id', columns='question')['correct']
    Thank you very much!

  • #2
    Code:
    drop wrong
    reshape wide correct, i(id) j(question) string
    rename correct* *
    
    list, noobs clean
    will work for your example. But it may not work more generally if the variable question can contain arbitrary strings. The problem, of course, is that you want the values of questions to become variable names, but not all strings are legal variable names. You can probably get around this by doing -replace question = strtoname(question)-, which will replace characters that are not legal in variable names with underscore characters (_). But that might be a problem if it turns two distinct values of question into the same one because they agree on all their legal characters. And it won't solve the potential problem of the contents of question being too long to be a variable name. You may have to implement some ad hoc solutions to those problems if they arise.

    Comment


    • #3
      Thank you, it works perfectly! I didn't realize -reshape- was so powerful.

      The problem, of course, is that you want the values of questions to become variable names, but not all strings are legal variable names
      The strings in "questions" are variable names in the original dataset that I am trying to correct. So I already know that they are valid variable names. I really appreciate you taking the time to answer in such details.

      edit: I realize that there is also a possibility for the variable names to be too long when "correct" is added in the beginning, but my variables are under 20 characters so it should be fine.
      Last edited by Baptiste Ottino; 24 May 2018, 10:04.

      Comment

      Working...
      X