"Pivotting" dataset

Baptiste Ottino

Join Date: May 2018

Posts: 21
#1

"Pivotting" dataset

24 May 2018, 08:51

Hello everyone,

I am designing a quality control framework for a longitudinal cohort study that uses survey data. Every year, twenty new datasets must be checked for potential mistakes in data collection. For each dataset, the "diagnosis" is a dataset of the form:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float id str64 question float(wrong correct) 1 "A" 1 0 1 "G" -3 -1 2 "A" 8 9 5 "E" 4 5 5 "G" 2 6 end | id question value correct | |---------------------------------| 1. | 1 A 1 0 | 2. | 1 G -3 -1 | 3. | 2 A 8 9 | 4. | 5 E 4 5 | 5. | 5 G 2 6 |

Where "id" is the participant id, "question" is the name of the variable that flagged an exception (the letters are an illustration, but could be any string), "wrong" is the wrong value, and "correct" is the correct value.

From this, I want to be able to output a corrected version of the original dataset. So I want to transform the diagnosis to the following:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(id A G E) 1 0 -1 . 2 9 . . 5 . 6 5 end | id A G E | |-----------------| 1. | 1 0 -1 . | 2. | 2 9 . . | 3. | 5 . 6 5 |

The variables are the distinct values of "question", the observations are grouped by "id", and the values are the "correct" values. That way, the result can be merged with the original set using -merge- (I already implemented this last part) to implement the corrections easily

Question: Is there a way in Stata to obtain my second example from my first? For those familiar with Python data management, in pandas you would write

Code:

pivoted_df = df.pivot(index='id', columns='question')['correct']

Thank you very much!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30064
#2

24 May 2018, 09:49

Code:

drop wrong reshape wide correct, i(id) j(question) string rename correct* * list, noobs clean

will work for your example. But it may not work more generally if the variable question can contain arbitrary strings. The problem, of course, is that you want the values of questions to become variable names, but not all strings are legal variable names. You can probably get around this by doing -replace question = strtoname(question)-, which will replace characters that are not legal in variable names with underscore characters (_). But that might be a problem if it turns two distinct values of question into the same one because they agree on all their legal characters. And it won't solve the potential problem of the contents of question being too long to be a variable name. You may have to implement some ad hoc solutions to those problems if they arise.
1 like
Comment
Baptiste Ottino

Join Date: May 2018

Posts: 21
#3

24 May 2018, 09:58

Thank you, it works perfectly! I didn't realize -reshape- was so powerful.

The problem, of course, is that you want the values of questions to become variable names, but not all strings are legal variable names

The strings in "questions" are variable names in the original dataset that I am trying to correct. So I already know that they are valid variable names. I really appreciate you taking the time to answer in such details.

edit: I realize that there is also a possibility for the variable names to be too long when "correct" is added in the beginning, but my variables are under 20 characters so it should be fine.

Last edited by Baptiste Ottino; 24 May 2018, 10:04.
Comment

Announcement

"Pivotting" dataset

Comment

Comment