Reshaping dta from long to wide format

Ardele Mandiriri

Join Date: Feb 2020

Posts: 3
#1

Reshaping dta from long to wide format

06 Feb 2020, 06:31

Hi there, I am fairly new to Stata.
I am needing help in reshaping a database from long to wide. I have 3 variables pid(unique identifier), Diagnosis and DiagnosisDate, as illustrastred below.
pid Diagnosis DiagnosisDate

1 HSV2 06-02-20

1 HSV2 06-02-19

2 TV 01-01-20

2 BV 01-01-20

3 Syphillis 06-05-19

I tried using the command reshape wide Diagnosis, i(pid) j(DiagnosisDate), and i get an error message values of variable DiagnosisDate not unique within pid. I would really appreciate guidance on how i proceed to create one observation per pid instead of the existing multiple ones.

Many thanks
Ardele
Tags: None

William Lisowski

Join Date: Dec 2014
Posts: 10150

06 Feb 2020, 13:48

If you reshaped your 5 example observations to a wide layout, what would you expect the results to be like? Make a table like the one in post #1 but with the Stata variable names for the wide layout and with the new 3 observations. What you are asking for with the command you tried is a layout with a diagnosis variable for each day that appears in your data, with the values missing for most variables in any given observation. And worse yet, what do you do with pid 2, who has two diagnoses on the same day?

Perhaps the following is what you were hoping for.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte pid str9 diagnosis str8 diagnosisdate
1 "HSV2"      "06-02-20"
1 "HSV2"      "06-02-19"
2 "TV"        "01-01-20"
2 "BV"        "01-01-20"
3 "Syphillis" "06-05-19"
end

// convert date string to a Stata Internal Format date
// honestly, I don't know which is the month or year, the last one has to be the day
generate ddate = daily(diagnosisdate,"20YMD")
format %td ddate
list, clean abbreviate(16)
drop diagnosisdate

// including seq in the sort ensures that duplicate pid/ddate observations 
// will be sorted the same way every time the program is run
generate seq = _n
sort pid ddate seq
drop seq
by pid (ddate): generate diagnum = _n

reshape wide ddate diagnosis, i(pid) j(diagnum)
list, clean abbreviate(16)

Code:

. // convert date string to a Stata Internal Format date
. // honestly, I don't know which is the month or year, the last one has to be the day
. generate ddate = daily(diagnosisdate,"20YMD")

. format %td ddate

. list, clean abbreviate(16)

       pid   diagnosis   diagnosisdate       ddate  
  1.     1        HSV2        06-02-20   20feb2006  
  2.     1        HSV2        06-02-19   19feb2006  
  3.     2          TV        01-01-20   20jan2001  
  4.     2          BV        01-01-20   20jan2001  
  5.     3   Syphillis        06-05-19   19may2006  

. drop diagnosisdate

. 
. // including seq in the sort ensures that duplicate pid/ddate observations 
. // will be sorted the same way every time the program is run
. generate seq = _n

. sort pid ddate seq

. drop seq

. by pid (ddate): generate diagnum = _n

. 
. reshape wide ddate diagnosis, i(pid) j(diagnum)
(note: j = 1 2)

Data                               long   ->   wide
-----------------------------------------------------------------------------
Number of obs.                        5   ->       3
Number of variables                   4   ->       5
j variable (2 values)           diagnum   ->   (dropped)
xij variables:
                                  ddate   ->   ddate1 ddate2
                              diagnosis   ->   diagnosis1 diagnosis2
-----------------------------------------------------------------------------

. list, clean abbreviate(16)

       pid   diagnosis1      ddate1   diagnosis2      ddate2  
  1.     1         HSV2   19feb2006         HSV2   20feb2006  
  2.     2           TV   20jan2001           BV   20jan2001  
  3.     3    Syphillis   19may2006                        .

But now let me tell you that organizing your data that way is almost certainly a bad idea for analysis in Stata. The experienced users here generally agree that, with few exceptions, Stata makes it much more straightforward to accomplish complex analyses using a long layout of your data rather than a wide layout of the same data. In a wide layout, determining which individuals have been diagnosed with TV will involve looping over your variables, while in a long layout

Code:

bysort id: egen had_TV = max(cond(diagnosis=="TV",1,0))

does it in a single command.

What you have looks like cross-sectional data, which Stata expects to find a a long layout. Since you are new to Stata, before embarking on your analysis, if you have not done so already, you should review the introductory material in the Stata Longitudinal Data/Panel Data Reference Manual PDF included in your Stata installation and accessible through Stata's Help menu.

Comment

Ardele Mandiriri

Join Date: Feb 2020

Posts: 3
#3

07 Feb 2020, 06:22

Hi William,

Thank you for your prompt feedback. The code you used reshaped the data exactly into what i intended it to be reshaped to. The end goal after reshaping to wide, is to merge this dta to a master dta, and will form a panel for a longitudinal analysis. Keeping the data as long would mean a 1(unique pid in master dta) to many ( duplicate pid's in this current dta) and would create multiple observations for each pid. Given this background in which the final goal is to create panel data, would the suggestion of using the bysort code still be ideal compared to reshaping?

Many thanks

Best
Ardele
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

07 Feb 2020, 06:58

You say "the final goal is to create panel data" and "I am fairly new to Stata". I recommended you read the introductory material in the Stata Longitudinal Data/Panel Data Reference Manual PDF. I repeat that recommendation now.

In Stata, analysis of panel data is carried out with data in a long layout - for each member of the panel, there is a separate observation for each time data is gathered for that member. The wide layout from post #2 will not be compatible with the commands you will use to carry out analysis of panel data.

Stata supplies exceptionally good documentation that amply repays the time spent studying it. Certainly, if you understand the requirements of the tools you use to carry out panel data analysis in Stata, you won't spend a lot of time preparing data that you then find to be in a form unsuitable for the analysis.

Let me add, if you are coming to Stata from, for example, SAS, it is not the case that "speaking Stata" is just "speaking SAS with a Stata accent". You need to think differently about your data, your programs, and your analyses. When I began using Stata in a serious way, after decades with SAS, I started, as have others here, by reading my way through the Getting Started with Stata manual relevant to my setup. Chapter 18 then gives suggested further reading, much of which is in the Stata User's Guide, and I worked my way through much of that reading as well. There are a lot of examples to copy and paste into Stata's do-file editor to run yourself, and better yet, to experiment with changing the options to see how the results change.

Last edited by William Lisowski; 07 Feb 2020, 07:00.
Comment
Ardele Mandiriri

Join Date: Feb 2020

Posts: 3
#5

07 Feb 2020, 08:17

Thank you so much William for this recommendation. I will see to it that i go through the recommended materials. Thank you once more for the guidance. I greatly appreciate.
Comment

Announcement