Reshaping?

Victor Dahl Mathiasen

Join Date: Nov 2017

Posts: 9
#1

Reshaping?

22 Nov 2017, 00:50

Hi,

In my datasheet (example attached) there are in some cases more than one observation for each record. Can I somehow combine each record into one observation with more than one variable?
I tried using reshape without any luck. To me it seems like it should be an easy enough task, but I spend a lot of time now trying the reshape command etc. I am not even sure it's the command to use..
The best layout would be something like: RECORD, DIAG1, DIAG2, DIAG3....

Sincerely
Victor

Attached Files
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30089
#2

22 Nov 2017, 08:55

Code:

by v_recnum, sort: gen _j = _n reshape wide c_diag, i(v_recnum) j(_j)

will do this.

Please read the FAQ, especially #12, for excellent advice about how to post data examples (and code and output). In particular, it clearly states specificallynot to use screenshots. They are often unreadable. Even when they are readable, as yours turns out to be, were it necessary to import the data to Stata to test out some code to solve your problem, a screenshot is useless. In the future, always use the -dataex- command to show Stata data examples.
1 like
Comment
Victor Dahl Mathiasen

Join Date: Nov 2017

Posts: 9
#3

23 Nov 2017, 02:01

Thank you Clyde, it worked very well.
I read the FAQ and will keep this in mind for the future.
Comment
Victor Dahl Mathiasen

Join Date: Nov 2017

Posts: 9
#4

23 Nov 2017, 04:48

What should I do, if I wanted to do the same command in alphabetical or numerical order?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30089
#5

25 Nov 2017, 11:57

It would be almost the same as in #2. Just change fhe first command to:

Code:

bysort v_recnum (c_diag): gen _j = _n

That way the numerical order established when _j is generated would be identical to the sort order of c_diag.
Comment

Victor Dahl Mathiasen

Join Date: Nov 2017
Posts: 9

30 Nov 2017, 06:53

Thank you for your previous helpful answers.
I have now encountered another problem. If i want to combine my observations based on a unique ID (example ID 1 to 3). How would I do that most easily? I have deleted all identical duplicates already. Following command gives me the error:
variable c_diag11 already defined
r(110);

Code:

by v_cpr, sort: gen _j = _n
reshape wide c_diag1 c_diag2 c_diag3 c_diag4 c_diag5 c_diag6 c_diag7 c_diag8 c_diag9 c_diag10 c_diag11 c_diag12 c_diag13, i(v_cpr) j(_j)

Code:

input double v_cpr str8(c_diag1 c_diag2 c_diag3 c_diag4 c_diag5 c_diag6 c_diag7 c_diag8 c_diag9 c_diag10 c_diag11 c_diag12 c_diag13)
1 "DI109"  "DI489"  ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
1 "DI489"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
1 "DI489"  "DR060"  ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
1 "DR060"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
1 "DZ039"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "40199"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "41099"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "53290"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "54090"  "78909"  ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "78909"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "DE668"  "DI109"  "DI209"  "DN289A" "DZ035" ""      ""      ""      ""      ""      ""      "" ""
2 "DE780"  "DI109"  "DI208"  ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "DH350"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "DI109"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "DI109"  "DI130"  ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "DI109"  "DI208"  "DI209"  "DI251"  ""      ""      ""      ""      ""      ""      ""      "" ""
2 "DI109"  "DI208"  "DI251"  "DR079"  ""      ""      ""      ""      ""      ""      ""      "" ""
2 "DI109"  "DM51"   "DM511D" ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "DI109"  "DN031"  "DZ039"  ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "DI109"  "DR079"  ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "DI109"  "DZ034"  ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "DM109"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "DM99"   "DM993"  "DZ038"  ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "DN409"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "DR391"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "DZ016"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "DZ035"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "DZ039"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "DZ769"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
3 "DN300"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
3 "DN810"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
3 "DO049"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
3 "DO200"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
3 "DO620"  "DO680"  "DO713"  "DO721"  "DO752" "DO802" "DO864" "DO990" "DZ291" "DZ370" "DZ390" "" ""
3 "DZ039"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""

end

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30089
#7

30 Nov 2017, 09:20

The problem is with your variable names. You start out with a variables c_diag1 through c_diag13. Now you try to reshape wide, with the j() variable taking numeric values 1 through 24. When Stata tries to do the -reshape-, it encounters a situation where it is asked to create a new variable c_diag11 (the result of postfixing _j = 1 to c_diag1) and finds that there is already a variable by that name. So you need to change the c_diag* variable names to avoid this. The following will work:

Code:

by v_cpr, sort: gen _j = _n rename c_diag* =_ reshape wide c_diag*_, i(v_cpr) j(_j)

By sticking an underscore character after the original number, you avoid name clashes.

I am once again wondering why you are doing this. Just in this example (and I fear your real data has even more variables) you are creating a new data set that has 313 variables, and most of the observations in it will be missing values. It is really hard for me to imagine how this can be useful for any kind of analysis.
Comment
Victor Dahl Mathiasen

Join Date: Nov 2017

Posts: 9
#8

01 Dec 2017, 02:13

Thank you once again.

Well, I must admit I am very new to Stata and these solutions might seem useless or odd. The thing is, I would like a data set with one observation pr unique ID as I have data from various sources (registries) that I want to combine. I was suggested to do so by colleagues. However, they are not very experienced with Stata either.

In this example, my thought was to create unique observations with all matching values, delete the duplicates and blanks if possible.

Hope to hear from you
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30089

01 Dec 2017, 08:11

When you have multiple observations on the same people, in Stata it is usually more effective to have each observation separate, rather than "stacking them horizontally" into a single observation. Not always, but usually. There is nothing inherently wrong with what you are doing, but I suspect that as you try to do more work with it you will find it cumbersome.

Let me give the benefit of the doubt that in your situation having all the diagnoses for a given person in a single observatoin is, in fact, the best way to proceed. (This sounds plausible to me.) Still, having so many of those observations consisting of missing values interspersed between non-missing values is almost certainly going to cause problems. So consider this alternative:

Code:

clear
input double v_cpr str8(c_diag1 c_diag2 c_diag3 c_diag4 c_diag5 c_diag6 c_diag7 c_diag8 c_diag9 c_diag10 c_diag11 c_diag12 c_diag13)
1 "DI109"  "DI489"  ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
1 "DI489"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
1 "DI489"  "DR060"  ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
1 "DR060"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
1 "DZ039"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "40199"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "41099"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "53290"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "54090"  "78909"  ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "78909"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "DE668"  "DI109"  "DI209"  "DN289A" "DZ035" ""      ""      ""      ""      ""      ""      "" ""
2 "DE780"  "DI109"  "DI208"  ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "DH350"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "DI109"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "DI109"  "DI130"  ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "DI109"  "DI208"  "DI209"  "DI251"  ""      ""      ""      ""      ""      ""      ""      "" ""
2 "DI109"  "DI208"  "DI251"  "DR079"  ""      ""      ""      ""      ""      ""      ""      "" ""
2 "DI109"  "DM51"   "DM511D" ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "DI109"  "DN031"  "DZ039"  ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "DI109"  "DR079"  ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "DI109"  "DZ034"  ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "DM109"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "DM99"   "DM993"  "DZ038"  ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "DN409"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "DR391"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "DZ016"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "DZ035"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "DZ039"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
2 "DZ769"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
3 "DN300"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
3 "DN810"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
3 "DO049"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
3 "DO200"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
3 "DO620"  "DO680"  "DO713"  "DO721"  "DO752" "DO802" "DO864" "DO990" "DZ291" "DZ370" "DZ390" "" ""
3 "DZ039"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""

end

gen long obs_no = _n
reshape long c_diag, i(obs_no)
drop if missing(c_diag)
by v_cpr (c_diag), sort: replace _j = _n
drop obs_no
reshape wide c_diag, i(v_cpr) j(_j)

The resulting data set has all of the diagnosis codes for an individual in the same observation. But there are only 33 such variables, and most of the values are not missing. Moreover, the diagnosis codes are sorted in alphabetical order, with the missing values appearing at the end. This will probably prove easier to work with. What you lose with this layout is whatever information is proxied by the separate observations per person in the original data. (If there are other variables besides the c_diag* and v_cpr, and if those are fixed patient attributes that are, therefore, constant within all observations carring the same v_cpr, the -reshape-s will carry those along undisturbed. What you may lose is if there are other variables that are not constant among the observations for the same person which are in the data set, or which are not variables but are implied by virtue of a certain observation being the first vs second vs third... for a given person.)

Last edited by Clyde Schechter; 01 Dec 2017, 08:15.

Comment

Victor Dahl Mathiasen

Join Date: Nov 2017
Posts: 9

#10

18 Dec 2017, 05:05

Dear Clyde,

Again thank you. Your answers have taught me a lot. I have finally realized that it is not necessarily beneficial to "stack my observations horizontally". However, in my following question I think it is necessary. I have used the reshape command a couple of times with good outcomes. I am now trying to reshape the following data to a wide format to get a better overview of the text in v_fritekst. I would like k_rekvnr (ID) and k_matnr as my i(option), but they are not unique as each ID may have more than one k_matnr (material). k_matnr is the identifier for each material linking the different sequences (k_sekvensnr).

Code:

k_rekvnr k_matnr k_sekvensnr v_fritekst
"ID1" 1  1 "[01] Blodtilblandet aspirat indeholdende talrige bare kerner,"          
"ID1" 1  2 "makrofager, lymfocytter og dårligt bevaret epitheliale celler. Enkelte"
"ID1" 1  3 "af kernerne forekommer relativt store, men da langt de fleste af"        
"ID1" 1  4 "cellerne er i varierende grad af opløsning, kan de ikke betegnes som"  
"ID1" 1  5 "andet end atypiske, og histologisk undersøgelse tilrådes.[01] xxxx.:"  
"ID1" 1  6 "Autolysepræget finnålsaspirat med enkelte atypiske epitheliale"        
"ID1" 1  7 "celler10.10.2007Med venlig hilsenxxxx/su"                
"ID2" 1  1 "ANTAL KARYOTYPERET: 25     NORMALE: 25     UNORMALE: 0 KARYOTYPE:"      
"ID2" 1  2 "46,XX"                                                                  
"ID2" 2  1 "Lymfeknude med talrige, størrelsesvarierende og stedvis"                
"ID2" 2  2 "sammenflydende, epiteloidcellegranulomer med centrale, kasseøse"        
"ID2" 2  3 "nekroser. Der ses kun enkelte kæmpeceller med randstillede kerner. Der"
"ID2" 2  4 "har ikke kunnet påvises syrefaste stave, men det udelukker ikke"        
"ID2" 2  5 "muligheden af TB. Der er ingen tegn på lymfom eller malignitet"        
"ID2" 2  6 "iøvrigt."                                                              
"ID3"  1  1 "2 glas med udstryg af aspirat fra lymfeknude, bestående af store og"    
"ID3"  1  2 "små lymfocytter samt enkelte immunoblastlignende celler. Der er ingen"  
"ID3"  1  3 "epitelceller eller tegn på malignitet i øvrigt.  xxx"                                                          
"ID4"  1  1 "01+02: Lymfeknudevæv med fokale, større og mindre, nekrotiserende,"    
"ID4"  1  2 "granulomatøse processer, delvis associeret med epiteloide makrofager"  
"ID4"  1  3 "og multinukleære kæmpeceller af Langhansk type. Granulomerne synes at"
"ID4"  1  4 "repræsentere forskellige udviklingsstadier, idet nogle er relativt"    
"ID4"  1  5 "små, stellate, andre partielt ophelede med fibrose. 6.2.: Mikroskopi"  
"ID4"  1  6 "af Ziehl-Neelsen farvde snit afslører ikke syrefaste stave.xxx"

Something like this would be favorable:

Code:

k_rekvnr k_ matnr k_sekvensnr1 v_fritekst1 k_sekvensnr2 v_fritekst2 ........
ID1      1         1            text      2             text
ID2      1         1            text      2             text
ID2      2         1            text      2             text

I know this might seem like an odd question, but I hope you can help me once again.

Sincerely,
Victor

Last edited by Victor Dahl Mathiasen; 18 Dec 2017, 05:16.

Comment

Victor Dahl Mathiasen

Join Date: Nov 2017

Posts: 9
#11

18 Dec 2017, 06:02

Actually I've solved it in another way now. Anyways, if you see an easy solution I would be pleased to hear so
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30089
#12

18 Dec 2017, 08:18

Well, to reshape wide, it is not necessary for the variable in the i() option to uniquely identify observations. In fact, if they do, the -reshape wide- has no effect at all. This is a one-liner:

Code:

reshape long v_fritekst, i(k_rekvnr k_matnr) j(k_sekvensnr)

This does not create a series of k_sekvensnr variables like the ones you show in your desired output. But those variables serve no purpose anyway. It is always going to be the case that k_sekvensnr1 = 1 and k_sekvensnr2 = 2 and k_sekvensnr3 = 3, etc. So these "variables" are predictable constants, which means you don't need them at all.

In the future, please use -dataex- to post example data. The listing you showed in #10 took longer to import into Stata than it took to solve your problem. Please help those who want to help you.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment