Can't reshape from long to wide - why?

Sigrid Vikjord

Join Date: Mar 2018
Posts: 69

Can't reshape from long to wide - why?

14 Jul 2024, 14:23

Dear Statalist,

I am encountering issues while attempting to merge two datasets. My using dataset contains around 110,000 observations, and the merging dataset contains approximately 10,000 observations. The using dataset is in wide format, and the merging dataset is in long format.

When I tried to merge them using 1:m, I received the following error message:

variable id does not uniquely identify observations in the using data r(459);

I confirmed there are no duplicates in the using dataset by using the duplicates report id command.

To resolve this, I attempted to reshape the merging dataset into wide format before performing a 1:1 merge. However, I encountered this error message:

Code:

 reshape wide niva inndato_aarmnd kodenavn kodenr kodeverdi nokkel, i(id) j(j)
variable j not found
    Data are already wide.
r(111);

What could be the issue here? I have not experienced similar problems with merge or reshape on the using dataset previously.

Thank you for your assistance.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float id str1 niva str7 inndato_aarmnd str6 kodenavn double kodenr str6 kodeverdi str10 nokkel
1 " " "2019-05" "ICD-10" 11 "M84"    "SOM_115782"
1 " " "2016-10" "ICD-10" 11 "I42"    "SOM_77505" 
1 " " "2021-04" "ICD-10" 11 "D12"    "SOM_140956"
1 " " "2009-10" "ICD-10" 11 "I49"    "SOM_45799" 
1 " " "2009-11" "ICD-10" 11 "M75"    "SOM_41812" 
1 " " "2009-10" "ICD-10" 11 "M75"    "SOM_45798" 
1 " " "2009-09" "ICD-10" 11 "I42"    "SOM_45797" 
1 " " "2020-11" "ICD-10" 11 "I42"    "SOM_132140"
1 " " "2019-06" "ICD-10" 21 "M84"    "SOM_115783"
1 " " "2013-08" "NCMP"    1 "FYDE31" "SOM_6810"  
1 " " "2013-06" "ICD-10" 11 "I49"    "SOM_6809"  
1 " " "2009-09" "NCMP"    1 "FYDE30" "SOM_45797" 
1 " " "2013-08" "ICD-10" 11 "I42"    "SOM_6810"  
1 " " "2019-07" "ICD-10" 21 "M84"    "SOM_115784"
1 " " "2009-10" "NCMP"    1 "FPFE30" "SOM_45799" 
1 " " "2013-06" "NCMP"    1 "FPFE30" "SOM_6809"  
2 " " "2009-07" "ICD-10" 11 "M77"    "SOM_41539" 
2 " " "2009-12" "ICD-10" 11 "D48"    "SOM_41796" 
2 " " "2015-12" "ICD-10" 11 "E88"    "SOM_30135" 
2 " " "2016-06" "ICD-10" 11 "M24"    "SOM_84222" 
3 "F" "2020-01" "ICD-10" 11 "M79"    "SOM_128909"
3 "F" "2019-12" "ICD-10" 11 "M79"    "SOM_111711"
3 " " "2022-09" "ICD-10" 11 "K58"    "SOM_166547"
3 "F" "2020-01" "ICD-10" 11 "M79"    "SOM_128908"
3 " " "2010-10" "ICD-10" 11 "K58"    "SOM_51103" 
3 "F" "2019-12" "ICD-10" 11 "M79"    "SOM_111709"
3 "S" "2020-01" "ICD-10" 11 "M79"    "SOM_128569"
3 "F" "2019-12" "ICD-10" 11 "M79"    "SOM_111710"
3 "S" "2019-12" "ICD-10" 11 "M79"    "SOM_111436"
4 " " "2015-02" "ICD-10" 11 "D12"    "SOM_28068" 
4 " " "2016-10" "ICD-10" 21 "K09"    "SOM_81518" 
4 "F" "2015-01" "ICD-10" 31 "K21"    "SOM_20301" 
4 " " "2017-04" "ICD-10" 11 "K09"    "SOM_95920" 
4 "S" "2015-01" "ICD-10" 21 "D62"    "SOM_28067" 
4 "F" "2015-01" "ICD-10" 11 "K92"    "SOM_20300" 
4 "S" "2015-01" "ICD-10" 41 "K51"    "SOM_28067" 
4 " " "2017-04" "ICD-10" 11 "K09"    "SOM_95921" 
4 "F" "2015-01" "ICD-10" 31 "K21"    "SOM_20300" 
4 "S" "2015-01" "ICD-10" 31 "K21"    "SOM_28067" 
4 "F" "2015-01" "ICD-10" 11 "K92"    "SOM_20301" 
4 "S" "2015-01" "ICD-10" 11 "K92"    "SOM_28067" 
4 "F" "2015-01" "ICD-10" 21 "K57"    "SOM_20301" 
4 " " "2017-11" "ICD-10" 11 "K01"    "SOM_95923" 
4 "S" "2015-01" "ICD-10" 51 "K57"    "SOM_28067" 
4 "F" "2015-01" "ICD-10" 41 "D62"    "SOM_20301" 
4 " " "2017-05" "ICD-10" 11 "K09"    "SOM_95922" 
4 " " "2016-10" "ICD-10" 11 "K01"    "SOM_81518" 
4 "F" "2015-01" "ICD-10" 21 "D62"    "SOM_20300" 
4 "F" "2015-01" "ICD-10" 11 "K51"    "SOM_20299" 
5 " " "2020-05" "ICD-10" 11 "K58"    "SOM_129924"
6 " " "2017-12" "ICD-10" 11 "K58"    "SOM_90025" 
6 "F" "2022-02" "ICD-10" 21 "I48"    "SOM_176614"
6 " " "2016-11" "ICD-10" 11 "I48"    "SOM_83851" 
6 " " "2016-09" "NCMP"    1 "FPFE15" "SOM_83850" 
6 "S" "2022-02" "NCMP"    1 "FPFE30" "SOM_173471"
6 " " "2022-03" "ICD-10" 11 "I49"    "SOM_176615"
6 " " "2008-08" "ICD-10" 21 "K21"    "SOM_34850" 
6 " " "2022-03" "ICD-10" 11 "I49"    "SOM_176616"
6 "S" "2022-02" "ICD-10" 21 "I48"    "SOM_173471"
6 " " "2016-09" "NCMP"    2 "FPFE50" "SOM_83850" 
6 " " "2008-10" "ICD-10" 11 "I49"    "SOM_34851" 
6 " " "2018-02" "ICD-10" 11 "K29"    "SOM_103532"
6 " " "2011-12" "ICD-10" 11 "I84"    "SOM_59185" 
6 " " "2018-07" "ICD-10" 11 "M19"    "SOM_99863" 
6 "F" "2022-02" "NCMP"    1 "FPFE30" "SOM_176614"
6 " " "2021-06" "ICD-10" 11 "M13"    "SOM_144004"
6 " " "2011-08" "ICD-10" 21 "I10"    "SOM_59183" 
6 " " "2020-11" "ICD-10" 11 "K64"    "SOM_126120"
6 " " "2012-01" "ICD-10" 11 "I84"    "SOM_68403" 
6 " " "2016-09" "ICD-10" 11 "I48"    "SOM_83850" 
6 " " "2011-10" "ICD-10" 21 "I10"    "SOM_59184" 
6 " " "2021-02" "ICD-10" 11 "K57"    "SOM_150699"
6 " " "2012-07" "ICD-10" 11 "M79"    "SOM_68404" 
6 " " "2022-03" "NCMP"    1 "FPFE15" "SOM_176615"
6 " " "2020-06" "ICD-10" 11 "M13"    "SOM_134925"
6 " " "2023-03" "NCMP"    1 "FPFE30" "SOM_155429"
6 " " "2023-03" "ICD-10" 31 "I48"    "SOM_155429"
6 " " "2008-08" "ICD-10" 11 "K29"    "SOM_34850" 
6 " " "2023-03" "ICD-10" 21 "I49"    "SOM_155429"
6 " " "2011-08" "NCMP"    1 "FPFE50" "SOM_59183" 
6 "F" "2022-02" "ICD-10" 21 "I48"    "SOM_176613"
6 " " "2016-03" "ICD-10" 11 "M19"    "SOM_84471" 
6 " " "2020-08" "ICD-10" 11 "M13"    "SOM_134926"
6 " " "2016-05" "ICD-10" 11 "K58"    "SOM_83849" 
6 " " "2011-10" "NCMP"    1 "FPFE15" "SOM_59184" 
6 " " "2014-02" "ICD-10" 21 "M79"    "SOM_17538" 
7 " " "2019-10" "ICD-10" 11 "D47"    "SOM_115147"
7 " " "2010-09" "ICD-10" 21 "O99"    "SOM_49423" 
7 " " "2020-11" "ICD-10" 11 "C90"    "SOM_123553"
7 " " "2022-08" "ICD-10" 11 "D47"    "SOM_173913"
7 " " "2021-10" "ICD-10" 11 "D47"    "SOM_147570"
7 " " "2010-09" "ICD-10" 11 "O65"    "SOM_49423" 
7 " " "2023-06" "ICD-10" 11 "D47"    "SOM_152471"
7 " " "2020-01" "ICD-10" 11 "D47"    "SOM_123552"
7 " " "2014-12" "ICD-10" 11 "J35"    "SOM_15946" 
8 " " "2015-04" "ICD-10" 11 "M67"    "SOM_26692" 
9 " " "2017-08" "ICD-10" 11 "K92"    "SOM_92266" 
9 "S" "2023-04" "NCMP"    2 "FYFX05" "SOM_158121"
9 " " "2023-05" "NCMP"    2 "FYFX05" "SOM_162008"
9 "S" "2023-04" "NCMP"    4 "GDFX15" "SOM_158121"
end

Last edited by Sigrid Vikjord; 14 Jul 2024, 14:27.

Tags: None

Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#2

14 Jul 2024, 14:57

why not just work with both long datasets and then merge? What benefit is there to reshape this? In other words, what would you be losing out on or be otherwise unable to accomplish by having both datasets be long and the merging? By the way, do you actually want to merge these together, or do you seek its cousin, joinby?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30063
#3

14 Jul 2024, 18:45

variable id does not uniquely identify observations in the using data r(459);

I confirmed there are no duplicates in the using dataset by using the duplicates report id command.

You may well have persuaded yourself of that, but I am quite confident you are wrong. I have never known Stata to be wrong when it says that a variable fails to uniquely identify observations. I suspect you ran -duplicates report id- on the wrong data, or something like that. Anyway, here's what you have to do:

Code:

use using_data_set, clear duplicates tag id, gen(flag) browse if flag

This will show you the actual duplicate observations. Your next test will be to figure out from what you see why they are there and what to do about them. Several possibilities arise:
The duplicate observations are all "empty." That is, they contain missing values on id and all other variables as well. This situation commonly rises when data from other software is imported into Stata, particularly after importing from Excel. Sometimes they arise from data management errors--so if your data set was not imported, then you should review the data management to see why it led to these empty observations. You can, of course, remove them from the data set with -duplicates drop- without knowing how they got there. But the concern is that if they reflect an error in data management, then there may well be other errors in the data, and you should take this opportunity to fix them so your entire project does not become "garbage in, garbage out."

The duplicate observations are not empty, but all the observations of a given value of id contain the exact same values on all other variables. These are, in other words, pure, exact duplicates. Again, it is easy enough to get rid of these with -duplicates drop-. But again, you should review the data management to see how they got there because it is probably due to a coding error, and there may be other implications of that error, and there may be still other coding errors that have corrupted your data. Once you have verified that the data management is otherwise correct, it is safe to eliminate these surplus observations.

The duplicate observations are not empty, and they do not always agree on the other variables within an id group. Here there are two possibilities:
There is another variable being overlooked. Perhaps you should be -merging- on id and date, not id alone.

There is no variable being overlooked--these are contradictory observations on the same id. In this case you have a lot of work to do: you must find out how to reconcile the contradictions among these observations to arrive at a correct observation, and retain only that one. Again, usually this situation arises because of errors in the data management, and the same concerns about other errors apply.

Concerning your -reshape- problem, the error message you are getting is self explanatory. It says variable j is not found. And, indeed, your example data contains no variable named j. It's not at all clear how you would reshape this data into wide layout in any case, and, most likely it would leave you with an awkward data set that is poorly suited to further analysis or data management in Stata. Long layout is better for most things in Stata, and -reshape wide- should be used sparingly and cautiously--for good reason only. So I endorse Jared Greathouse's advice to keep the data in its current layout unless you have a truly compelling reason to take it wide. Suffice it to say that the -reshape- error you are having is not a reason to do that at all, let alone a good one. I do not, at this point, agree with his suggestion of using -joinby-. After you have resolved the problem of the duplicate observations in the using data set, you probably will have some kind of reasonably straightforward -merge- of the two data sets together. If you do not, at that point it is worth looking into -joinby- as an alternative. But first make sure you understand what -joinby- actually does: it is different from what -merge- does and it is only occasionally what is wanted and appropriate.
Comment
Sigrid Vikjord

Join Date: Mar 2018

Posts: 69
#4

15 Jul 2024, 01:38

Dear Clyde and Jared,

Thank you for your kind response.

After loading my dataset and using the provided code from Clyde, there were still 0 flagged duplicates (see screenshot). PID_109925 = id in the example dataset above.

I then tried to merge again, but it was unsuccessful:

merge m:1 PID_109925 using "M:\Sigrid\Datafil\2017_33246\
> Datafiler\Rettede data fra NPR.dta"

variable PID_109925 does not uniquely identify
observations in the using data
r(459);

Would you have any suggestions on how to proceed with this issue? The solutions provided above seem to address cases where duplicates are present.

Regarding the issue both of you raised about reshaping: Reshaping wasn't something I initially intended to do; it was an (unsuccessful) attempt to solve my merge problems. I am curious about your views on long versus wide data formats. My dataset is from a cohort study with three repeated surveys, spaced 10 years apart. I have always used it in wide format, as provided by the data manager. I primarily use it for regression analyses, including Cox regression. On rare occasions, I reshaped it when necessary.
Do you recommend reshaping the entire dataset to long format for future use? I find it challenging to browse through my data in long format, partly due to the 13-digit PID, which cannot be changed due to linkage and traceability requirements.
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3445
#5

15 Jul 2024, 06:07

Your problem is in the file "M:\Sigrid\Datafil\2017_33246\Datafiler\Retted e data fra NPR.dta". Check in that file if PID_109925 does not uniquely identify observations.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10180
#6

15 Jul 2024, 09:36

Originally posted by Sigrid Vikjord View Post

merge m:1 PID_109925 using "M:\Sigrid\Datafil\2017_33246\
> Datafiler\Rettede data fra NPR.dta"

variable PID_109925 does not uniquely identify
observations in the using data
r(459);

If PID_109925 is a unique identifier in the master dataset, then just switch from m:1 to 1:m.

Code:

merge 1:m PID_109925 using "M:\Sigrid\Datafil\2017_33246\Datafiler\Rettede data fra NPR.dta"
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3445
#7

15 Jul 2024, 09:47

Originally posted by Andrew Musau View Post

If PID_109925 is a unique identifier in the master dataset, then just switch from m:1 to 1:m.

Only if this is correct for the data. Given that Sigrid was surprised by the error message suggests to me that that is not the case.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10180
#8

15 Jul 2024, 09:50

Correct, I assumed that.
Comment
Sigrid Vikjord

Join Date: Mar 2018

Posts: 69
#9

16 Jul 2024, 05:12

Dear Maarten,

I went through the merging dataset ("Rettede data fra NPR.dta") in the same manner as suggested by Clyde above (duplicates tag PID_109925, gen(flag) -> browse if flag), and it flags a lot (around 270,000) of duplicates. However, these do not appear to be real duplicates, as shown in the screenshot below. This dataset is from a national registry called NPR, including everyone in a sample who was admitted on the date inndato_aarmnd (admission date YYMM) and coded with certain ICD-10 and NCMP (procedure) codes (kodeverdi, "code values"). I interpret the large number of "duplicates" as the dataset being in long format and that it should be possible to merge it using 1:m. I would then expect that the number of observations in the master dataset remains unchanged, but that is not the case as shown below. I trust that the mistake is on my side, but I can't seem to figure out what I do wrong.

Again, I created a dummy ID variable to avoid revealing sensitive PID numbers, but id = PID_109925.

Dear Andrew,

PID_109925 is indeed unique in the master dataset. Switching to m:1 created the result below. I'm not sure whether this is the correct outcome. I would prefer to end up with 108,000 individuals in the master dataset after the merge (one individual per row).

merge 1:m PID_109925 using "M:\Sigrid\Datafil\2017_33246\
> Datafiler\Rettede data fra NPR.dta"

Result

Number of obs
-----------------------------------------
Not matched 98,645
from master 98,645 (_merge==1)
from using 0 (_merge==2)

Matched 270,163 (_merge==3)
-----------------------------------------
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10180
#10

16 Jul 2024, 06:03

To exclude unmatched observations from the using dataset, you can specify

Code:

merge 1:m PID_109925 using "M:\Sigrid\Datafil\2017_33246\Datafiler\Rettede data fra NPR.dta", keep(master match)

You need however to figure out whether there are problems with your data.
Comment
Benno Schoenberger

Join Date: Apr 2024

Posts: 61
#11

16 Jul 2024, 07:31

Andrew Musau I'm not sure whether this fixes the problem as her output shows NO unmatched observations from the using dataset.

Sigrid Vikjord
Unfortunately, I'm not entirely sure whether you fully understand what happens with an m:1 or 1:m merge and what requirements the two data sets should meet. If your master data set consists of your desired 108,000 cases and you don't want any additional cases through a merge, but only want to add additional variables to your existing cases, then simply switching from a 1:1 merge to 1:m merge is the completely wrong approach if that's the only way to get around a legitimate Stata error message.

Based on your result from #9 it looks like this to me:
1) you have about 110,000 cases in master
2) you only have mergeable observations for about 11,350 cases (110,000 - 98,645) in using
3) However, since you have over 270,000 matches, approximately 24 observations from using are merged per PID_109925 in master.

Like many of the previous commentators, I strongly advocate taking care of your duplicates in your using data set before a merge and ideally only keeping one observation per PID_109925 at the end. There are countless ways to achieve this, but since you haven't posted a data example from your using dataset yet, we can't give you any good recommendations here.
1 like
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10180

#12

16 Jul 2024, 10:04

I agree with you, Benno, naturally. On face value, there does not appear to be anything wrong with the merge, and the options -keep(master match)- will not change anything. It is up to the OP to understand the structure of her data. Why are there multiple instances of PID_109925 in the using dataset? If it is patient data, it may be that patients had multiple visits recorded. Then the master dataset could be holding demographic information, for example. A 1:m merge will result in a larger combined dataset than the master dataset.

Code:

clear
input float(pid) str10(gender)
1 "female"
2 "male"
end

*MASTER
list


clear
input float(pid) str10(visit_no test_result)
1 1 "X"
1 2 "Y"
1 3 "Z"
2 1 "Z"
2 2 "Y"
2 3 "X"
end

*USING
list, sepby(pid)

clear
input float(pid) str10(visit_no gender test_result)
1 1 "female" "X"
1 2 "female" "Y"
1 3 "female" "Z"
2 1 "male" "Z"
2 2 "male" "Y"
2 3 "male" "X"
end

*COMBINED
list, sepby(pid)

Res.:

Code:

. *MASTER

.
. list

     +--------------+
     | pid   gender |
     |--------------|
  1. |   1   female |
  2. |   2     male |
     +--------------+


.
. *USING

.
. list, sepby(pid)

     +---------------------------+
     | pid   visit_no   test_r~t |
     |---------------------------|
  1. |   1          1          X |
  2. |   1          2          Y |
  3. |   1          3          Z |
     |---------------------------|
  4. |   2          1          Z |
  5. |   2          2          Y |
  6. |   2          3          X |
     +---------------------------+

.

. *COMBINED

.
. list, sepby(pid)

     +------------------------------------+
     | pid   visit_no   gender   test_r~t |
     |------------------------------------|
  1. |   1          1   female          X |
  2. |   1          2   female          Y |
  3. |   1          3   female          Z |
     |------------------------------------|
  4. |   2          1     male          Z |
  5. |   2          2     male          Y |
  6. |   2          3     male          X |
     +------------------------------------+

In a case such as the one above, if you want the combined dataset to have only two observations like the master dataset, then the question becomes which ones to keep. Perhaps there is a second variable in both the master and using datasets that will allow for a 1:1 match, such as the combination of PID and date. This could solve the selection issue for you. The bottom line is to understand your data thoroughly and then make a decision on how to proceed.

Last edited by Andrew Musau; 16 Jul 2024, 10:09.

Comment

Sigrid Vikjord

Join Date: Mar 2018
Posts: 69

#13

16 Jul 2024, 15:23

Benno Schoenberger I provided example data from my using dataset in my first post in this thread (#1 above).

I understand what you mean about duplicates and how merging works (at least at a baseline level). I can't exclude any duplicates in my using file because, as Andrew points out, they represent repeated hospital admissions for the same individual. I need help merging these two datasets, as my usual merging strategy doesn't seem to work with this newly received data file. I'm not skilled enough in data handling to understand why.

Andrew Musau Yes, you are correct. As explained in my post #9, these are ICD-10 (diagnosis codes) and NCMP (procedure codes) from different hospital admissions in the same individuals, collected from a national registry. This means that one individual could have had multiple visits/several lines per PID. I want to combine this with data in my master dataset, where each individual has all variables recorded in wide format/one line per PID. I posted an example dataset from my master datafile below. The result I want is that the ICD-10 and NCMP codes from each visit, end up as new variables linked to the same PID.

For instance, in the using dataset, Patient 1 has been admitted with COPD exacerbation (coded with a certain ICD10 code) three times, recorded as follows (long form):

Patient 1 - admission date 1 - ICD10
Patient 1 - admission date 2 - ICD10
Patient 1 - admission date 3 - ICD10

When I merge it with my master dataset, I want the structure to be like this (wide form):

Patient 1 | admission date 1 - admission date 2 - admission date 3 | ICD10_1 ICD10_2 ICD10_3

How can I achieve this?

Dataex from master data:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float id double(Sex BMI_1995 BMI_2005 BMI_2015)
  1 0    . 19.7    .
  2 1 26.8 28.8    .
  3 0 27.1 26.6 24.3
  4 0 19.7    .    .
  5 1 26.4 28.2 31.5
  6 1 26.4 28.5 27.2
  7 0    . 26.9   27
  8 1 21.7    .    .
  9 0    .    . 20.9
 10 0 21.3    .    .
 11 0 26.3    . 25.3
 12 0 27.7 33.4 31.1
 13 1 27.5   27 26.4
 14 1 28.2    .    .
 15 1 28.3    .    .
 16 0    .    .    .
 17 1 29.1    .    .
 18 0    .    .    .
 19 1 28.1 35.9   36
 20 1    . 25.4    .
 21 1 27.5 30.6 28.5
 22 1 31.2 34.5 32.9
 23 1 34.5 28.9    .
 24 1 25.4 28.7    .
 25 0 25.7    .    .
 26 1    .    .   18
 27 1 29.9    .    .
 28 0 29.4 31.8    .
 29 1 24.7 24.5    .
 30 0    . 25.2 25.6
 31 1 23.5 25.6 29.6
 32 0 30.3 35.9    .
 33 0 28.2 21.3 17.5
 34 0 18.7    .    .
 35 1 32.6 34.3    .
 36 1 29.6    .    .
 37 0 20.8 24.6 26.2
 38 1 23.3    .    .
 39 1    .    .    .
 40 0 20.9 20.4 20.1
 41 0 31.3    .    .
 42 1    . 39.6   38
 43 0    . 25.9 24.3
 44 1 26.3 26.7 25.8
 45 0    .    . 21.8
 46 0    . 25.1    .
 47 1    . 26.4 26.7
 48 1 25.3    .    .
 49 1 25.2    .    .
 50 0    .    . 22.1
 51 0    .    . 26.6
 52 1   23    .    .
 53 0 27.5    .    .
 54 0 28.4    .    .
 55 0   31 32.7 31.7
 56 1    .    . 23.8
 57 1 23.3    .    .
 58 0    .    . 26.6
 59 0    . 24.9 23.1
 60 0 24.4 23.3 25.7
 61 0 28.5 30.4    .
 62 0 25.5    .    .
 63 0 23.5 25.2 26.1
 64 1    .    . 23.1
 65 0 46.2    .    .
 66 0    .    .    .
 67 0    .    . 24.7
 68 1 28.4    .    .
 69 1 24.5    .    .
 70 1 28.5    .    .
 71 0    .    . 17.8
 72 0 18.9    .    .
 73 0    .    . 21.4
 74 0    . 25.5   29
 75 1    . 24.6    .
 76 1    . 28.9    .
 77 0 28.7    .    .
 78 0 18.2    .    .
 79 1 27.6 28.5 27.1
 80 1    .    . 31.9
 81 0 24.8 26.3   27
 82 1 27.3 29.9 26.6
 83 1 27.3 26.9 29.9
 84 1    . 28.1    .
 85 0   27    .    .
 86 0   29    .    .
 87 0 21.3    .    .
 88 1    . 32.6    .
 89 1 28.6    .    .
 90 0 26.1 28.2    .
 91 0 35.2    .    .
 92 1    .    . 19.8
 93 1    .   23 22.9
 94 1    .    .    .
 95 0    . 26.2    .
 96 1 27.7    .    .
 97 0    .    . 37.4
 98 0    .    .    .
 99 0    .    . 26.2
100 0 25.3    . 20.1
end
label values Sex Sex
label def Sex 0 "Kvinne", modify
label def Sex 1 "Mann", modify

Dataex from using data:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float id str7 admissiondateYYMM str6 ICD10orNCMP
1 "2021-04" "D12"  
1 "2013-08" "FYDE31"
1 "2009-10" "M75"  
1 "2009-10" "I49"  
1 "2016-10" "I42"  
1 "2009-10" "FPFE30"
1 "2020-11" "I42"  
1 "2019-05" "M84"  
1 "2013-06" "I49"  
1 "2019-06" "M84"  
1 "2013-06" "FPFE30"
1 "2009-11" "M75"  
1 "2013-08" "I42"  
1 "2009-09" "I42"  
1 "2019-07" "M84"  
1 "2009-09" "FYDE30"
2 "2009-12" "D48"  
2 "2016-06" "M24"  
2 "2009-07" "M77"  
2 "2015-12" "E88"  
3 "2019-12" "M79"  
3 "2019-12" "M79"  
3 "2019-12" "M79"  
3 "2020-01" "M79"  
3 "2022-09" "K58"  
3 "2010-10" "K58"  
3 "2020-01" "M79"  
3 "2020-01" "M79"  
3 "2019-12" "M79"  
4 "2017-04" "K09"  
4 "2015-01" "K92"  
4 "2015-01" "K57"  
4 "2015-01" "K21"  
4 "2015-01" "K51"  
4 "2015-01" "K92"  
4 "2016-10" "K09"  
4 "2015-01" "D62"  
4 "2015-01" "K57"  
4 "2015-01" "D62"  
4 "2017-11" "K01"  
4 "2015-01" "K51"  
4 "2017-05" "K09"  
4 "2015-02" "D12"  
4 "2015-01" "K92"  
4 "2015-01" "K21"  
4 "2016-10" "K01"  
4 "2015-01" "D62"  
4 "2017-04" "K09"  
4 "2015-01" "K21"  
5 "2020-05" "K58"  
6 "2022-02" "I48"  
6 "2008-08" "K21"  
6 "2012-07" "M79"  
6 "2011-10" "FPFE15"
6 "2011-12" "I84"  
6 "2011-08" "I10"  
6 "2022-03" "I49"  
6 "2023-03" "I48"  
6 "2016-09" "FPFE15"
6 "2018-02" "K29"  
6 "2022-02" "I48"  
6 "2008-10" "I49"  
6 "2022-02" "FPFE30"
6 "2022-02" "FPFE30"
6 "2022-03" "I49"  
6 "2016-09" "FPFE50"
6 "2020-08" "M13"  
6 "2022-03" "FPFE15"
6 "2021-06" "M13"  
6 "2020-06" "M13"  
6 "2012-01" "I84"  
6 "2022-02" "I48"  
6 "2018-07" "M19"  
6 "2014-02" "M79"  
6 "2016-09" "I48"  
6 "2017-12" "K58"  
6 "2016-05" "K58"  
6 "2016-03" "M19"  
6 "2008-08" "K29"  
6 "2020-11" "K64"  
6 "2011-08" "FPFE50"
6 "2016-11" "I48"  
6 "2021-02" "K57"  
6 "2023-03" "FPFE30"
6 "2023-03" "I49"  
6 "2011-10" "I10"  
7 "2021-10" "D47"  
7 "2010-09" "O65"  
7 "2020-11" "C90"  
7 "2019-10" "D47"  
7 "2022-08" "D47"  
7 "2023-06" "D47"  
7 "2014-12" "J35"  
7 "2020-01" "D47"  
7 "2010-09" "O99"  
8 "2015-04" "M67"  
9 "2023-04" "I50"  
9 "2023-04" "GDFC00"
9 "2023-04" "I50"  
9 "2008-05" "I48"  
end

Last edited by Sigrid Vikjord; 16 Jul 2024, 15:27.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30063
#14

16 Jul 2024, 16:08

Well, in your example data for the using data set, there are numerous examples of completely duplicate observations. For example there are three observations that all have id = 3, admission date December 2019 and ICD10orNCMP = "M79". That is just one of many such sets of completely duplicate observations. You say you cannot remove any duplicates, but why not? The duplicates contain no information that is not found in the original. And there is no meaning to giving the patient the same diagnosis code on the same date. It just doesn't make any sense. It seems to me there are a few possibilities:
As mentioned previously, this situation can arise from defective data management leading to surplus duplicate observations.

Another possibility here arises because your "date" is actually just a month and year. So it is conceivable that in fact each observation represents a separate visit, and it is reasonable that the same person might have multiple visits in the same month with the same diagnosis. The fact that the data refers to admission dates, however makes it less likely: few people have multiple hospital admissions in the same month. So I'm skeptical. But you know where the data come from and what it is supposed to mean.

Notwithstanding my doubts, I'm going to assume that reason 2 applies here, as it makes it possible to go forward. But if reason 2 is not really what's happening, you are just building a data set that is teeming with garbage.

Code:

use `using_data', clear // CHANGE STRING DATE VARIABLE TO NUMERIC gen admission_date = monthly(admissiondateYYMM, "YM") assert missing(admission_date) == missing(admissiondateYYMM) format admission_date %tmMonCCYY drop admissiondateYYMM // PREPARE DATA FOR RESHAPING TO WIDE sort id admission_date, stable by id (admission_date): gen seq = _n reshape wide admission_date ICD10orNCMP,i(id) j(seq) merge 1:1 id using `master_data'

Replace the references to `master_data' and `using_data' by the actual names (without surrounding ` ') of the two data sets.

Notes: Your admissiondateYYMM variable encodes monthly date information as a string. The way you have encoded it is actually somewhat workable, in that at least it can be correctly sorted into chronological order. But if at some point you needed to calculate, for example, the elapsed time between two visits, this would be useless. So it is better to convert this string variable into a Stata internal format numeric date variable.

The end result is the what you said you wanted. You will probably regret doing this, because this data set is unwieldy. Even in the example data there are 72 variables to capture the admission dates and codes for each person, and in the full data set it will likely be larger. And doing things like identifying patients who have particular codes, or identifying patients seen in certain time intervals, from this data will be awkward and cumbersome. The original long layout is much better for almost everything you would want to do in Stata. But, this is what you asked for, and now you have it.
Comment
Benno Schoenberger

Join Date: Apr 2024

Posts: 61
#15

17 Jul 2024, 01:23

I can imagine where the duplicates in the using data come from. Obviously, the using data set posted here is only a section of a larger data set, which probably contains other variables such as the exact day of the hospital visit. If this is the case, there are probably no duplicates in the entire data set and you can proceed with Clyde's code. But If these observations are perfect duplicates that do not provide any additional information and therefore offer no further potential for analysis, I would recommend eliminating them before reshaping - unlike Clyde did here:

Code:

duplicates drop

or if you want to inspect them first:

Code:

duplicates tag , gen(dups) br if dups // stop execution after this line ;-) bysort id admission_date ICD10orNCMP dups: keep if _n==1 drop dups

And if there are other variables in the overall data set that are different between the duplicates of the example data set, you will probably have to take further data preparation steps before you can use Clyde's code here, as it could technically be executed without errors but would not produce the desired results .

Clyde Schechter
A question on my own behalf: Why do you use the "stable" option when sorting, which is rather unusual for me? Or is that just personal style?
Comment

Announcement

Can't reshape from long to wide - why?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment