obtaining distinct count of variables in long data

Bob Green

Join Date: Oct 2015

Posts: 85
#1

obtaining distinct count of variables in long data

05 May 2018, 06:47

Hello,

I am hoping for advice on obtaining count data from a dataset in which individual data is located in multiple rows.A complication is that two DiagnosisIDs may be made on the same date.

In the example below I want to (a) obtain a count of distinct DiagnosisIDs by StudyID and (b) identify by StudyID a count of instances when DiagnosisID has changed from the first DiagnosisID made .

A dummy data example is below. In this example StudyID 6283 has 1 distinct DiagnosisID value while StudyID 3757 has 5. 6283 demonstrated there was no change in DiagnosisID, whereas there was for 3757.

I will eventually look at changes within orders, but being about to obtain the count data would be a very useful step. List is helpful, however, I want to be able to merge the count data into another data file.

Any assistance is appreciated.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int StudyID str10 OrderMadeDate str9 OffenceDate str10 DiagnosisDate int DiagnosisID 6283 "5/26/2016" "4/17/2014" "9/23/2013" 1519 6283 "5/26/2016" "4/17/2014" "12/24/2013" 1519 6283 "5/26/2016" "4/17/2014" "11/10/2013" 1519 3757 "12/03/2016" "2/07/2016" "8/29/2013" 1527 3757 "12/03/2016" "2/07/2016" "7/17/2014" 1779 3757 "12/03/2016" "2/07/2016" "11/19/2015" 1779 3757 "12/03/2016" "2/07/2016" "11/26/2015" 1779 3757 "12/03/2016" "2/07/2016" "10/24/2015" 1780 3757 "12/03/2016" "2/07/2016" "8/29/2015" 1784 3757 "12/03/2016" "2/07/2016" "9/26/2015" 1784 3757 "12/03/2017" "2/07/2017" "10/08/2016" 1784 3757 "12/03/2016" "2/07/2016" "9/21/2015" 1800 end
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35641
#2

05 May 2018, 08:21

Thanks for the data example. https://www.stata-journal.com/articl...article=dm0042 gives a survey of technique in this territory. dm0042 is thus revealed as an otherwise unpredictable search term that will identify many related threads on Statalist.

Code:

. egen Tag = tag(DiagnosisID StudyID) . egen Distinct = total(Tag), by(StudyID) . . tabdisp StudyID, c(Distinct) ---------------------- StudyID | Distinct ----------+----------- 3757 | 5 6283 | 1 ----------------------
Comment
Bob Green

Join Date: Oct 2015

Posts: 85
#3

05 May 2018, 14:25

Nick many thanks. I'll experiment and see if I can build on this and add the total number of DiagnosisID so I can identify instances where may be change. The levelsof command sounds useful, though the article identifies the problem of obtain too many values which is likely. Thanks again.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35641
#4

06 May 2018, 03:01

I don't see why levelsof could help here, as you want to create a new variable. or why further experiment is required. The code in #2 solves (a) in #1 so far as I can see. The number of (distinct) diagnoses different from the first is just one fewer than the number of (distinct) diagnoses. If you've some different definition in mind please be specific on what it is.
Comment
Bob Green

Join Date: Oct 2015

Posts: 85
#5

06 May 2018, 16:22

Nick thanks. This is step one of several steps, so this is what I inelegantly meant by experimentation. I ran the same code on another variable, Diagnosis dates and can now look at distinct diagnoses and the number of distinct dates (times) a diagnosis was made.

The next step and this is where I incorreclty thought levelsof may have been of use, I want to look at the different diagnoses to see if the diferences are substantive or not. For example, being diagnosed at time 1 with a bipolar disorder in an acute state and at time 2 bipolar disorder in remission, is very different to being diagnoses with bipolar at time 1 and intoxication at time 2. DiagnosisID contains a code of diagnoses.

Nested within StudyID is OffenceDate, of which there can be a large number. For each OffenceDate there are diagnosis dates in the previous 12 months. These 12 month periods can overlap so where there are multiple offences there is a lot of data. Obtaining distinct counts is my first step in trying to make sense of this

There are several questions I have:

1) For each StudyID identifying instances where there was more than one DiagnosisID for a DiagnosisDate? Two diagnoses on the same date could be mistaken for diagnosis change.
2) a count of distinct DiagnosisID by StudyID

I think answering 1 & 2, can address the nesting issue above.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35641
#6

06 May 2018, 18:07

Isn't 2) of #5 the same as already answered?

1) looks to be the same stuff with different variables:

Code:

egen Tag2 = tag(DiagnosisID DiagnosisDate StudyID) egen Distinct2 = total(Tag2), by(StudyID DiagnosisDate)

You want observations with more than 1 as the value.
Comment
Bob Green

Join Date: Oct 2015

Posts: 85
#7

06 May 2018, 22:42

Thanks again Nick. This code detected one instance of a seeming change that isn't, which I'll need to factor into analysis.

I see I wasn't clear enough. By " count of distinct DiagnosisID by StudyID" I meant a count of the individual DiagnosisIDs. This is what I had in mind. In the real data there are 71 distinct StudyID and 65 DiagnosisID, with a maximum of 7 diagnoses.

dx1 dx2 dx3 dx4 dx5 dx6 dx7
3757 1527 1779 1780 1784 1800 0 0
6283 1519 0 0 0 0 0 0

Last edited by Bob Green; 06 May 2018, 22:45.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35641
#8

07 May 2018, 01:11

Sorry, you've lost me. Here for example are variables dx1 on which didn't appear before. #1 gave me the impression that separate diagnoses were given in separate observations. Are you now saying they are in different variables?

Also the wording

By " count of distinct DiagnosisID by StudyID" I meant a count of the individual DiagnosisIDs

doesn't help me. Perhaps you should give a very simple example with the numbers you expect for new variables with the rules for them. Otherwise I am not helping you, it seems.
Comment
Bob Green

Join Date: Oct 2015

Posts: 85
#9

07 May 2018, 05:53

Sorry, I can see I should have explained the situation in more clearly (it was in my head). In the example above, StudyID 6283 has 1 distinct diagnosis 1519 and StudyID 3757 has 5 distinct diagnoses: 1527 1779 1780 1784 1800

Considering output that counts StudyID by DiagnosiID, StudyID would be one variable. In this example, to count diagnosis there would need to be an additional 5 columns (one for each of the five DiagnosisID values). Lets call these var1 to var5 (I used dx 1-dx7 - which I now see wasn't helpful).

Both StudyID cases would have a value in var 1 (1519 and 1527 respectively). StudyID 6283 has no further diagnoses, so the remaining spaces could be left blank (I added a 0, again without making it clear what I had done). For StudyID 3757, var2 to var5 would have the values: 1779 1780 1784 1800 respectively. Is this clearer?
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35641

#10

07 May 2018, 16:34

What you're asking for can be achieve with reshape wide. I am not clear why you want that layout. The long layout you have already is better for most Stata
purposes.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int StudyID str10 OrderMadeDate str9 OffenceDate str10 DiagnosisDate int DiagnosisID
6283 "5/26/2016"  "4/17/2014" "9/23/2013"   1519
6283 "5/26/2016"  "4/17/2014" "12/24/2013"  1519
6283 "5/26/2016"  "4/17/2014" "11/10/2013" 1519
3757 "12/03/2016" "2/07/2016" "8/29/2013"   1527
3757 "12/03/2016" "2/07/2016" "7/17/2014"   1779
3757 "12/03/2016" "2/07/2016" "11/19/2015"  1779
3757 "12/03/2016" "2/07/2016" "11/26/2015"  1779
3757 "12/03/2016" "2/07/2016" "10/24/2015"  1780
3757 "12/03/2016" "2/07/2016" "8/29/2015"   1784
3757 "12/03/2016" "2/07/2016" "9/26/2015"   1784
3757 "12/03/2017" "2/07/2017" "10/08/2016"  1784
3757 "12/03/2016" "2/07/2016" "9/21/2015"   1800
end

drop *Date 
duplicates drop 
sort StudyID, stable 
by StudyID : gen Which = _n 
reshape wide DiagnosisID, i(StudyID) j(Which) 

list 

     +----------------------------------------------------------------+
     | StudyID   Diagno~1   Diagno~2   Diagno~3   Diagno~4   Diagno~5 |
     |----------------------------------------------------------------|
  1. |    3757       1527       1779       1780       1784       1800 |
  2. |    6283       1519          .          .          .          . |
     +----------------------------------------------------------------+

Comment

Bob Green

Join Date: Oct 2015

Posts: 85
#11

07 May 2018, 19:42

Many thanks Nick. With a maximum of 7 columns I can readily see the story of how an individual's diagnosis has changed/not changed. In the case of #3757 there has been a substantive change from Diagnosis1 to Diagnosi2, the remaining diagnoses are variants of diagnosis 2. In relation to the actual data, instead of over 1200 rows I can now easily review 71 rows.
Comment

Announcement

obtaining distinct count of variables in long data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment