Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • obtaining distinct count of variables in long data

    Hello,

    I am hoping for advice on obtaining count data from a dataset in which individual data is located in multiple rows.A complication is that two DiagnosisIDs may be made on the same date.

    In the example below I want to (a) obtain a count of distinct DiagnosisIDs by StudyID and (b) identify by StudyID a count of instances when DiagnosisID has changed from the first DiagnosisID made .

    A dummy data example is below. In this example StudyID 6283 has 1 distinct DiagnosisID value while StudyID 3757 has 5. 6283 demonstrated there was no change in DiagnosisID, whereas there was for 3757.

    I will eventually look at changes within orders, but being about to obtain the count data would be a very useful step. List is helpful, however, I want to be able to merge the count data into another data file.

    Any assistance is appreciated.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int StudyID str10 OrderMadeDate str9 OffenceDate str10 DiagnosisDate int DiagnosisID
    6283 "5/26/2016"  "4/17/2014" "9/23/2013"   1519
    6283 "5/26/2016"  "4/17/2014" "12/24/2013"  1519
    6283 "5/26/2016"  "4/17/2014" "11/10/2013" 1519
    3757 "12/03/2016" "2/07/2016" "8/29/2013"   1527
    3757 "12/03/2016" "2/07/2016" "7/17/2014"   1779
    3757 "12/03/2016" "2/07/2016" "11/19/2015"  1779
    3757 "12/03/2016" "2/07/2016" "11/26/2015"  1779
    3757 "12/03/2016" "2/07/2016" "10/24/2015"  1780
    3757 "12/03/2016" "2/07/2016" "8/29/2015"   1784
    3757 "12/03/2016" "2/07/2016" "9/26/2015"   1784
    3757 "12/03/2017" "2/07/2017" "10/08/2016"  1784
    3757 "12/03/2016" "2/07/2016" "9/21/2015"   1800
    end

  • #2
    Thanks for the data example. https://www.stata-journal.com/articl...article=dm0042 gives a survey of technique in this territory. dm0042 is thus revealed as an otherwise unpredictable search term that will identify many related threads on Statalist.

    Code:
    . egen Tag = tag(DiagnosisID StudyID)
    
    . egen Distinct = total(Tag), by(StudyID)
    
    .
    . tabdisp StudyID, c(Distinct)
    
    ----------------------
      StudyID |   Distinct
    ----------+-----------
         3757 |          5
         6283 |          1
    ----------------------

    Comment


    • #3
      Nick many thanks. I'll experiment and see if I can build on this and add the total number of DiagnosisID so I can identify instances where may be change. The levelsof command sounds useful, though the article identifies the problem of obtain too many values which is likely. Thanks again.

      Comment


      • #4
        I don't see why levelsof could help here, as you want to create a new variable. or why further experiment is required. The code in #2 solves (a) in #1 so far as I can see. The number of (distinct) diagnoses different from the first is just one fewer than the number of (distinct) diagnoses. If you've some different definition in mind please be specific on what it is.

        Comment


        • #5
          Nick thanks. This is step one of several steps, so this is what I inelegantly meant by experimentation. I ran the same code on another variable, Diagnosis dates and can now look at distinct diagnoses and the number of distinct dates (times) a diagnosis was made.

          The next step and this is where I incorreclty thought levelsof may have been of use, I want to look at the different diagnoses to see if the diferences are substantive or not. For example, being diagnosed at time 1 with a bipolar disorder in an acute state and at time 2 bipolar disorder in remission, is very different to being diagnoses with bipolar at time 1 and intoxication at time 2. DiagnosisID contains a code of diagnoses.

          Nested within StudyID is OffenceDate, of which there can be a large number. For each OffenceDate there are diagnosis dates in the previous 12 months. These 12 month periods can overlap so where there are multiple offences there is a lot of data. Obtaining distinct counts is my first step in trying to make sense of this

          There are several questions I have:

          1) For each StudyID identifying instances where there was more than one DiagnosisID for a DiagnosisDate? Two diagnoses on the same date could be mistaken for diagnosis change.
          2) a count of distinct DiagnosisID by StudyID

          I think answering 1 & 2, can address the nesting issue above.

          Comment


          • #6
            Isn't 2) of #5 the same as already answered?

            1) looks to be the same stuff with different variables:

            Code:
            egen Tag2 = tag(DiagnosisID DiagnosisDate StudyID)
            
            egen Distinct2 = total(Tag2), by(StudyID DiagnosisDate)
            You want observations with more than 1 as the value.

            Comment


            • #7
              Thanks again Nick. This code detected one instance of a seeming change that isn't, which I'll need to factor into analysis.

              I see I wasn't clear enough. By " count of distinct DiagnosisID by StudyID" I meant a count of the individual DiagnosisIDs. This is what I had in mind. In the real data there are 71 distinct StudyID and 65 DiagnosisID, with a maximum of 7 diagnoses.


              dx1 dx2 dx3 dx4 dx5 dx6 dx7
              3757 1527 1779 1780 1784 1800 0 0
              6283 1519 0 0 0 0 0 0
              Last edited by Bob Green; 06 May 2018, 22:45.

              Comment


              • #8
                Sorry, you've lost me. Here for example are variables dx1 on which didn't appear before. #1 gave me the impression that separate diagnoses were given in separate observations. Are you now saying they are in different variables?

                Also the wording

                By " count of distinct DiagnosisID by StudyID" I meant a count of the individual DiagnosisIDs
                doesn't help me. Perhaps you should give a very simple example with the numbers you expect for new variables with the rules for them. Otherwise I am not helping you, it seems.

                Comment


                • #9
                  Sorry, I can see I should have explained the situation in more clearly (it was in my head). In the example above, StudyID 6283 has 1 distinct diagnosis 1519 and StudyID 3757 has 5 distinct diagnoses: 1527 1779 1780 1784 1800

                  Considering output that counts StudyID by DiagnosiID, StudyID would be one variable. In this example, to count diagnosis there would need to be an additional 5 columns (one for each of the five DiagnosisID values). Lets call these var1 to var5 (I used dx 1-dx7 - which I now see wasn't helpful).

                  Both StudyID cases would have a value in var 1 (1519 and 1527 respectively). StudyID 6283 has no further diagnoses, so the remaining spaces could be left blank (I added a 0, again without making it clear what I had done). For StudyID 3757, var2 to var5 would have the values: 1779 1780 1784 1800 respectively. Is this clearer?

                  Comment


                  • #10
                    What you're asking for can be achieve with reshape wide. I am not clear why you want that layout. The long layout you have already is better for most Stata
                    purposes.

                    Code:
                    * Example generated by -dataex-. To install: ssc install dataex
                    clear
                    input int StudyID str10 OrderMadeDate str9 OffenceDate str10 DiagnosisDate int DiagnosisID
                    6283 "5/26/2016"  "4/17/2014" "9/23/2013"   1519
                    6283 "5/26/2016"  "4/17/2014" "12/24/2013"  1519
                    6283 "5/26/2016"  "4/17/2014" "11/10/2013" 1519
                    3757 "12/03/2016" "2/07/2016" "8/29/2013"   1527
                    3757 "12/03/2016" "2/07/2016" "7/17/2014"   1779
                    3757 "12/03/2016" "2/07/2016" "11/19/2015"  1779
                    3757 "12/03/2016" "2/07/2016" "11/26/2015"  1779
                    3757 "12/03/2016" "2/07/2016" "10/24/2015"  1780
                    3757 "12/03/2016" "2/07/2016" "8/29/2015"   1784
                    3757 "12/03/2016" "2/07/2016" "9/26/2015"   1784
                    3757 "12/03/2017" "2/07/2017" "10/08/2016"  1784
                    3757 "12/03/2016" "2/07/2016" "9/21/2015"   1800
                    end
                    
                    drop *Date 
                    duplicates drop 
                    sort StudyID, stable 
                    by StudyID : gen Which = _n 
                    reshape wide DiagnosisID, i(StudyID) j(Which) 
                    
                    list 
                    
                         +----------------------------------------------------------------+
                         | StudyID   Diagno~1   Diagno~2   Diagno~3   Diagno~4   Diagno~5 |
                         |----------------------------------------------------------------|
                      1. |    3757       1527       1779       1780       1784       1800 |
                      2. |    6283       1519          .          .          .          . |
                         +----------------------------------------------------------------+

                    Comment


                    • #11
                      Many thanks Nick. With a maximum of 7 columns I can readily see the story of how an individual's diagnosis has changed/not changed. In the case of #3757 there has been a substantive change from Diagnosis1 to Diagnosi2, the remaining diagnoses are variants of diagnosis 2. In relation to the actual data, instead of over 1200 rows I can now easily review 71 rows.

                      Comment

                      Working...
                      X