Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Reshaping?

    Hi,

    In my datasheet (example attached) there are in some cases more than one observation for each record. Can I somehow combine each record into one observation with more than one variable?
    I tried using reshape without any luck. To me it seems like it should be an easy enough task, but I spend a lot of time now trying the reshape command etc. I am not even sure it's the command to use..
    The best layout would be something like: RECORD, DIAG1, DIAG2, DIAG3....

    Sincerely
    Victor
    Attached Files

  • #2
    Code:
    by v_recnum, sort: gen _j = _n
    reshape wide c_diag, i(v_recnum) j(_j)
    will do this.

    Please read the FAQ, especially #12, for excellent advice about how to post data examples (and code and output). In particular, it clearly states specificallynot to use screenshots. They are often unreadable. Even when they are readable, as yours turns out to be, were it necessary to import the data to Stata to test out some code to solve your problem, a screenshot is useless. In the future, always use the -dataex- command to show Stata data examples.

    Comment


    • #3
      Thank you Clyde, it worked very well.
      I read the FAQ and will keep this in mind for the future.

      Comment


      • #4
        What should I do, if I wanted to do the same command in alphabetical or numerical order?

        Comment


        • #5
          It would be almost the same as in #2. Just change fhe first command to:
          Code:
          bysort v_recnum (c_diag): gen _j = _n
          That way the numerical order established when _j is generated would be identical to the sort order of c_diag.

          Comment


          • #6
            Thank you for your previous helpful answers.
            I have now encountered another problem. If i want to combine my observations based on a unique ID (example ID 1 to 3). How would I do that most easily? I have deleted all identical duplicates already. Following command gives me the error:
            variable c_diag11 already defined
            r(110);


            Code:
            by v_cpr, sort: gen _j = _n
            reshape wide c_diag1 c_diag2 c_diag3 c_diag4 c_diag5 c_diag6 c_diag7 c_diag8 c_diag9 c_diag10 c_diag11 c_diag12 c_diag13, i(v_cpr) j(_j)
            Code:
            input double v_cpr str8(c_diag1 c_diag2 c_diag3 c_diag4 c_diag5 c_diag6 c_diag7 c_diag8 c_diag9 c_diag10 c_diag11 c_diag12 c_diag13)
            1 "DI109"  "DI489"  ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
            1 "DI489"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
            1 "DI489"  "DR060"  ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
            1 "DR060"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
            1 "DZ039"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
            2 "40199"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
            2 "41099"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
            2 "53290"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
            2 "54090"  "78909"  ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
            2 "78909"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
            2 "DE668"  "DI109"  "DI209"  "DN289A" "DZ035" ""      ""      ""      ""      ""      ""      "" ""
            2 "DE780"  "DI109"  "DI208"  ""       ""      ""      ""      ""      ""      ""      ""      "" ""
            2 "DH350"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
            2 "DI109"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
            2 "DI109"  "DI130"  ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
            2 "DI109"  "DI208"  "DI209"  "DI251"  ""      ""      ""      ""      ""      ""      ""      "" ""
            2 "DI109"  "DI208"  "DI251"  "DR079"  ""      ""      ""      ""      ""      ""      ""      "" ""
            2 "DI109"  "DM51"   "DM511D" ""       ""      ""      ""      ""      ""      ""      ""      "" ""
            2 "DI109"  "DN031"  "DZ039"  ""       ""      ""      ""      ""      ""      ""      ""      "" ""
            2 "DI109"  "DR079"  ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
            2 "DI109"  "DZ034"  ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
            2 "DM109"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
            2 "DM99"   "DM993"  "DZ038"  ""       ""      ""      ""      ""      ""      ""      ""      "" ""
            2 "DN409"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
            2 "DR391"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
            2 "DZ016"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
            2 "DZ035"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
            2 "DZ039"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
            2 "DZ769"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
            3 "DN300"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
            3 "DN810"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
            3 "DO049"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
            3 "DO200"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
            3 "DO620"  "DO680"  "DO713"  "DO721"  "DO752" "DO802" "DO864" "DO990" "DZ291" "DZ370" "DZ390" "" ""
            3 "DZ039"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
            
            end

            Comment


            • #7
              The problem is with your variable names. You start out with a variables c_diag1 through c_diag13. Now you try to reshape wide, with the j() variable taking numeric values 1 through 24. When Stata tries to do the -reshape-, it encounters a situation where it is asked to create a new variable c_diag11 (the result of postfixing _j = 1 to c_diag1) and finds that there is already a variable by that name. So you need to change the c_diag* variable names to avoid this. The following will work:

              Code:
              by v_cpr, sort: gen _j = _n
              rename c_diag* =_
              reshape wide c_diag*_, i(v_cpr) j(_j)
              By sticking an underscore character after the original number, you avoid name clashes.

              I am once again wondering why you are doing this. Just in this example (and I fear your real data has even more variables) you are creating a new data set that has 313 variables, and most of the observations in it will be missing values. It is really hard for me to imagine how this can be useful for any kind of analysis.

              Comment


              • #8
                Thank you once again.

                Well, I must admit I am very new to Stata and these solutions might seem useless or odd. The thing is, I would like a data set with one observation pr unique ID as I have data from various sources (registries) that I want to combine. I was suggested to do so by colleagues. However, they are not very experienced with Stata either.

                In this example, my thought was to create unique observations with all matching values, delete the duplicates and blanks if possible.

                Hope to hear from you

                Comment


                • #9
                  When you have multiple observations on the same people, in Stata it is usually more effective to have each observation separate, rather than "stacking them horizontally" into a single observation. Not always, but usually. There is nothing inherently wrong with what you are doing, but I suspect that as you try to do more work with it you will find it cumbersome.

                  Let me give the benefit of the doubt that in your situation having all the diagnoses for a given person in a single observatoin is, in fact, the best way to proceed. (This sounds plausible to me.) Still, having so many of those observations consisting of missing values interspersed between non-missing values is almost certainly going to cause problems. So consider this alternative:

                  Code:
                  clear
                  input double v_cpr str8(c_diag1 c_diag2 c_diag3 c_diag4 c_diag5 c_diag6 c_diag7 c_diag8 c_diag9 c_diag10 c_diag11 c_diag12 c_diag13)
                  1 "DI109"  "DI489"  ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
                  1 "DI489"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
                  1 "DI489"  "DR060"  ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
                  1 "DR060"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
                  1 "DZ039"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
                  2 "40199"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
                  2 "41099"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
                  2 "53290"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
                  2 "54090"  "78909"  ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
                  2 "78909"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
                  2 "DE668"  "DI109"  "DI209"  "DN289A" "DZ035" ""      ""      ""      ""      ""      ""      "" ""
                  2 "DE780"  "DI109"  "DI208"  ""       ""      ""      ""      ""      ""      ""      ""      "" ""
                  2 "DH350"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
                  2 "DI109"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
                  2 "DI109"  "DI130"  ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
                  2 "DI109"  "DI208"  "DI209"  "DI251"  ""      ""      ""      ""      ""      ""      ""      "" ""
                  2 "DI109"  "DI208"  "DI251"  "DR079"  ""      ""      ""      ""      ""      ""      ""      "" ""
                  2 "DI109"  "DM51"   "DM511D" ""       ""      ""      ""      ""      ""      ""      ""      "" ""
                  2 "DI109"  "DN031"  "DZ039"  ""       ""      ""      ""      ""      ""      ""      ""      "" ""
                  2 "DI109"  "DR079"  ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
                  2 "DI109"  "DZ034"  ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
                  2 "DM109"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
                  2 "DM99"   "DM993"  "DZ038"  ""       ""      ""      ""      ""      ""      ""      ""      "" ""
                  2 "DN409"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
                  2 "DR391"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
                  2 "DZ016"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
                  2 "DZ035"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
                  2 "DZ039"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
                  2 "DZ769"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
                  3 "DN300"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
                  3 "DN810"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
                  3 "DO049"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
                  3 "DO200"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
                  3 "DO620"  "DO680"  "DO713"  "DO721"  "DO752" "DO802" "DO864" "DO990" "DZ291" "DZ370" "DZ390" "" ""
                  3 "DZ039"  ""       ""       ""       ""      ""      ""      ""      ""      ""      ""      "" ""
                  
                  end
                  
                  gen long obs_no = _n
                  reshape long c_diag, i(obs_no)
                  drop if missing(c_diag)
                  by v_cpr (c_diag), sort: replace _j = _n
                  drop obs_no
                  reshape wide c_diag, i(v_cpr) j(_j)
                  The resulting data set has all of the diagnosis codes for an individual in the same observation. But there are only 33 such variables, and most of the values are not missing. Moreover, the diagnosis codes are sorted in alphabetical order, with the missing values appearing at the end. This will probably prove easier to work with. What you lose with this layout is whatever information is proxied by the separate observations per person in the original data. (If there are other variables besides the c_diag* and v_cpr, and if those are fixed patient attributes that are, therefore, constant within all observations carring the same v_cpr, the -reshape-s will carry those along undisturbed. What you may lose is if there are other variables that are not constant among the observations for the same person which are in the data set, or which are not variables but are implied by virtue of a certain observation being the first vs second vs third... for a given person.)
                  Last edited by Clyde Schechter; 01 Dec 2017, 08:15.

                  Comment


                  • #10
                    Dear Clyde,

                    Again thank you. Your answers have taught me a lot. I have finally realized that it is not necessarily beneficial to "stack my observations horizontally". However, in my following question I think it is necessary. I have used the reshape command a couple of times with good outcomes. I am now trying to reshape the following data to a wide format to get a better overview of the text in v_fritekst. I would like k_rekvnr (ID) and k_matnr as my i(option), but they are not unique as each ID may have more than one k_matnr (material). k_matnr is the identifier for each material linking the different sequences (k_sekvensnr).

                    Code:
                    k_rekvnr k_matnr k_sekvensnr v_fritekst
                    "ID1" 1  1 "[01] Blodtilblandet aspirat indeholdende talrige bare kerner,"          
                    "ID1" 1  2 "makrofager, lymfocytter og dårligt bevaret epitheliale celler. Enkelte"
                    "ID1" 1  3 "af kernerne forekommer relativt store, men da langt de fleste af"        
                    "ID1" 1  4 "cellerne er i varierende grad af opløsning, kan de ikke betegnes som"  
                    "ID1" 1  5 "andet end atypiske, og histologisk undersøgelse tilrådes.[01] xxxx.:"  
                    "ID1" 1  6 "Autolysepræget finnålsaspirat med enkelte atypiske epitheliale"        
                    "ID1" 1  7 "celler10.10.2007Med venlig hilsenxxxx/su"                
                    "ID2" 1  1 "ANTAL KARYOTYPERET: 25     NORMALE: 25     UNORMALE: 0 KARYOTYPE:"      
                    "ID2" 1  2 "46,XX"                                                                  
                    "ID2" 2  1 "Lymfeknude med talrige, størrelsesvarierende og stedvis"                
                    "ID2" 2  2 "sammenflydende, epiteloidcellegranulomer med centrale, kasseøse"        
                    "ID2" 2  3 "nekroser. Der ses kun enkelte kæmpeceller med randstillede kerner. Der"
                    "ID2" 2  4 "har ikke kunnet påvises syrefaste stave, men det udelukker ikke"        
                    "ID2" 2  5 "muligheden af TB. Der er ingen tegn på lymfom eller malignitet"        
                    "ID2" 2  6 "iøvrigt."                                                              
                    "ID3"  1  1 "2 glas med udstryg af aspirat fra lymfeknude, bestående af store og"    
                    "ID3"  1  2 "små lymfocytter samt enkelte immunoblastlignende celler. Der er ingen"  
                    "ID3"  1  3 "epitelceller eller tegn på malignitet i øvrigt.  xxx"                                                          
                    "ID4"  1  1 "01+02: Lymfeknudevæv med fokale, større og mindre, nekrotiserende,"    
                    "ID4"  1  2 "granulomatøse processer, delvis associeret med epiteloide makrofager"  
                    "ID4"  1  3 "og multinukleære kæmpeceller af Langhansk type. Granulomerne synes at"
                    "ID4"  1  4 "repræsentere forskellige udviklingsstadier, idet nogle er relativt"    
                    "ID4"  1  5 "små, stellate, andre partielt ophelede med fibrose. 6.2.: Mikroskopi"  
                    "ID4"  1  6 "af Ziehl-Neelsen farvde snit afslører ikke syrefaste stave.xxx"
                    Something like this would be favorable:

                    Code:
                    k_rekvnr k_ matnr k_sekvensnr1 v_fritekst1 k_sekvensnr2 v_fritekst2 ........
                    ID1      1         1            text      2             text
                    ID2      1         1            text      2             text
                    ID2      2         1            text      2             text
                    I know this might seem like an odd question, but I hope you can help me once again.


                    Sincerely,
                    Victor
                    Last edited by Victor Dahl Mathiasen; 18 Dec 2017, 05:16.

                    Comment


                    • #11
                      Actually I've solved it in another way now. Anyways, if you see an easy solution I would be pleased to hear so

                      Comment


                      • #12
                        Well, to reshape wide, it is not necessary for the variable in the i() option to uniquely identify observations. In fact, if they do, the -reshape wide- has no effect at all. This is a one-liner:

                        Code:
                        reshape long v_fritekst, i(k_rekvnr k_matnr) j(k_sekvensnr)
                        This does not create a series of k_sekvensnr variables like the ones you show in your desired output. But those variables serve no purpose anyway. It is always going to be the case that k_sekvensnr1 = 1 and k_sekvensnr2 = 2 and k_sekvensnr3 = 3, etc. So these "variables" are predictable constants, which means you don't need them at all.

                        In the future, please use -dataex- to post example data. The listing you showed in #10 took longer to import into Stata than it took to solve your problem. Please help those who want to help you.

                        Comment

                        Working...
                        X