Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • merge long and wide dataset

    Hi all,

    I have multiple wide and long datasets stored in a unique folder. I am trying to write a code to merge them all. I have datasets with 1 line per participant (unique id = identano) and datasets with multiple lines per participant (still identano for each participant but visitnum for the number of visit baseline visit 1 etc.). I would like STATA to automatically recognize a wide or long dataset and create my unique working dataset automatically.

    I wrote this code but it is still not working, someone here to help me? Thanks so much! I imagine there are many ways to make this code easier and shorter +++ as well. Thanks a lot to you all +++

    Best,
    Pierre


    local dir "xx"
    cd "`dir'"

    local files: dir "`dir'" files "*.dta"

    local base_datasets ""
    local visit_datasets ""

    foreach file of local files {

    use "`file'", clear


    capture confirm variable visitnum
    if !_rc {
    local visit_datasets "`visit_datasets' `file'"
    }
    else {
    local base_datasets "`base_datasets' `file'"
    }
    }




    foreach base_file of local base_datasets {
    if `base_merged' == 0 {
    use "`base_file'", clear
    local base_merged = 1
    }
    else {
    merge 1:1 identano using "`base_file'", assert(match using) nogenerate
    }
    }


    save base_temp.dta, replace


    local visit_merged = 0
    foreach visit_file of local visit_datasets {
    use "`visit_file'", clear


    reshape wide *, i(identano) j(visitnum)

    if `visit_merged' == 0 {
    save visit_temp.dta, replace
    local visit_merged = 1
    }
    else {
    merge 1:1 identano using visit_temp.dta, assert(match using) nogenerate
    save visit_temp.dta, replace
    }
    }


    use base_temp.dta, clear


    merge 1:1 identano using visit_temp.dta, assert(match using) nogenerate


    save dataset_final.dta, replace


    erase base_temp.dta
    erase visit_temp.dta

  • #2
    Here are some suggestions for another post that may help people to answer your question:
    1) Describe in detail what you mean by "not working." There are many ways in which something can "not work." Answering without knowledge of this is very difficult.
    2) To aid in #1, using -set trace on- would help. In fact, using that will give information that might enable you to answer your own question.
    3) Explain how and where your local base_merged is assigned a value. (Perhaps you think base-merged has a default value of 0, but that is not true.) If any of your locals have values assigned outside the code shown here, the same advice would apply to them.)
    4) Show examples of some of your data files, using -dataex-. Among other things, this will help in understanding the features of your files that make it possible to automatically distinguish your different types of files.

    Comment


    • #3
      Hi Mike,

      Thanks for your response. I want to merge multiple long and wide datasets. My wide datasets are like the one entitled infl.dta as follows with each line = a participant (identano).

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input int identano float infl
      759 46.48893
      462 47.54221
       20 53.46077
      288 64.74852
      563 61.09239
      end
      Than I have long datasets like the one as follows where each participant is identified using identano but for each participant i have 2 lines one for visitnum 0 which is baseline and 1 for visitnum 1 which is year 1 data and of course i have multiple variables. I would like to merge my multiple long and my multiple wide datasets automatically (they are all located in the same folder) to produce a unique dataset which would be my unique working dataset for all analyses.

      Exemple long dataset called mmse here.

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input int identano byte visitnum str1(mmseres_1 mmseres_2)
      805 0 "1" "1"
      805 1 "1" "1"
      489 0 "1" "1"
      489 1 "1" "1"
       21 0 "1" "1"
      end
      I imagine this is quite simple and very basic just merge 2 long datasets and 2 wide datasets with a unique participant identifier which is identano and a unique visit identifier which is visitnum.

      Any idea how I could do that? I would save an enormous amount of time. Thanks so much to you all +++

      Best
      Pierre

      Comment


      • #4
        Well, the following code, begun in the working directory that contains the files, can do this for you:

        Code:
        clear*
        
        local files: dir "." files "*.dta"
        local first_file: word 1 of `files'
        local files: list files - first_file
        
        
        use `"`first_file'"'
        confirm var identano
        capture confirm var visitnum
        if c(rc) == 0 {
            local master_dimensions 2
        }
        else {
            local master_dimensions 1
        }
        
        foreach f of local files {
            capture isid identano using `"`f'"'
            if c(rc) == 0 {
                local using_dimensions 1
            }
            else {
                capture isid identano visitnum using `"`f'"'
                if c(rc) == 0 {
                    local using_dimensions 2
                }
                else {
                    display as error `"`f' not uniquely identifiable with vbles identanon and visitnum"'
                    exit(9)
                }
            }
            
            display  `"`f'", `master_dimensions', `using_dimensions'
            if `master_dimensions' == 1 & `using_dimensions' == 1 {
                merge 1:1 identano using `"`f'"', nogenerate
            }
            else if `master_dimensions' == 1 & `using_dimensions' == 2 {
                merge 1:m identano using `"`f'"', nogenerate
                local master_dimensions = 2
            }
            else if `master_dimensions' == 2 & `using_dimensions' == 1 {
                merge m:1 identano using `"`f'"', nogenerate
            }
            else {
                merge 1:1 identano visitnum using `"`f'"', nogenerate
            }
        }
        That said, I advise against doing this. If there really are so many files that this will save you a lot of time, then it is very likely that there are incompatibilities among these files, or other problems combining them, that need your attention, not blitzing over. In my experience, large collections of files, even from reputable sources, are seldom compatible as received and usually need considerable cleaning to make them compatible. And, in my experience, clinical datasets are among the worst in this regard.

        By incompatible I mean files that contain what is supposed to be the same variable, but the names are slightly different in the two files. Or worse, a variable that has the same name in two files but actually refers to different things in each. Or a variable that is numeric in one file and the corresponding variable (same name) is string. Or two numeric variables that are both numeric but represent different coding schemes (have conflicting value labels, for example) for the same construct. Also, you are proposing to -merge- all of these files: if two of them contain the same variable, is one of them supposed to update or replace the other? If so, the order in which they are handled is important, and the -merge- command needs to be modified with -update- or -update replace- options accordingly. But that same specification could be wrong as applied to other files. So unless you have already considered and resolved all these potential problems, automating this mass merge is really looking for trouble.

        To help you identify some of the incompatibilities that may be present among these files, I recommend you install Mark Chatfield's -precombine- command from SSC. Running it on the list of files will alert you to potential incompatibilities so you can fix them first. (It will not, however, help you with the problem of update/replace issues.)

        Comment


        • #5
          Hello Clyde,

          Thank you so much for all your help. This is so much appreciated. I will really pay attention to what you said, especially with many datasets. The precombine command is very helpful too.

          Thank you again!

          Best,
          Pierre

          Comment

          Working...
          X