Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging a large number of .dta-files

    Hey!

    We're a group of students who are new to Stata, so our skills in Stata are fairly limited. Our problem is with merging a large number of files, horizontally and vertically. First I'll give an example of the data we have and how it is structured:

    Session 1
    001_uniquefilename1.dta
    001_uniquefilename2.dta
    001_uniquefilename3.dta
    etc...
    Session 2
    002_uniquefilename1.dta
    002_uniquefilename2.dta
    002_uniquefilename3.dta
    etc...
    etc...

    For a total of ~300 sessions. We aim to merge the data files for each session horizontally, and appending the sessions vertically. There is a mismatch in name for unique ID in some of the files in each session, but if this is not something Stata can accommodate for, we will change the unique ID name to have the same value.

    Furthermore, some files have multiple entries per unique ID, lets say unique ID is year, e.g.:

    001_uniquefilename1.dta:
    1 entry per year, 8 dimensions
    001_uniquefilename2.dta:
    6 entries per year, 6 dimensions
    001_uniquefilename3.dta:
    24 entries per year, 8 dimensions
    001_uniquefilename4.dta:
    24 entries per year, 8 dimensions (same IDs as uniquefilename3 for the 24 entries)

    We are trying to merge these in a way where it will duplicate so that the entries in uniquefilename1 is duplicated 6 times to accommodate for the 6 entries in uniquefilename2, and the 6 entries in the resulting data file from 1 and 2 is duplicated 24 times for uniquefilename3. on the 3rd merge with uniquefilename4, it should not duplicate another 24 times as the IDs for the 24 entries match the IDs in uniquefilename3.

    As we have very limited experience with Stata, our googling game is not too on point, and we are getting the impression that files to be merged must be in a chronological order with the same name except for a number-identifier. Such as data001, data002, data003 etc., so we were wondering if anyone more experienced would be able to point us in the right direction.

  • #2
    Welcome to Statalist.
    Please understand that there is an unofficial policy on here to not just provide exact code for student assignments.
    However, I would want to point you in the direction of
    merge: https://www.stata.com/manuals13/dmerge.pdf
    append: https://www.stata.com/manuals13/dappend.pdf
    foreach: https://www.stata.com/manuals13/pforeach.pdf

    I'll also add that for appending, or what you call horizontal merging, Stata does not require ID's to be the same across datasets. Rather, it will identify matching variables, and append the values from each datasets in the column for the corresponding variable.
    Merging does require unique ID's across datasets, at least if those ID's refer to the same subject/person.
    If you have one set of variables for person A&B in one dataset, and another set of variables for A&B (or any overlapping set of individuals) in another dataset, you want to use merge.
    If you have a set of variables for persons A&B in one dataset, and the same (or overlapping) set of variables for persons C&D, or for persons A&B at another moment in time, you use append.

    I'd advise you to go through the manuals, do further reading on here or what you can find via web searches, and come back here when you have more specific questions, including a piece of code that you have written so far.

    Comment


    • #3
      Thank you, Jorrit!

      I should have explained better in my first post, this is not a student assignment, rather a small portion of preparing data for a machine learning bachelor thesis using R.

      Stata is another tool we are trying to do this part of data prep, we will read up more and try code, I'll post more specific questions and code examples later.

      Thank you, again.

      Comment

      Working...
      X