Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help: Merging datasets, unexisting observation added

    Hi

    I'm running analyses on baseline and follow-up for an intervention meant to improve how schools support children with diabetes. The baseline observations and follow-up observations are each in their own dataset. For schools that received the intervention, I have to only look at those that even had students with diabetes. There are two variables here: "Intervention", where 1 is "yes, they did receive the intervention", and 2 is "no, they didn't". And "Diabetes", where 1 is "yes, they did have students with diabetes, 2 is "no, they didn't", and 3 is "don't know". So I run som drop commands in the baseline dataset:

    drop if Intervention==1 & Diabetes==2
    drop if Intervention==1 & Diabetes==3
    ​​​​​​​drop if Intervention==1 & Diabetes==.

    I tabulate the two variables to confirm that there are 21 observations, where the school did receive the intervention and did have students with diabetes. I then merge the baseline dataset with the follow-up dataset (the latter has no Diabetes variable, but does have the Intervention variable) using the variable "ID", where each observation has a unique number, as the key variable. The code looks as follows:

    merge 1:m ID using "H:\Documents\Data\Follow-up.dta", nogenerate force

    I then run the same tabulation as before, and I get 22 observations. How is that possible? There were no new observations for Diabetes in the follow-up dataset, because the variable wasn't even in that dataset. How do I get one extra observation that shouldn't even exist?

    I also tried m:m. instead of 1:m, but same problem. It should be noted that I'm fairly new to Stata. I used to work with SPSS. In Stata, I've mostly done a lot of tabulate and recode, some append and stack, but I've only recently learned to merge. I checked the dataeditor, and there were two observations with the same ID and the same values across all variables. I figure that I just have to delete one of them and move on, but I can't help but wonder, how it even got there, and how I might avoid it in the future, so I don't have to check every time. Also, there seems to be one other observation missing, because it's the same total number of observations between the follow-up dataset and the merged datase; one observation duplicated, one missing.

    Please help. What am I doing wrong? Thanks in advance.

  • #2
    Well, trying -merge m:m- was certainly wrong--it always is! But since you didn't get what you hoped for from it, you, fortunately, abandoned it.

    Probably you aren't doing anything wrong in terms of your code. Most likely the problem is that the data are not what you think they are. But to pinpoint that, we would need to see the data itself. It seems that in the Follow-up data there is some ID that has more than one observation, and you do not expect that. So the first thing is to find the ID that has the surplus observation:
    Code:
    use "H:/Documents/Data-Follow-up.dta", clear
    keep if Intervention == 1
    duplicates tag ID, gen(flag)
    browse if flag
    This will show you the duplicated ID. If you look at those observation carefully you may be able to figure out how that happened, and go back and fix up the data management that created this data set so that you get a correct data set.

    I will speculate that since you also note that one ID seems to be missing in the merged data, it may be that in one of the duplicated ID observations, the ID value itself is a mistake and should be the ID of the missing observation. But that's just speculation. You have to work with the actual data to figure out what went wrong.

    If you need additional assistance with this, please show example data from both data sets that exhibits the problem you are encountering, and use the -dataex- command to do that. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    When asking for help with code, always show example data. When showing example data, always use -dataex-.

    Comment


    • #3
      Added in edit: this crossed with post #2, and makes many of the same points.

      Without having seen the output of your merge command it is difficult to tell what the problem is. What I have written below is my best guess based on the likely possibilities.

      First, it is clear from your attempt of merge m:m that you have not read the full documentation for the merge command. The following is copied word-for-word from the documentation of the merge command in the Stata Data Management Reference Manual PDF included in the Stata installation and accessible from Stata's Help menu.

      m:m merges

      m:m specifies a many-to-many merge and is a bad idea. In an m:m merge, observations are matched within equal values of the key variable(s), with the first observation being matched to the first; the second, to the second; and so on. If the master and using have an unequal number of observations within the group, then the last observation of the shorter group is used repeatedly to match with subsequent observations of the longer group. Thus m:m merges are dependent on the current sort order—something which should never happen.

      Because m:m merges are such a bad idea, we are not going to show you an example.
      So in the future, avoid merge m:m - experienced users on Statalist have generally never needed to use it, and one who has, on a single occasion, also recognizes that there were other ways of accomplishing what he needed.

      With regard to the merge 1:m I will guess, lacking other information, that there was an ID in your follow-up ("using") dataset that did not match to an ID in your baseline ("master") dataset. The output of help merge - and the full PDF documentation referenced above - explains that by default all observations from both datasets are retained in the output file, and the _merge variable (which you suppressed with the nogenerate option) tells you, for each observation, whether it was only in the master dataset (_merge==1) or in the using dataset (_merge==2) or was in both datasets (_merge==3).

      In any event, you should rerun your merge 1:m command, omitting the nogenerate option, and then examine your output in the Data Editor to understand what merge did. The report that merge produces will tell you the overall counts of matches and each type of mismatch.

      You can later
      Code:
      drop _merge
      to remove it from your dataset when it is no longer needed.

      Comment


      • #4
        Thank you both for your input. It's for work, so I'm pretty sure that I'm not allowed to share the data for confidentiality reasons, though I do acknowledge that it would make helping much easier. I'm relieved that it's probably not my coding that's the problem, when I'm not trying to fix it with m:m at least. Turned out there were actually 2 duplicate observations, one where Intervention is 1, and one where Intervention is 2. I dropped those two observations, everything seemed to check out, and I carried on with my analysis, only to discover halfway through that there are 4 missing observations from the follow-up dataset. It's someone else more experienced than me, who's in charge of creating the datasets, I only do the the analyses. Seems there must be some mistake in at least one of the datasets to be merged. Wouldn't be the first time. Twice in almost half a year I've been asked to analyze variables that weren't even IN the dataset that was provided. Seems someone is going to have to review the datasets, or they'll have to make do with analyses with a few missing observations out of hundreds. Thanks again for clearing this up, as far as can be done without confidential example data.

        Comment


        • #5
          please read the FAQ which has advice on what to do when your data are confidential

          Comment

          Working...
          X