Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Joinby two variables

    Dear statalisters,

    I am trying to merge two datasets and I have some problems. I started yesterday merging using
    Code:
    joinby firm
    and everything was ok. Today I am trying to use
    Code:
    joinby country year
    but I have a problem, I think I créate duplicate data. My master data has 1 million observation and a size about 1,3 GB and the second database is about 170,000 observation and a size of 10MB. The final database is about 20 GB and 20 millón observations.

    Do you know why is that change in size and observations? I think there are some duplicates, how can I see if there are duplicates and what can I do if there are?

    Thank you very much for your help.

    Alejandro

  • #2
    -duplicates report- will tell you how many observations appear as duplicates and how many copies of them there are. If you want to identify them in your data set so you can do something with them, look -duplicates tag- will do that. Read -help duplicates- for the details of these commands and how to use them.

    If there are duplicates in your -joinby- result they necessarily arise from duplicates in the original datasets. -joinby- will not create duplicates that were not there to start with. From what you describe, it sounds like for each observation in the master data, there are, on average, 20 observations in the second data set that agree with it on country and year. Does that seem unreasonably large? If it does, perhaps you meant to also restrict the pairings to those that also agree on some other variable(s). If so, the variable(s) in question must be included in the varlist of the -joinby- command.

    Comment


    • #3
      Joinby forms all pairwise combinations within groups. So if the variable is not unique in one of your datasets you will create duplicates. Stata has a duplicates command you could use.

      I personally would use a merge and verify beforehand that your data are unique.
      Code:
      bysort country year: assert _N == 1
      If you want to check duplicates after the joinby you could try
      Code:
      bysort country year: gen var_count = _N
      br if var_count > 1

      Comment


      • #4
        Hello, thank you for your answers.
        From what you describe, it sounds like for each observation in the master data, there are, on average, 20 observations in the second data set that agree with it on country and year. Does that seem unreasonably large? If it does, perhaps you meant to also restrict the pairings to those that also agree on some other variable(s)


        That is correct, I am trying to merge political parties influencing firms decisions, and I have observation lets say for France 1995 to 2015. But for every observation of firms I have a year and country of observation, lets say firm Renault country france year 2000 income 1,000, then I want a merge a name of a political party for france and year 2000. The second database only have a country, year and a name of the political party.

        Thats why I dont understand the reasons for having more observation after the merge.I did the same after a concatenate and I have the same results.

        At the other side I tested
        Code:
        bysort country year: assert _N == 1
        and received a message saying
        1,463 contradictions in 1,463 by-groups
        Hope to be clear.

        Thank you again,

        Alejandro

        Comment


        • #5
          From what you've described, you should have a unique list in your "second database" with the country and year. I would try doing a merge.

          Something along the lines of
          Code:
          use "Dataset1.dta" , clear 
          merge m:1 country year using "Dataset2.dta" 
          tab _merge
          Also you should try the code below in your using dataset ("second database")
          Code:
           
           bysort country year: assert _N == 1

          Comment


          • #6
            So, focusing on Renault, in France, year 2000, how many political parties were there in France in year 2000. I would be willing to bet it's at least 3 and if minor parties are included in your data set there could well be many more. For present-day France, Wikipedia lists 3 major national parties and 20 minor national parties. So your observation for Renault, in France, year 2000 would lead to somewhere between 3 and 23 observations in the resulting data set after the -joinby-.

            Justin Blasongame and I seem to have different impressions regarding the nature of your data. From my reading of your descriptions, you would not have unique identification of observations by country and year in either of your data sets. But he is certainly correct that if country and year do identify unique observations in one of your data sets, you should use -merge- rather than -joinby-.

            Comment


            • #7
              Thank you very much !!!! Justin and Clyde for your advice !!! it Works !! I used the merge that Justin said, since I have only one party on dataset 2, the party with the largest presence in the government.
              I would like to ask you just one more question. When I do the merge, I add to the dataset1 countries from dataset2 that I dont have in dataset1, are there any option to merge that allow to merge only if I have a match between dataset1 and dataset2 not adding the extra countries?
              Thank you very much, you saved me a lot of time.
              Alejandro

              Comment


              • #8
                Yes, you could do the following:
                Code:
                merge m:1 country year using "Dataset2.dta"
                keep if _merge == 3
                drop _merge
                Or in one line:
                Code:
                merge m:1 country year using "Dataset2.dta" ,  keep(match) nogenerate

                Comment

                Working...
                X