Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging two datasets having different years

    Dear Statalist Members,

    I am truly confused about how to merge two different datasets. I have created an example of the master dataset and the other dataset, please see below. The datasets should be matched based on country names (variable named ctr_n). However, since the years are different (variable named yr), I cannot merge them.

    I would be really happy if you can help me. Thank you in advance.

    This is the master dataset:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str6 ctr_n int yr str7 GADM_1
    "Angola" 1992 "."      
    "Angola" 1992 "."      
    "Angola" 1992 "."      
    "Angola" 1992 "."      
    "Angola" 1992 "Bengo"  
    "Angola" 1992 "Bengo"  
    "Angola" 1992 "Bengo"  
    "Angola" 1992 "Bie"    
    "Angola" 1992 "Bie"    
    "Angola" 1992 "Bie"    
    "Angola" 1992 "Cabinda"
    "Angola" 1992 "Cabinda"
    "Angola" 1992 "Zaire"  
    "Angola" 1992 "Zaire"  
    "Angola" 1992 "Zaire"  
    "Angola" 2008 "Bengo"  
    "Angola" 2008 "Bengo"  
    "Angola" 2008 "Bengo"  
    "Angola" 2008 "Bie"    
    "Angola" 2008 "Bie"    
    "Angola" 2008 "Cabinda"
    "Angola" 2008 "Cabinda"
    "Angola" 2008 "Cabinda"
    "Angola" 2008 "Cabinda"
    "Angola" 2008 "Cabinda"
    "Angola" 2012 "Bengo"  
    "Angola" 2012 "Bengo"  
    "Angola" 2012 "Bengo"  
    "Angola" 2012 "Bengo"  
    "Angola" 2012 "Bengo"  
    "Angola" 2012 "Bengo"  
    "Angola" 2012 "Bengo"  
    "Angola" 2012 "Bengo"  
    "Angola" 2012 "Bengo"  
    "Angola" 2012 "Cabinda"
    "Angola" 2017 "Bie"    
    "Angola" 2017 "Bie"    
    "Angola" 2017 "Zaire"  
    "Angola" 2017 "Zaire"  
    end

    This is the other dataset:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str6 ctr_n int(yr refugeepop longitude latitude) str7 GADM_1
    "Angola" 2000  545     .     . "Bengo"  
    "Angola" 2000   27     .     . "Cunene"
    "Angola" 2000 1396 13234 -8812 "Luanda"
    "Angola" 2000  127     .     . "Malanje"
    "Angola" 2000 5955     .     . "Malanje"
    "Angola" 2000   79     .     . "Moxico"
    "Angola" 2001    5     .     . "Cunene"
    "Angola" 2001  610 13234 -8812 "Luanda"
    "Angola" 2001  476     .     . "Luanda"
    "Angola" 2001    1     .     . "Malanje"
    "Angola" 2001 6544     .     . "Malanje"
    "Angola" 2001  752     .     . "Malanje"
    "Angola" 2001   11     .     . "Malanje"
    "Angola" 2001    3     .     . "Malanje"
    "Angola" 2002  476     .     . "Luanda"
    "Angola" 2002  610 13234 -8812 "Luanda"
    "Angola" 2002   11     .     . "Malanje"
    "Angola" 2002    3     .     . "Malanje"
    "Angola" 2002    1     .     . "Malanje"
    "Angola" 2002 6544     .     . "Malanje"
    "Angola" 2002  752     .     . "Malanje"
    "Angola" 2003   11     .     . "Malanje"
    "Angola" 2003    1     .     . "Malanje"
    "Angola" 2003 7051     .     . "Malanje"
    "Angola" 2003    3     .     . "Malanje"
    "Angola" 2004    1     .     . "Malanje"
    "Angola" 2004 7051     .     . "Malanje"
    "Angola" 2004    3     .     . "Malanje"
    "Angola" 2004   11     .     . "Malanje"
    "Angola" 2005 7066     .     . "Malanje"
    "Angola" 2005   11     .     . "Malanje"
    "Angola" 2005    1     .     . "Malanje"
    "Angola" 2005    3     .     . "Malanje"
    "Angola" 2006    3     .     . "Malanje"
    "Angola" 2006    1     .     . "Malanje"
    "Angola" 2006 6217     .     . "Malanje"
    "Angola" 2006   11     .     . "Malanje"
    "Angola"    .    .     .     . "Luanda"
    end
    Last edited by Cansu Oymak; 27 May 2023, 13:14. Reason: Important is to get a balanced panel per country, but the years will not be the same across countries. The N-T dimensions will be “country GADM_1 year”.

  • #2
    These data sets should not be -merge-d at all; -append- is the appropriate way to combine these.

    Comment


    • #3
      Originally posted by Clyde Schechter View Post
      These data sets should not be -merge-d at all; -append- is the appropriate way to combine these.
      Dear Clyde,

      Thank you very much for your reply. However, I am trying to have a balanced panel. I understand your point of view when saying append rather than merge. For this country's case (Angola), it has 4 election years (1992, 2008, 2012, and 2017) at the regional level (GADM_1 variable) in the master dataset. The using dataset has information on the number of refugees at the GADM_1 level, but the reporting year (yr variable) is different and spans from 2000 to 2016. I would like to merge these two datasets rather than appending. Important is to get a balanced panel per country, but the years will not be the same across countries. The N-T dimensions will be “country GADM_1 year”. In that sense, the merged dataset should start with 1992 for Angola, it should also have 1993, 1994, ..., and 2017. I hope it is now clear. Thank you very much.

      Comment


      • #4
        Well, it is a bit clearer, but not entirely clear.

        There is no -merge- or -append- that will give you results with all years from 1992 through 2017 when the data sets you are starting with contain only 1992, 2008, 2012, 2017, and 2000 through 2016 respectively. If you want results for years 1993 through 1999, you will either have to find that data somewhere, or decide on an approach to imputation: neither -merge- nor -append- will do that for you.

        In the original example data, the sets of years in the two data sets did not overlap, which is what led me to propose combining them with -append-. I now understand that the example data was, in this aspect, misleading. So, these data set can be -merge-d as follows:

        Code:
        use dataset1, clear
        merge 1:1 country year GADM_1 using dataset2
        Even now, though, I cannot be certain if this gets you (closer to) what you want. Again in the example data, the GADM_1 values shown in the two data sets are mostly different, so that there will be only a few observations that pair up in the two data sets. Most of the observations in the results will have data only from one data set or the other. Hopefully, in the full data, the same GADM_1 values will appear in both data sets for any given country and year and this won't be a problem.

        But, no matter what, you will still be left with the absence of any data for years 1993 through 1999. You will have to decide on what approach you want to use to impute those values.

        Comment

        Working...
        X