Pooling the data from different data sources

Omar Shaher

Join Date: Feb 2019

Posts: 164
#1

Pooling the data from different data sources

12 Oct 2020, 18:57

Dear researchers,

I am interested in studying specific factors across countries for the period from 2000-2019. The population (i.e. name of countries) has been identified from one database but due to the unavailability of data; the variables’ data for these countries has been collected from different data sources, where some of these data sources includes data for specific number of countries (i.e. not all of them), and some of the data sources have data for specific variables (i.e. not all of them) for specific years. While, other data sources have data for specific variables since 2000 until 2019. So, after importing the data set to the STATA, it shows me that the data is balanced and I could justify this as I could have country with missing values for specific variables in specific year and meanwhile I could have data for the same country and year with different variables.

But I think that I should pool the data instead of using the panel as I have collected the data from different sources and the availability of the variables in years is vary across databases. If this make sense, is there any paper or reference that support the idea of pooling the data if it is collected from different databases?

Many thanks in advance.
Tags: None
Omar Shaher

Join Date: Feb 2019

Posts: 164
#2

13 Oct 2020, 17:12

Since I haven't received an answer yet, I'll make it more clear:

1- Data period from 2000-2019.
2- I have 10 variables.
3- The sample is 100 countries
4- The variables are not available in one source (that is, several sources will be used)
5- Not all countries are available in all sources.
6- In some sources there is no data from 2016 to 2019 for some variables
7- All data have been arranged so that they are in the form of balanced panel, so that years and variables for which there are no values have been left as they are (i.e. missing values).

The most important point in the study that I don't want to control for countries (i.e. not including country dummies or -fe- in the code), but I will include the year dummies. On that way, I have two options, either to use the Generalized Least Sqaure (GLS) by using -xtreg- OR to use the pool regressoin by using -reg-.

Honestly, I have used pool (reg) and this was 7 months ago, and the results have been written based on pool (reg). I have relied on two papers which are :

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2894156/

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4927450/

They have combined data across studies and mentioned that the strategy of pooling data drawn from separate investigations holds many benefits including
A) To increase the sample size.
B) To reduce variances.
C) To obtain more precise confidence intervals for outcomes.
D) increasing statistical power.
E) The ability to estimate a variety of models that would not be possible within any single data set

Thus, does it make sense to rely on the above and to say that the data has been collected from different sources, to justify why I have used the pool regression instead of GLS.

Any recommendation would be highly appreciated.

Many thanks in advance.
Comment

Announcement

Pooling the data from different data sources

Comment