Cleaning data of Globar Entrepreneurship Monitor

Bence Boross

Join Date: Mar 2020
Posts: 23

Cleaning data of Globar Entrepreneurship Monitor

22 Mar 2020, 08:16

Dear All,

I am currently dealing with cross-sectional studies derived from the Global Entrepreneurship Monitor (GEM) from 2004 to 2016. This study investigates the entrepreneurial activity across countries based on individual surveys. I am using a multi-level research design as I take the individual-level data from GEM and merge them with country-level factors such as corruption, the competitiveness of a country, etc. I investigate the country-level factors' impact on certain individual-level factors. My question relates to the cleaning phase.

After I selected my variables from GEM and appended the separate datasets from 2004 to 2016, I arrived at more than 2 million observations. In my initial dataset for the individual factors, I have around 16 variables and most of the observations are not complete, implying missing values. If I only keep the complete observations, I am left with approximately 9 000 observations. However, as I mentioned above, I would like to merge the individual-level data with country-level variables to obtain country-level effects. As a result of deleting the incomplete observations, some countries dropped out of certain years in the observations. For example, Germany is not presented in 2006 and 2007. Indeed, only a few countries (approx. 5) have complete data in all the examined years (2004-2016). This arose the question of whether I can still investigate differences between country despite the fact that not all the countries are completly presented in the dataset. Do you have anny suggestions how to overcome this issue?

This is the -datex- before dropping out the missing values:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double(country yrsurv gemhhinc gemeduc omexport omnowjob gender age estbbuso knowenyy suskilyy frfailyy teayyopp teayynec eb_cust eb_tech eb_yytec eb_jobgr)
1 2016    33 1212 . . 2 63 0 0 1 1 0 0 . . -2 .
1 2016 68100 1316 . . 1 52 0 0 1 1 0 0 . . -2 .
1 2016  3467 1316 . . 2 64 0 0 0 0 0 0 . . -2 .
1 2016  3467 1316 6 2 2 70 1 0 0 0 0 0 2 3  0 0
1 2016  3467 1316 . . 1 -2 0 0 0 1 0 0 . . -2 .
end
label values country country
label def country 1 "United States", modify
label values gemhhinc GEMHHINC
label def GEMHHINC 33 "Lowest 33%tile", modify
label def GEMHHINC 3467 "Middle 33%tile", modify
label def GEMHHINC 68100 "Upper  33%tile", modify
label values gemeduc GEMEDUC
label def GEMEDUC 1212 "SECONDARY DEGREE", modify
label def GEMEDUC 1316 "POST SECONDARY", modify
label values omexport omexport
label def omexport 6 "10% or less", modify
label values omnowjob omnowjob
label values gender gender
label def gender 1 "Male", modify
label def gender 2 "Female", modify
label values age age
label def age -2 "Refused", modify
label values estbbuso ESTBBUSO
label def ESTBBUSO 0 "No", modify
label def ESTBBUSO 1 "Yes", modify
label values knowenyy KNOWENyy
label def KNOWENyy 0 "No", modify
label values suskilyy SUSKILyy
label def SUSKILyy 0 "No", modify
label def SUSKILyy 1 "Yes", modify
label values frfailyy FRFAILyy
label def FRFAILyy 0 "No", modify
label def FRFAILyy 1 "Yes", modify
label values teayyopp TEAyyOPP
label def TEAyyOPP 0 "No", modify
label values teayynec TEAyyNEC
label def TEAyyNEC 0 "No", modify
label values eb_cust EB_CUST
label def EB_CUST 2 "Some", modify
label values eb_tech EB_TECH
label def EB_TECH 3 "No new technology (more than 5 years)", modify
label values eb_yytec EB_yyTEC
label def EB_yyTEC 0 "No/low technology sector", modify

This is the -datex- after dropping out the missing values:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double(country yrsurv gemhhinc gemeduc omexport omnowjob gender age estbbuso knowenyy suskilyy frfailyy teayyopp teayynec eb_cust eb_tech eb_yytec eb_jobgr)
1 2016 68100 1212 6  1 2 36 1 0 1 0 1 0 3 3 0 0
1 2016 68100 1316 6  3 2 64 1 0 1 0 1 0 3 3 0 0
1 2016 68100 1316 6  0 2 59 1 1 1 0 1 0 2 3 0 0
1 2016  3467 1720 5  0 1 55 1 1 1 1 1 0 3 3 0 0
1 2016 68100 1316 6 24 1 66 1 1 1 0 1 0 3 3 0 6
end
label values country country
label def country 1 "United States", modify
label values gemhhinc GEMHHINC
label def GEMHHINC 3467 "Middle 33%tile", modify
label def GEMHHINC 68100 "Upper  33%tile", modify
label values gemeduc GEMEDUC
label def GEMEDUC 1212 "SECONDARY DEGREE", modify
label def GEMEDUC 1316 "POST SECONDARY", modify
label def GEMEDUC 1720 "GRAD EXP", modify
label values omexport omexport
label def omexport 5 "11 to 25%", modify
label def omexport 6 "10% or less", modify
label values omnowjob omnowjob
label values gender gender
label def gender 1 "Male", modify
label def gender 2 "Female", modify
label values age age
label values estbbuso ESTBBUSO
label def ESTBBUSO 1 "Yes", modify
label values knowenyy KNOWENyy
label def KNOWENyy 0 "No", modify
label def KNOWENyy 1 "Yes", modify
label values suskilyy SUSKILyy
label def SUSKILyy 1 "Yes", modify
label values frfailyy FRFAILyy
label def FRFAILyy 0 "No", modify
label def FRFAILyy 1 "Yes", modify
label values teayyopp TEAyyOPP
label def TEAyyOPP 1 "Yes", modify
label values teayynec TEAyyNEC
label def TEAyyNEC 0 "No", modify
label values eb_cust EB_CUST
label def EB_CUST 2 "Some", modify
label def EB_CUST 3 "None", modify
label values eb_tech EB_TECH
label def EB_TECH 3 "No new technology (more than 5 years)", modify
label values eb_yytec EB_yyTEC
label def EB_yyTEC 0 "No/low technology sector", modify

Thank you so much for your help and time!

Tags: None

Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#2

23 Mar 2020, 13:50

You have a massive sample selection problem unless the missing is at random. That will make any sample statistics almost meaningless. While there are missing data procedures, I would worry about them when 95+% of your observations have missing values.
Comment
Bence Boross

Join Date: Mar 2020

Posts: 23
#3

24 Mar 2020, 10:50

Thank you for your answer Phil. I arrived to use this database based on other authors' work. So many previous researchers focused on my topic. Just to give an example: Estrin, S., Korosteleva, J., & Mickiewicz, T. M. (2019). Schumpeterian Entry: Innovation, Exporting, and Growth Aspirations of Entrepreneurs. In Academy of Management Proceedings (Vol. 2019, No. 1, p. 17308). Briarcliff Manor, NY 10510: Academy of Management. I believe this database can be used. I might have done something wrong.
Comment

Announcement

Cleaning data of Globar Entrepreneurship Monitor

Comment

Comment