Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cleaning data of Globar Entrepreneurship Monitor

    Dear All,

    I am currently dealing with cross-sectional studies derived from the Global Entrepreneurship Monitor (GEM) from 2004 to 2016. This study investigates the entrepreneurial activity across countries based on individual surveys. I am using a multi-level research design as I take the individual-level data from GEM and merge them with country-level factors such as corruption, the competitiveness of a country, etc. I investigate the country-level factors' impact on certain individual-level factors. My question relates to the cleaning phase.

    After I selected my variables from GEM and appended the separate datasets from 2004 to 2016, I arrived at more than 2 million observations. In my initial dataset for the individual factors, I have around 16 variables and most of the observations are not complete, implying missing values. If I only keep the complete observations, I am left with approximately 9 000 observations. However, as I mentioned above, I would like to merge the individual-level data with country-level variables to obtain country-level effects. As a result of deleting the incomplete observations, some countries dropped out of certain years in the observations. For example, Germany is not presented in 2006 and 2007. Indeed, only a few countries (approx. 5) have complete data in all the examined years (2004-2016). This arose the question of whether I can still investigate differences between country despite the fact that not all the countries are completly presented in the dataset. Do you have anny suggestions how to overcome this issue?

    This is the -datex- before dropping out the missing values:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input double(country yrsurv gemhhinc gemeduc omexport omnowjob gender age estbbuso knowenyy suskilyy frfailyy teayyopp teayynec eb_cust eb_tech eb_yytec eb_jobgr)
    1 2016    33 1212 . . 2 63 0 0 1 1 0 0 . . -2 .
    1 2016 68100 1316 . . 1 52 0 0 1 1 0 0 . . -2 .
    1 2016  3467 1316 . . 2 64 0 0 0 0 0 0 . . -2 .
    1 2016  3467 1316 6 2 2 70 1 0 0 0 0 0 2 3  0 0
    1 2016  3467 1316 . . 1 -2 0 0 0 1 0 0 . . -2 .
    end
    label values country country
    label def country 1 "United States", modify
    label values gemhhinc GEMHHINC
    label def GEMHHINC 33 "Lowest 33%tile", modify
    label def GEMHHINC 3467 "Middle 33%tile", modify
    label def GEMHHINC 68100 "Upper  33%tile", modify
    label values gemeduc GEMEDUC
    label def GEMEDUC 1212 "SECONDARY DEGREE", modify
    label def GEMEDUC 1316 "POST SECONDARY", modify
    label values omexport omexport
    label def omexport 6 "10% or less", modify
    label values omnowjob omnowjob
    label values gender gender
    label def gender 1 "Male", modify
    label def gender 2 "Female", modify
    label values age age
    label def age -2 "Refused", modify
    label values estbbuso ESTBBUSO
    label def ESTBBUSO 0 "No", modify
    label def ESTBBUSO 1 "Yes", modify
    label values knowenyy KNOWENyy
    label def KNOWENyy 0 "No", modify
    label values suskilyy SUSKILyy
    label def SUSKILyy 0 "No", modify
    label def SUSKILyy 1 "Yes", modify
    label values frfailyy FRFAILyy
    label def FRFAILyy 0 "No", modify
    label def FRFAILyy 1 "Yes", modify
    label values teayyopp TEAyyOPP
    label def TEAyyOPP 0 "No", modify
    label values teayynec TEAyyNEC
    label def TEAyyNEC 0 "No", modify
    label values eb_cust EB_CUST
    label def EB_CUST 2 "Some", modify
    label values eb_tech EB_TECH
    label def EB_TECH 3 "No new technology (more than 5 years)", modify
    label values eb_yytec EB_yyTEC
    label def EB_yyTEC 0 "No/low technology sector", modify

    This is the -datex- after dropping out the missing values:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input double(country yrsurv gemhhinc gemeduc omexport omnowjob gender age estbbuso knowenyy suskilyy frfailyy teayyopp teayynec eb_cust eb_tech eb_yytec eb_jobgr)
    1 2016 68100 1212 6  1 2 36 1 0 1 0 1 0 3 3 0 0
    1 2016 68100 1316 6  3 2 64 1 0 1 0 1 0 3 3 0 0
    1 2016 68100 1316 6  0 2 59 1 1 1 0 1 0 2 3 0 0
    1 2016  3467 1720 5  0 1 55 1 1 1 1 1 0 3 3 0 0
    1 2016 68100 1316 6 24 1 66 1 1 1 0 1 0 3 3 0 6
    end
    label values country country
    label def country 1 "United States", modify
    label values gemhhinc GEMHHINC
    label def GEMHHINC 3467 "Middle 33%tile", modify
    label def GEMHHINC 68100 "Upper  33%tile", modify
    label values gemeduc GEMEDUC
    label def GEMEDUC 1212 "SECONDARY DEGREE", modify
    label def GEMEDUC 1316 "POST SECONDARY", modify
    label def GEMEDUC 1720 "GRAD EXP", modify
    label values omexport omexport
    label def omexport 5 "11 to 25%", modify
    label def omexport 6 "10% or less", modify
    label values omnowjob omnowjob
    label values gender gender
    label def gender 1 "Male", modify
    label def gender 2 "Female", modify
    label values age age
    label values estbbuso ESTBBUSO
    label def ESTBBUSO 1 "Yes", modify
    label values knowenyy KNOWENyy
    label def KNOWENyy 0 "No", modify
    label def KNOWENyy 1 "Yes", modify
    label values suskilyy SUSKILyy
    label def SUSKILyy 1 "Yes", modify
    label values frfailyy FRFAILyy
    label def FRFAILyy 0 "No", modify
    label def FRFAILyy 1 "Yes", modify
    label values teayyopp TEAyyOPP
    label def TEAyyOPP 1 "Yes", modify
    label values teayynec TEAyyNEC
    label def TEAyyNEC 0 "No", modify
    label values eb_cust EB_CUST
    label def EB_CUST 2 "Some", modify
    label def EB_CUST 3 "None", modify
    label values eb_tech EB_TECH
    label def EB_TECH 3 "No new technology (more than 5 years)", modify
    label values eb_yytec EB_yyTEC
    label def EB_yyTEC 0 "No/low technology sector", modify

    Thank you so much for your help and time!

  • #2
    You have a massive sample selection problem unless the missing is at random. That will make any sample statistics almost meaningless. While there are missing data procedures, I would worry about them when 95+% of your observations have missing values.

    Comment


    • #3
      Thank you for your answer Phil. I arrived to use this database based on other authors' work. So many previous researchers focused on my topic. Just to give an example: Estrin, S., Korosteleva, J., & Mickiewicz, T. M. (2019). Schumpeterian Entry: Innovation, Exporting, and Growth Aspirations of Entrepreneurs. In Academy of Management Proceedings (Vol. 2019, No. 1, p. 17308). Briarcliff Manor, NY 10510: Academy of Management. I believe this database can be used. I might have done something wrong.

      Comment

      Working...
      X