Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Recognize dataset

    Hello all together,

    I am new here and I am happy to join the Stata cummunity.
    Currently I'm writing my Master's Thesis about the effect of internationalization on entrepreneurial success in developing countries.
    I have access to the World Bank and I just downloaded the all economies survey.
    I am just wondering how I can see if a data set has panel structure or not? Because some country specific datasets are explicitely named as panel data. But not the all economies survey.
    Can you tell me please how to figure out if I have panel data or not?
    And which type of data would you recommend to use in my case? I think panel data, right?

    Thank you very much

    Dominik

  • #2
    Please read the introduction material and xtreg material from the panel documentation provided with Stata. A panel dataset would look like this:

    country1 yr1
    country1 yr2
    country1 yr3
    country2 yr1
    country2 yr2
    country2 yr3

    etc.

    Even if it is not structured this way in the download, it is almost always worth reshaping the data into panel format for estimation if it is panel data.

    Comment


    • #3
      Phil clarified the issue.

      Additionally, you may type - help xtset - or - help xtreg - in the Command Window and take a look at the examples.
      Best regards,

      Marcos

      Comment


      • #4
        Hello together,

        First of all thank you very much for your answer. I appreciate your support.
        Phil Bromiley The dataset I was talking about first does indeed look like your example ('all economies enterprise survey 2018' in the picture below). Therefore I assume we have panel data, but only on country level. The variable 'idstd', which is suppposed to be the enterprise ID, occurs only once. Even though ~180,000 companies (enterprises) have been surveyed. My research question is:

        "What is the effect of internationalization on entrepreneurial success in developing countries?"

        Therefore I assume I have to use panel data on firm level, right?
        For the respective countries there are panel datasets available which contains the variable 'panelid' ('panel dataset, country specific' in the picture below). I suppose that is the firm. Each variable 'idstd' has two similar values for 'panelid' in 2 different years:

        idstd panelid year
        42846 47190 2007
        20357 47190 2014

        37819 67823 2007
        82036 67823 2014

        What I would do now, in order to answer my research question is to merge the panel datasets from the respective countries to one dataset I can work with. Do you agree with that?
        I would really appreciate if I could get feedback to this post.
        I appreciate any single hint/tip to make it better. Especially because I never worked with stata on such a level before and I am really lucky to stay in contact with such a experienced community.
        And sorry for my english...

        I attached a drawing which visualizes what I explained above.



        Kind regards
        Dominik
        Click image for larger version

Name:	both combined.jpg
Views:	1
Size:	408.6 KB
ID:	1487713

        Last edited by Dominik Miksch; 12 Mar 2019, 09:17.

        Comment


        • #5
          Domink:
          very artistic contribution indeed!
          However, for the future, please read the FAQ about the best way to post attachments in Stata format or, even far more better, what you typed and what Stata gave you back (via CODE delimiters, please) and/or share an example/excerpt of your data via -dataex-.
          That said:
          - you have a panel dataset if the same sample (although you probably miss some units as time elapses) is repeatedly measured at equally spaced intervals in time.
          -usually, survey are not panel datasets,, as the sample changes from year to year;
          - as far as I can get the gist of your previous posts, you might have countries (not firms) as -panelid- and years as -timevar-.
          Kind regards,
          Carlo
          (Stata 19.0)

          Comment


          • #6
            Carlo Lazzaro Thank you for your fast reply.
            I read the FAQ before, it says I should not post .GPH files as it makes it difficult to follow the conversation. But I didn't know that .JPG is also not suitable to support/visualize what I am talking about? Unfortunately, I havn't run any important regressions yet, as I still sight the datasets I have. But yes, I will keep that in mindn and do it in the way you mentioned: CODE delimiters and if I share examples I will do it with dataex.

            In the 'all aconomies enterprise survey' there are countries measured over time so yes, I also think I have countries as a 'panelid' and years as 'timevar'.
            The problem I face is the fact that Stata doesn't set 'countries' as my panel variable with the xtset command. It tells me that 'string variables not allowed in varlist' as 'country is a string variable'.

            Kind regards
            Dominik

            Comment


            • #7
              Dominik:
              it's easy to converti -Country- from -string- to numeric format and -xtset- accordingly, as you can see from the following toy-example:
              Code:
              . set obs 2
              number of observations (_N) was 0, now 2
              
              . g Country="UK" in 1
              (1 missing value generated)
              
              . replace Country="USA" in 2
              variable Country was str2 now str3
              (1 real change made)
              
              . encode Country, gen (numeric_Country)
              
              . list
              
                   +--------------------+
                   | Country   numeri~y |
                   |--------------------|
                1. |      UK         UK |
                2. |     USA        USA |
                   +--------------------+
              
              . g year=_n
              
              . xtset Country year
              string variables not allowed in varlist;
              Country is a string variable
              r(109);
              
              . xtset numeric_Country year
                     panel variable:  numeric_Country (weakly balanced)
                      time variable:  year, 1 to 2
                              delta:  1 unit
              
              .
              Kind regards,
              Carlo
              (Stata 19.0)

              Comment


              • #8
                Carlo Lazzaro Thank you for you answer. I did what you suggested and it worked, I can use 'numeric_country' as panelid now.
                But as you can see in the code below, Stata don't take the year (a14y) as time variable.
                Is it because for some countries I have only one year observations? Do you have any ideas what I could do?


                Code:
                xtset country a14y
                
                string variables not allowed in varlist;
                country is a string variable
                
                
                . describe country
                
                              storage   display    value
                variable name   type    format     label      variable label
                ---------------------------------------------------------------------------------------------------------------------
                country         str26   %26s                  Country
                
                .
                end of do-file
                
                
                encode country, gen(numeric_country)
                describe numeric_country
                
                              storage   display    value
                variable name   type    format     label      variable label
                ---------------------------------------------------------------------------------------------------------------------
                numeric_country long    %26.0g     numeric_country
                                                              Country
                
                .
                end of do-file
                
                . xtset numeric_country a14y
                repeated time values within panel
                r(451);
                
                end of do-file
                
                duplicates list numeric_country a14y
                
                --> gives a long list
                Kind regards
                Dominik

                Comment


                • #9
                  Dominik:
                  if your repeated time values are simply a matter of fact (ie, you have no duplicates due to mistaken data entry) and you do not plan to use time-sereis command, such as lags and leads, you can -xtset- your data with -panelid- only:
                  Code:
                  xtset numeric_country
                  Kind regards,
                  Carlo
                  (Stata 19.0)

                  Comment


                  • #10
                    Carlo Lazzaro Thank you for your fast reply and your help. But doing so I restrict myself as I can't do time series commands, right?
                    I mean it could be possible that I need those commands later. Is there any chance to fix that issue?
                    Using the isid commant gives me

                    Code:
                    isid numeric_country a14y
                    variables numeric_country a14y should never be missing
                    Kind regards
                    Dominik

                    Comment


                    • #11
                      Dominik:
                      yes, you are correct about the possible future limitations of your research.
                      That said, Stata tells you that something is missing in the -timevar-.
                      Check it and see if you can fix it.
                      Kind regards,
                      Carlo
                      (Stata 19.0)

                      Comment


                      • #12
                        Carlo Lazzaro Thank you for your hints.
                        I am not sure if I am right, but do you think it could be a possible solution to get rid of the varriables in the list I got with the following command?

                        (the output contains more, this is just part of it as an example)

                        Code:
                        duplicates list numeric_country a14y
                        
                               348   136769                   Zambia2013   2013 |
                          |    348   136770                   Zambia2013   2013 |
                          |    348   136771                   Zambia2013   2013 |
                          |    348   136772                   Zambia2013   2013 |
                          |    348   136773                   Zambia2013   2013 |
                          |-----------------------------------------------------|
                          |    348   136774                   Zambia2013   2013 |
                          |    348   136775                   Zambia2013   2013 |
                          |    348   136776                   Zambia2013   2013 |
                          |    348   136777                   Zambia2013   2013 |
                          |    348   136778                   Zambia2013   2013 |
                          |-----------------------------------------------------|
                          |    348   136779                   Zambia2013   2013 |
                          |    348   136780                   Zambia2013   2013 |
                          |    348   136781                   Zambia2013   2013 |
                          |    348   136782                   Zambia2013   2013 |
                          |    348   136783                   Zambia2013   2013 |
                          |-----------------------------------------------------|
                          |    348   136784                   Zambia2013   2013 |
                          |    348   136785                   Zambia2013   2013 |
                          |    348   136786                   Zambia2013   2013 |
                          |    348   136787                   Zambia2013   2013 |
                          |    348   136788                   Zambia2013   2013 |
                          |-----------------------------------------------------|
                          |    348   136789                   Zambia2013   2013 |
                          |    348   136790                   Zambia2013   2013 |
                          |    348   136791                   Zambia2013   2013 |
                          |    348   136792                   Zambia2013   2013 |
                          |    348   136793                   Zambia2013   2013
                        Furthermore I have two variables in the dataset who are supposed to be the year. Namely a14y and a15y. But I am not sure where exactly the difference is.
                        Do you have an explanation for it?

                        Code:
                        table a14y
                        
                        ----------------------
                             Year |      Freq.
                        ----------+-----------
                               -8 |          2
                             2005 |          2
                             2008 |     12,278
                             2009 |      8,975
                             2010 |      9,468
                             2011 |     11,023
                             2012 |      2,896
                             2013 |     22,725
                             2014 |     18,992
                             2015 |      5,988
                             2016 |      7,649
                             2017 |      6,022
                             2018 |      2,361
                             2019 |         73
                             2301 |          1
                        ----------------------
                        
                        . table a15y
                        
                        ----------------------
                             Year |      Freq.
                        ----------+-----------
                               -9 |          1
                               -8 |          2
                             2010 |      9,292
                             2011 |     11,016
                             2012 |      2,629
                             2013 |     22,503
                             2014 |     19,761
                             2015 |      5,430
                             2016 |      7,618
                             2017 |      6,056
                             2018 |      2,366
                             2019 |         73
                        ----------------------

                        I also have a different question for future regressions and hope you can help me as well.
                        If I want to get the difference between the years or like 2 years/ 2 observations, I create dummy variables right?
                        Like a dummy for each year I have data?

                        A lot of questions at the beginning, but I really appreciate your help.
                        Thanks a lot!

                        Kind regards
                        Dominik

                        Comment


                        • #13
                          Dominik:
                          before getting rid of anything, you shoud be 100% sure that you have genuine duplicates (ie, observations mistakenly entered >=2 times in your dataset). Is it the case with what you posted?
                          Years like -8, -9 and 2301 look suspicious; they might well be missing data placeholders (-8; -9) and mistaken data entry (2301). Only you can check the reason of the differencer between -a14y- and -a15y-.
                          As far as your last question is concerned, you may add -i.timevar- among the predictors in the right-hand side of youre regerssion equation.
                          Kind regards,
                          Carlo
                          (Stata 19.0)

                          Comment


                          • #14
                            Carlo Lazzaro Thank you very much for your answer.
                            Based on the variable 'idstd' there are no numbers which occur >= 2 times. I couldn't find any other hint or double values.
                            Like you suggested I add 'i.timevar' in front of my right hand side variables of a simple regression but stata gave me:

                            Code:
                             reg l6 i.a14y d3b d3c
                            a14y:  factor variables may not contain negative values
                            Is it because of the -8, right?
                            Just because I'm curious, would it also be possible to use dummy variables for the respective years?

                            Kind regards
                            Dominik

                            Edit:

                            After I got rid of the negativ variables it worked:

                            Code:
                            reg l6 i.a14y d3b d3c
                            
                                  Source |       SS           df       MS      Number of obs   =   108,076
                            -------------+----------------------------------   F(15, 108060)   =      6.28
                                   Model |  8.8203e+09        15   588022903   Prob > F        =    0.0000
                                Residual |  1.0123e+13   108,060  93680043.1   R-squared       =    0.0009
                            -------------+----------------------------------   Adj R-squared   =    0.0007
                                   Total |  1.0132e+13   108,075  93748654.2   Root MSE        =    9678.8
                            
                            ------------------------------------------------------------------------------
                                      l6 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                            -------------+----------------------------------------------------------------
                                    a14y |
                                   2008  |   2.339926   6844.545     0.00   1.000    -13412.87    13417.55
                                   2009  |    14.0007   6844.768     0.00   0.998    -13401.65    13429.65
                                   2010  |   15.58416   6844.708     0.00   0.998    -13399.95    13431.11
                                   2011  |   4.259579   6844.602     0.00   1.000    -13411.07    13419.58
                                   2012  |  -3.417121   6846.351    -0.00   1.000    -13422.17    13415.33
                                   2013  |  -3.422354   6844.291    -0.00   1.000    -13418.14    13411.29
                                   2014  |   3.106307   6844.348     0.00   1.000    -13411.72    13417.93
                                   2015  |   1241.734   6845.142     0.18   0.856    -12174.65    14658.12
                                   2016  |   6.352013   6844.883     0.00   0.999    -13409.52    13422.23
                                   2017  |   7.714403   6845.119     0.00   0.999    -13408.62    13424.05
                                   2018  |   5.482297   6846.885     0.00   0.999    -13414.32    13425.28
                                   2019  |  -16.51652   6937.132    -0.00   0.998     -13613.2    13580.17
                                   2301  |  -7.41e-08   11854.12    -0.00   1.000     -23233.9     23233.9
                                         |
                                     d3b |   2.944417    2.14347     1.37   0.170    -1.256754    7.145589
                                     d3c |  -.0846422   1.360979    -0.06   0.950    -2.752143    2.582858
                                   _cons |   7.41e-08   6843.977     0.00   1.000     -13414.1     13414.1
                            ------------------------------------------------------------------------------
                            Last edited by Dominik Miksch; 13 Mar 2019, 09:49.

                            Comment


                            • #15
                              Dominik:
                              I think you have to deal with -8 and -9 first (that cause the error Stata gave you back), as they are in all likelihood missing values.
                              In addition, 2301 is an apparent mistake.
                              You can use dummies, but -fvvarlist- notation is much more useful.
                              Eventually, you have monster standard errors that produce very weird confidence interval (most of them are perfectly symmetric and this is far from what you usually find out in empirical researches).
                              In sum: scrutinize your data carefully, in search of the culprit of those strange results.
                              Kind regards,
                              Carlo
                              (Stata 19.0)

                              Comment

                              Working...
                              X