Recognize dataset

Dominik Miksch

Join Date: Mar 2019

Posts: 37
#1

Recognize dataset

10 Mar 2019, 11:06

Hello all together,

I am new here and I am happy to join the Stata cummunity.
Currently I'm writing my Master's Thesis about the effect of internationalization on entrepreneurial success in developing countries.
I have access to the World Bank and I just downloaded the all economies survey.
I am just wondering how I can see if a data set has panel structure or not? Because some country specific datasets are explicitely named as panel data. But not the all economies survey.
Can you tell me please how to figure out if I have panel data or not?
And which type of data would you recommend to use in my case? I think panel data, right?

Thank you very much

Dominik
Tags: None
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#2

11 Mar 2019, 12:55

Please read the introduction material and xtreg material from the panel documentation provided with Stata. A panel dataset would look like this:

country1 yr1
country1 yr2
country1 yr3
country2 yr1
country2 yr2
country2 yr3

etc.

Even if it is not structured this way in the download, it is almost always worth reshaping the data into panel format for estimation if it is panel data.
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#3

11 Mar 2019, 13:50

Phil clarified the issue.

Additionally, you may type - help xtset - or - help xtreg - in the Command Window and take a look at the examples.

Best regards,

Marcos
Comment
Dominik Miksch

Join Date: Mar 2019

Posts: 37
#4

12 Mar 2019, 09:12

Hello together,

First of all thank you very much for your answer. I appreciate your support.
Phil Bromiley The dataset I was talking about first does indeed look like your example ('all economies enterprise survey 2018' in the picture below). Therefore I assume we have panel data, but only on country level. The variable 'idstd', which is suppposed to be the enterprise ID, occurs only once. Even though ~180,000 companies (enterprises) have been surveyed. My research question is:

"What is the effect of internationalization on entrepreneurial success in developing countries?"

Therefore I assume I have to use panel data on firm level, right?
For the respective countries there are panel datasets available which contains the variable 'panelid' ('panel dataset, country specific' in the picture below). I suppose that is the firm. Each variable 'idstd' has two similar values for 'panelid' in 2 different years:

idstd panelid year
42846 47190 2007
20357 47190 2014

37819 67823 2007
82036 67823 2014

What I would do now, in order to answer my research question is to merge the panel datasets from the respective countries to one dataset I can work with. Do you agree with that?
I would really appreciate if I could get feedback to this post.
I appreciate any single hint/tip to make it better. Especially because I never worked with stata on such a level before and I am really lucky to stay in contact with such a experienced community.
And sorry for my english...

I attached a drawing which visualizes what I explained above.

Kind regards
Dominik

Last edited by Dominik Miksch; 12 Mar 2019, 09:17.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17743
#5

12 Mar 2019, 09:29

Domink:
very artistic contribution indeed!
However, for the future, please read the FAQ about the best way to post attachments in Stata format or, even far more better, what you typed and what Stata gave you back (via CODE delimiters, please) and/or share an example/excerpt of your data via -dataex-.
That said:
- you have a panel dataset if the same sample (although you probably miss some units as time elapses) is repeatedly measured at equally spaced intervals in time.
-usually, survey are not panel datasets,, as the sample changes from year to year;
- as far as I can get the gist of your previous posts, you might have countries (not firms) as -panelid- and years as -timevar-.

Kind regards,
Carlo
(Stata 19.0)
Comment
Dominik Miksch

Join Date: Mar 2019

Posts: 37
#6

12 Mar 2019, 13:49

Carlo Lazzaro Thank you for your fast reply.
I read the FAQ before, it says I should not post .GPH files as it makes it difficult to follow the conversation. But I didn't know that .JPG is also not suitable to support/visualize what I am talking about? Unfortunately, I havn't run any important regressions yet, as I still sight the datasets I have. But yes, I will keep that in mindn and do it in the way you mentioned: CODE delimiters and if I share examples I will do it with dataex.

In the 'all aconomies enterprise survey' there are countries measured over time so yes, I also think I have countries as a 'panelid' and years as 'timevar'.
The problem I face is the fact that Stata doesn't set 'countries' as my panel variable with the xtset command. It tells me that 'string variables not allowed in varlist' as 'country is a string variable'.

Kind regards
Dominik
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17743

13 Mar 2019, 01:04

Dominik:
it's easy to converti -Country- from -string- to numeric format and -xtset- accordingly, as you can see from the following toy-example:

Code:

. set obs 2
number of observations (_N) was 0, now 2

. g Country="UK" in 1
(1 missing value generated)

. replace Country="USA" in 2
variable Country was str2 now str3
(1 real change made)

. encode Country, gen (numeric_Country)

. list

     +--------------------+
     | Country   numeri~y |
     |--------------------|
  1. |      UK         UK |
  2. |     USA        USA |
     +--------------------+

. g year=_n

. xtset Country year
string variables not allowed in varlist;
Country is a string variable
r(109);

. xtset numeric_Country year
       panel variable:  numeric_Country (weakly balanced)
        time variable:  year, 1 to 2
                delta:  1 unit

.

Kind regards,
Carlo
(Stata 19.0)

Comment

Dominik Miksch

Join Date: Mar 2019
Posts: 37

13 Mar 2019, 05:06

Carlo Lazzaro Thank you for you answer. I did what you suggested and it worked, I can use 'numeric_country' as panelid now.
But as you can see in the code below, Stata don't take the year (a14y) as time variable.
Is it because for some countries I have only one year observations? Do you have any ideas what I could do?

Code:

xtset country a14y

string variables not allowed in varlist;
country is a string variable


. describe country

              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------------------------------------------------------
country         str26   %26s                  Country

.
end of do-file


encode country, gen(numeric_country)
describe numeric_country

              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------------------------------------------------------
numeric_country long    %26.0g     numeric_country
                                              Country

.
end of do-file

. xtset numeric_country a14y
repeated time values within panel
r(451);

end of do-file

duplicates list numeric_country a14y

--> gives a long list

Kind regards
Dominik

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17743
#9

13 Mar 2019, 05:10

Dominik:
if your repeated time values are simply a matter of fact (ie, you have no duplicates due to mistaken data entry) and you do not plan to use time-sereis command, such as lags and leads, you can -xtset- your data with -panelid- only:

Code:

xtset numeric_country

Kind regards,
Carlo
(Stata 19.0)
Comment
Dominik Miksch

Join Date: Mar 2019

Posts: 37
#10

13 Mar 2019, 05:21

Carlo Lazzaro Thank you for your fast reply and your help. But doing so I restrict myself as I can't do time series commands, right?
I mean it could be possible that I need those commands later. Is there any chance to fix that issue?
Using the isid commant gives me

Code:

isid numeric_country a14y variables numeric_country a14y should never be missing

Kind regards
Dominik
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17743
#11

13 Mar 2019, 05:25

Dominik:
yes, you are correct about the possible future limitations of your research.
That said, Stata tells you that something is missing in the -timevar-.
Check it and see if you can fix it.

Kind regards,
Carlo
(Stata 19.0)
Comment

Dominik Miksch

Join Date: Mar 2019
Posts: 37

#12

13 Mar 2019, 05:49

Carlo Lazzaro Thank you for your hints.
I am not sure if I am right, but do you think it could be a possible solution to get rid of the varriables in the list I got with the following command?

(the output contains more, this is just part of it as an example)

Code:

duplicates list numeric_country a14y

       348   136769                   Zambia2013   2013 |
  |    348   136770                   Zambia2013   2013 |
  |    348   136771                   Zambia2013   2013 |
  |    348   136772                   Zambia2013   2013 |
  |    348   136773                   Zambia2013   2013 |
  |-----------------------------------------------------|
  |    348   136774                   Zambia2013   2013 |
  |    348   136775                   Zambia2013   2013 |
  |    348   136776                   Zambia2013   2013 |
  |    348   136777                   Zambia2013   2013 |
  |    348   136778                   Zambia2013   2013 |
  |-----------------------------------------------------|
  |    348   136779                   Zambia2013   2013 |
  |    348   136780                   Zambia2013   2013 |
  |    348   136781                   Zambia2013   2013 |
  |    348   136782                   Zambia2013   2013 |
  |    348   136783                   Zambia2013   2013 |
  |-----------------------------------------------------|
  |    348   136784                   Zambia2013   2013 |
  |    348   136785                   Zambia2013   2013 |
  |    348   136786                   Zambia2013   2013 |
  |    348   136787                   Zambia2013   2013 |
  |    348   136788                   Zambia2013   2013 |
  |-----------------------------------------------------|
  |    348   136789                   Zambia2013   2013 |
  |    348   136790                   Zambia2013   2013 |
  |    348   136791                   Zambia2013   2013 |
  |    348   136792                   Zambia2013   2013 |
  |    348   136793                   Zambia2013   2013

Furthermore I have two variables in the dataset who are supposed to be the year. Namely a14y and a15y. But I am not sure where exactly the difference is.
Do you have an explanation for it?

Code:

table a14y

----------------------
     Year |      Freq.
----------+-----------
       -8 |          2
     2005 |          2
     2008 |     12,278
     2009 |      8,975
     2010 |      9,468
     2011 |     11,023
     2012 |      2,896
     2013 |     22,725
     2014 |     18,992
     2015 |      5,988
     2016 |      7,649
     2017 |      6,022
     2018 |      2,361
     2019 |         73
     2301 |          1
----------------------

. table a15y

----------------------
     Year |      Freq.
----------+-----------
       -9 |          1
       -8 |          2
     2010 |      9,292
     2011 |     11,016
     2012 |      2,629
     2013 |     22,503
     2014 |     19,761
     2015 |      5,430
     2016 |      7,618
     2017 |      6,056
     2018 |      2,366
     2019 |         73
----------------------

I also have a different question for future regressions and hope you can help me as well.
If I want to get the difference between the years or like 2 years/ 2 observations, I create dummy variables right?
Like a dummy for each year I have data?

A lot of questions at the beginning, but I really appreciate your help.
Thanks a lot!

Kind regards
Dominik

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17743
#13

13 Mar 2019, 07:10

Dominik:
before getting rid of anything, you shoud be 100% sure that you have genuine duplicates (ie, observations mistakenly entered >=2 times in your dataset). Is it the case with what you posted?
Years like -8, -9 and 2301 look suspicious; they might well be missing data placeholders (-8; -9) and mistaken data entry (2301). Only you can check the reason of the differencer between -a14y- and -a15y-.
As far as your last question is concerned, you may add -i.timevar- among the predictors in the right-hand side of youre regerssion equation.

Kind regards,
Carlo
(Stata 19.0)
Comment

Dominik Miksch

Join Date: Mar 2019
Posts: 37

#14

13 Mar 2019, 09:45

Carlo Lazzaro Thank you very much for your answer.
Based on the variable 'idstd' there are no numbers which occur >= 2 times. I couldn't find any other hint or double values.
Like you suggested I add 'i.timevar' in front of my right hand side variables of a simple regression but stata gave me:

Code:

 reg l6 i.a14y d3b d3c
a14y:  factor variables may not contain negative values

Is it because of the -8, right?
Just because I'm curious, would it also be possible to use dummy variables for the respective years?

Kind regards
Dominik

Edit:

After I got rid of the negativ variables it worked:

Code:

reg l6 i.a14y d3b d3c

      Source |       SS           df       MS      Number of obs   =   108,076
-------------+----------------------------------   F(15, 108060)   =      6.28
       Model |  8.8203e+09        15   588022903   Prob > F        =    0.0000
    Residual |  1.0123e+13   108,060  93680043.1   R-squared       =    0.0009
-------------+----------------------------------   Adj R-squared   =    0.0007
       Total |  1.0132e+13   108,075  93748654.2   Root MSE        =    9678.8

------------------------------------------------------------------------------
          l6 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        a14y |
       2008  |   2.339926   6844.545     0.00   1.000    -13412.87    13417.55
       2009  |    14.0007   6844.768     0.00   0.998    -13401.65    13429.65
       2010  |   15.58416   6844.708     0.00   0.998    -13399.95    13431.11
       2011  |   4.259579   6844.602     0.00   1.000    -13411.07    13419.58
       2012  |  -3.417121   6846.351    -0.00   1.000    -13422.17    13415.33
       2013  |  -3.422354   6844.291    -0.00   1.000    -13418.14    13411.29
       2014  |   3.106307   6844.348     0.00   1.000    -13411.72    13417.93
       2015  |   1241.734   6845.142     0.18   0.856    -12174.65    14658.12
       2016  |   6.352013   6844.883     0.00   0.999    -13409.52    13422.23
       2017  |   7.714403   6845.119     0.00   0.999    -13408.62    13424.05
       2018  |   5.482297   6846.885     0.00   0.999    -13414.32    13425.28
       2019  |  -16.51652   6937.132    -0.00   0.998     -13613.2    13580.17
       2301  |  -7.41e-08   11854.12    -0.00   1.000     -23233.9     23233.9
             |
         d3b |   2.944417    2.14347     1.37   0.170    -1.256754    7.145589
         d3c |  -.0846422   1.360979    -0.06   0.950    -2.752143    2.582858
       _cons |   7.41e-08   6843.977     0.00   1.000     -13414.1     13414.1
------------------------------------------------------------------------------

Last edited by Dominik Miksch; 13 Mar 2019, 09:49.

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17743
#15

13 Mar 2019, 11:02

Dominik:
I think you have to deal with -8 and -9 first (that cause the error Stata gave you back), as they are in all likelihood missing values.
In addition, 2301 is an apparent mistake.
You can use dummies, but -fvvarlist- notation is much more useful.
Eventually, you have monster standard errors that produce very weird confidence interval (most of them are perfectly symmetric and this is far from what you usually find out in empirical researches).
In sum: scrutinize your data carefully, in search of the culprit of those strange results.

Kind regards,
Carlo
(Stata 19.0)
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment