Creating a dummy variable marking countries with more than 100 observations (or droping)

Alejandro Torres

Join Date: Jan 2018

Posts: 152
#1

Creating a dummy variable marking countries with more than 100 observations (or droping)

22 Sep 2020, 13:47

Dear statalisters,

I have a database with 120 countries. From those countries, I have countries with more that 100 observations and less than 100 observations (100 is my mark).
What I want to do is a dummy variable equals 1 (one) if the country in that observation have more than 100 observations, and 0 (zero) if the country of that observation have less or equals to 100 observations.

My idea is run a regression conditioned to countries that only have more than 100 observations, something like:

Code:

regress IV DV control1 control2 if country=1

HTML Code:

With country=1 the country have more than 100 observations in total

Hope to be clear with my explanations.

I have Stata 15

Thank you very much for any help,

Alejandro
Tags: None
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#2

22 Sep 2020, 14:35

For example:

Code:

clear all sysuse auto sort rep78 by rep78: egen groupsize=count(rep78) list rep78 groupsize, sepby(rep78)

But more likely that you don't want just to have 100 records with that occurrence of the country code, but so many records usable in the regression (meaning non-missing values in ALL of the regression variables). In this case see help for markout.

Best, Sergiy Radyakin
Comment
Alejandro Torres

Join Date: Jan 2018

Posts: 152
#3

22 Sep 2020, 17:39

Thank you Sergiy for your answer.
I am not sure if I am explaning well.

Lets say I have 3 countries: USA with 150 observations, China with 140 observations and Italy with 80 observations.
What I need to do is create a dummy (lets call it "country100") that have 1 (ones) if the observation is from USA or China (because they have more than 100 observations) and 0 (Zeros) if the observation is from Italy, because Italy have less that 100 observations in total.

so, my regression will be:

Code:

regress IV DV control1 control2 if country100=1

then I expect to have an output considering only observations from China and USA, not Italy.

Thank you very much again.

Alejandro
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35784
#4

23 Sep 2020, 03:13

Sergiy Radyakin pointed you in a good direction, but there are other ways to do it.

.

Code:

regress IV DV control1 control2 gen OK = e(sample) egen nOK = total(OK), by(countryname) regress IV DV control1 control2 if nOK >= 100

Note that I have to make extra guesses about your variable names, as you have not given a data example, despite the request at https://www.statalist.org/forums/help#stata

Your proposed regression might benefit from some thought about error structure.

Last edited by Nick Cox; 23 Sep 2020, 03:50.
Comment
Alejandro Torres

Join Date: Jan 2018

Posts: 152
#5

29 Sep 2020, 07:30

Dear Nick,

Thank you for your answer. I solved my problem.
Now I would like to ask you about your comment "Your proposed regression might benefit from some thought about error structure".
What do you mean with that please? Is because I wrote IV first and then DV?
Thank you,

Alejandro
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35784
#6

29 Sep 2020, 09:23

No. I mean that you have clusters of observations pooled together.
Comment
Alejandro Torres

Join Date: Jan 2018

Posts: 152
#7

29 Sep 2020, 12:27

Dear Nick, could you tell me more about that please? I think you are looking at something that I don`t and I am experiencing some problems.
I am struggling in this moment because I feel confused.

I have data for firms in different countries in a time spam from 1990 to 2017.

My dependent variable is R&D intensity (I am not considering values lower than 0, since I can't have a negative investment, and I am not considering values largest that 1, because I am saying that I can`t have investing larger than sales in a period), my independent variable is a dummy.

Now, I have doubt about the xtset, since I was reading the stata manuals,I was tempted to use

Code:

xtset firm year

, but when I used it, the regression is not concave, and I am not sure if time have importance, so I was thinking in using

Code:

xtset firm

, but since I would like to consider the variance of different countries, because the ecological fallacy I am not sure if I should use instead

Code:

xtset country

.

Because the ecological fallacy, I was planning to use the

Code:

xtset firm year

, and the using the option

Code:

vce (cluster ountry)

, but then I realise that you are calling my attention about that. Finally, I was told by a friend (PhD student too) to use country as fixed effect, but I already have as fixed effect industry, years, and now considering firms and country too.

As you see, I have a disaster in my mind in this moment, and I read some stata manuals, I understand some of the use, but not sure what should I use. Also with the regression, I see some research using tobit, fracreg logit or even OLS.

Saying that and I know that my explanation is quite confusing, if you have any advice for me I really really appreciate it, I am going forward in my dissertation only because I receive feedback and help here.

(I am using stata 15).

Thank you so much again,

Alejandro
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35784
#8

29 Sep 2020, 17:41

Good questions, but you'd be better advised now by people in econometrics and applied economics who work with these kinds of models.
Comment

Announcement

Creating a dummy variable marking countries with more than 100 observations (or droping)

Comment

Comment

Comment

Comment

Comment

Comment

Comment