Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating a dummy variable marking countries with more than 100 observations (or droping)

    Dear statalisters,

    I have a database with 120 countries. From those countries, I have countries with more that 100 observations and less than 100 observations (100 is my mark).
    What I want to do is a dummy variable equals 1 (one) if the country in that observation have more than 100 observations, and 0 (zero) if the country of that observation have less or equals to 100 observations.

    My idea is run a regression conditioned to countries that only have more than 100 observations, something like:

    Code:
    regress IV DV control1 control2 if country=1
    HTML Code:
    With country=1 the country have more than 100 observations in total
    Hope to be clear with my explanations.

    I have Stata 15

    Thank you very much for any help,

    Alejandro

  • #2
    For example:

    Code:
    clear all
    sysuse auto
    
    sort rep78
    by rep78: egen groupsize=count(rep78)
    list rep78 groupsize, sepby(rep78)
    But more likely that you don't want just to have 100 records with that occurrence of the country code, but so many records usable in the regression (meaning non-missing values in ALL of the regression variables). In this case see help for markout.

    Best, Sergiy Radyakin

    Comment


    • #3
      Thank you Sergiy for your answer.
      I am not sure if I am explaning well.

      Lets say I have 3 countries: USA with 150 observations, China with 140 observations and Italy with 80 observations.
      What I need to do is create a dummy (lets call it "country100") that have 1 (ones) if the observation is from USA or China (because they have more than 100 observations) and 0 (Zeros) if the observation is from Italy, because Italy have less that 100 observations in total.

      so, my regression will be:

      Code:
       
       regress IV DV control1 control2 if country100=1
      then I expect to have an output considering only observations from China and USA, not Italy.

      Thank you very much again.

      Alejandro

      Comment


      • #4
        Sergiy Radyakin pointed you in a good direction, but there are other ways to do it.

        .
        Code:
        regress IV DV control1 control2
        gen OK = e(sample)  
        egen nOK = total(OK), by(countryname)  
        regress IV DV control1 control2 if nOK >= 100
        Note that I have to make extra guesses about your variable names, as you have not given a data example, despite the request at https://www.statalist.org/forums/help#stata

        Your proposed regression might benefit from some thought about error structure.
        Last edited by Nick Cox; 23 Sep 2020, 03:50.

        Comment


        • #5
          Dear Nick,

          Thank you for your answer. I solved my problem.
          Now I would like to ask you about your comment "Your proposed regression might benefit from some thought about error structure".
          What do you mean with that please? Is because I wrote IV first and then DV?
          Thank you,

          Alejandro

          Comment


          • #6
            No. I mean that you have clusters of observations pooled together.

            Comment


            • #7
              Dear Nick, could you tell me more about that please? I think you are looking at something that I don`t and I am experiencing some problems.
              I am struggling in this moment because I feel confused.

              I have data for firms in different countries in a time spam from 1990 to 2017.

              My dependent variable is R&D intensity (I am not considering values lower than 0, since I can't have a negative investment, and I am not considering values largest that 1, because I am saying that I can`t have investing larger than sales in a period), my independent variable is a dummy.

              Now, I have doubt about the xtset, since I was reading the stata manuals,I was tempted to use
              Code:
              xtset firm year
              , but when I used it, the regression is not concave, and I am not sure if time have importance, so I was thinking in using
              Code:
              xtset firm
              , but since I would like to consider the variance of different countries, because the ecological fallacy I am not sure if I should use instead
              Code:
              xtset country
              .

              Because the ecological fallacy, I was planning to use the
              Code:
              xtset firm year
              , and the using the option
              Code:
              vce (cluster ountry)
              , but then I realise that you are calling my attention about that. Finally, I was told by a friend (PhD student too) to use country as fixed effect, but I already have as fixed effect industry, years, and now considering firms and country too.

              As you see, I have a disaster in my mind in this moment, and I read some stata manuals, I understand some of the use, but not sure what should I use. Also with the regression, I see some research using tobit, fracreg logit or even OLS.

              Saying that and I know that my explanation is quite confusing, if you have any advice for me I really really appreciate it, I am going forward in my dissertation only because I receive feedback and help here.

              (I am using stata 15).

              Thank you so much again,

              Alejandro

              Comment


              • #8
                Good questions, but you'd be better advised now by people in econometrics and applied economics who work with these kinds of models.

                Comment

                Working...
                X