Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Keeping top 100 observations in a variable

    Dear STATA users,

    I wanted to keep the top 100 observations in the lPCINC (log per capita income) variable to see if that improves the significance level of my explanatory variables.

    I sorted lPCINC first from highest to lowest:

    gsort -lPCINC

    And this is what the top 50 observations look like:
    1. | 11.32131 |
    2. | 11.30339 |
    3. | 11.27098 |
    4. | 11.23771 |
    5. | 11.20342 |
    |----------|
    6. | 11.17617 |
    7. | 11.10413 |
    8. | 11.10041 |
    9. | 11.09818 |
    10. | 11.08996 |
    |----------|
    11. | 11.08294 |
    12. | 11.04685 |
    13. | 10.96996 |
    14. | 10.96204 |
    15. | 10.95771 |
    |----------|
    16. | 10.94856 |
    17. | 10.94265 |
    18. | 10.89704 |
    19. | 10.88772 |
    20. | 10.86662 |
    |----------|
    21. | 10.85414 |
    22. | 10.84457 |
    23. | 10.83262 |
    24. | 10.83187 |
    25. | 10.81677 |
    |----------|
    26. | 10.7861 |
    27. | 10.78506 |
    28. | 10.78502 |
    29. | 10.78423 |
    30. | 10.77103 |
    |----------|
    31. | 10.76079 |
    32. | 10.7559 |
    33. | 10.75066 |
    34. | 10.75064 |
    35. | 10.74663 |
    |----------|
    36. | 10.74329 |
    37. | 10.73217 |
    38. | 10.72946 |
    39. | 10.72887 |
    40. | 10.72223 |
    |----------|
    41. | 10.71633 |
    42. | 10.71259 |
    43. | 10.70782 |
    44. | 10.70773 |
    45. | 10.70672 |
    |----------|
    46. | 10.70518 |
    47. | 10.70162 |
    48. | 10.68917 |
    49. | 10.68757 |
    50. | 10.68689 |

    I would like to create a new variable called toplPCINC that has the top 100 observations from the sorted lPCINC. I haven't learned much about STATA yet, but I tried a few solutions from similar questions and none worked.

    I'm not sure if my explanatory variables would be needed so I'll include them here just in case: (1) exp_pop_density2 (2)Total. Both are numeric data.

    I would really appreciate it if anyone could help!

  • #2
    Code:
    gen wanted = lPCINC in 1/100
    is an answer to your question. However, reducing the range of values in a relationship almost always reduces explanatory power, and in any cases arbitrary selection of data to improve significance tests is (personal opinion) of limited scientific or statistical value.

    Comment


    • #3
      Thank you so much, Mr. Cox, for your code and advice! I wanted to select the top 100 observations based on log per capita income because they are more representative of bigger cities (with greater population), which is the topic of my group study (We didn't state a range of cities but this selection, based on the data we have, seems to be more suitable for our topic.). Also because smaller cities would be less agglomerated and have less homogenous traits in explaining variations in income (personal opinion based on the basics of urban economics that I've learned). We will include both the regression with the original data and the selected ones to make a comparison. So, please don't worry. I was wrong about improving the significance level though, as you said. This is my first econometrics project and I still got a lot to learn. Thank you again!

      Comment

      Working...
      X