Keeping top 100 observations in a variable

Yifei Zang

Join Date: Dec 2022

Posts: 2
#1

Keeping top 100 observations in a variable

09 Dec 2022, 13:35

Dear STATA users,

I wanted to keep the top 100 observations in the lPCINC (log per capita income) variable to see if that improves the significance level of my explanatory variables.

I sorted lPCINC first from highest to lowest:

gsort -lPCINC

And this is what the top 50 observations look like:
1. | 11.32131 |
2. | 11.30339 |
3. | 11.27098 |
4. | 11.23771 |
5. | 11.20342 |
|----------|
6. | 11.17617 |
7. | 11.10413 |
8. | 11.10041 |
9. | 11.09818 |
10. | 11.08996 |
|----------|
11. | 11.08294 |
12. | 11.04685 |
13. | 10.96996 |
14. | 10.96204 |
15. | 10.95771 |
|----------|
16. | 10.94856 |
17. | 10.94265 |
18. | 10.89704 |
19. | 10.88772 |
20. | 10.86662 |
|----------|
21. | 10.85414 |
22. | 10.84457 |
23. | 10.83262 |
24. | 10.83187 |
25. | 10.81677 |
|----------|
26. | 10.7861 |
27. | 10.78506 |
28. | 10.78502 |
29. | 10.78423 |
30. | 10.77103 |
|----------|
31. | 10.76079 |
32. | 10.7559 |
33. | 10.75066 |
34. | 10.75064 |
35. | 10.74663 |
|----------|
36. | 10.74329 |
37. | 10.73217 |
38. | 10.72946 |
39. | 10.72887 |
40. | 10.72223 |
|----------|
41. | 10.71633 |
42. | 10.71259 |
43. | 10.70782 |
44. | 10.70773 |
45. | 10.70672 |
|----------|
46. | 10.70518 |
47. | 10.70162 |
48. | 10.68917 |
49. | 10.68757 |
50. | 10.68689 |

I would like to create a new variable called toplPCINC that has the top 100 observations from the sorted lPCINC. I haven't learned much about STATA yet, but I tried a few solutions from similar questions and none worked.

I'm not sure if my explanatory variables would be needed so I'll include them here just in case: (1) exp_pop_density2 (2)Total. Both are numeric data.

I would really appreciate it if anyone could help!
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35651
#2

09 Dec 2022, 14:11

Code:

gen wanted = lPCINC in 1/100

is an answer to your question. However, reducing the range of values in a relationship almost always reduces explanatory power, and in any cases arbitrary selection of data to improve significance tests is (personal opinion) of limited scientific or statistical value.
1 like
Comment
Yifei Zang

Join Date: Dec 2022

Posts: 2
#3

09 Dec 2022, 14:57

Thank you so much, Mr. Cox, for your code and advice! I wanted to select the top 100 observations based on log per capita income because they are more representative of bigger cities (with greater population), which is the topic of my group study (We didn't state a range of cities but this selection, based on the data we have, seems to be more suitable for our topic.). Also because smaller cities would be less agglomerated and have less homogenous traits in explaining variations in income (personal opinion based on the basics of urban economics that I've learned). We will include both the regression with the original data and the selected ones to make a comparison. So, please don't worry. I was wrong about improving the significance level though, as you said. This is my first econometrics project and I still got a lot to learn. Thank you again!
Comment

Announcement

Keeping top 100 observations in a variable

Comment

Comment