Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Drop row where value is lower than another for a specific column

    Hi all,
    I am looking to clean a data set I have. I think what I want to achieve should only take a few lines of code, but I can't seem to figure out what they should be! My data set has population density buckets for every county in the US. However, most counties have at least two different population density's within one county boundary. For any county that has more than one row of population density buckets, I want to drop all rows equal to that county except for the highest one. For example, county Adair has a population density of Under 2 and 2 to 6- I want to drop the row where Adair is = Under 2. But, I have 2,000 observations so is there some sort of loop that can do this for me?

    The second factor that makes this data cleaning difficult is that there are county names that appear in multiple states. For example, the county name of Alleghany (shown at the bottom of the data example) appears in 2 different states (state number 37 and 51). I want to make sure that when lower population densities are being dropped, they are being dropped for only the specific state, and not deleting the observation of a separate county. i.e. dropping the observation where NAME=Alleghany for STATE=37 and Pop_density_string=Under 2 and where NAME=Alleghany for STATE=51 and Pop_density_string=Under 2.

    My data looks like this:
    STATEFP NAME Pop_density_string Population_Density
    45 Abbeville 6 to 18 5
    51 Accomack 18 to 45 17
    21 Adair Under 2 0
    21 Adair 2 to 6 1
    17 Adams Under 2 0
    18 Adams Under 2 0
    28 Adams Under 2 0
    39 Adams Under 2 0
    42 Adams 18 to 45 17
    55 Adams Under 2 0
    50 Addison 2 to 6 1
    50 Addison 6 to 18 5
    45 Aiken 6 to 18 5
    45 Aiken 2 to 6 1
    27 Aitkin Under 2 0
    37 Alamance 6 to 18 5
    37 Alleghany 6 to 18 5
    37 Alleghany Under 2 0
    51 Alleghany 6 to 18 5
    51 Alleghany Under 2 0



    STATEFP is a ID number for each state, NAME is the name of a specific county, Pop_density_string specifies the buckets of population density, and Population_Density are the numbers I generated to match the text buckets listed in Pop_density_string.
    Any advice on how to carry out this task? Thank you!




  • #2
    Can you use -dataex- to put your data in a form we can import?

    Otherwise, something like this should do the trick:

    Code:
    bysort STATEFP NAME (Pop_density_string): keep if _n==_N
    and we will have a problem with your Pop_density_string because it is an awkward variable and I do not know how it will sort.

    Comment


    • #3
      Thank you Joro! You were right, using Pop_density_string didn't sort correctly but when I used the variable Population_Density it worked perfectly. So the code ends up being: bysort STATEFP NAME (Population_Density): keep if _n==_N.

      Thanks again for your help!

      Comment


      • #4
        You are welcome, Hena! I am glad that you figured it out, and it was so easy, I myself looked at Population_Density and the only thing I could see was that it was always to the left of the bin of Pop_density_string, I did not know whether it would do the sorting we want or not.

        As a general rule when you work with numerical software like Stata: The less you use words (strings valued variables), the less headache you will have. E.g., if you have Yes/No variable, it is better to code it as 1/0, and then if you want to beautify the appearance add labels. In your case it would have been more convenient to give Pop_density_string some numerical values, say the left ends of the bins, and then if you want to beautify, put the labels spelling out the range of the bin. Or maybe have two numerical variables, one for the left end of the bin, one for the right.

        Originally posted by Hena Matthias View Post
        Thank you Joro! You were right, using Pop_density_string didn't sort correctly but when I used the variable Population_Density it worked perfectly. So the code ends up being: bysort STATEFP NAME (Population_Density): keep if _n==_N.

        Thanks again for your help!

        Comment

        Working...
        X