Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Remove duplicate observations

    Hi all,

    I use data from the American Housing Survey and I have the following issue with preparing the data set:

    As the survey always returns to the same homes, certain observations have to be deleted. Eg a house that was purchased in 2010 will be interviewed again in 2011 and 2013; 2013 may then be removed because this is a double observation. How can I write the code for this in Stata? Thanks!

  • #2
    Welcome to Statalist.

    Without understanding your data, the most that can be said is
    1. create a variable that takes the value 1 (think "true") for the observations you want to drop and 0 (think "false") otherwise; suppose you call it "doubleobs"
    2. use the command "drop if doubleobs==1"
    For more complete advice, please take a few moments to review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post. Note especially sections 9-12 on how to best pose your question. It's particularly helpful to copy commands and output from your Stata Results window and paste them into your Statalist post using code delimiters [CODE] and [/CODE], and to use the dataex command to provide sample data, as described in section 12 of the FAQ.

    The more you help others understand your problem, the more likely others are to be able to help you solve your problem.

    Comment


    • #3
      Hi William,

      Thank you for your response.

      The problem is that I don't know which observations or lines in my data set needs to be removed.

      I have a large set of data retrieved from the American Housing Survey (AHS) which is a survey that is held every 2 years.
      So I need to find a commando to 1. detect duplicate observations (since the survey always returns to the same homes) to make sure that each house has been 'surveyed' in only 1 particular year (and not twice or more).

      In particular, I guess that I need to find some kind of commando that scans all lines (read: observations) and detect lines with the same responses to the variables/questions.
      And then, 2. to remove these observations before starting regression on the remaining observations.

      Hope the attached screenshot can clarify my question.
      If not, let me know what you need in order to understand or formulate an answer.

      Thanks;

      Attached Files

      Comment


      • #4
        First, some general advice for making effective use of Statalist.

        Please take a few moments to review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post. Note especially sections 9-12 on how to best pose your question. It's particularly helpful to copy commands and output from your Stata Results window and paste them into your Statalist post using code delimiters [CODE] and [/CODE], and to use the dataex command to provide sample data, as described in section 12 of the FAQ.

        The more you help others understand your problem, the more likely others are to be able to help you solve your problem.

        With that said, without knowing how you are planning on using this data, I cannot recommend starting by deleting observations. Some of the characteristics will change from survey to survey, even if the same householder is living in the housing unit.

        Comment


        • #5
          Mat - take a look at "help duplicates tag" to follow William's advice in #2 (about creating a variable that you want to use to tag observations for possible deletion). And, instead of deleting the observations, you might just want to create a variable called in_sample andset it to 1 for observations you want to include in your regressions. Then, if you later change your mind, the other observations are still there.

          I thought that the AHS had a variable like CONTROL or CONTROLM that acted like a household_id so you can track households across survey years. See https://www.census.gov/data-tools/de...s/ahsdict.html

          Comment

          Working...
          X