Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Random sample by unique observations

    Hello Statlists,

    I'm currently struggling with my dataset. I have a large dataset of around 7 000 000 observations of companies from 2004 - 2017 in Sweden.

    These companies will be divided into 2 periods. Period=1 is 2010-2017 and Period=0 is 2004-2009
    They will also be divided if they can opt-out of an audit by Audit=1 they can't opt out, and Audit=0 they can opt-out.

    The dataset contains yearly observations of a company's annual report, so one company is usually included in multiple observations if they have submitted an annual report for more than 1 year. (please see: http://prntscr.com/rh3vcc for an extract from my data)

    So, what I want to do is: to take a random sample of 1500 unique companies in, period, and audit. So I do not want to have 1 company appear in my sample several times. I still want to keep the "duplicate" observations. I need to do this because I will have to manually check each company and all their annual reports in another database. I need to know which year the annual report is from in my Stata dataset.

    Is this possible? Or do I need to rethink the whole thing...

    Kind regards,

    Thomas

  • #2
    For whats its worth, I think I need to create a new variable that gives a value based on the uniqueness of the variable "orgnr"(company number). However, from what I can find if one orgnr appears 2 times, both observations will get the value 2, if it appears 1 time it will get the value 1 and so on. What I think is necessary is to create a variable that gives the value 1 on the first observation, 2 on the second and so on. And thereafter I can draw my sample.

    Am I going about this correctly?

    Comment


    • #3
      While there may be a more efficient way to do this, one way to do it would be to create a data set with one observation per company, do your random selection on this data set, and then merge it back with the original set.

      Comment


      • #4
        Originally posted by Phil Bromiley View Post
        While there may be a more efficient way to do this, one way to do it would be to create a data set with one observation per company, do your random selection on this data set, and then merge it back with the original set.
        Thanks for your reply. Yes this is exactly what I did.

        Comment

        Working...
        X