Random sample by unique observations

Thomas Engstrom

Join Date: Feb 2020

Posts: 5
#1

Random sample by unique observations

16 Mar 2020, 04:59

Hello Statlists,

I'm currently struggling with my dataset. I have a large dataset of around 7 000 000 observations of companies from 2004 - 2017 in Sweden.

These companies will be divided into 2 periods. Period=1 is 2010-2017 and Period=0 is 2004-2009
They will also be divided if they can opt-out of an audit by Audit=1 they can't opt out, and Audit=0 they can opt-out.

The dataset contains yearly observations of a company's annual report, so one company is usually included in multiple observations if they have submitted an annual report for more than 1 year. (please see: http://prntscr.com/rh3vcc for an extract from my data)

So, what I want to do is: to take a random sample of 1500 unique companies in, period, and audit. So I do not want to have 1 company appear in my sample several times. I still want to keep the "duplicate" observations. I need to do this because I will have to manually check each company and all their annual reports in another database. I need to know which year the annual report is from in my Stata dataset.

Is this possible? Or do I need to rethink the whole thing...

Kind regards,

Thomas
Tags: None
Thomas Engstrom

Join Date: Feb 2020

Posts: 5
#2

16 Mar 2020, 05:46

For whats its worth, I think I need to create a new variable that gives a value based on the uniqueness of the variable "orgnr"(company number). However, from what I can find if one orgnr appears 2 times, both observations will get the value 2, if it appears 1 time it will get the value 1 and so on. What I think is necessary is to create a variable that gives the value 1 on the first observation, 2 on the second and so on. And thereafter I can draw my sample.

Am I going about this correctly?
Comment
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#3

17 Mar 2020, 14:23

While there may be a more efficient way to do this, one way to do it would be to create a data set with one observation per company, do your random selection on this data set, and then merge it back with the original set.
Comment
Thomas Engstrom

Join Date: Feb 2020

Posts: 5
#4

23 Mar 2020, 04:17

Originally posted by Phil Bromiley View Post

While there may be a more efficient way to do this, one way to do it would be to create a data set with one observation per company, do your random selection on this data set, and then merge it back with the original set.

Thanks for your reply. Yes this is exactly what I did.
Comment

Announcement

Random sample by unique observations

Comment

Comment

Comment