Hello,
I have a huge data file of unbalanced panel data( around 30 million ID-year observations).
I want to create a random sample but keep all years' observations for a particular company ID.
The main data file includes companies from 13 countries (variable: country_code).
How can I create a random sample (of 100,000 different IDs) but keep all years' observations of a particular ID, and the sample should represent all 13 countries? Do I need to sort the data first?
How the code would be different if I want a % based sample, for example, a sample of 10% of the full data, representing the 13 countries
here's an example of the variables that are important for the sample
* Example generated by -dataex-. To install: ssc install dataex
clear
input str18 id float year str2 country_code
"IT00832220156" 1999 "IT"
"GB01180514" 2018 "GB"
"SE5560525692" 2020 "SE"
"FR826680084" 2007 "FR"
"IT00180310781" 2015 "IT"
"ESA38403028" 2019 "ES"
"FR418872081" 2006 "FR"
"ESB04105847" 2011 "ES"
"BE0819252201" 2010 "BE"
"SE5562265917" 2016 "SE"
"FR378281935" 2000 "FR"
"ESB81518623" 2009 "ES"
"ESB10343051" 2017 "ES"
"ESB75041228" 2015 "ES"
"IT02564770903" 2019 "IT"
"FI22585062" 2019 "FI"
"IT10607171005" 2016 "IT"
"IT01817990276" 2015 "IT"
"FR381004415" 2011 "FR"
"IT01861600342" 1998 "IT"
"FI17091885" 2012 "FI"
"IT01777250976" 2008 "IT"
"ESB31777808" 2009 "ES"
"SE5569792962" 2020 "SE"
"IT09727470966" 2021 "IT"
"IT03005730738" 2018 "IT"
"SE5563133395" 2011 "SE"
"DK31170427" 2021 "DK"
"FR338466295" 1998 "FR"
"ESB97281521" 2019 "ES"
"ESB15665748" 2005 "ES"
"SE5564657269" 2013 "SE"
"IT01930040561" 2013 "IT"
"ESB04251955" 2013 "ES"
"ESB14361810" 2003 "ES"
"ESB12742821" 2011 "ES"
end
[/CODE]
thank you for your help
I have a huge data file of unbalanced panel data( around 30 million ID-year observations).
I want to create a random sample but keep all years' observations for a particular company ID.
The main data file includes companies from 13 countries (variable: country_code).
How can I create a random sample (of 100,000 different IDs) but keep all years' observations of a particular ID, and the sample should represent all 13 countries? Do I need to sort the data first?
Code:
sort ID year
here's an example of the variables that are important for the sample
* Example generated by -dataex-. To install: ssc install dataex
clear
input str18 id float year str2 country_code
"IT00832220156" 1999 "IT"
"GB01180514" 2018 "GB"
"SE5560525692" 2020 "SE"
"FR826680084" 2007 "FR"
"IT00180310781" 2015 "IT"
"ESA38403028" 2019 "ES"
"FR418872081" 2006 "FR"
"ESB04105847" 2011 "ES"
"BE0819252201" 2010 "BE"
"SE5562265917" 2016 "SE"
"FR378281935" 2000 "FR"
"ESB81518623" 2009 "ES"
"ESB10343051" 2017 "ES"
"ESB75041228" 2015 "ES"
"IT02564770903" 2019 "IT"
"FI22585062" 2019 "FI"
"IT10607171005" 2016 "IT"
"IT01817990276" 2015 "IT"
"FR381004415" 2011 "FR"
"IT01861600342" 1998 "IT"
"FI17091885" 2012 "FI"
"IT01777250976" 2008 "IT"
"ESB31777808" 2009 "ES"
"SE5569792962" 2020 "SE"
"IT09727470966" 2021 "IT"
"IT03005730738" 2018 "IT"
"SE5563133395" 2011 "SE"
"DK31170427" 2021 "DK"
"FR338466295" 1998 "FR"
"ESB97281521" 2019 "ES"
"ESB15665748" 2005 "ES"
"SE5564657269" 2013 "SE"
"IT01930040561" 2013 "IT"
"ESB04251955" 2013 "ES"
"ESB14361810" 2003 "ES"
"ESB12742821" 2011 "ES"
end
[/CODE]
thank you for your help
Comment