Hi everyone,
I have a repeated cross-sectional dataset -- the French base tous salaries from 2015 to 2022 -- because I am a master's student and not a full researcher, I have the anonymised version where you cannot track individuals over time. Hence, I want to turn it into a pseudo-panel dataset. The issue I have is that the majority of data on this dataset is categorical, and thus I can't take an average.
I'm trying to assess the impact of mandatory gender pay gap reporting legislation on pay gaps. It's only for firms above 50 employees and was introduced in 2020 do a DiDiD is my regression ideal.
Key variables of interest:
Continuous:
As I understand it, the criteria that creates a cohort cannot vary over time, so birth year is suitable, as is sex. I've seen other papers also suggest region of residence, as it is mostly unvarying. This still leaves me with the issue that all variables not included in cohort-creation (job-type, employer industry) would be dropped from the data as you can't create an average of a categorical variable. I've been trawling papers to find suggestions of how to get around it but my searches have mostly been in vain.
Does anyone here know how to deal with this issue? Any help would be much appreciated!!
I also have a secondary issue around cohort sizes. Obviously they need to be of a certain size to be limit intra-cohort measurement errors (papers tend to say >100 preferable, with <30 being untenable). With certain cohort creation criteria, the mean size of the cohort has been about 450 observations, but a full 1/3 of those cohorts are below 30. Papers tend to drop these latter observations, but all the papers I've read which do this generally have the cohorts <30 being 10-15% of observations, not 33%. What should I do in this case?
Again, many thanks to anyone who can respond!
I have a repeated cross-sectional dataset -- the French base tous salaries from 2015 to 2022 -- because I am a master's student and not a full researcher, I have the anonymised version where you cannot track individuals over time. Hence, I want to turn it into a pseudo-panel dataset. The issue I have is that the majority of data on this dataset is categorical, and thus I can't take an average.
I'm trying to assess the impact of mandatory gender pay gap reporting legislation on pay gaps. It's only for firms above 50 employees and was introduced in 2020 do a DiDiD is my regression ideal.
Key variables of interest:
Continuous:
- Salary -- dependent variable
- Age -- used for constructing birth-year cohorts
- Number of hours worked
- Employer industry (a6/17/38 categorical classification)
- Job-type
- Department/region of residence
- Firm size, in tranches
- Sex -- binary (within the data at least) so I could create an "average sex per cohort" -- papers do use sex as a cohort category but
As I understand it, the criteria that creates a cohort cannot vary over time, so birth year is suitable, as is sex. I've seen other papers also suggest region of residence, as it is mostly unvarying. This still leaves me with the issue that all variables not included in cohort-creation (job-type, employer industry) would be dropped from the data as you can't create an average of a categorical variable. I've been trawling papers to find suggestions of how to get around it but my searches have mostly been in vain.
Does anyone here know how to deal with this issue? Any help would be much appreciated!!
I also have a secondary issue around cohort sizes. Obviously they need to be of a certain size to be limit intra-cohort measurement errors (papers tend to say >100 preferable, with <30 being untenable). With certain cohort creation criteria, the mean size of the cohort has been about 450 observations, but a full 1/3 of those cohorts are below 30. Papers tend to drop these latter observations, but all the papers I've read which do this generally have the cohorts <30 being 10-15% of observations, not 33%. What should I do in this case?
Again, many thanks to anyone who can respond!
Comment