Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Pseudo Panel with Categorical Variables

    Hi everyone,


    I have a repeated cross-sectional dataset -- the French base tous salaries from 2015 to 2022 -- because I am a master's student and not a full researcher, I have the anonymised version where you cannot track individuals over time. Hence, I want to turn it into a pseudo-panel dataset. The issue I have is that the majority of data on this dataset is categorical, and thus I can't take an average.

    I'm trying to assess the impact of mandatory gender pay gap reporting legislation on pay gaps. It's only for firms above 50 employees and was introduced in 2020 do a DiDiD is my regression ideal.

    Key variables of interest:

    Continuous:
    • Salary -- dependent variable
    • Age -- used for constructing birth-year cohorts
    • Number of hours worked
    Discrete:
    • Employer industry (a6/17/38 categorical classification)
    • Job-type
    • Department/region of residence
    • Firm size, in tranches
    • Sex -- binary (within the data at least) so I could create an "average sex per cohort" -- papers do use sex as a cohort category but
    The dataset is very large (24 million at the moment), so I can afford to create reasonably detailed cohorts before I run into the problem of too few observations per cohort.

    As I understand it, the criteria that creates a cohort cannot vary over time, so birth year is suitable, as is sex. I've seen other papers also suggest region of residence, as it is mostly unvarying. This still leaves me with the issue that all variables not included in cohort-creation (job-type, employer industry) would be dropped from the data as you can't create an average of a categorical variable. I've been trawling papers to find suggestions of how to get around it but my searches have mostly been in vain.

    Does anyone here know how to deal with this issue? Any help would be much appreciated!!




    I also have a secondary issue around cohort sizes. Obviously they need to be of a certain size to be limit intra-cohort measurement errors (papers tend to say >100 preferable, with <30 being untenable). With certain cohort creation criteria, the mean size of the cohort has been about 450 observations, but a full 1/3 of those cohorts are below 30. Papers tend to drop these latter observations, but all the papers I've read which do this generally have the cohorts <30 being 10-15% of observations, not 33%. What should I do in this case?

    Again, many thanks to anyone who can respond!

  • #2
    Can you make the case that job type and employer industry tend to be relatively stable over time in the way that region of residence is mostly unvarying? My intuition is that those two variables shouldn't have that much variation over time either. Another thought is that you may be able to use the mode of those two variables instead of the mean. Best case, the mode is a fairly large majority within each cohort. If the responses within the cohort are fairly uniformly distributed, the mode may not be representative of the cohort (edit: in the same way that the mean assumes normality). Mode is the first statistic that comes to mind for nominal data, but you could also use other statistics like the total or proportion in an especially salient category for example.

    Papers tend to drop these latter observations, but all the papers I've read which do this generally have the cohorts <30 being 10-15% of observations, not 33%. What should I do in this case?
    There are two issues here: sample size, and bias. If after dropping the 33% you don't have enough cohorts to draw good inferences (same rules should apply; more than 30, ideally >100), then you should be worried about sample size. You should also think about whether the small cohorts are effectively missing at random, or if they are missing because they have something important in common. Are the small within-cohort sample sizes due to random chance, or do those cohorts have something in common that cause them to have small sample sizes? If it's the latter, you should be worried about bias.

    Two solutions come to mind for both the number of cohorts and the bias problems. (1) You rework your cohort criteria to get as few cohorts below 30 as possible, or (2) you find an imputation method that works well in this context. If you decide to go the second rout, that's worth having a conversation with your advisor and a lit review to find the best method on it's own. There are some ways you could impute aggregate values on clustered data, but it's possible MICE on the aggregate data alone will be enough for your master's thesis.
    Last edited by Daniel Schaefer; 05 Aug 2025, 15:50.

    Comment


    • #3
      Thank you for your response Daniel!

      I think I've found a way forward. E.g. with job_field I'm doing
      Code:
      tabulate job_field, generate(job_field_)
      so I'm able to record the proportions in each cohort which does which job, and then running job_field_* in the regression.

      I think my 33% apparent drop-rate of small cohorts last time was as a result of me not restricting the dataset to individuals aged 19 to 62 as I now have -- the working aged population of France -- so I had cohorts of e.g. a single 82 year old who is still employed in an industry. With my restricted age band, I now only have 800 observations under 30 individuals in a cohort, out of a total of 66,000 (1.2%), and my average cohort size is 398.


      What's causing me despair now is my firm_size variable. My ideal panel regression would be reg ln_wage female##post_2020##large_firm where large_firm = 1 if the firm an individual works in has more than 50 employees. My firm_size variable is in tranches, so values 1 to 5 are firms below 50 employees and 6 and 7 are above. Because of my workaround above, I now have the proportion of each cohort which works at a firm of size i, rather than a strict cohort dummy. It's not a safe assumption that individuals do not change firm sizes so I can't include it in cohort-creation, but I'm not sure how to run this regression otherwise.

      Comment


      • #4
        Dummy encoding like that is a good idea I think. Proportions will not meet the normality assumption, but it likely doesn't matter much in this context.

        Because of my workaround above, I now have the proportion of each cohort which works at a firm of size i, rather than a strict cohort dummy.
        You should be able to sum the proportion in tranches 6 and 7 in each cohort to get the proportion above 50 employees, right? (Or you could equivalently recode the categorical variable to combine 6 and 7 before you dummy encode.) Then you have a single proportion representing the people in the 50 employees category, which you can treat as a continuous variable in your three-way interaction. That makes the interpretation a bit more complicated, but it's not too bad, especially considering you only have one continuous variable. I'd recode the proportion to a percent to get a one percent unit-change interpretation of the coefficient. Then you get a "as the percent of people in a large firm increases" interpretation instead of a large firm/not large firm interpretation. You could conceptualize it as a percent probability for the idealized person represented by your cohort to work at a large firm rather than a strict binary.

        Comment


        • #5
          Recoding is a good idea! And yup I did decide that combining 6 and 7 to make a single proportion is the way to go. Thanks for your help!

          The paper I'm basing this loosely off of is Blundell 2021 who is looking at the British policy, which affects firms above 250 employees. With regards to the parallel trends assumption, he limits his analysis to firms between 150 employees and 350 -- a bandwidth of 100. I would like to do similar, but my tranches are 1-9, 10-19, 20-49, 50-99, 100-249, etc., so I can't reasonably have a symmetric bandwidth around 50, as it's not fair to say that a firm with 1-9 employees is similar to a firm with 50-100 employees.

          When I do this, would looking at tranches 3 and 4 be fine (i.e. 20-49 vs 50-99) even though they're asymmetric sizes, in the sense of employees? The number of observations in the individual-level dataset for 20-49 employees is actually significantly larger than those that work in 50-99 (3.2 million vs. 2.5 million), but I've not seen asymmetric bandwidths before so I'm not really sure how to approach it. There's been a couple of papers which use them in the case of RDDs, but obviously that's a different regression type.

          Comment


          • #6
            I'm not aware of a reason why the asymmetry would be a problem, but I'm not a DiD expert. Someone else may know more.

            Comment


            • #7
              Mia: Is it the size of the data set that is leading you to create a pseudo panel? You could just use DiDID methods for repeated cross sections if you have sufficient memory.

              Comment

              Working...
              X