Percentile Dummies generations in consideration of missing values

Tuan Tran

Join Date: Jun 2025

Posts: 1
#1

Percentile Dummies generations in consideration of missing values

01 Jun 2025, 19:04

Greetings,

I wanted to ask some questions regarding percentile dummy generation. Among academic papers, there is a kind of variable called "High_Var", which basically is a dummy equaling 1 when the year firm-specific observation is greater than the sample's median.

So the questions are, when generating the median of the sample:
1. Say I separate my sample into two sub-groups, called Developed (D1) and Developing (D0) economies, when generating this High_Var dummy variable, do I only use the Full sample's median (F1) to evaluate the High_Var, or the sub-group-specific median, say, two medians from each sup-group: Median_Income_D1 and Median_Income_D0?

2. The generation of the median, as I expect, would exclude missing values. These missing values occur due to actual missing values from the dataset I used, and so, when forming the High_Var dummy, should the values of the High_Var for missing values of Var be missing as well, or can it be 0?

3. Back to the missing value issue, the dataset I use is the aggregated patents granted on a country-level and has a time range of 2014~2023, yet, there are missing values for some countries that varies from 2015~2017. What would be the best approach to deal with this? Do I accept such missing values simply as missing? Or do I give them a zero since patents granted can't be below zero? what about aggregated investments into sustainable projects?

Thank you so much for your time and assistance!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30170
#2

02 Jun 2025, 12:14

1. It depends entirely on your specific research questions. There is no generic answer to this question. The answer will depend on the nature of the variables you are working with and whether the relationships among them depend primarily on potentially differing within-group effects or primarily on effects that are the same across groups.

2. I'm not sure I understand the question. If the value of var is unknown, then, in general, there is also no knowledge of whether that unknown value exceeds the median or not. So the value of High_var should be missing when the value of var is missing. There are exceptions to this situation, such as censored observations, where it may be possible to know whether the unobserved value is actually above or below the median even though the value itself is unknown. You don't describe your data set in sufficient detail to determine whether something like this might apply in your instance.

3. This depends on the process that creates the missing values in the data. If the data reporting process simply leaves out any would-be observations where there are no patents, and missing values do not arise for any other reason, then, yes you should replace missing values by zero. But if the missing values can arise in situations where patents have been issued, but for whatever reason, the number is not known, or is withheld to preserve privacy or for other reasons, then, no you cannot infer that missing implies zero, and you should just leave missings as missing.

Finally, let me just note that while converting numeric variables to dichotomies is sometimes helpful for descriptive purposes, when used in analyses, it introduces noise and destroys information, unless the dichotomous split point has real-world meaning and implications. So unless you have a compelling explanation of why being above vs below the median is truly a qualitative difference in real-world terms, it is ill-advised to use categories when actual numeric data is available.
Comment

Announcement

Percentile Dummies generations in consideration of missing values

Comment