Hi all,
Maybe a bit of an odd question, but I run into this challenge with enough frequency in my data cleaning that I wonder if there's some obvious solution I'm missing. For one reason or another, I often find myself trying to reverse engineer the logic of some calculation or categorization rule that occurs in my data. For example, imagine I have a variable that calculates a total income value from a bunch of income subcomponents, but only some of the income subcomponent variables are included in the total and I don't know which. Or imagine people in my data get placed into a category based on some criteria that can be determined from other variables in the data (say, whether or not someone is eligible for a program or not), but I don't know the exact rules for the classification. My basic question is in situations like this, where I have the final answer (total income or program eligibility classification) and a bunch of candidate inputs, but I am not sure how the inputs map to the final answer and am hoping to reverse engineer it to apply the same logic elsewhere or explore modifications of the logic, is there a way to somewhat automatically work through the logic of how to reconstruct such a mapping?
I usually do this sort of manually, using some background knowledge, educated guesses, and direct examination of samples of the data to test out hypotheses such as "I bet receiving means tested federal benefits is a sufficient condition for being program eligible" and then I go and look at the data and see if in all cases those who have a variable indicating that they are eligible for a means tested program are also always listed as program eligible. The hope would just be able to do this type of searching more systematically. It feels sort of like a support vector machines problem? Or a decision tree thing? Anyway, just hoping to draw on the wisdom of this crowd to see if there's some better way of doing this than my current adhoc, by-hand approach.
Thanks!
CJ
Maybe a bit of an odd question, but I run into this challenge with enough frequency in my data cleaning that I wonder if there's some obvious solution I'm missing. For one reason or another, I often find myself trying to reverse engineer the logic of some calculation or categorization rule that occurs in my data. For example, imagine I have a variable that calculates a total income value from a bunch of income subcomponents, but only some of the income subcomponent variables are included in the total and I don't know which. Or imagine people in my data get placed into a category based on some criteria that can be determined from other variables in the data (say, whether or not someone is eligible for a program or not), but I don't know the exact rules for the classification. My basic question is in situations like this, where I have the final answer (total income or program eligibility classification) and a bunch of candidate inputs, but I am not sure how the inputs map to the final answer and am hoping to reverse engineer it to apply the same logic elsewhere or explore modifications of the logic, is there a way to somewhat automatically work through the logic of how to reconstruct such a mapping?
I usually do this sort of manually, using some background knowledge, educated guesses, and direct examination of samples of the data to test out hypotheses such as "I bet receiving means tested federal benefits is a sufficient condition for being program eligible" and then I go and look at the data and see if in all cases those who have a variable indicating that they are eligible for a means tested program are also always listed as program eligible. The hope would just be able to do this type of searching more systematically. It feels sort of like a support vector machines problem? Or a decision tree thing? Anyway, just hoping to draw on the wisdom of this crowd to see if there's some better way of doing this than my current adhoc, by-hand approach.
Thanks!
CJ
Comment