Advice on a command or workflow for deducing unknown mapping from inputs to outputs in a dataset?

CJ Libassi

Join Date: May 2020

Posts: 44
#1

Advice on a command or workflow for deducing unknown mapping from inputs to outputs in a dataset?

20 Oct 2023, 09:11

Hi all,

Maybe a bit of an odd question, but I run into this challenge with enough frequency in my data cleaning that I wonder if there's some obvious solution I'm missing. For one reason or another, I often find myself trying to reverse engineer the logic of some calculation or categorization rule that occurs in my data. For example, imagine I have a variable that calculates a total income value from a bunch of income subcomponents, but only some of the income subcomponent variables are included in the total and I don't know which. Or imagine people in my data get placed into a category based on some criteria that can be determined from other variables in the data (say, whether or not someone is eligible for a program or not), but I don't know the exact rules for the classification. My basic question is in situations like this, where I have the final answer (total income or program eligibility classification) and a bunch of candidate inputs, but I am not sure how the inputs map to the final answer and am hoping to reverse engineer it to apply the same logic elsewhere or explore modifications of the logic, is there a way to somewhat automatically work through the logic of how to reconstruct such a mapping?

I usually do this sort of manually, using some background knowledge, educated guesses, and direct examination of samples of the data to test out hypotheses such as "I bet receiving means tested federal benefits is a sufficient condition for being program eligible" and then I go and look at the data and see if in all cases those who have a variable indicating that they are eligible for a means tested program are also always listed as program eligible. The hope would just be able to do this type of searching more systematically. It feels sort of like a support vector machines problem? Or a decision tree thing? Anyway, just hoping to draw on the wisdom of this crowd to see if there's some better way of doing this than my current adhoc, by-hand approach.

Thanks!
CJ
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30163
#2

20 Oct 2023, 09:59

Well, in the situation where you know that a certain variable was calculated as the sum of some subset of a group of other variables, it is straightforward to just regress the calculated variable on the subset of other variables. The ones that were included will show up with a coefficient of 1, and the others with a coefficient of 0. (The coefficients may deviate slightly from 0 or 1 due to floating point arithmetic errors, but those deviations will be quite trivial.)

The more general situation is, I think, not tractable without some kind of information that constrains the possibilities.

Last edited by Clyde Schechter; 20 Oct 2023, 10:01.
Comment
CJ Libassi

Join Date: May 2020

Posts: 44
#3

20 Oct 2023, 11:56

That makes sense - thank you!
Comment

Announcement

Advice on a command or workflow for deducing unknown mapping from inputs to outputs in a dataset?

Comment

Comment