Mapping across 2 datasets

Maudrene Tan

Join Date: Feb 2015
Posts: 14

Mapping across 2 datasets

10 May 2019, 01:19

Dear All,

I am stuck trying to loop through every observation from one dataset and map the results across to another dataset. I'm really hoping someone can guide me through this:

I have 2 datasets. One contains the search terms for the various medications and their classification categories (dataset A) and the other contains the inpatient medications ordered by the doctors for each patient (dataset B). What I need to do is to map the medicine classification to the medications ordered for the patients and generate a new variables that capture the intermediateclassification categories (Final dataset)

Dataset A

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str46 searchterm1 str13 searchterm2 str7 searchterm3 str25 recodeas str27 intermediateclassification
"Metformin"                   "Glucophage"    "" "Metformin non-XR" "Metformin"    
"Metformin XR"                "Glucophage XR" "" "Metformin XR"     "Metformin"    
"Pioglitazone"                "Actos"         "" "Pioglitazone"     "TZD"          
"Rosiglitazone"               "Avandia"       "" "Rosiglitazone"    "TZD"          
"Rosiglitazone AND metformin" "Avandamet"     "" "Avandamet"        "TZD-metformin"
"MICU Intensive IV insulin"   ""              "" "MICU IVSI scale"  "Sliding scale"
"MICU subcutaneous insulin"   ""              "" "MICU SCSI scale"  "Sliding scale"
end

Dataset B

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long ïCaseNo str229 OrderMeditation
1514762004 "Iron Polymaltose (Elemental Iron 50mg/mL)  Drops"                                   
1514762004 "Lynae Vitamin D [Colecalciferol 1,000 unit, Calcium 149mg]  Tablet"                 
1514762004 "MICU Subcutaneous Insulin - Titrate per protocol Insulin Soluble 1,000unit/10mL Inj"
1514762004 "Metformin HCl  Tablet"                                                              
1514762004 "Metformin HCl XR 500mg Tablet"                                                      
1514762004 "Myotein [Protein Supplement]  Powder"                                               
end

Final dataset

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long ïCaseNo str229 OrderMeditation float(metformin slidingscale)
1514762004 "Iron Polymaltose (Elemental Iron 50mg/mL)  Drops"                                    0 0
1514762004 "Lynae Vitamin D [Colecalciferol 1,000 unit, Calcium 149mg]  Tablet"                  0 0
1514762004 "MICU Subcutaneous Insulin - Titrate per protocol Insulin Soluble 1,000unit/10mL Inj" 0 1
1514762004 "Metformin HCl  Tablet"                                                               1 0
1514762004 "Metformin HCl XR 500mg Tablet"                                                       1 0
1514762004 "Myotein [Protein Supplement]  Powder"                                                0 0
end

I know how to manually go through the list of the medications and their search term(s) using the following codes:

Code:

gen metformin = strpos(lower(OrderMeditation), "metformin") > 0
gen slidingscale = strpos(lower(OrderMeditation), "micu subcutaneous insulin") > 0

end

However, that will take really long since I have a total 104 observations in Dataset A to go through. Can someone help me please? Thanks.

Warmest regards,
Maudrene

Tags: None

Mike Lacy

Join Date: Apr 2014

Posts: 2425
#2

10 May 2019, 10:11

I'd have to clarify my understanding of your problem. I'm *guessing* that your problem is something like this:
1) You have observations on individuals with a string variable for a medication they received (dataset B). An individual may appear multiple times. The problem is that a) The terms used for a given medications in this file vary widely. "Metformin," for example, might be written with a number of varying terms, Also, you want a broader classification of medications.
2) You would like a variable in dataset B that codes each medication string into a variable indicating whether or not (0/1) that medication term falls in a broader category, a variable that you call "intermediateclassification."

Is this an accurate description? If so, I see a number of problems that would need to be resolved before giving you a helpful answer.

Here are two problems:

1) How many different values are there for "intermediateclassification?." I would think that it is a large number, and that you would not want a large list of indicator variables, one for each value. Rather, I would think that you would want one categorical variable with a large number of values. Is the number of potential values for "intermediateclassification" large? And, if so, why do you want a large list of indicator variables?
2) I can't believe that a short list of "search terms," as in dataset A, is likely to exhaust all the possible ways someone might have recorded (for example) "metformin." This is particularly true since if these come from handwritten records, spelling errors will be numerous, which would mean that your search terms might have to include "mettformin," "metforman," "metform," Can you clarify here? Depending on your answer here, different approaches might be necessary, including a "fuzzy match" of some kind. I'm not an expert on this, but -search fuzzy match- yields some useful things.

By the way: You have used "Meditation" where I suspect you intend "Medication." (Unless the doctor ordered Meditation rather than, say, an anti-anxiety agent for the patient's problems <grin>). This spelling variation might cause you a problem later. Also, in dataset B, searchterm3 is blank for all observations. Is this what you intended?
Comment
Maudrene Tan

Join Date: Feb 2015

Posts: 14
#3

10 May 2019, 19:58

Dear Sir,

Thanks for the response. Please allow me to clarify:

1) Yes, individuals are issued multiple medications during their inpatient stay. The reason for the limited classification of medications is because we are interested in focusing on the occurrence of hypoglycemia, particularly in patients with diabetes. As a result, there are limited resources to obtain the classification of all medications ordered.
2) Yes, exactly. If I can somehow input the "intermediateclassification" into dataset B, that will be ideal.

Problems:
1) There are a total of 17 different values for "intermediateclassification". Ideally, I will like just 1 variable, but I don't know how to do it.
2) The list of "orderedmedications" are manually input by clinicians. Hence, as you rightly said, there might be spelling errors. Thanks for the fuzzy match suggestion. I will look it up.

Thanks for pointing out the spelling error in the variable name. We had spotted that and informed the system administrator.

As for the missingness in searchterm3, it happened that for the observations that I'd selected, searchterm 3 was blank. It is available for others, such as:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str46 searchterm1 str13 searchterm2 str7 searchterm3 str25 recodeas str27 intermediateclassification "Isophane insulin" "Insulatard" "NPH" "NPH" "Intermediate-acting insulin" "Isophane insulin" "Insulatard" "NPH" "NPH" "Intermediate-acting insulin" "Mixtard 30" "" "" "Mixtard 30" "Pre-mixed human insulin" "Mixtard 30" "" "" "Mixtard 30" "Pre-mixed human insulin" end

However, having said that, there are not too many of such instances.

Thank you so much for the response. I will look up fuzzy match for a start. Hopefully it'll help me get closer to my goal. Thanks again =).

Warmest regards,
Maudrene
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2425
#4

11 May 2019, 07:55

Here's a thought: What if you prepared a data set with two variables:1) A string containing a key part of what might occur in the name of one of your medications; 2) A numeric variable that has a code (1, ..., 17) denoting the intermediate classification to which it belongs. This data set would have multiple observations for the key words that correspond to a given medication or classification. This would be like your search term data set, but in long format, one observation per search string. Then, I would suspect that one of the fuzzy match programs might be able to match search strings to medication descriptions in your data set B. I have not tried something like this myself; this is just a suggestion.
1 like
Comment
Maudrene Tan

Join Date: Feb 2015

Posts: 14
#5

13 May 2019, 02:46

Thank you so much for this suggestion. I will certainly try it.
Comment

Announcement

Mapping across 2 datasets

Comment

Comment

Comment

Comment