Writing a program to run code across multiple files

Scott Rick

Join Date: May 2021
Posts: 242

Writing a program to run code across multiple files

09 Jun 2021, 10:39

Hi. I have MEPS data files ranging from 1996-2016. I want to run the following code on the files for 2007-2015. The code involves cleaning the data, merging it with a file named “MMS_WAC_data”, performing a bunch of data manipulations and saving multiple subsets of the data along the way before the final cleaned dataset. The code below is only for the 2007 dataset. Instead of replicating it independently for each of the files from 2007-2015, I know that it is possible to write a program to have one code file that runs across each of these files. I have never written a program before and all my attempts so far have failed. What I want to do which needs to be changed in the code below is:

import the MEPS files from 2007-2015
some of the variable names in each of the files from 2007-2015 have the year as the suffix – eg rxsf07x rxmr07x rxmd07x. Ensure that the right variables in each year file is retained.
run the same code on each of them, and in each of the subsets of the data saved change “2007” to the relevant year name

Any guidance on how to do this would be much appreciated.

Code:

set more off
clear all
capture log close
set matsize 10000
set maxvar 10000

* set graph parameters
set scheme s1mono
graph set window fontface "Palatino Linotype"
global blue "75 156 211"
global orange "211 122 75"
global green "76 184 72"

*** MAIN DIRECTORY: EDIT THIS LINE TO RUN THE CODE AS A DIFFERENT USER ***
global db "C:\Users\ricks\Dropbox"


* global directories
global meps "${db}\MEPS"
global drug "${db}\UNC\Dissertation\List Price"
global output "${db}\UNC\Dissertation\List Price\Cleaned Data"



use "${meps}\MEPS Prescription Raw\P2007.dta" 


*keeping only the variables that matter - id, weight, payment vars & drug code and name
keep DUPERSID RXNDC PERWT07F RXSF07X RXMR07X RXMD07X RXPV07X RXXP07X RXNAME


 
*converting var names to lowercase
rename *, lower 

//Data is in long form by id and drug code - collapsing on drug code  condl on id
//calculating totals and averages by id and drug code

collapse (sum) rxsf07x rxmr07x rxmd07x rxpv07x rxxp07x (mean) rxsf07x_mean=rxsf07x ///
rxmr07x_mean=rxmr07x rxmd07x_mean=rxmd07x rxpv07x_mean=rxpv07x rxxp07x_mean=rxxp07x ///
(first) perwt07f rxname, by (dupersid rxndc)


gen year = 2007 //to merge with MMS data
//rename rxname Product

gen id2=_n

drop if rxname=="-9" //drop all missing product name obs

drop if real(substr(rxname,1,1))<. //drop all obs w product names starting with a number

save "${output}\P2007_dummy.dta", replace


*************Combinig with the MMS dataset***********
/*The code below identifies drug names that have similarities 
e.g. "ALLERX"/"ALLERX (AM/PM DOSE PACK 30)"
It provides a similarity score - similscore - ranging between 0-1 as a measure of
the overlap
*/


matchit id2 rxname using "${drug}\MMS_WAC_data.dta", idu(id1) txtu(Product)
//1- matchit first by the most relevant pair of columns - drug name

// 2- bring back the other columns
joinby id1 using "${drug}\MMS_WAC_data.dta"
joinby id2 using "${output}\P2007_dummy.dta"

save "${output}\2007+MMS.dta", replace

clear 

use "${output}\2007+MMS.dta", replace

//year1 is the year var in the MMS dataset. Restricting to 2007 meds
keep if year1==2007

gen keep1=1 if similscore==1
gen keep7=1 if similscore>=0.7 & similscore<=1

/*Match observations if:
1. rxname is contained in Product, OR
2. Product is contained in rxname (e.g. this would match TRICOR with TRICOR / TRILIPIX)
Match on the first word of rxname and Product (e.g. this would match DIOVAN W/HCTZ with DIOVAN / HCT
*/

//3 - rxname is contained in Product or Product is contained in rxname

gen byte either_contains = (strpos(Product, rxname) > 0) | (strpos(rxname, Product) > 0)

/*Alternate
gen overlap1=strpos(rxname, Product) > 0
gen overlap2=strpos(Product, rxname) > 0
//Product is contained in rxname
gen overlap=1 if overlap1==1 | overlap2==1
//single var combining the above 2
*/

//4- Match on the first word of rxname and Product

gen rxname1 = rxname
replace rxname1 = subinstr(rxname1, ",", " ",.)
replace rxname1 = subinstr(rxname1, "/", " ",.)
replace rxname1 = subinstr(rxname1, "(", " ",.)
replace rxname1 = subinstr(rxname1, ")", " ",.)
replace rxname1 = subinstr(rxname1, "-", " ",.)

gen Product1 = Product
replace Product1 = subinstr(Product1, ",", " ",.)
replace Product1 = subinstr(Product1, "/", " ",.)
replace Product1 = subinstr(Product1, "(", " ",.)
replace Product1 = subinstr(Product1, ")", " ",.)
replace Product1 = subinstr(Product1, "-", " ",.)
//removing special characters that mess up the fist name match

gen byte firstsame = lower(word(rxname1,1)) == lower(word(Product1, 1))

preserve

keep if keep7==1

export excel rxname rxname1 Product Product1 keep1 keep7 either_contains ///
firstsame similscore using "${drug}\EXported checks\MMS_MEPS_2007.xls", replace

restore


/*Next:
limit to >.7 pull into a csv and look to see. can delete dups.
Delete useless entries. Merge into main file. And keep only merged ones. */

keep if keep1==1 | keep7==1 | either_contains==1 | firstsame==1

save "${output}\2007_clean.dta", replace

Tags: None

Announcement

Writing a program to run code across multiple files