log using "d:/datacleaning_07112023.log",replace
import excel using "https://www.tizianobellini.com/_files/ugd/8e0f21_995ab7f61d2a48fbbb59131298df5ee7.xlsx?dn=ch ap2oneypd.xlsx", sheet("Foglio1") firstrow clear
split idvintage_yearmonthly_in, p(",")
drop idvintage_yearmonthly_in
gen id=idvintage_yearmonthly_in1+idvintage_yearmonthly _in2
drop idvintage_yearmonthly_in1 idvintage_yearmonthly_in2
ren id idvintage_yearmonthly_in2
order idvintage_yearmonthly_in2 idvintage_yearmonthly_in3-idvintage_yearmonthly_in45
ren (idvintage_yearmonthly_in2-idvintage_yearmonthly_in45)(id vintage_year monthly_installment ///
loan_balance bureau_score num_bankrupt_iva time_since_bankrupt num_ccj time_since_ccj ccj_amount ///
num_bankrupt num_iva min_months_since_bankrupt pl_flag region ltv arrears_months ///
origination_date maturity_date repayment_type arrears_status arrears_segment mob remaining_mat ///
loan_term live_status repaid_status month arrears_event ///
bankrupt_event term_expiry_event worst_arrears_status max_arrears_12m recent_arrears_date months_since_2mia ///
avg_mia_6m max_arrears_bal_6m max_mia_6m avg_bal_6m avg_bureau_score_6m cc_util annual_income emp_length ///
months_since_recent_cc_delinq)
destring *, ignore("NA") replace
ds, has(type string)
foreach var in `r(varlist)' {
replace `var' = subinstr(`var', `"""', `""', .)
replace `var' = "" if `var' == "NA"
if "`var'" == "repayment_type" {
label define repayment_type 0 "Non-IO" 1 "IO"
encode `var', gen(repaymenttype) label(repayment_type)
drop `var'
rename repaymenttype `var'
}
else if "`var'" == "region" {
replace region = "" if region == "NA"
label define region 1 "r_a" 2 "r_b" 3 "r_c" 4 "r_d" 5 "r_e" 6 "r_f" 7 "r_g" 8 "r_h" 9 "r_i" 10 "r_l" 11 "r_m"
encode `var', gen(region1) label(region)
drop `var'
rename region1 `var'
}
else {
gen `var'1 = date(`var', "YMD")
drop `var'
rename `var'1 `var'
}
}
order id-pl_flag ltv region origination_date maturity_date repayment_type
order id-max_arrears_12m recent_arrears_date months_since_2mia-months_since_recent_cc_delinq
log close
exit
The existing code operates without any issues. Beginning with ds, has(type string) up until the closing bracket, this segment primarily serves to further process the five character-type variables, indicated in red, which are the output of the preceding code segment. In this context, both 'region' and 'repayment_type' can be transformed into numeric types using the encode command, after which their label values are assigned, and the original character variables removed.
However, the remaining three character-type variables can be morphed into date-type variables. My challenge is, how might I modify the code so that Stata accurately discerns the nature of these five character-type variables? If they are the sort that can be converted into date-type variables, they should be transformed accordingly. If they can be encoded into numeric types, I'd like to use encode to automatically convert them into the corresponding numeric format based on their original subcategory quantity and content. Following this, label values would be assigned, and the original character-type variables removed.
I would greatly appreciate any assistance or suggestions on this matter.
Thank you.
import excel using "https://www.tizianobellini.com/_files/ugd/8e0f21_995ab7f61d2a48fbbb59131298df5ee7.xlsx?dn=ch ap2oneypd.xlsx", sheet("Foglio1") firstrow clear
split idvintage_yearmonthly_in, p(",")
drop idvintage_yearmonthly_in
gen id=idvintage_yearmonthly_in1+idvintage_yearmonthly _in2
drop idvintage_yearmonthly_in1 idvintage_yearmonthly_in2
ren id idvintage_yearmonthly_in2
order idvintage_yearmonthly_in2 idvintage_yearmonthly_in3-idvintage_yearmonthly_in45
ren (idvintage_yearmonthly_in2-idvintage_yearmonthly_in45)(id vintage_year monthly_installment ///
loan_balance bureau_score num_bankrupt_iva time_since_bankrupt num_ccj time_since_ccj ccj_amount ///
num_bankrupt num_iva min_months_since_bankrupt pl_flag region ltv arrears_months ///
origination_date maturity_date repayment_type arrears_status arrears_segment mob remaining_mat ///
loan_term live_status repaid_status month arrears_event ///
bankrupt_event term_expiry_event worst_arrears_status max_arrears_12m recent_arrears_date months_since_2mia ///
avg_mia_6m max_arrears_bal_6m max_mia_6m avg_bal_6m avg_bureau_score_6m cc_util annual_income emp_length ///
months_since_recent_cc_delinq)
destring *, ignore("NA") replace
ds, has(type string)
foreach var in `r(varlist)' {
replace `var' = subinstr(`var', `"""', `""', .)
replace `var' = "" if `var' == "NA"
if "`var'" == "repayment_type" {
label define repayment_type 0 "Non-IO" 1 "IO"
encode `var', gen(repaymenttype) label(repayment_type)
drop `var'
rename repaymenttype `var'
}
else if "`var'" == "region" {
replace region = "" if region == "NA"
label define region 1 "r_a" 2 "r_b" 3 "r_c" 4 "r_d" 5 "r_e" 6 "r_f" 7 "r_g" 8 "r_h" 9 "r_i" 10 "r_l" 11 "r_m"
encode `var', gen(region1) label(region)
drop `var'
rename region1 `var'
}
else {
gen `var'1 = date(`var', "YMD")
drop `var'
rename `var'1 `var'
}
}
order id-pl_flag ltv region origination_date maturity_date repayment_type
order id-max_arrears_12m recent_arrears_date months_since_2mia-months_since_recent_cc_delinq
log close
exit
The existing code operates without any issues. Beginning with ds, has(type string) up until the closing bracket, this segment primarily serves to further process the five character-type variables, indicated in red, which are the output of the preceding code segment. In this context, both 'region' and 'repayment_type' can be transformed into numeric types using the encode command, after which their label values are assigned, and the original character variables removed.
However, the remaining three character-type variables can be morphed into date-type variables. My challenge is, how might I modify the code so that Stata accurately discerns the nature of these five character-type variables? If they are the sort that can be converted into date-type variables, they should be transformed accordingly. If they can be encoded into numeric types, I'd like to use encode to automatically convert them into the corresponding numeric format based on their original subcategory quantity and content. Following this, label values would be assigned, and the original character-type variables removed.
I would greatly appreciate any assistance or suggestions on this matter.
Thank you.
Comment