Seeking Assistance: Adapting Stata Code to Efficiently Convert Character-Type Variables into Date or Numeric Types

smith Jason

Join Date: Sep 2020

Posts: 378
#1

Seeking Assistance: Adapting Stata Code to Efficiently Convert Character-Type Variables into Date or Numeric Types

11 Jul 2023, 11:43

log using "d:/datacleaning_07112023.log",replace
import excel using "https://www.tizianobellini.com/_files/ugd/8e0f21_995ab7f61d2a48fbbb59131298df5ee7.xlsx?dn=ch ap2oneypd.xlsx", sheet("Foglio1") firstrow clear
split idvintage_yearmonthly_in, p(",")
drop idvintage_yearmonthly_in
gen id=idvintage_yearmonthly_in1+idvintage_yearmonthly _in2
drop idvintage_yearmonthly_in1 idvintage_yearmonthly_in2
ren id idvintage_yearmonthly_in2
order idvintage_yearmonthly_in2 idvintage_yearmonthly_in3-idvintage_yearmonthly_in45
ren (idvintage_yearmonthly_in2-idvintage_yearmonthly_in45)(id vintage_year monthly_installment ///
loan_balance bureau_score num_bankrupt_iva time_since_bankrupt num_ccj time_since_ccj ccj_amount ///
num_bankrupt num_iva min_months_since_bankrupt pl_flag region ltv arrears_months ///
origination_date maturity_date repayment_type arrears_status arrears_segment mob remaining_mat ///
loan_term live_status repaid_status month arrears_event ///
bankrupt_event term_expiry_event worst_arrears_status max_arrears_12m recent_arrears_date months_since_2mia ///
avg_mia_6m max_arrears_bal_6m max_mia_6m avg_bal_6m avg_bureau_score_6m cc_util annual_income emp_length ///
months_since_recent_cc_delinq)
destring *, ignore("NA") replace
ds, has(type string)
foreach var in `r(varlist)' {
replace `var' = subinstr(`var', `"""', `""', .)
replace `var' = "" if `var' == "NA"
if "`var'" == "repayment_type" {
label define repayment_type 0 "Non-IO" 1 "IO"
encode `var', gen(repaymenttype) label(repayment_type)
drop `var'
rename repaymenttype `var'
}
else if "`var'" == "region" {
replace region = "" if region == "NA"
label define region 1 "r_a" 2 "r_b" 3 "r_c" 4 "r_d" 5 "r_e" 6 "r_f" 7 "r_g" 8 "r_h" 9 "r_i" 10 "r_l" 11 "r_m"
encode `var', gen(region1) label(region)
drop `var'
rename region1 `var'
}
else {
gen `var'1 = date(`var', "YMD")
drop `var'
rename `var'1 `var'
}
}
order id-pl_flag ltv region origination_date maturity_date repayment_type
order id-max_arrears_12m recent_arrears_date months_since_2mia-months_since_recent_cc_delinq
log close
exit

The existing code operates without any issues. Beginning with ds, has(type string) up until the closing bracket, this segment primarily serves to further process the five character-type variables, indicated in red, which are the output of the preceding code segment. In this context, both 'region' and 'repayment_type' can be transformed into numeric types using the encode command, after which their label values are assigned, and the original character variables removed.

However, the remaining three character-type variables can be morphed into date-type variables. My challenge is, how might I modify the code so that Stata accurately discerns the nature of these five character-type variables? If they are the sort that can be converted into date-type variables, they should be transformed accordingly. If they can be encoded into numeric types, I'd like to use encode to automatically convert them into the corresponding numeric format based on their original subcategory quantity and content. Following this, label values would be assigned, and the original character-type variables removed.

I would greatly appreciate any assistance or suggestions on this matter.
Thank you.

Last edited by smith Jason; 11 Jul 2023, 11:50.
Tags: None
smith Jason

Join Date: Sep 2020

Posts: 378
#2

11 Jul 2023, 11:52

If you cannot run the code above, please download the Stata code here.

Attached Files

data_cleaning.do (2.3 KB, 1 view)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30165
#3

11 Jul 2023, 12:25

I'm not sure I understand what you are asking for. The code you show -destring-'s everything it can. The string variables that -destring- rejects are then dealt with. Two of them are identified by name, and the remainder are assumed to be date variables and you create a Stata daily date variable from them. So what precisely is the problem? Are you asking for some way to have Stata inspect a string variable and ascertain whether it is the string representation of a date? That is doable, but fairly complicated. And I question whether it is worth the trouble. In general you will have better luck identifying string variables that represent dates by their names, as most people when creating data sets will signal date variables by including something like date or dt, or more specific signals like birthday or dob in the name.

The main problem in recognizing string variables that represent dates by their content is that there are so many different ways a date variable can be represented as a string. For starters, the year, month, and day can occur in any order. Yes day-year-month is bizarre and I have never seen it in real life, but, in principle that is 6 possibilities. Next, the year can be either 2 digits or 4 digits. And if 2 digits, you have to know how to interpret the missing century. If we assume (and this will be wrong for some kinds of data) that the century is always the 21st, then that's 12 possibilities. You would have to set up a loop through them, trying to make the conversion, until you either found one that worked, or, reached the end of the list with all of them failing. The way to tell if the conversion was successful is to test whether missing(converted_date_variable) == missing(original_string_variable).

Something like this:

Code:

capture program drop is_it_a_date program define is_it_a_date, rclass syntax varname, gen(name) format(string) capture confirm string var `varname', exact if c(rc) != 0 { display as error "`varname' is not a string variable" return scalar is_a_date = 0 exit(9) } local formats DMY YDM YMD MDY DM20Y 20YDM 20YMD MD20Y M20YD DYM MYD D20YM gen int `gen' = . foreach f of local formats { replace `gen' = daily(`varname', "`f') capture assert missing(`gen') == missing(`varname') if c(rc) == 0 { // SUCCESSFUL DATE CONVERSION format `gen' `format' return scalar is_a_date = 1 exit } } // IF WE REACH THIS PART OF THE CODE, VARIABLE IS NOT A DATE drop `gen' return scalar is_a_date = 0 exit end

I have not tested this code, and I think it more likely than not contains errors. It's just intended to show the overall approach.

There are some serious limitations, however, to this approach, which I don't think are surmountable. If the variable is, in fact, supposed to be a date variable, but it contains some invalid dates, like 29feb2023, or a typo like 7/111/2023 the -assert- will fail and the variable will not be recognized as a date. Moreover, if the string variable contains a mixture of date layouts: some are like 7/11/2023 and others are like 11jul23, it will fail. Also, it is possible, if the values of the variable are sufficiently restricted, that the values will be interpretable in more than one order of D M and Y, and in that case, the code would just use the layout that appears first in the definition of local formats, possibly producing incorrect results.

I suppose a test based on regular expressions is possible, but the number of possibilities gets multiplied even more due to the variable use of separators between the components of a date: none, /, -, space, period (and perhaps others I haven't thought of) and the various ways of spelling out months as numbers, lower or upper or mixed case abbreviations, or full names.

Overall, I just wouldn't go this route. I would familiarize myself with each data set I confront, combined with the codebooks that come with them, to identify the date variables it contains and then write bespoke code for the data set.
Comment
smith Jason

Join Date: Sep 2020

Posts: 378
#4

12 Jul 2023, 12:51

Professor Clyde, Thank you your code.
Comment

Announcement

Seeking Assistance: Adapting Stata Code to Efficiently Convert Character-Type Variables into Date or Numeric Types

Comment

Comment

Comment