I am working with a cardiac ultrasound dataset in which there is a long string variable with the key findings. Here is one example:
I would like to remove the stopwords and substitute the synonym words (as specified by me).
Additionally, I would like to specify qualifier words such as: normal, mild, moderate, severe, dilated, enlarged, ... and any digits.
Finally, rather than a simple bag of words representation, I would like to create new variables named according to the non-qualifier words with their values = qualifier words that surround them (either before or after). For example:
Is this feasible in Stata??
Thank you,
Jonathan
The LV size and mass are within normal limits. LVEF 60-65%. No RWMA. Dilated RV. Moderate-to-severe TR. PASP is estimated at 54 mmHg.
Additionally, I would like to specify qualifier words such as: normal, mild, moderate, severe, dilated, enlarged, ... and any digits.
Finally, rather than a simple bag of words representation, I would like to create new variables named according to the non-qualifier words with their values = qualifier words that surround them (either before or after). For example:
var_lv_size | var_lv_mass | var_lvef | var_rwma | var_rv_size | var_tr | var_pasp |
normal | normal | 65 | no | dilated | moderate-severe | 54 |
Thank you,
Jonathan