How to Clean String variable with errors in data entry

Mathew Toll

Join Date: Jun 2021

Posts: 2
#1

How to Clean String variable with errors in data entry

29 Jun 2021, 22:42

Hi Stata Forum,

I am a relatively new user to STATA. I am working on a project that uses data scraped off a website that people have manually entered information into. I have a string variable that is supposed to contain a simple one phrase description "آخر جلسة" etc.

Quite a few entries contain information that should not be there, i.e. the date or multiplications of the entry:

المتابعة (إنشاء الملف) : [12008/2102/2020 آخر جلسة 2021-03-31 09:00:00] [42/2201/2020 آخر جلسة 2020-11-04 12:00:00]
المتابعة (إنشاء الملف) : [13/2114/2020 آخر جلسة 2021-02-02 13:00:00]

Most of the data has been entered correctly and the mistakes are not consistent, so I can't simply delete the first set of unneeded digits.

One of my ideas is to split the variable by the spaces and than drop values that are incorrect and than try and work all the correct values into a single column through if conditions and replace. Does this sound reasonable and are there any commands that could help make this easier?

Kind regards,
Mathew Toll
Tags: None
Andrew Musau

Join Date: Oct 2014

Posts: 10190
#2

30 Jun 2021, 11:35

It is unclear what constitutes a wrong entry from your description. If this is only identifiable on an observation-by-observation basis, then no general approach can be efficient. You just have to manually go through all entries. Otherwise, regular expressions may be helpful if there is a pattern to the errors. For what you suggest, see

Code:

help split
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30089
#3

30 Jun 2021, 12:23

Andrew Musau gives good advice. I will supplement it with a different aspect of the process.

and than try and work all the correct values into a single column through if conditions and replace.

This part is likely to be very difficult, and I suspect you will pull your hair out, if not entirely lose your mind, trying.

I recommend instead that after you have split the original variable, use -reshape long- to create a single variable ("column") containing all of the segments created by -split-. Then go ahead and drop the incorrect ones. If, in the end, there is only one correct segment for each original observation, then at that point you are done. If there can be more than one for each original observation, then renumber the segments, reshape back to wide, and concatenate. So something like this:

Code:

split messy_variable, gen(segments) gen long obs_no = _n reshape long segments, i(obs_no) // INSPECT THE DATA AND DROP ANY SEGMENTS THAT ARE UNWANTED // DO IT WITH CODE IF POSSIBLE; BY HAND IF NECESSARY drop if missing(segments) by obs_no (_j): replace _j = _n reshape wide segments, i(obs_no) j(_j) egen wanted = concat(segments), punct(" "))
Comment
Mathew Toll

Join Date: Jun 2021

Posts: 2
#4

14 Jul 2021, 19:18

Thank you Andrew Musau & Clyde Schechter.

I waited to reply until I had got it to work. So, again, thank you both this was extremely helpful.
Comment

Announcement

How to Clean String variable with errors in data entry

Comment

Comment

Comment