String: Regexm for a specific word in a string (1st)

Igor Paploski

Join Date: Oct 2014

Posts: 174
#16

13 Jul 2018, 16:33

I really like looking at the codes you guys come up with. They're great, I really like understanding the rationale behind them.

The main issue I see here is how representative of the whole dataset is this example we've been working with. This might be just a portion of the dataset, and whatever solutions appear to the issues shown, they might just be examples of extra stuff to come. I honestly hope that new data is not being entered on Marvin dataset (troubleshooting this while new issues keep arising is a nightmare). Additionally, I cannot stress enough how a simple -tab city- would be helpfull here (assuming no more data is being entered). Being able to look to all values in city would make creating the cleaning code so much easier.

Marvin, please try the following:

Code:

contract city dataex city

Please don't save the data after the -contract city- command. Running it followed by a -dataex city- will allow you to generate a code that contain a single observation of each actual answer in city. Hopefully there won't be many

Best;
1 like
Comment
Marvin Aliaga

Join Date: Feb 2015

Posts: 255
#17

16 Jul 2018, 08:56

Thanks for the explanation Clyde Schechter !

Romalpa Akzo This is pretty smart and more "elegant"! Thanks!

Igor Paploski
In fact, I am doing similar than contract city. As I mention, I am using different techniques to clean my city variable. For example, since the zip code variable in data is much cleaner, I am using zip codes to determine cities. If zip codes don't match, then I used the actual city to clean my city variable but first I only kept unique city values (bysort ID: keep if _n==1) . Then I just sort my city variable and look at 100 cases at the time and find the best way to clean it (ex. sound ex or substr expressions). The I am planning to merge my new clean city variable to my original dataset (merge m:1). I don't expect this merge to be problematic.
Comment
Igor Paploski

Join Date: Oct 2014

Posts: 174
#18

16 Jul 2018, 09:36

Hi Marvin,

Your rationale makes sense to me, but keep in mind that looking at 100 problematic cases at a time (and not the hole dataset of problems) might make issues arise. Romalpa's code (cleverly) deals with situations in which New York (or some variants) appear, namely: " NY", ", NY" and "NEW YORK". It is possible that, lurking somewhere in the other cases that are not in the 100 you are looking at, there is a weird observation whose city value is "SUNNY SIDE NEW Y" or "ELMHURST N YORK" or whatever weird variants might exist. This is a problem, but it can be easily dealt with by adding another line of code using the same rationale of Romalpa's code. My main concern is that a fix for specific problem might correct things it shouldn't, and because you are not looking at the whole dataset, it might take a while to recognize it. Take "SUNNY SIDE", for instance. Had it not been on your list of 100 problematic cases you happened to be looking at that moment, you would not have noticed that replacing "NY" for "" messes up with this observation. The post #9, from July 12, 3 days after prior posts on the topic were done, suggests me that this happened and you were attentive enough to detect it. But this might not be always the case. Anyway, just my 2 cents.

Best;
1 like
Comment

Announcement

Comment

Comment

Comment