Replace command: Assistance in data cleaning

Chinmay Korgaonkar

Join Date: Dec 2022
Posts: 35

Replace command: Assistance in data cleaning

03 Oct 2023, 23:14

Hello Everyone! I have data for one year on the municipal division of 25 counties of one state. The example Excel data is in the following format.

Allegany County
Barton
Bel Air
Bowling Green-Roberts Place
Cresaptown
Cumberland
Total
Anne Arundel
LaVale
Lonaconing
Luke
McCoole
Midland
Mt. Savage
Potomac Park Addition
Total

I want to change it to the following format.

Barton	Allegany County
Bel Air	Allegany County
Bowling Green-Roberts Place	Allegany County
Cresaptown	Allegany County
Cumberland	Allegany County
Total	Allegany County

LaVale	Anne Arundel
Lonaconing	Anne Arundel
Luke	Anne Arundel
McCoole	Anne Arundel
Midland	Anne Arundel
Mt. Savage	Anne Arundel
Potomac Park Addition	Anne Arundel
Total	Anne Arundel

I want to do the above for each of the 25 counties in the data.

The code replace variable_name= "Allegany County" in 2/30 works. But doing it for all counties and then all years will have human errors.
I want to know how to do this efficiently.

Thanks in advance!

Last edited by Chinmay Korgaonkar; 03 Oct 2023, 23:21.

Tags: None

Chinmay Korgaonkar

Join Date: Dec 2022

Posts: 35
#2

03 Oct 2023, 23:31

The another issue I am facing is as below:

The data for each county is in the following format.

Caroline County

Denton

Federalsburg

Goldsboro

Greensboro

Henderson

Hillsboro

Marydel

Preston

Ridgely

Templeville

Total

Rural and Unincorporated

Total

I want to rename the first "Total" (appearing below Templeville) as "Total_incorporated." I want to do this for all counties in the data. Again, using code like replace variable_name= "Total_incorporated" in 24 for each county will have human errors.

Can I do it efficiently?
Thank you.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30163
#3

04 Oct 2023, 11:14

Showing us the data from Excel is suggestive but not truly helpful. If you have not yet imported this data into Stata, it is premature to ask for any Stata help other than with the process of importing itself. If you have, then it is far more helpful to show the example from Stata, using the -dataex- command. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

The data appears to be messy. For example, some of the municipality names (or Total) have trailing blank spaces, and others do not. So all of that has to be made uniform first. After doing that, the code I have written here assumes that the variability in the arrangements in your entire data set is limited to what you have shown in the example. Ultimately, the reading of free-form data is a task for an AI program, not a statistical package. But if the variation is limited to a small number of possibilities, as in the example, then it can be handled:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str28 var1 "Allegany County " "Barton " "Bel Air " "Bowling Green-Roberts Place " "Cresaptown " "Cumberland " "Total " "Anne Arundel " "LaVale " "Lonaconing " "Luke " "McCoole " "Midland " "Mt. Savage " "Potomac Park Addition " "Total" "Caroline County" "Denton" "Federalsburg" "Goldsboro" "Greensboro" "Henderson" "Hillsboro" "Marydel" "Preston" "Ridgely" "Templeville" "Total" "Rural and Unincorporated" "Total" end replace var1 = trim(itrim(var1)) gen byte is_header = /// (var1[_n-1] == "Total" & !inlist(var1, "Total", "Rural and Unincorporated")) /// | _n == 1 gen county = var1 if is_header replace county = county[_n-1] if missing(county) drop if is_header drop is_header rename var1 municipal_division sort county, stable by county: gen n_totals = sum(municipal_division == "Total") replace municipal_division = "Total Incorporated" if municipal_division == "Total" /// & n_totals == 1

That said, please don't do this. While it is common for speadsheets to contain rows that are totals of rows above them, doing that in Stata is a recipe for serious problems. Unless you are meticulously thorough in always explicitly excluding the total rows from all calculations you will do, these total rows will end up double counting the rows they represent in all of your calculations. It is clear from your post that you are aware of the potential for human error, and having these total rows in there is a human-error trap waiting to spring. It is far too easy to forget to exclude them from some of the commands where they don't belong. It is like scattering throw rugs around the home of a person who can barely keep his balance. Please get rid of those total rows. Stata is a statistical program, not a spreadsheet. Using a spreadsheet-oriented data organization will only lead to tears.

If you will need to use the county totals themselves frequently, put them in a separate data set. If you will need to use them occasionally, you can always re-calculate them when the need arises using the -collapse- command, or creating county-total variables with -egen, total()-. Just don't have the totals and the disaggregated data as separate observations in the same data set.

Last edited by Clyde Schechter; 04 Oct 2023, 11:18.
2 likes
Comment

Chinmay Korgaonkar

Join Date: Dec 2022
Posts: 35

04 Oct 2023, 16:51

Thank you so much for the code and detailed advice. I will keep that in mind.

This is the sample data structure of two county groups and their municipal subdivisions. I want county names such as "Allegany County" and "Anne Arundel County" to appear beside all the municipal subdivisions below them as a separate column.
I also want to rename the first "Totals" in each county group as "Totals_incorporated"

I am working on how to modify and apply your code for these two tasks.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str27 municipal_division str17 Number_Returns_Filed
"Allegany County"             ""          
"Barton"                      "209"        
"Bel Air"                     "683"        
"Bowling Green-Roberts Place" "488"        
"Cresaptown"                  "711"        
"Cumberland"                  "8272"      
"Ellerslie"                   "233"        
"Frostburg"                   "2844"      
"LaVale"                      "2816"      
"Lonaconing"                  "422"        
"Luke"                        "45"        
"McCoole"                     "139"        
"Midland"                     "203"        
"Mt. Savage"                  "359"        
"Potomac Park Addition"       "396"        
"Westernport"                 "751"        
"Totals"                      "18571"      
"Rural and Unincorporated"    "10171"      
"Totals"                      "28742"      
""                            ""          
"Anne Arundel County"         ""          
"Annapolis"                   "21414"      
"Highland Beach"              "91"        
"Totals"                      "21505"      
"Rural and Unincorporated"    "263550"    
"Totals"                      "285055"    
""                            ""        
end

Thank you once again!

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30163
#5

04 Oct 2023, 17:43

It's just some very minor revisions of what I wrote in #3. For one, var1 has been replaced everywhere by municipal_division

Code:

gen byte is_header = _n == 1 | municipal_division[_n-1] == "" gen county = municipal_division if is_header replace county = county[_n-1] if missing(county) drop if is_header drop is_header sort county, stable by county: gen n_totals = sum(municipal_division == "Totals") replace municipal_division = "Total Incorporated" if municipal_division == "Totals" /// & n_totals == 1 drop n_totals drop if missing(municipal_division) destring Number_Returns_Filed, replace

I added the final -destring- command because the variable clearly is intended to be numeric, and I suspect you will want to do some calculations with it. But it was imported as a string variable (because its value in the very first row of the Excel data was missing).

Again, though, I warn you that using a data set that contains separate observations for items and totals is dangerous. You do it at your peril.
1 like
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30163

04 Oct 2023, 17:50

If you really need the totals to be in the data set, the following gives a safer way to organize the data. After the code shown in #5, you can do this:

Code:

drop if substr(municipal_division, 1, 5) == "Total"
by county (municipal_division), sort: egen grand_total = total(Number_Returns_Filed)
by county (municipal_division): egen total_incorporated = ///
    total(cond(municipal_division != "Rural and Unincorporated", Number_Returns_Filed, .))

Comment

Chinmay Korgaonkar

Join Date: Dec 2022

Posts: 35
#7

04 Oct 2023, 20:47

Thank you very much once again. The code worked perfectly. I am working on this data as a part of an RA assignment, and I was specifically asked not to delete any data. However, I will make a note in do file that the cleaned data contains totals, too. Thanks once again.
Comment

Chinmay Korgaonkar

Join Date: Dec 2022
Posts: 35

06 Oct 2023, 18:09

Hello, everyone and Clyde Schechter

I am facing two issues while cleaning the data for other years. These are due formatting of Excel files when I converted them from PDF files. This was the best PDF to Excel file conversion I got after trying multiple conversion options.

The cleaned sample data looks like the following as of now:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str29 Municipal_division str13 Number_Returns_Filed str17(Net_State_Tax Net_Local_Tax Net_Total_Tax) double G
"Calvert"                    ""       ""                  ""             ""                             .
"Chesapeake Beach"           "2571"   "6731199.41"        "3983224.57"   "10714423.98"                  .
"North Beach"                "1004"   "1944394.88"        "1171853.44"   "3116248.32"                   .
"Totals"                     "3575"   "8675594.289999999" "5155078.01"   "13830672.3"                   .
" Rural and Unincorporated " "37878"  "98603342.31999999" "58553325.56"  "157156667.88"                 .
"Totals"                     "41453"  "107278936.61"      "63708403.57"  "170987340.18"                 .
"Caroline"                   ""       ""                  ""             ""                             .
"Denton"                     "1687"   "2143260.67"        "1275887.03"   "3419147.7"                    .
"Federalsburg"               "1186"   "797918.3"          "496939.62"    "1294857.92"                   .
"Goldsboro"                  "124"    "99374.85000000001" "66705.64"     "166080.49"                    .
"Greensboro"                 "768"    "736489.26"         "450834.62"    "1187323.88"                   .
"Henderson"                  "56"     "30126"             "20624.84"     "50750.84"                     .
"Hillsboro"                  "89"     "64075.18"          "38285.98"     "102361.16"                    .
"Marydel"                    "88"     "53539.35"          "37393.83"     "90933.17999999999"            .
"Preston"                    "329"    "455103.3"          "263709.39"    "718812.6899999999"            .
"Ridgely"                    "696"    "901211.84"         "530353.13"    "1431564.97"                   .
"Templeville"                "12"     "3974"              "2929.06"      "6903.06"                      .
"Totals"                     "5035"   "5285072.75"        "3183663.14"   "8468735.890000001"            .
" Rural and Unincorporated " "9369"   "13495233.97"       "8070272.44"   "21565506.41"                  .
"Totals"                     "14404"  "18780306.72"       "11253935.58"  "30034242.3"                   .
""                           ""       ""                  ""             ""                             .
""                           ""       ""                  ""             ""                             .
""                           ""       ""                  ""             ""                             .
"Carroll"                    ""       ""                  ""             ""                             .
"Hampstead"                  ""       "3014"              "6087945.03"   "3977285.9"          10065230.93
"Manchester"                 ""       "2123"              "4242961.55"   "2776010.67"          7018972.22
"Mt. Airy"                   ""       "2305"              "7008405.53"   "4521873.2"          11530278.73
"New Windsor"                ""       "604"               "1104205.52"   "723073.86"           1827279.38
"Sykesville"                 ""       "2105"              "5372577.96"   "3507678.14"           8880256.1
"Taneytown"                  ""       "2989"              "4410801.97"   "2902180.52"          7312982.49
"Union Bridge"               ""       "381"               "487966.08"    "322711.6"             810677.68
"Westminster"                ""       "8777"              "15246812.22"  "9955907.35"         25202719.57
"Totals"                     "TOTALS" "22298"             "43961675.86"  "28686721.24"         72648397.1
" Rural and Unincorporated " ""       "57765"             "156351597.97" "100549786.85"      256901384.82
"Totals"                     "TOTALS" "80063"             "200313273.83" "129236508.09"      329549781.92
end

What to do when the header is not clearly identified as there is no space between the two observations. For example, "Caroline" and "Calvert" are two different counties. But as there is no space between them, Stata does not identify them as headers. If I use insobs 1, after (6) the issue is resolved. But I have to use this for each such case in data.
For "Carroll" County, the variables are shifted to the right. The G variable should not exit. Using replace to change the value of each cell is impossible.

Thanks in advance!

Last edited by Chinmay Korgaonkar; 06 Oct 2023, 18:14.

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30163

06 Oct 2023, 19:22

Well, for the first difficulty, at least we still have the regularity that a new county appears after a sequence of Totals, Rural & Unncorporated, Totals (perhaps with some blank observations in between). The second is not that hard to fix.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str29 Municipal_division str13 Number_Returns_Filed str17(Net_State_Tax Net_Local_Tax Net_Total_Tax) double G
"Calvert"                    ""       ""                  ""             ""                             .
"Chesapeake Beach"           "2571"   "6731199.41"        "3983224.57"   "10714423.98"                  .
"North Beach"                "1004"   "1944394.88"        "1171853.44"   "3116248.32"                   .
"Totals"                     "3575"   "8675594.289999999" "5155078.01"   "13830672.3"                   .
" Rural and Unincorporated " "37878"  "98603342.31999999" "58553325.56"  "157156667.88"                 .
"Totals"                     "41453"  "107278936.61"      "63708403.57"  "170987340.18"                 .
"Caroline"                   ""       ""                  ""             ""                             .
"Denton"                     "1687"   "2143260.67"        "1275887.03"   "3419147.7"                    .
"Federalsburg"               "1186"   "797918.3"          "496939.62"    "1294857.92"                   .
"Goldsboro"                  "124"    "99374.85000000001" "66705.64"     "166080.49"                    .
"Greensboro"                 "768"    "736489.26"         "450834.62"    "1187323.88"                   .
"Henderson"                  "56"     "30126"             "20624.84"     "50750.84"                     .
"Hillsboro"                  "89"     "64075.18"          "38285.98"     "102361.16"                    .
"Marydel"                    "88"     "53539.35"          "37393.83"     "90933.17999999999"            .
"Preston"                    "329"    "455103.3"          "263709.39"    "718812.6899999999"            .
"Ridgely"                    "696"    "901211.84"         "530353.13"    "1431564.97"                   .
"Templeville"                "12"     "3974"              "2929.06"      "6903.06"                      .
"Totals"                     "5035"   "5285072.75"        "3183663.14"   "8468735.890000001"            .
" Rural and Unincorporated " "9369"   "13495233.97"       "8070272.44"   "21565506.41"                  .
"Totals"                     "14404"  "18780306.72"       "11253935.58"  "30034242.3"                   .
""                           ""       ""                  ""             ""                             .
""                           ""       ""                  ""             ""                             .
""                           ""       ""                  ""             ""                             .
"Carroll"                    ""       ""                  ""             ""                             .
"Hampstead"                  ""       "3014"              "6087945.03"   "3977285.9"          10065230.93
"Manchester"                 ""       "2123"              "4242961.55"   "2776010.67"          7018972.22
"Mt. Airy"                   ""       "2305"              "7008405.53"   "4521873.2"          11530278.73
"New Windsor"                ""       "604"               "1104205.52"   "723073.86"           1827279.38
"Sykesville"                 ""       "2105"              "5372577.96"   "3507678.14"           8880256.1
"Taneytown"                  ""       "2989"              "4410801.97"   "2902180.52"          7312982.49
"Union Bridge"               ""       "381"               "487966.08"    "322711.6"             810677.68
"Westminster"                ""       "8777"              "15246812.22"  "9955907.35"         25202719.57
"Totals"                     "TOTALS" "22298"             "43961675.86"  "28686721.24"         72648397.1
" Rural and Unincorporated " ""       "57765"             "156351597.97" "100549786.85"      256901384.82
"Totals"                     "TOTALS" "80063"             "200313273.83" "129236508.09"      329549781.92
end

drop if missing(Municipal_division)
replace Municipal_division = trim(itrim(Municipal_division))
gen byte is_header = (Municipal_division[_n-1] == "Totals" & Municipal_division[_n-3] == "Totals" ///
    & Municipal_division[_n-2] == "Rural and Unincorporated") | _n == 1
gen county = Municipal_division if is_header, before(Municipal_division)
replace county = county[_n-1] if missing(county)
drop if is_header
drop is_header

//    DEAL WITH SKIPPED COLUMN IN SOME COUNTIES
gen byte problem = missing(Number_Returns_Filed) | Number_Returns_Filed == "TOTALS"
unab vbles: Number_Returns_Filed-G
tostring G, replace force
display `"`vbles'"'
local n_vbles: word count `vbles'
forvalues i = 2/`n_vbles' {
    local v: word `i' of `vbles'
    local w: word `=`i'-1' of `vbles'
    replace `w' = `v' if problem
}
destring `vbles', replace
drop G problem

sort county, stable
by county: gen n_totals = sum(Municipal_division == "Totals")
foreach v of varlist Number_Returns_Filed-Net_Total_Tax {
    by county: egen total_inc_`v' = ///
        max(cond(n_totals == 1 & Municipal_division == "Totals", `v', .))
    by county: egen grand_total_`v' = ///
        max(cond(n_totals == 2 & Municipal_division == "Totals", `v', .))
}

Who did you antagonize to get the assignment of working with this hot mess of a data source?

Comment

Chinmay Korgaonkar

Join Date: Dec 2022

Posts: 35
#10

06 Oct 2023, 22:26

Thank you so much again. You have been a lifesaver!
Comment

Chinmay Korgaonkar

Join Date: Dec 2022
Posts: 35

#11

12 Oct 2023, 00:36

Hello everyone,

I want to generate a new variable if the string variable contains a specific word. For example, I want to generate a new var1 = 1 if any observation in the variable "Municipal_division" contains the word "County". I tried split, substr, and word functions but was not able to get what I wanted. Thanks in advance.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str35 Municipal_division str17 Real_property_tax_municipal
"Allegany County"           ""      
"Barton"                    ".225"  
"Cumberland"                "1.0595"
"Frostburg"                 ".7"    
"Lonaconing"                ".3084"
"Luke - Residential"        "1.5"  
"Luke - Commercial"         "2.5"  
"Luke - Rented Residential" "2.07"  
"Midland"                   ".28"  
"Westernport"               ".6"    
"Anne Arundel County"       ""      
"Annapolis"                 ".738"  
"Highland Beach"            ".1505"
"Baltimore City"            ""      
"Baltimore County"          ""      
end

Last edited by Chinmay Korgaonkar; 12 Oct 2023, 01:14.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30163
#12

12 Oct 2023, 10:05

Most likely you can do this satisfactorily with

Code:

gen byte wanted = strpos(Municipal_division, "County") > 0

There are some ways this can fail. The first is if the capitalization, which appears very regular in the example data, sometimes varies in the full data set. So, for example, this will fail to pick up a Muncicial_divison named, say, "Blahblah county". That's because, like everything in Stata, -strpos()- is case sensitive. If this might happen in your data, a safer approach would be

Code:

gen byte wanted = strpos(lower(Municipal_division), "county")) > 0

Most large data sets are likely to contain some variations like this, so unless you are very confident that the data set was very meticulously curated, go with this.

Another possible failure mode, in the opposite direction, is if there is some municipality (not a county) whose name itself contains "County". For example, if there is a village named "Countyseat Village", it will be misidentified by this code as a county entry. This could be avoided with a still more cautious -gen byte wanted = strpos(lower(" " + Municipal_division + " ", " county ")) > 0-. This will identify an observation as a county only if "county" appears as a whole word. It is unlikely that there are any municipalities with names like this, but I can't say that for sure. You probably understand this data better than I do and have a sense whether this level of caution is appropriate or excessive.

Even that could fail if there is a municipality whose name is more than one word and one of them is county. So, if there were a town called "Orange County City", this would be picked up even by my last proposed code. But at this point we are at the level of trying to identify strings based on semantics, not just string characteristics. That's an AI task, not something that can be done in Stata. To me, it seems far-fetched that there would be a municipality like this, but, again, I don't know. Again, I'm confident your intuitions about this would be better than mine.

My advice is to pick the approach that corresponds to the degree of regularity you expect the data set exhibits in municipal division names. Then check your results carefully. If you are unsure about the quality of the data, or it is from multiple sources, then go for my last approach as it is the safest. But, in a situation like this, no code can be 100% guaranteed.
Comment
Chinmay Korgaonkar

Join Date: Dec 2022

Posts: 35
#13

12 Oct 2023, 22:27

Thank you so much for the detailed advice. The second code was sufficient for the data. Thank you once again!
Comment

Announcement

Replace command: Assistance in data cleaning

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment