Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Replace command: Assistance in data cleaning

    Hello Everyone! I have data for one year on the municipal division of 25 counties of one state. The example Excel data is in the following format.
    Allegany County
    Barton
    Bel Air
    Bowling Green-Roberts Place
    Cresaptown
    Cumberland
    Total
    Anne Arundel
    LaVale
    Lonaconing
    Luke
    McCoole
    Midland
    Mt. Savage
    Potomac Park Addition
    Total
    I want to change it to the following format.
    Barton Allegany County
    Bel Air Allegany County
    Bowling Green-Roberts Place Allegany County
    Cresaptown Allegany County
    Cumberland Allegany County
    Total Allegany County
    LaVale Anne Arundel
    Lonaconing Anne Arundel
    Luke Anne Arundel
    McCoole Anne Arundel
    Midland Anne Arundel
    Mt. Savage Anne Arundel
    Potomac Park Addition Anne Arundel
    Total Anne Arundel
    I want to do the above for each of the 25 counties in the data.

    The code replace variable_name= "Allegany County" in 2/30 works. But doing it for all counties and then all years will have human errors.
    I want to know how to do this efficiently.

    Thanks in advance!
    Last edited by Chinmay Korgaonkar; 03 Oct 2023, 23:21.

  • #2
    The another issue I am facing is as below:

    The data for each county is in the following format.
    Caroline County
    Denton
    Federalsburg
    Goldsboro
    Greensboro
    Henderson
    Hillsboro
    Marydel
    Preston
    Ridgely
    Templeville
    Total
    Rural and Unincorporated
    Total
    I want to rename the first "Total" (appearing below Templeville) as "Total_incorporated." I want to do this for all counties in the data. Again, using code like replace variable_name= "Total_incorporated" in 24 for each county will have human errors.

    Can I do it efficiently?
    Thank you.

    Comment


    • #3
      Showing us the data from Excel is suggestive but not truly helpful. If you have not yet imported this data into Stata, it is premature to ask for any Stata help other than with the process of importing itself. If you have, then it is far more helpful to show the example from Stata, using the -dataex- command. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

      The data appears to be messy. For example, some of the municipality names (or Total) have trailing blank spaces, and others do not. So all of that has to be made uniform first. After doing that, the code I have written here assumes that the variability in the arrangements in your entire data set is limited to what you have shown in the example. Ultimately, the reading of free-form data is a task for an AI program, not a statistical package. But if the variation is limited to a small number of possibilities, as in the example, then it can be handled:

      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input str28 var1
      "Allegany County "            
      "Barton "                    
      "Bel Air "                    
      "Bowling Green-Roberts Place "
      "Cresaptown "                
      "Cumberland "                
      "Total "                      
      "Anne Arundel "              
      "LaVale "                    
      "Lonaconing "                
      "Luke "                      
      "McCoole "                    
      "Midland "                    
      "Mt. Savage "                
      "Potomac Park Addition "      
      "Total"                      
      "Caroline County"            
      "Denton"                      
      "Federalsburg"                
      "Goldsboro"                  
      "Greensboro"                  
      "Henderson"                  
      "Hillsboro"                  
      "Marydel"                    
      "Preston"                    
      "Ridgely"                    
      "Templeville"                
      "Total"                      
      "Rural and Unincorporated"    
      "Total"                      
      end
      
      replace var1 = trim(itrim(var1))
      gen byte is_header = ///
          (var1[_n-1] == "Total" & !inlist(var1, "Total", "Rural and Unincorporated")) ///
          | _n == 1
      gen county = var1 if is_header
      replace county = county[_n-1] if missing(county)
      
      drop if is_header
      drop is_header
      rename var1 municipal_division
      sort county, stable
      by county: gen n_totals = sum(municipal_division == "Total")
      replace municipal_division = "Total Incorporated" if municipal_division == "Total"    ///
          & n_totals == 1
      That said, please don't do this. While it is common for speadsheets to contain rows that are totals of rows above them, doing that in Stata is a recipe for serious problems. Unless you are meticulously thorough in always explicitly excluding the total rows from all calculations you will do, these total rows will end up double counting the rows they represent in all of your calculations. It is clear from your post that you are aware of the potential for human error, and having these total rows in there is a human-error trap waiting to spring. It is far too easy to forget to exclude them from some of the commands where they don't belong. It is like scattering throw rugs around the home of a person who can barely keep his balance. Please get rid of those total rows. Stata is a statistical program, not a spreadsheet. Using a spreadsheet-oriented data organization will only lead to tears.

      If you will need to use the county totals themselves frequently, put them in a separate data set. If you will need to use them occasionally, you can always re-calculate them when the need arises using the -collapse- command, or creating county-total variables with -egen, total()-. Just don't have the totals and the disaggregated data as separate observations in the same data set.
      Last edited by Clyde Schechter; 04 Oct 2023, 11:18.

      Comment


      • #4
        Thank you so much for the code and detailed advice. I will keep that in mind.

        This is the sample data structure of two county groups and their municipal subdivisions. I want county names such as "Allegany County" and "Anne Arundel County" to appear beside all the municipal subdivisions below them as a separate column.
        I also want to rename the first "Totals" in each county group as "Totals_incorporated"

        I am working on how to modify and apply your code for these two tasks.

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input str27 municipal_division str17 Number_Returns_Filed
        "Allegany County"             ""          
        "Barton"                      "209"        
        "Bel Air"                     "683"        
        "Bowling Green-Roberts Place" "488"        
        "Cresaptown"                  "711"        
        "Cumberland"                  "8272"      
        "Ellerslie"                   "233"        
        "Frostburg"                   "2844"      
        "LaVale"                      "2816"      
        "Lonaconing"                  "422"        
        "Luke"                        "45"        
        "McCoole"                     "139"        
        "Midland"                     "203"        
        "Mt. Savage"                  "359"        
        "Potomac Park Addition"       "396"        
        "Westernport"                 "751"        
        "Totals"                      "18571"      
        "Rural and Unincorporated"    "10171"      
        "Totals"                      "28742"      
        ""                            ""          
        "Anne Arundel County"         ""          
        "Annapolis"                   "21414"      
        "Highland Beach"              "91"        
        "Totals"                      "21505"      
        "Rural and Unincorporated"    "263550"    
        "Totals"                      "285055"    
        ""                            ""        
        end
        Thank you once again!

        Comment


        • #5
          It's just some very minor revisions of what I wrote in #3. For one, var1 has been replaced everywhere by municipal_division
          Code:
          gen byte is_header = _n == 1 | municipal_division[_n-1] == ""
          gen county = municipal_division if is_header
          replace county = county[_n-1] if missing(county)
          drop if is_header
          drop is_header
          
          sort county, stable
          by county: gen n_totals = sum(municipal_division == "Totals")
          replace municipal_division = "Total Incorporated" if municipal_division == "Totals"    ///
              & n_totals == 1
              
          drop n_totals
          drop if missing(municipal_division)
          
          destring Number_Returns_Filed, replace
          I added the final -destring- command because the variable clearly is intended to be numeric, and I suspect you will want to do some calculations with it. But it was imported as a string variable (because its value in the very first row of the Excel data was missing).

          Again, though, I warn you that using a data set that contains separate observations for items and totals is dangerous. You do it at your peril.

          Comment


          • #6
            If you really need the totals to be in the data set, the following gives a safer way to organize the data. After the code shown in #5, you can do this:
            Code:
            drop if substr(municipal_division, 1, 5) == "Total"
            by county (municipal_division), sort: egen grand_total = total(Number_Returns_Filed)
            by county (municipal_division): egen total_incorporated = ///
                total(cond(municipal_division != "Rural and Unincorporated", Number_Returns_Filed, .))

            Comment


            • #7
              Thank you very much once again. The code worked perfectly. I am working on this data as a part of an RA assignment, and I was specifically asked not to delete any data. However, I will make a note in do file that the cleaned data contains totals, too. Thanks once again.

              Comment


              • #8
                Hello, everyone and Clyde Schechter

                I am facing two issues while cleaning the data for other years. These are due formatting of Excel files when I converted them from PDF files. This was the best PDF to Excel file conversion I got after trying multiple conversion options.

                The cleaned sample data looks like the following as of now:


                Code:
                * Example generated by -dataex-. For more info, type help dataex
                clear
                input str29 Municipal_division str13 Number_Returns_Filed str17(Net_State_Tax Net_Local_Tax Net_Total_Tax) double G
                "Calvert"                    ""       ""                  ""             ""                             .
                "Chesapeake Beach"           "2571"   "6731199.41"        "3983224.57"   "10714423.98"                  .
                "North Beach"                "1004"   "1944394.88"        "1171853.44"   "3116248.32"                   .
                "Totals"                     "3575"   "8675594.289999999" "5155078.01"   "13830672.3"                   .
                " Rural and Unincorporated " "37878"  "98603342.31999999" "58553325.56"  "157156667.88"                 .
                "Totals"                     "41453"  "107278936.61"      "63708403.57"  "170987340.18"                 .
                "Caroline"                   ""       ""                  ""             ""                             .
                "Denton"                     "1687"   "2143260.67"        "1275887.03"   "3419147.7"                    .
                "Federalsburg"               "1186"   "797918.3"          "496939.62"    "1294857.92"                   .
                "Goldsboro"                  "124"    "99374.85000000001" "66705.64"     "166080.49"                    .
                "Greensboro"                 "768"    "736489.26"         "450834.62"    "1187323.88"                   .
                "Henderson"                  "56"     "30126"             "20624.84"     "50750.84"                     .
                "Hillsboro"                  "89"     "64075.18"          "38285.98"     "102361.16"                    .
                "Marydel"                    "88"     "53539.35"          "37393.83"     "90933.17999999999"            .
                "Preston"                    "329"    "455103.3"          "263709.39"    "718812.6899999999"            .
                "Ridgely"                    "696"    "901211.84"         "530353.13"    "1431564.97"                   .
                "Templeville"                "12"     "3974"              "2929.06"      "6903.06"                      .
                "Totals"                     "5035"   "5285072.75"        "3183663.14"   "8468735.890000001"            .
                " Rural and Unincorporated " "9369"   "13495233.97"       "8070272.44"   "21565506.41"                  .
                "Totals"                     "14404"  "18780306.72"       "11253935.58"  "30034242.3"                   .
                ""                           ""       ""                  ""             ""                             .
                ""                           ""       ""                  ""             ""                             .
                ""                           ""       ""                  ""             ""                             .
                "Carroll"                    ""       ""                  ""             ""                             .
                "Hampstead"                  ""       "3014"              "6087945.03"   "3977285.9"          10065230.93
                "Manchester"                 ""       "2123"              "4242961.55"   "2776010.67"          7018972.22
                "Mt. Airy"                   ""       "2305"              "7008405.53"   "4521873.2"          11530278.73
                "New Windsor"                ""       "604"               "1104205.52"   "723073.86"           1827279.38
                "Sykesville"                 ""       "2105"              "5372577.96"   "3507678.14"           8880256.1
                "Taneytown"                  ""       "2989"              "4410801.97"   "2902180.52"          7312982.49
                "Union Bridge"               ""       "381"               "487966.08"    "322711.6"             810677.68
                "Westminster"                ""       "8777"              "15246812.22"  "9955907.35"         25202719.57
                "Totals"                     "TOTALS" "22298"             "43961675.86"  "28686721.24"         72648397.1
                " Rural and Unincorporated " ""       "57765"             "156351597.97" "100549786.85"      256901384.82
                "Totals"                     "TOTALS" "80063"             "200313273.83" "129236508.09"      329549781.92
                end
                1. What to do when the header is not clearly identified as there is no space between the two observations. For example, "Caroline" and "Calvert" are two different counties. But as there is no space between them, Stata does not identify them as headers. If I use insobs 1, after (6) the issue is resolved. But I have to use this for each such case in data.
                2. For "Carroll" County, the variables are shifted to the right. The G variable should not exit. Using replace to change the value of each cell is impossible.
                Thanks in advance!
                Last edited by Chinmay Korgaonkar; 06 Oct 2023, 18:14.

                Comment


                • #9
                  Well, for the first difficulty, at least we still have the regularity that a new county appears after a sequence of Totals, Rural & Unncorporated, Totals (perhaps with some blank observations in between). The second is not that hard to fix.
                  Code:
                  * Example generated by -dataex-. For more info, type help dataex
                  clear
                  input str29 Municipal_division str13 Number_Returns_Filed str17(Net_State_Tax Net_Local_Tax Net_Total_Tax) double G
                  "Calvert"                    ""       ""                  ""             ""                             .
                  "Chesapeake Beach"           "2571"   "6731199.41"        "3983224.57"   "10714423.98"                  .
                  "North Beach"                "1004"   "1944394.88"        "1171853.44"   "3116248.32"                   .
                  "Totals"                     "3575"   "8675594.289999999" "5155078.01"   "13830672.3"                   .
                  " Rural and Unincorporated " "37878"  "98603342.31999999" "58553325.56"  "157156667.88"                 .
                  "Totals"                     "41453"  "107278936.61"      "63708403.57"  "170987340.18"                 .
                  "Caroline"                   ""       ""                  ""             ""                             .
                  "Denton"                     "1687"   "2143260.67"        "1275887.03"   "3419147.7"                    .
                  "Federalsburg"               "1186"   "797918.3"          "496939.62"    "1294857.92"                   .
                  "Goldsboro"                  "124"    "99374.85000000001" "66705.64"     "166080.49"                    .
                  "Greensboro"                 "768"    "736489.26"         "450834.62"    "1187323.88"                   .
                  "Henderson"                  "56"     "30126"             "20624.84"     "50750.84"                     .
                  "Hillsboro"                  "89"     "64075.18"          "38285.98"     "102361.16"                    .
                  "Marydel"                    "88"     "53539.35"          "37393.83"     "90933.17999999999"            .
                  "Preston"                    "329"    "455103.3"          "263709.39"    "718812.6899999999"            .
                  "Ridgely"                    "696"    "901211.84"         "530353.13"    "1431564.97"                   .
                  "Templeville"                "12"     "3974"              "2929.06"      "6903.06"                      .
                  "Totals"                     "5035"   "5285072.75"        "3183663.14"   "8468735.890000001"            .
                  " Rural and Unincorporated " "9369"   "13495233.97"       "8070272.44"   "21565506.41"                  .
                  "Totals"                     "14404"  "18780306.72"       "11253935.58"  "30034242.3"                   .
                  ""                           ""       ""                  ""             ""                             .
                  ""                           ""       ""                  ""             ""                             .
                  ""                           ""       ""                  ""             ""                             .
                  "Carroll"                    ""       ""                  ""             ""                             .
                  "Hampstead"                  ""       "3014"              "6087945.03"   "3977285.9"          10065230.93
                  "Manchester"                 ""       "2123"              "4242961.55"   "2776010.67"          7018972.22
                  "Mt. Airy"                   ""       "2305"              "7008405.53"   "4521873.2"          11530278.73
                  "New Windsor"                ""       "604"               "1104205.52"   "723073.86"           1827279.38
                  "Sykesville"                 ""       "2105"              "5372577.96"   "3507678.14"           8880256.1
                  "Taneytown"                  ""       "2989"              "4410801.97"   "2902180.52"          7312982.49
                  "Union Bridge"               ""       "381"               "487966.08"    "322711.6"             810677.68
                  "Westminster"                ""       "8777"              "15246812.22"  "9955907.35"         25202719.57
                  "Totals"                     "TOTALS" "22298"             "43961675.86"  "28686721.24"         72648397.1
                  " Rural and Unincorporated " ""       "57765"             "156351597.97" "100549786.85"      256901384.82
                  "Totals"                     "TOTALS" "80063"             "200313273.83" "129236508.09"      329549781.92
                  end
                  
                  drop if missing(Municipal_division)
                  replace Municipal_division = trim(itrim(Municipal_division))
                  gen byte is_header = (Municipal_division[_n-1] == "Totals" & Municipal_division[_n-3] == "Totals" ///
                      & Municipal_division[_n-2] == "Rural and Unincorporated") | _n == 1
                  gen county = Municipal_division if is_header, before(Municipal_division)
                  replace county = county[_n-1] if missing(county)
                  drop if is_header
                  drop is_header
                  
                  //    DEAL WITH SKIPPED COLUMN IN SOME COUNTIES
                  gen byte problem = missing(Number_Returns_Filed) | Number_Returns_Filed == "TOTALS"
                  unab vbles: Number_Returns_Filed-G
                  tostring G, replace force
                  display `"`vbles'"'
                  local n_vbles: word count `vbles'
                  forvalues i = 2/`n_vbles' {
                      local v: word `i' of `vbles'
                      local w: word `=`i'-1' of `vbles'
                      replace `w' = `v' if problem
                  }
                  destring `vbles', replace
                  drop G problem
                  
                  sort county, stable
                  by county: gen n_totals = sum(Municipal_division == "Totals")
                  foreach v of varlist Number_Returns_Filed-Net_Total_Tax {
                      by county: egen total_inc_`v' = ///
                          max(cond(n_totals == 1 & Municipal_division == "Totals", `v', .))
                      by county: egen grand_total_`v' = ///
                          max(cond(n_totals == 2 & Municipal_division == "Totals", `v', .))
                  }
                  Who did you antagonize to get the assignment of working with this hot mess of a data source?

                  Comment


                  • #10
                    Thank you so much again. You have been a lifesaver!

                    Comment


                    • #11
                      Hello everyone,

                      I want to generate a new variable if the string variable contains a specific word. For example, I want to generate a new var1 = 1 if any observation in the variable "Municipal_division" contains the word "County". I tried split, substr, and word functions but was not able to get what I wanted. Thanks in advance.

                      Code:
                      * Example generated by -dataex-. For more info, type help dataex
                      clear
                      input str35 Municipal_division str17 Real_property_tax_municipal
                      "Allegany County"           ""      
                      "Barton"                    ".225"  
                      "Cumberland"                "1.0595"
                      "Frostburg"                 ".7"    
                      "Lonaconing"                ".3084"
                      "Luke - Residential"        "1.5"  
                      "Luke - Commercial"         "2.5"  
                      "Luke - Rented Residential" "2.07"  
                      "Midland"                   ".28"  
                      "Westernport"               ".6"    
                      "Anne Arundel County"       ""      
                      "Annapolis"                 ".738"  
                      "Highland Beach"            ".1505"
                      "Baltimore City"            ""      
                      "Baltimore County"          ""      
                      end
                      Last edited by Chinmay Korgaonkar; 12 Oct 2023, 01:14.

                      Comment


                      • #12
                        Most likely you can do this satisfactorily with
                        Code:
                        gen byte wanted = strpos(Municipal_division, "County") > 0
                        There are some ways this can fail. The first is if the capitalization, which appears very regular in the example data, sometimes varies in the full data set. So, for example, this will fail to pick up a Muncicial_divison named, say, "Blahblah county". That's because, like everything in Stata, -strpos()- is case sensitive. If this might happen in your data, a safer approach would be
                        Code:
                        gen byte wanted = strpos(lower(Municipal_division), "county")) > 0
                        Most large data sets are likely to contain some variations like this, so unless you are very confident that the data set was very meticulously curated, go with this.

                        Another possible failure mode, in the opposite direction, is if there is some municipality (not a county) whose name itself contains "County". For example, if there is a village named "Countyseat Village", it will be misidentified by this code as a county entry. This could be avoided with a still more cautious -gen byte wanted = strpos(lower(" " + Municipal_division + " ", " county ")) > 0-. This will identify an observation as a county only if "county" appears as a whole word. It is unlikely that there are any municipalities with names like this, but I can't say that for sure. You probably understand this data better than I do and have a sense whether this level of caution is appropriate or excessive.

                        Even that could fail if there is a municipality whose name is more than one word and one of them is county. So, if there were a town called "Orange County City", this would be picked up even by my last proposed code. But at this point we are at the level of trying to identify strings based on semantics, not just string characteristics. That's an AI task, not something that can be done in Stata. To me, it seems far-fetched that there would be a municipality like this, but, again, I don't know. Again, I'm confident your intuitions about this would be better than mine.

                        My advice is to pick the approach that corresponds to the degree of regularity you expect the data set exhibits in municipal division names. Then check your results carefully. If you are unsure about the quality of the data, or it is from multiple sources, then go for my last approach as it is the safest. But, in a situation like this, no code can be 100% guaranteed.

                        Comment


                        • #13
                          Thank you so much for the detailed advice. The second code was sufficient for the data. Thank you once again!

                          Comment

                          Working...
                          X