Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Removing Leading and Trailing Spaces

    Consider the following full worked example
    Code:
    import delimited "http://qed.econ.queensu.ca/jae/datasets/ke001/readme.kh.txt", ///
     rowrange(45:75) clear delim(")") bindquote(strict)
    
     replace v2 =ustrregexra(v2,"\p{Z}","")
    My goal is to remove all leading and trailing spaces in my string variable, since a command I've written doesn't play nice with trailing spaces.

    As current, my syntax here gets rid of leading spaces (and apparently the one for Inner Mongolia, but for now I can live with that). Optimally, I'd want all leading and trailing spaces gone. Equally, I've tried
    Code:
    import delimited "http://qed.econ.queensu.ca/jae/datasets/ke001/readme.kh.txt", ///
     rowrange(45:75) clear delim(")") bindquote(strict)
    
     replace v2 =strtrim(v2)
    And I still have trailing spaces. Why? Maybe it's a special character being interpreted by me as a space?

  • #2
    Hey Jared, I just used R to quickly print the string literal of the section of text you are working with:

    Code:
    "(1) Beijing\t\n(2) Tianjin\t\n(3) Hebei\t\n(4) Shanxi\t\n(5) Inner Mongolia\t\n(6) Liaoning\t\n(7) Jilin\t\n(8) Heilongjiang\t\n(9) Shanghai\t\n(10) Jiangsu\t\n(11) Zhejiang\t\n(12) Anhui\t\n(13) Fujian\t\n(14) Jiangxi\t\n(15) Shandong\t\n(16) Henan\t\n(17) Hubei\t\n(18) Hunan\t\n(19) Guangdong\t\n(20) Guangxi\t\n(21) Hainan\t\n(22) Chongqing\t\n(23) Sichuan\t\n(24) Guizhou\t\n(25) Yunnan\t\n(26) Tibet\t\n(27) Shaanxi\t\n(28) Gansu\t\n(29) Qinghai\t\n(30) Ningxia\t\n(31) Xinjiang"
    I'm thinking those tab characters (\t) are causing your problem.

    Comment


    • #3
      Your data has a nuisance trailing tab on each line.
      Code:
      28 31 29 20 42 65 69 6A 69 6E 67 09 0A
      (  1  )     B  e  i  j  i  n  g
      Not sure enough about \p{Z} to say why your regex didn't solve the problem.

      Comment


      • #4
        Code:
        replace v2 =ustrtrim(v2)
        See

        Code:
        help ustrtrim()

        Comment


        • #5
          Yeah I figured this was part of the answer.
          Code:
          import delimited "http://qed.econ.queensu.ca/jae/datasets/ke001/readme.kh.txt", ///
              delimiter(tab) varnames(1) ///
              rowrange(45:75) clear  ///
              bindquote(nobind)
              
          rename xi province
          
          split province, p(")")
          
          replace province1 = regexs(2) if regexm(province1, "^([^0-9]*)([0-9]+)([^0-9]*)$")
          
          drop province v2
          
          rename (province1 province2) (id province)
          
          replace province = subinstr(province, " ", "", .)
          This is more or less what I've done, which works. Presumably, a more efficient solution exists.

          Comment


          • #6
            This seems sufficient for your described problem. I also use Andrew's suggestion of ustrtrim() and use a different regex pattern to remove the province number.

            Code:
            gen want = ustrregexs(2) if ustrregexm(province, "^(\([0-9]+\)[\s]+)(.*)", .)
            replace want = ustrtrim(want) // for good measure, but not necessary with example data

            Comment


            • #7
              Yep all the solutions here work. Thank you all so much!

              Comment

              Working...
              X