Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Data Cleaning Strings with Nested Quotes (Regex)

    I would like to clean a group of string variables to remove data inside some nested quotes. I would like to replace these strings, with the substring after the word "value" in the string below (Integreation in red in my example below).

    String: {"@odata.type":"#Microsoft.Azure.Connectors.Sha reP oint.SPListExpandedReference","Id":9,"Value":Integration"}
    Following the "Id": There is a range of numbers-1 or 2 digits

    Code:
    *** 0.A  Initializing local macros for filepaths and files
    ********************************************
    
    *** 0.A.1 Folder Navigation
    
    local fldr_teamsync        "C:\Users\\`c(username)'\Regional"
    local fldr_auto            "`fldr_teamsync'\ Documents\4. Refreshes\Automated Extracts"
    local fldr_lists         "`fldr_auto'\zS_Microsoft-Lists"
    
    ****************************************************************************************************
    *** PART 1:    Calling scheduled do files
    ****************************************************************************************************
    
    import delimited    "`fldr_lists'\Inventory - Data Sources.csv", bindquote(strict)
    
    local cleanvars portfolio portfolio_topic_subtopic status program_client program_data_entry     ///
          software2 type_primary_secondary
    
        
    foreach var in `cleanvars' {
        gen `var'_clean = regex(4) if regexm(`var', `"{"@odata.type":"#Microsoft.Azure.Connectors.SharePoint.SPListExpandedReference","Id":[0-9],"Value":,[a-zA-Z]+)
        }
    Any ideas on how to best accomplish this task is much appreciated.
    Last edited by Shannon Meadows; 29 Aug 2024, 09:03.

  • #2
    Code:
    clear
    input strL text
     `"{"@odata.type":"#Microsoft.Azure.Connectors.Sha reP oint.SPListExpandedReference","Id":9,"Value":Integration"}"'
    end
    
    gen wanted= ustrregexra(text, ".*@odata.type.*Value.*[^a-zA-Z]([a-zA-Z]+).*$","$1")
    Res.:

    Code:
    . l
    
         +------------------------------------------------------------------------------------------------------------------------------+
         |                                                                                                           text        wanted |
         |------------------------------------------------------------------------------------------------------------------------------|
      1. | {"@odata.type":"#Microsoft.Azure.Connectors.Sha reP oint.SPListExpandedReference","Id":9,"Value":Integration"}   Integration |
         +------------------------------------------------------------------------------------------------------------------------------+

    Comment


    • #3
      That worked well. Thank you Andrew Musau for your assistance.

      Code:
      foreach var in `cleanvars' {
          replace `var' = ustrregexra(`var', ".*@odata.type.*Value.*[^a-zA-Z]([a-zA-Z]+).*$","$1")  
      }

      Comment


      • #4
        Andrew Musau I have some variables, where the there are multiple "Value" strings. How would I get the above to repeat, to extract all of the values available in that variable. Example provided below:

        [{"@odata.type":"#Microsoft.Azure.Connectors.ShareP oint.SPListExpandedReference","Id":7,"Value":"2017 "},{"@odata.type":"#Microsoft.Azure.Connectors.Sha rePoint.SPListExpandedReference","Id":6,"Value":"2 018"},{"@odata.type":"#Microsoft.Azure.Connectors. SharePoint.SPListExpandedReference","Id":5,"Value" :"2019"}

        In this example, I would want to return 2017, 2018, 2019

        Code:
         
         foreach var in `cleanvars' {     replace `var' = ustrregexra(`var', ".*@odata.type.*Value.*[^a-zA-Z0-9]([a-zA-Z0-9]+).*$","$1")   }

        Comment


        • #5
          There is some structure to the repetitions, so you can use split.

          Code:
          clear
          input strL text
           `"{"@odata.type":"#Microsoft.Azure.Connectors.Sha reP oint.SPListExpandedReference","Id":9,"Value":Integration"}"'
           `"{"@odata.type":"#Microsoft.Azure.Connectors.ShareP oint.SPListExpandedReference","Id":7,"Value":"2017 "},{"@odata.type":"#Microsoft.Azure.Connectors.Sha rePoint.SPListExpandedReference","Id":6,"Value":"2 018"},{"@odata.type":"#Microsoft.Azure.Connectors. SharePoint.SPListExpandedReference","Id":5,"Value" :"2019"}"'
          end
          
          split text, p(},{) g(strings)
          foreach string of varlist strings*{
           replace `string'= ustrregexra(`string', ".*@odata.type.*Value.*[^a-zA-Z0-9\s]([a-zA-Z0-9\s]+).*$","$1,")
          }
          egen wanted= concat(strings*)
          replace text= trim(itrim(substr(wanted, 1,length(wanted)-1)))
          drop strings* wanted
          Res.:

          Code:
          . l
          
               +------------------+
               |             text |
               |------------------|
            1. |      Integration |
            2. | 2017 ,2 018,2019 |
               +------------------+

          Comment


          • #6
            Andrew Musau thank you for your reply. Using the code above, I received an error r(111), variable text not found.

            If I replace 'text' with my macro `cleanvars', then I recieve and error r(103), too many variables specified.

            Code:
            local cleanvars portfolio portfolio_topic_subtopic status program_client program_data_entry     ///
                  hart_lead hart_access software2 type_primary_secondary
            
            *** 1c.0    Clean text variables with one or multiple text entries per cell    
            
            split `cleanvars', p(},{) g(strings)
            foreach string of varlist `cleanvars'*{
             replace `string'= ustrregexra(`string', ".*@odata.type.*Value.*[^a-zA-Z0-9\s]([a-zA-Z0-9\s]+).*$","$1,")
            }
            egen wanted= concat(strings*)
            replace `cleanvars'= trim(itrim(substr(wanted, 1,length(wanted)-1)))
            drop `cleanvars'* wanted
            Last edited by Shannon Meadows; 03 Sep 2024, 12:18.

            Comment


            • #7
              Code:
              egen text= concat(`cleanvars')
              if you do not intend on keeping each of the individual variables. Otherwise:

              Code:
              foreach var in `cleanvars'{
                  split `var', p(},{) g(strings)
                  foreach string of varlist strings*{
                      replace `string'= ustrregexra(`string', ".*@odata.type.*Value.*[^a-zA-Z0-9\s]([a-zA-Z0-9\s]+).*$","$1,")
                  }
                  egen wanted= concat(strings*)
                  replace `var'= trim(itrim(substr(wanted, 1,length(wanted)-1)))
                  drop strings* wanted
              }

              Comment


              • #8
                Thank you Andrew Musau. That worked well. In the `clearnvars' if there are no strings in a cell, there is a square bracket character in place " [ ", but I can't quite figure out where it is coming from. Any thoughts?

                Comment


                • #9
                  Your strings (some) appear to start with an opening square bracket (see your example in #4). This suggests that some empty strings take the form (highlighted):

                  Code:
                  clear
                  input strL text
                   `"[{"@odata.type":"#Microsoft.Azure.Connectors.Sha reP oint.SPListExpandedReference","Id":9,"Value":Integration"}"'
                  `"{"@odata.type":"#Microsoft.Azure.Connectors.Sha reP oint.SPListExpandedReference","Id":9,"Value":Some text 123 4"}"'
                  `"[ ]"'
                   `"{"@odata.type":"#Microsoft.Azure.Connectors.ShareP oint.SPListExpandedReference","Id":7,"Value":"2017 "},{"@odata.type":"#Microsoft.Azure.Connectors.Sha rePoint.SPListExpandedReference","Id":6,"Value":"2 018"},{"@odata.type":"#Microsoft.Azure.Connectors. SharePoint.SPListExpandedReference","Id":5,"Value" :"2019"}"'
                  end
                  
                  local cleanvars text
                  
                  foreach var in `cleanvars'{
                      split `var', p(},{) g(strings)
                      foreach string of varlist strings*{
                          replace `string'= ustrregexra(`string', ".*@odata.type.*Value.*[^a-zA-Z0-9\s]([a-zA-Z0-9\s]+).*$","$1,")
                      }
                      egen wanted= concat(strings*)
                      replace `var'= trim(itrim(substr(wanted, 1,length(wanted)-1)))
                      drop strings* wanted
                  }
                  Res.:

                  Code:
                  . l
                  
                       +------------------+
                       |             text |
                       |------------------|
                    1. |      Integration |
                    2. |  Some text 123 4 |
                    3. |                [ |
                    4. | 2017 ,2 018,2019 |
                       +------------------+

                  You can add a replace command to get rid of these opening square brackets in the results.
                  Code:
                  clear
                  input strL text
                   `"[{"@odata.type":"#Microsoft.Azure.Connectors.Sha reP oint.SPListExpandedReference","Id":9,"Value":Integration"}"'
                  `"{"@odata.type":"#Microsoft.Azure.Connectors.Sha reP oint.SPListExpandedReference","Id":9,"Value":Some text 123 4"}"'
                  `"[ ]"'
                   `"{"@odata.type":"#Microsoft.Azure.Connectors.ShareP oint.SPListExpandedReference","Id":7,"Value":"2017 "},{"@odata.type":"#Microsoft.Azure.Connectors.Sha rePoint.SPListExpandedReference","Id":6,"Value":"2 018"},{"@odata.type":"#Microsoft.Azure.Connectors. SharePoint.SPListExpandedReference","Id":5,"Value" :"2019"}"'
                  end
                  
                  local cleanvars text
                  
                  foreach var in `cleanvars'{
                      split `var', p(},{) g(strings)
                      foreach string of varlist strings*{
                          replace `string'= ustrregexra(`string', ".*@odata.type.*Value.*[^a-zA-Z0-9\s]([a-zA-Z0-9\s]+).*$","$1,")
                      }
                      egen wanted= concat(strings*)
                      replace `var'= trim(itrim(substr(wanted, 1,length(wanted)-1)))
                      replace `var'="" if `var'=="["
                      drop strings* wanted
                  }
                  Res.:

                  Code:
                  . l
                  
                       +------------------+
                       |             text |
                       |------------------|
                    1. |      Integration |
                    2. |  Some text 123 4 |
                    3. |                  |
                    4. | 2017 ,2 018,2019 |
                       +------------------+

                  Comment


                  • #10
                    Excellent, that worked. Thank you again Andrew Musau

                    Comment

                    Working...
                    X