Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Drop Observation if sting contains specific text string

    I will appreciate your advise regarding to drop observations.

    My data set contains a list of institutions names (observations) by the var "instnm". (see print screen attached)

    I want to drop all the institutions that their name contained the word "BEAUTY".

    What will be the best way to do so?

    Thank you!!!
    Attached Files

  • #2
    Code:
    drop if strpos(instant,"BEAUTY")>0
    auto-correct strikes again; keeps changing your variable name to "instant"

    Comment


    • #3
      Mike, thanks for your help! In order to "clean" the set, I used the following codes

      [CODE]
      clear
      **Assume "zzzzz" never occurs so that each line is read as one string.
      import delimited using "C:\Users\Scholz.ECFS-SERV\Desktop\DSGF\sample.txt", delimiter("zzzzz", asstring)
      rename v1 s
      // Put an ID and line number on each line that belongs to the same transaction.
      gen int ID = .
      quiet replace ID = cond(_n ==1, 1, ID[_n-1] + (strpos(s, "REFERENZ-NUMMER") > 0))
      gen long origorder = _n // new
      bysort ID (origorder) : gen int line = _n //new
      order ID line // shows the structure
      desc
      **drop noise
      keep if strpos(s,"REFERENZ")>0 | strpos(s, "ERFASSUNG") >0 | strpos(s, "FREIGABE")>0
      [CODE]

      Let's assume that I am only interested in observations in s that contain string positions used in "keep" above. I now like to structure the set that each reference ("REFERENZ-NUMMER") identifies the observations with the variables of interest being the dates and times in s following the prefix "ERFASSUNG/BEARBEITUNG" or "FREIGABE". In s, the left-hand side somehow contains the variable names (e.g., "REFERENZ-NUMMER", "ERFASSUNG" etc.) and the right-hand side the actual observations I am interested in, the date and times when the transactions were processed and approved and the employee (here: anonymized) who processed it (e.g. VVVN).

      [CODE]
      ERFASSUNG/BEARBEITUNG S022K480 VVVN 25.10.2019 10:11
      [CODE]

      The line above, therefore, contains three variables: the name of the employee who processed it ("VVVN"), the date (25.10.2019) and time (10:11).

      Ideally, the dataset would look like this, with each reference as an identifyer and the other variables containing processing date, time, and employee (if processing was not automated).
      Reference Automated_processing date Automated_processing time Processing 1 employee Processing 1 date Processing 1 time
      191025022BB110025 25.10.2019 09:53 VVVN 25.10.2019 10:11
      I hope I could explain the desired structure of the data set. Any suggestions on how to accomplish this?

      Regards,
      Julian

      Comment


      • #4
        Mike, thanks for your help! In order to "clean" the set, I used the following codes

        [CODE]
        clear
        **Assume "zzzzz" never occurs so that each line is read as one string.
        import delimited using "C:\Users\Scholz.ECFS-SERV\Desktop\DSGF\sample.txt", delimiter("zzzzz", asstring)
        rename v1 s
        // Put an ID and line number on each line that belongs to the same transaction.
        gen int ID = .
        quiet replace ID = cond(_n ==1, 1, ID[_n-1] + (strpos(s, "REFERENZ-NUMMER") > 0))
        gen long origorder = _n // new
        bysort ID (origorder) : gen int line = _n //new
        order ID line // shows the structure
        desc
        **drop noise
        keep if strpos(s,"REFERENZ")>0 | strpos(s, "ERFASSUNG") >0 | strpos(s, "FREIGABE")>0
        [CODE]

        Let's assume that I am only interested in observations in s that contain string positions used in "keep" above. I now like to structure the set that each reference ("REFERENZ-NUMMER") identifies the observations with the variables of interest being the dates and times in s following the prefix "ERFASSUNG/BEARBEITUNG" or "FREIGABE". In s, the left-hand side somehow contains the variable names (e.g., "REFERENZ-NUMMER", "ERFASSUNG" etc.) and the right-hand side the actual observations I am interested in, the date and times when the transactions were processed and approved and the employee (here: anonymized) who processed it (e.g. VVVN).

        [CODE]
        ERFASSUNG/BEARBEITUNG S022K480 VVVN 25.10.2019 10:11
        [CODE]

        The line above, therefore, contains three variables: the name of the employee who processed it ("VVVN"), the date (25.10.2019), and time (10:11).

        Ideally, the dataset would look like this, with each reference as an identifier and the other variables containing processing date, time, and employee (if processing was not automated). For the analysis, I think the long format is the way to go.
        Reference Automated_processing date Automated_processing_time Processing_1_employee Processing_1_date Processing_1_time
        191025022BB110025 25.10.2019 09:53 VVVN 25.10.2019 10:11
        I hope I could express the desired structure of the data set. Any suggestions on how to accomplish this?

        Regards,
        Julian
        Last edited by Julian Scholz; 15 Apr 2021, 02:53.

        Comment


        • #5
          sorry, wrong thread

          Comment

          Working...
          X