Drop Observation if sting contains specific text string

Gal deu

Join Date: Mar 2015

Posts: 6
#1

Drop Observation if sting contains specific text string

24 May 2015, 09:36

I will appreciate your advise regarding to drop observations.

My data set contains a list of institutions names (observations) by the var "instnm". (see print screen attached)

I want to drop all the institutions that their name contained the word "BEAUTY".

What will be the best way to do so?

Thank you!!!
Attached Files
Tags: None
Rich Goldstein

Join Date: Mar 2014

Posts: 4466
#2

24 May 2015, 10:44

Code:

drop if strpos(instant,"BEAUTY")>0

auto-correct strikes again; keeps changing your variable name to "instant"
1 like
Comment
Julian Scholz

Join Date: May 2019

Posts: 22
#3

15 Apr 2021, 02:49

Mike, thanks for your help! In order to "clean" the set, I used the following codes

[CODE]
clear
**Assume "zzzzz" never occurs so that each line is read as one string.
import delimited using "C:\Users\Scholz.ECFS-SERV\Desktop\DSGF\sample.txt", delimiter("zzzzz", asstring)
rename v1 s
// Put an ID and line number on each line that belongs to the same transaction.
gen int ID = .
quiet replace ID = cond(_n ==1, 1, ID[_n-1] + (strpos(s, "REFERENZ-NUMMER") > 0))
gen long origorder = _n // new
bysort ID (origorder) : gen int line = _n //new
order ID line // shows the structure
desc
**drop noise
keep if strpos(s,"REFERENZ")>0 | strpos(s, "ERFASSUNG") >0 | strpos(s, "FREIGABE")>0
[CODE]

Let's assume that I am only interested in observations in s that contain string positions used in "keep" above. I now like to structure the set that each reference ("REFERENZ-NUMMER") identifies the observations with the variables of interest being the dates and times in s following the prefix "ERFASSUNG/BEARBEITUNG" or "FREIGABE". In s, the left-hand side somehow contains the variable names (e.g., "REFERENZ-NUMMER", "ERFASSUNG" etc.) and the right-hand side the actual observations I am interested in, the date and times when the transactions were processed and approved and the employee (here: anonymized) who processed it (e.g. VVVN).

[CODE]
ERFASSUNG/BEARBEITUNG S022K480 VVVN 25.10.2019 10:11
[CODE]

The line above, therefore, contains three variables: the name of the employee who processed it ("VVVN"), the date (25.10.2019) and time (10:11).

Ideally, the dataset would look like this, with each reference as an identifyer and the other variables containing processing date, time, and employee (if processing was not automated).
Reference Automated_processing date Automated_processing time Processing 1 employee Processing 1 date Processing 1 time

191025022BB110025 25.10.2019 09:53 VVVN 25.10.2019 10:11

I hope I could explain the desired structure of the data set. Any suggestions on how to accomplish this?

Regards,
Julian
Comment
Julian Scholz

Join Date: May 2019

Posts: 22
#4

15 Apr 2021, 02:50

Mike, thanks for your help! In order to "clean" the set, I used the following codes

[CODE]
clear
**Assume "zzzzz" never occurs so that each line is read as one string.
import delimited using "C:\Users\Scholz.ECFS-SERV\Desktop\DSGF\sample.txt", delimiter("zzzzz", asstring)
rename v1 s
// Put an ID and line number on each line that belongs to the same transaction.
gen int ID = .
quiet replace ID = cond(_n ==1, 1, ID[_n-1] + (strpos(s, "REFERENZ-NUMMER") > 0))
gen long origorder = _n // new
bysort ID (origorder) : gen int line = _n //new
order ID line // shows the structure
desc
**drop noise
keep if strpos(s,"REFERENZ")>0 | strpos(s, "ERFASSUNG") >0 | strpos(s, "FREIGABE")>0
[CODE]

Let's assume that I am only interested in observations in s that contain string positions used in "keep" above. I now like to structure the set that each reference ("REFERENZ-NUMMER") identifies the observations with the variables of interest being the dates and times in s following the prefix "ERFASSUNG/BEARBEITUNG" or "FREIGABE". In s, the left-hand side somehow contains the variable names (e.g., "REFERENZ-NUMMER", "ERFASSUNG" etc.) and the right-hand side the actual observations I am interested in, the date and times when the transactions were processed and approved and the employee (here: anonymized) who processed it (e.g. VVVN).

[CODE]
ERFASSUNG/BEARBEITUNG S022K480 VVVN 25.10.2019 10:11
[CODE]

The line above, therefore, contains three variables: the name of the employee who processed it ("VVVN"), the date (25.10.2019), and time (10:11).

Ideally, the dataset would look like this, with each reference as an identifier and the other variables containing processing date, time, and employee (if processing was not automated). For the analysis, I think the long format is the way to go.

Reference Automated_processing date Automated_processing_time Processing_1_employee Processing_1_date Processing_1_time

191025022BB110025 25.10.2019 09:53 VVVN 25.10.2019 10:11

I hope I could express the desired structure of the data set. Any suggestions on how to accomplish this?

Regards,
Julian

Last edited by Julian Scholz; 15 Apr 2021, 02:53.
Comment
Julian Scholz

Join Date: May 2019

Posts: 22
#5

15 Apr 2021, 02:56

sorry, wrong thread
Comment

Reference	Automated_processing date	Automated_processing time	Processing 1 employee	Processing 1 date	Processing 1 time
191025022BB110025	25.10.2019	09:53	VVVN	25.10.2019	10:11

Reference	Automated_processing date	Automated_processing_time	Processing_1_employee	Processing_1_date	Processing_1_time
191025022BB110025	25.10.2019	09:53	VVVN	25.10.2019	10:11

Announcement

Drop Observation if sting contains specific text string

Comment

Comment

Comment

Comment