Searching distinct identifier in a folder

Marco Errico

Join Date: Apr 2020

Posts: 187
#1

Searching distinct identifier in a folder

02 Apr 2021, 07:05

Hi everyone,

I have the following issue. In one folder, I have stored different xlsx files. Each files corresponds to an identifier and it is repeated four times. Below you have a screenshoot, just to give you a snapshot:

As you can see, each identifier is repeated 4 items with label _1 _2 _3 _4.

What I need to do with Stata is to list each unique idenfitier that starts with "GB" letters, then I have to copy and paste these identifiers for another type of work. The critical point is that, in order to be included in my sample, each identifier must have all _1 _2 _3 _4 files. If, for instance, one identifier is missing _2 or _1, it has not to be included in my list.

I'm wondering whether this can be feasible and effeciently done with STATA.
Would you be so kind to help me?

Many thanks in advance
Tags: None
Justin Niakamal

Join Date: Aug 2017

Posts: 760
#2

02 Apr 2021, 07:24

I would use Robert Picard's filelist command (from SSC).

Something along the lines of

Code:

filelist, dir("YOUR_FOLDER_DIRECTORY") pattern("GB*.xlsx") norec split filename, parse("_") bys filename1: keep if _N == 4

The code is untested, of course, but should hopefully give you some direction. Just add in your folder's path to the dir() option.
Comment
Marco Errico

Join Date: Apr 2020

Posts: 187
#3

04 Apr 2021, 04:00

Hi Justin Niakamal

Many thanks for your codes that really worked properly. Sorry for the late reply.

I only have a small (and also silly point)

I have to keep those identifiers (that starts with GB), as long as they have all the 4 files. Some files maybe only "identifier_4".
I'm wondering how can i do that. I tried:

Code:

bys filename: keep if _N == 4 & _N==3 & _N==2 & _N==1

Neverthless, it deletes all my observations. Do you have any suggestion? I guess that the solution is very easy but now it doesn't came in my mind.

Many thanks
Comment
Justin Niakamal

Join Date: Aug 2017

Posts: 760
#4

04 Apr 2021, 07:18

Your observations are getting deleted because you're using and where you should be using or. But that could be simplified to bys filename: keep if _N <= 4.
I'm not sure I understand the second question.
Comment
Marco Errico

Join Date: Apr 2020

Posts: 187
#5

04 Apr 2021, 12:01

Thanks Justin Niakamal
It gives me 0 observation deleted.

I'm not absolutely questioning the command but is puzzling that id doesn't delete any observations.

I remember that some of the files only had _1 and _2 (but not _3 and _4).

Would you be so kind to confirm me that with

Code:

bys filename: keep if _N <= 4

I'm keeping only identifiers as long as they have ALL the 4 files?
Comment
Justin Niakamal

Join Date: Aug 2017

Posts: 760
#6

04 Apr 2021, 12:05

Sorry should be filename1 (because when you split on "_" you'll create stubs of filename).

Code:

bys filename1: keep if _N <= 4
Comment
Marco Errico

Join Date: Apr 2020

Posts: 187
#7

04 Apr 2021, 12:21

Thanks Justin Niakamal .

Still it doesn't delete any observations. I guess that there is nothing wrote with the codes. It is puzzling to me why.
But thanks for your suggestions. They have been really helpful and they will help me to save me a lot of time.
Comment
Justin Niakamal

Join Date: Aug 2017

Posts: 760
#8

04 Apr 2021, 12:52

This code won't delete any observations unless there are _5, _6, files etc. The code in #2 should do what you originally asked.

Code:

bys filename1: keep if _N == 4

Tells Stata to keep the stub (split on "_") if there are four files.
Comment
Marco Errico

Join Date: Apr 2020

Posts: 187
#9

04 Apr 2021, 14:17

Brilliant,

Now I got it and it works properly, deleting the observations that I wanted to do

Many thanks for your time Justin Niakamal
Comment

Announcement

Searching distinct identifier in a folder

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment