Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Searching distinct identifier in a folder

    Hi everyone,

    I have the following issue. In one folder, I have stored different xlsx files. Each files corresponds to an identifier and it is repeated four times. Below you have a screenshoot, just to give you a snapshot:

    Click image for larger version

Name:	Screenshoot.png
Views:	1
Size:	15.1 KB
ID:	1601166


    As you can see, each identifier is repeated 4 items with label _1 _2 _3 _4.

    What I need to do with Stata is to list each unique idenfitier that starts with "GB" letters, then I have to copy and paste these identifiers for another type of work. The critical point is that, in order to be included in my sample, each identifier must have all _1 _2 _3 _4 files. If, for instance, one identifier is missing _2 or _1, it has not to be included in my list.

    I'm wondering whether this can be feasible and effeciently done with STATA.
    Would you be so kind to help me?

    Many thanks in advance

  • #2
    I would use Robert Picard's filelist command (from SSC).

    Something along the lines of

    Code:
    filelist, dir("YOUR_FOLDER_DIRECTORY") pattern("GB*.xlsx") norec
    split filename, parse("_")
    bys filename1: keep if _N == 4
    The code is untested, of course, but should hopefully give you some direction. Just add in your folder's path to the dir() option.

    Comment


    • #3
      Hi Justin Niakamal

      Many thanks for your codes that really worked properly. Sorry for the late reply.

      I only have a small (and also silly point)

      I have to keep those identifiers (that starts with GB), as long as they have all the 4 files. Some files maybe only "identifier_4".
      I'm wondering how can i do that. I tried:

      Code:
      bys filename: keep if _N == 4  & _N==3 & _N==2 & _N==1
      Neverthless, it deletes all my observations. Do you have any suggestion? I guess that the solution is very easy but now it doesn't came in my mind.

      Many thanks

      Comment


      • #4
        Your observations are getting deleted because you're using and where you should be using or. But that could be simplified to bys filename: keep if _N <= 4.
        I'm not sure I understand the second question.

        Comment


        • #5
          Thanks Justin Niakamal
          It gives me 0 observation deleted.

          I'm not absolutely questioning the command but is puzzling that id doesn't delete any observations.

          I remember that some of the files only had _1 and _2 (but not _3 and _4).

          Would you be so kind to confirm me that with
          Code:
          bys filename: keep if _N <= 4
          I'm keeping only identifiers as long as they have ALL the 4 files?

          Comment


          • #6
            Sorry should be filename1 (because when you split on "_" you'll create stubs of filename).

            Code:
             bys filename1: keep if _N <= 4

            Comment


            • #7
              Thanks Justin Niakamal .

              Still it doesn't delete any observations. I guess that there is nothing wrote with the codes. It is puzzling to me why.
              But thanks for your suggestions. They have been really helpful and they will help me to save me a lot of time.

              Comment


              • #8
                This code won't delete any observations unless there are _5, _6, files etc. The code in #2 should do what you originally asked.

                Code:
                 bys filename1: keep if _N == 4
                Tells Stata to keep the stub (split on "_") if there are four files.

                Comment


                • #9
                  Brilliant,

                  Now I got it and it works properly, deleting the observations that I wanted to do

                  Many thanks for your time Justin Niakamal

                  Comment

                  Working...
                  X