Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Append multiple excel files and capture specific cells F3, F4, T3, T4 as variable

    Hi Stata people,

    My several excel files are appended into a single DTA file from various locations. Right now, I'm having trouble figuring out how to add new variables to cells from F3, F4, T3, T4.
    Below is the working command that I used, excluding the commands to generate the following division = T[3, the schoolyear = T[4], the schoolid = F[3], and the school = F[4].


    clear
    set more off
    global route "C:\Users\bernie.seville\OneDrive - Palladium International, LLC\Stata command loop test"
    cd "$route"
    local allfiles : dir "." files "*.xls"
    tempfile schroom

    clear
    save `schroom', emptyok
    foreach f of local allfiles {
    import excel using "`f'", sheet("school_form_1_ver2014.2.1.1") cellrange(A2) firstrow case(lower) clear
    /*
    local schoolid= F[3] //not working
    gen schoolid=`hname //not working
    local school= F[4] //not working
    gen school= `school' //not working

    local division= T[3] //not working
    gen division= `division' //not working

    local schoolyear= T[4] //not working
    gen schoolyear= `division' //not working
    */
    gen source = "`f'"
    append using `schroom'
    save `schroom', replace
    }


    Thanks in advance for the help.







    CAPI Specialist
    Data Manager

  • #2
    Without seeing any example data, it is not possible to give a definitive answer to your question.

    What I can say is that having specified the -firstrow- option in your -import excel- command, it is unlikely that you will have variables named F and T in your Stata data set. With the -firstrow- option specified, Stata will assign the contents of the cell in the first row of a column as the name of the variable corresponding to that column. It will preserve the original alphabetic names only if the contents of that cell in the first row cannot be converted to a legal variable name. So there is a good chance that your imported data set has no F or T variable. Try removing -firstrow- and see if that resolves the problem. (Of course, that may introduce other problems if your code has elsewhere relied on the usual variable renaming that -firstrow- accomplishes.)

    Something else worth pointing out is that it seems like the data in these spreadsheets is not organized in a way that is conducive to importation as a well-organized Stata data set. If column F contains a school id in one cell and a school name in another, as well as a division in one cell of a column and a school year in another, then the columns of these spreadsheets do not correspond to coherent Stata variables. So the resulting imports will probably require extensive data management to wrestle them into a usable layout.

    Comment


    • #3
      Hi Clyde,

      Thank you for your prompt reply. My bad for not attaching any data.

      I attached 3 excel files, 2 standard forms with dummy data and 1 my desired output in DTA which I aimed to achieved. As mentioned earlier, may goal #1 append these several excel files into one dataset and #2, the division name, schoolID, schoolname, barangay and etc. be included as new variables in the dataset.

      Thanks again,
      Attached Files
      CAPI Specialist
      Data Manager

      Comment


      • #4
        Sorry, but like many others, I do not download files from people I don't know. I appreciate the effort to show example data. The most effective way to do that is to import each of these files into a Stata data set. I should have been clearer about that in #2. First, use -import excel- to do that, and don't use the -firstrow- option for any of them.

        Then you can show the contents of each of the imported .dta files by using the -dataex- command. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

        It is likely that there are more variables in these files than -dataex- can display. So I would say when using it pick the most important columns: those would be any variables that would jointly identify unique observations in the data set, and also columns T and F.

        Comment


        • #5
          Thanks Clyde,

          Again, I appreciate your advice and recommendations. I made an attempt to follow, but I failed. Here's my tables below.

          Table 1 is the form use in the schools with same format and the table 2 is the dataset i aimed to achieved.

          Table 1: This is the form used in the school.
          School ID 893060 Division
          School Name Holland ES School year
          LRN NAME Sex BIRTH DATE
          (mm/dd/yyyy)
          AGE as of 1st Friday June MOTHER TONGUE (Grade 1 to 3 Only) Ethnic Group
          1234561 xxx yyy zzz M 09-10-2017 5 Sama, Central Badjao
          1234572 qqq M 06-20-2017 6 Sama, Central Badjao
          1234583 www M 10-20-2013 9 Sama, Central Badjao
          1234593 rrr F 11-20-2015 7 Sama, Central Badjao
          Table 2: Desired stata format
          School ID School Name Sex LRN NAME BIRTH DATE
          (mm/dd/yyyy)
          AGE MOTHER TONGUE (Grade 1 to 3 Only) Ethnic Group
          893060 Holland ES
          893060 Holland ES M 1234561 xxx yyy zzz 09-10-2017 5 Sama, Central Badjao
          893060 Holland ES M 1234572 qqq 06-20-2017 6 Sama, Central Badjao
          893060 Holland ES M 1234583 www 10-20-2013 9 Sama, Central Badjao
          893060 Holland ES F 1234593 rrr 11-20-2015 7 Sama, Central Badjao
          Here's my code to append.

          clear
          set more off
          global route "C:\Users\bernie.seville\OneDrive - Palladium International, LLC\Stata command loop test"
          cd "$route"
          local allfiles : dir "." files "*.xls"
          tempfile schroom

          clear
          save `schroom', emptyok
          foreach f of local allfiles {
          import excel using "`f'", sheet("school_form_1_ver2014.2.1.1") cellrange(A2) firstrow case(lower) clear
          gen source = "`f'"
          append using `schroom'
          save `schroom', replace
          }

          Thanks,
          Bernie
          CAPI Specialist
          Data Manager

          Comment


          • #6
            OK. My first reaction is, don't try to create what you show in Table 2: it is not a good data organization for use with Stata. It perpetuates the spreadsheet concept of mix-and-matching heterogeneous information. What I'm specifically advising is not to have in Table 2 the observation that contains non-missing values only for the school ID and school name. Have observations only for students (but keep the school ID and school name variables).

            As for how to do this, it will depend on an aspect of the spreadsheet organization that I cannot discern from what is shown. Does each school have a separate worksheet in a big Excel file? If so, how are those worksheets named? Or does each school have its own Excel file? Or are multiple schools included on the same worksheet, stacked vertically? From the code you show, I suppose each school has its own Excel file, but I need to be certain.

            Comment


            • #7
              Hello Clyde,

              Yes, each school maintains an Excel file, and all forms follow the same structure. If the school has three grade levels, there will be three different excel files since one excel file has one set of data only. The excel file bears the name of the barangay in which the school is situated. FYI, the smallest geographic political division in our nation is called a "barangay."

              Thanks again Clyde.
              CAPI Specialist
              Data Manager

              Comment


              • #8
                So here is the gist of how to do this:
                Code:
                cd "C:\Users\bernie.seville\OneDrive - Palladium International, LLC\Stata command loop test"
                
                local allfiles : dir "." files "*.xls"
                tempfile schroom
                
                clear
                save `schroom', emptyok
                
                //    LOOP OVER FILES, READING THEM ONCE FOR SCHOOL-LEVEL INFORMATION
                //    AND A SECOND TIME FOR ALL STUDENT DATA
                
                foreach f of local allfiles {
                    import excel "`f'", cellrange(A3:AM4) clear
                    local school_id = F[1]
                    local school_name = F[2]
                    local division = T[1]
                    local school_year = T[2]
                    local grade_level = AE[2]
                    local district = AM[1]
                    local section = AM[2]
                    
                    import excel "`f'",  cellrange(A5) firstrow case(lower)
                    foreach x in school_id school_name division school_year grade_level ///
                        district section {
                            gen `x' = `"`x'"'
                        }
                    gen source = `"`f'"'
                    append using `schroom'
                    save `schroom', replace
                }
                Note that the informative "header" rows of the spreadsheet containing school attributes are read in first, and their values stored away in local macros. Then the rows containing the student information are read in separately, and the previously stored school variables are transferred in as new variables.

                Now, if you are very lucky, your trouble will be over. But, the probability of that is low. Because the first several rows of the spreadsheet contain exclusively text information, the student variables, some of which are numeric and should be treated accordingly, may be read in as strings instead, and will require -destring-ing later. The date variables probably also will not come in as dates but as strings and you will have to convert them.

                Even more to the point, in almost any large cache of spreadsheets, even the best curated ones, there is a likelihood that some of the variables will be incompatible across files. This arises from things like typographical errors in the variable name fields, or typos in the data values leading what is a nunmeric variable in one file to be string in another, etc. Stata will complain about all of this, and you will have to fix up all these incompatibilities before the loop of -append- commands will run to the finish. This is a generic problem we commonly encounter whenever we try to append large numbers of files. It is, I speculate, even more likely to happen when the file structure is as complex and idiosyncratic as the one you are dealing with.

                So something to consider is, rather than appending the files, consider saving each one as a separate .dta file. Then, install Mark Chatfield's -precombine- program from SSC, and follow the instructions in the help file to apply it to all of these new .dta files. It will alert you to all of the incompatibilities, so that you can fix them in these files. Then you can write another loop to append all these fixed-up files. (Actually, for that, you don't even need a loop. See -help append- for the -append using ...- syntax which will append a whole list of files together in one fell swoop.)

                It is, I think, far less tedious and frustrating to get a heads-up on all of the problems lurking in the files, rather than having Stata interrupt execution after each problem arises, forcing you to fix that one and then go back to the beginning. Also, by seeing the full list of incompatibilities up front, you won't make the mistake of fixing one of them in a way that makes it harder to fix another one that will pop up later!

                Comment


                • #9
                  Hi Clyde,

                  Perfect! I am so grateful for your assistance; it resolved my problem. Except to add "clear" in the second import (
                  import excel "`f'", cellrange(A5) firstrow case(lower) ).


                  Just one last question, what if appending 5 folders with 3 subfolders each containing forms and structure? Else, I can do this to each of the folder and append using your command.

                  Thanks so much.
                  CAPI Specialist
                  Data Manager

                  Comment


                  • #10
                    Well, if the names of the subfolders are systematically patterned and related to the names of the superordinate folder, then you can take the code for one folder and wrap it in nested loops.

                    Code:
                    local return_to `c(pwd)' // SAVE THE CURRENT WORKING DIRECTORY
                    
                    clear
                    tempfile schroom
                    save `schroom', emptyok
                    
                    local folders alpha beta gamma delta epislon
                    
                    foreach F of local folders {
                        local subfolders: dir `"`F"' dirs "pattern_for_subfolder_names"
                        foreach s of local subfolders {
                            cd `"`F'/`s'"'
                            // EMBED CODE TO PROCESS SINGLE FOLDER HERE
                            cd ..
                        }
                        cd `"`return_to'"'
                    }
                    Replace pattern_for_subfolder_names by something that expands to the names of the subfolders. What that is, evidently, depends on how those subfolders are named. (Or, if the relevant subfolders are the only subfolders in the folder, then * will do it.

                    Comment


                    • #11
                      Hi Clyde,

                      I am grateful for your help; I achieved the expected result. I'll try to figure out this second code as well, since it seems more advanced to me. Nevertheless, once more, thank you for your support.
                      CAPI Specialist
                      Data Manager

                      Comment


                      • #12
                        Hi Clyde,

                        I made multiple attempts to follow your code, but I was unsuccessful. It reads "directory Basilan not found"

                        Here's my code:

                        local return_to `c(pwd)' // SAVE THE CURRENT WORKING DIRECTORY

                        //this also the file folders directory pattern
                        cd "directory: C:\Users\School Forms
                        \2023-2024 SF1
                        \Basilan
                        \Basilan G1
                        brgy_name.xls
                        brgy_name.xls
                        brgy_name.xls
                        etc..."

                        clear
                        tempfile schroom
                        save `schroom', emptyok

                        local folders "Basilan G1" "Basilan G2" "Basilan G3" "Basilan Kinder"

                        foreach F of local folders {
                        local subfolders: dir `"`F'"' dirs "*.xls"
                        foreach s of local subfolders {
                        cd `"`F'/`s'"'
                        // EMBED CODE TO PROCESS SINGLE FOLDER HERE
                        foreach f of local allfiles {
                        import excel "`f'", cellrange(A3:AM4) clear
                        local school_id = F[1]
                        import excel "`f'", cellrange(A5) firstrow case(lower) clear
                        gen schoolid = `school_id'
                        gen source = subinstr(`"`f'"',".xls","",.)
                        append using `schroom'
                        save `schroom', replace

                        }
                        cd `"`return_to'"'
                        }

                        Thank you in advance.
                        CAPI Specialist
                        Data Manager

                        Comment


                        • #13
                          The problem is with -local folders "Basilan G1" "Basilan G2" "Basilan G3" "Basilan Kinder"-.

                          The way Stata handles quotes in local macros is confusing. You are correct in noting that you have to bind these directory names in quotes because they contain embedded blanks. But the -local- command is subverting you, because Stata automatically removes the opening and closing quote from a macro definition. Watch:
                          Code:
                          . local folders "Basilan G1" "Basilan G2" "Basilan G3" "Basilan Kinder"
                          
                          .
                          . macro list _folders
                          _folders:       Basilan G1" "Basilan G2" "Basilan G3" "Basilan Kinder
                          
                          .
                          . foreach F of local folders {
                            2.         display "`F'"
                            3. }
                          Basilan
                          G1
                          Basilan G2
                          Basilan G3
                          Basilan Kinder
                          You can see in this output that the opening quote of Basilan G1 is missing, as is the closing quote of Basilan Kinder. With that opening quote gone, Stata thinks the first "element" of local macro folder is Basilan, not Basilan G1, and it thinks G1 is a separate element--and since there is no directory named Basilan, your code breaks.

                          The solution to this problem is to put a sacrificial extra pair of quotes around the entire macro when you define it. Of course, since there are quotes internal to the macro itself, you have to use Stata's compound double quotes for this purpose. Watch:
                          Code:
                          . local folders `""Basilan G1" "Basilan G2" "Basilan G3" "Basilan Kinder""'
                          
                          .
                          . macro list _folders
                          _folders:       "Basilan G1" "Basilan G2" "Basilan G3" "Basilan Kinder"
                          
                          .
                          . foreach F of local folders {
                            2.         display "`F'"
                            3. }
                          Basilan G1
                          Basilan G2
                          Basilan G3
                          Basilan Kinder
                          
                          .
                          Notice that the `" "' that were added are gone. But they protected the needed opening quote around "Basilan G1" from being omitted.
                          Last edited by Clyde Schechter; 20 Apr 2024, 10:07.

                          Comment


                          • #14
                            Thank you Clyde,

                            All codes are now error-free. But none of my Excel files have been appended or found.
                            I also changed the directory file to see if there was a difference in the path, but the outcome remained the same.

                            Below is my code:

                            local return_to `c(pwd)' // SAVE THE CURRENT WORKING DIRECTORY
                            cd "C:\Users\2023-2024 SF1"
                            clear
                            tempfile schroom
                            save `schroom', emptyok
                            local folders `" "Tawi-Tawi" "Basilan" "Sulu" ""'
                            foreach F of local folders {
                            local subfolders: dir `"`F'"' dirs "*.xls"
                            foreach s of local subfolders {
                            cd `""`F'/`s'""'
                            // EMBED CODE TO PROCESS SINGLE FOLDER HERE
                            foreach f of local allfiles {
                            import excel "`f'", cellrange(A3:AM4) clear
                            local school_id = F[1]
                            import excel "`f'", cellrange(A5) firstrow case(lower) clear
                            gen schoolid = `school_id'
                            gen source = subinstr(`"`f'"',".xls","",.)
                            append using "`schroom'"
                            save "`schroom'", replace
                            }
                            cd `"`return_to'"'
                            }
                            }
                            Thanks you always for the help.
                            CAPI Specialist
                            Data Manager

                            Comment


                            • #15
                              You need to define local macro allfiles before you use it.

                              [code]
                              local return_to `c(pwd)' // SAVE THE CURRENT WORKING DIRECTORY
                              cd "C:\Users\2023-2024 SF1"
                              clear
                              tempfile schroom
                              save `schroom', emptyok
                              local folders `" "Tawi-Tawi" "Basilan" "Sulu" ""'
                              foreach F of local folders {
                              local subfolders: dir `"`F'"' dirs "*.xls"
                              foreach s of local subfolders {
                              cd `""`F'/`s'""'
                              // EMBED CODE TO PROCESS SINGLE FOLDER HERE
                              local allfiles: dir "." files "*.xls" // OR WHATEVER IS THE FILE NAME PATTERN
                              foreach f of local allfiles {
                              import excel "`f'", cellrange(A3:AM4) clear
                              local school_id = F[1]
                              import excel "`f'", cellrange(A5) firstrow case(lower) clear
                              gen schoolid = `school_id'
                              gen source = subinstr(`"`f'"',".xls","",.)
                              append using "`schroom'"
                              save "`schroom'", replace
                              }
                              cd `"`return_to'"'
                              }
                              }

                              Comment

                              Working...
                              X