Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • HELP: Deleted Rows from Appending Datasets

    Hello

    I have appended several datasets together, the issue being that each dataset had rows labeling its respective variable (see row 3 of the attached imaged). Rows like these are reoccuring throughout my dataset since each individual dataset (pre-appending) had their respective titles. Is there an effective way I can remove these rows from my dataset on Stata?
    Click image for larger version

Name:	Screenshots 2.png
Views:	1
Size:	42.9 KB
ID:	1534848

  • #2
    The only really good way to do that is to not have those "rows" in the data sets before you append them. I take it these data are imported into Stata from Excel. Is that right?

    If so, re-import them, and use the -firstrow- option in your -import excel- command so that these names will not show up as if they were data observations but will actually be the variable names (instead of useless variable names like A, B, C, etc.)

    Aside: the second row of your tableau alarms me as much as the first, if not more. It looks like you have imported some bottom rows of the spreadsheet that include county totals. That will mess up your life working with this data in Stata in ways you can barely imagine. It will invalidate nearly every calculation you try to do. Since you have to go back and re-do the importing of your data from Excel to Stata, while you are doing that, also use the -cellrange()- option of -import excel- so that you don't bring in extraneous rows that have summary statistics in them: just bring in the raw data itself (and the first row as variable names, not as data observations).
    Last edited by Clyde Schechter; 03 Feb 2020, 15:10.

    Comment


    • #3
      Hi Clyde --

      Thank you for your help. Yes you are correct, these files are imported from excel.

      I'm familiar with the feature, but in total there are about 30 files appended together..Would I have to import each file individually and then use the appending feature? I was wondering if there is a more efficient way of doing it?

      Comment


      • #4
        Well, if you are 100% certain that all of these spreadsheets are clean and fully compatible with each other you can write a single loop to import and append them together. But I've been doing computer programming in various settings since 1962 and I don't think I've ever seen a set of 30 files that were all clean and fully compatible with each other. So that's why I recommend separately importing the files (in a loop if possible, though it may not be if your files are too different), then going through them individually to identify incompatibilities or other problems that need fixing before then appending them all together.

        Your situation makes it even less likely that all 30 files will be fully compatible. You need to filter out those summary statistics rows: but unless all the files have exactly the same number of rows of data, the location of those summary statistics rows will differ from one file to another, and you will need to write a different -cellrange()- option into the -import excel- command for each of the spreadsheets.

        As far as efficiency goes, importing and appending 30 spreadsheets of typical size will be, at worst, a matter of a few minutes, and taking shortcuts might help you shave seconds off of that. But if there are discrepancies among those 30 files, you will waste hours to days of time chasing that down and trying to code a fix inside a big loop. Worse, you might not even realize you have a problem until you have actually put the appended file to use and somewhere down the line become aware of it by stumbling on some inexplicable wrong results: in that case you might even waste weeks of your time.
        Last edited by Clyde Schechter; 03 Feb 2020, 18:54.

        Comment


        • #5
          Originally posted by Clyde Schechter View Post
          Well, if you are 100% certain that all of these spreadsheets are clean and fully compatible with each other you can write a single loop to import and append them together. But I've been doing computer programming in various settings since 1962 and I don't think I've ever seen a set of 30 files that were all clean and fully compatible with each other. So that's why I recommend separately importing the files (in a loop if possible, though it may not be if your files are too different), then going through them individually to identify incompatibilities or other problems that need fixing before then appending them all together.

          Your situation makes it even less likely that all 30 files will be fully compatible. You need to filter out those summary statistics rows: but unless all the files have exactly the same number of rows of data, the location of those summary statistics rows will differ from one file to another, and you will need to write a different -cellrange()- option into the -import excel- command for each of the spreadsheets.

          As far as efficiency goes, importing and appending 30 spreadsheets of typical size will be, at worst, a matter of a few minutes, and taking shortcuts might help you shave seconds off of that. But if there are discrepancies among those 30 files, you will waste hours to days of time chasing that down and trying to code a fix inside a big loop. Worse, you might not even realize you have a problem until you have actually put the appended file to use and somewhere down the line become aware of it by stumbling on some inexplicable wrong results: in that case you might even waste weeks of your time.
          Thank you for the clarification Clyde. I agree, the best option for me now is to import each data set separately. I will explore the guide in detail to import using the -cellrange()- option but two questions still remain on my end: (1) I'm not too sure how to import multiple files at once (to later append) if Stata only handles one dataset at a time no? Will I need to import the excel, save the dta to then append all of it together? (2) Would the appending of these new dtas files just be the standard appending process?

          Comment


          • #6
            Will I need to import the excel, save the dta to then append all of it together?
            Probably yes. If the files are highly compatible, you can do a loop to import them all into separate files, then inspect the files for discrepancies that need fixing, and then do another loop to import them. But if the files have larger incompatibilities, such as different -cellrange()-s for the actual data, then you're not going to be able to write a loop for the importation: each one will have to be done on its own after you inspect the Excel file to determine the exact parameters of the import.

            Once you have the files imported and cleaned and you have adjusted any discrepancies between them, then the append is standard.

            Comment


            • #7
              Originally posted by Clyde Schechter View Post
              Probably yes. If the files are highly compatible, you can do a loop to import them all into separate files, then inspect the files for discrepancies that need fixing, and then do another loop to import them. But if the files have larger incompatibilities, such as different -cellrange()-s for the actual data, then you're not going to be able to write a loop for the importation: each one will have to be done on its own after you inspect the Excel file to determine the exact parameters of the import.

              Once you have the files imported and cleaned and you have adjusted any discrepancies between them, then the append is standard.
              Ok this makes sense, thanks. Do you have any suggestions on where I can learn about creating a loop to import the files?

              Comment


              • #8
                So I tried looping using this code:

                use ENR_2016_01.dta, clear

                foreach num of numlist 2/44 {
                2. 2. append using ENR_2018_07`num'
                3. 3. }


                and got this error:

                2. is not a valid command name
                r(199);


                Any ideas on how I can move forward and create the loop?

                Comment


                • #9
                  The message means exactly what it says. Those lines that begin with 2. 2. and 3.3. should not have those numbers there. Get rid of them nad it will run, assuming that .dta files with names like ENR_2018_072, ENR_2018_073, … ENR_2018_0710, ENR_2018_0711, … ENR_2018_0744 all exist.

                  I think the mistake you made here is that you copied that "code" from the Results window after the correct code had been run. When Stata encounters a loop, it numbers each line within the loop as it echoes the code to the Results window. But those numbers are not part of the code.

                  Comment


                  • #10
                    Originally posted by Clyde Schechter View Post
                    The message means exactly what it says. Those lines that begin with 2. 2. and 3.3. should not have those numbers there. Get rid of them nad it will run, assuming that .dta files with names like ENR_2018_072, ENR_2018_073, … ENR_2018_0710, ENR_2018_0711, … ENR_2018_0744 all exist.

                    I think the mistake you made here is that you copied that "code" from the Results window after the correct code had been run. When Stata encounters a loop, it numbers each line within the loop as it echoes the code to the Results window. But those numbers are not part of the code.
                    Hi Clyde. Thanks again for your help, but I am not too sure I am following. I understood the portion on the code errors but the .dta files name reflect ENR_year_month. Will I have to change this for my loop to function? Perhaps this has been my issue?

                    Comment


                    • #11
                      When you write:
                      Code:
                       
                      foreach num of numlist 2/44 {
                      append using ENR_2018_07`num' // etc. }

                      You are telling Stata you want it to set num = 2, then 3, then, 4, and so on up through 44, and then each time through the loop to append a file whose name is ENR_2018_07x, where x gets replaced by the current value of num. I expected that is not what you want, based on what you said earlier in the thread, hence my comment in #9.

                      What are the names of the files you are trying to append? You have to construct your loop so that what goes through the loop matches those names. If they are years and months, which years, and which months in each year? The loop has to express that information.

                      Or, perhaps the files are all, and only, the .dta files in the current directory whose names begin with ENR_? If so, there's an easy way to do that.

                      Last edited by Clyde Schechter; 03 Feb 2020, 21:16.

                      Comment


                      • #12
                        Originally posted by Clyde Schechter View Post
                        When you write:
                        Code:
                         
                        foreach num of numlist 2/44 {
                        append using ENR_2018_07`num' // etc. }

                        You are telling Stata you want it to set num = 2, then 3, then, 4, and so on up through 44, and then each time through the loop to append a file whose name is ENR_2018_07x, where x gets replaced by the current value of num. I expected that is not what you want, based on what you said earlier in the thread, hence my comment in #9.

                        What are the names of the files you are trying to append? You have to construct your loop so that what goes through the loop matches those names. If they are years and months, which years, and which months in each year? The loop has to express that information.

                        Or, perhaps the files are all, and only, the .dta files in the current directory whose names begin with ENR_? If so, there's an easy way to do that.
                        The files I am trying to append are named ENR_YEAR_MONTH so a sample file may be labeled ENR_2015_05 for May of 2015. Does that make sense? How would you suggest dooing the looping?

                        Comment


                        • #13
                          You didn't answer the question completely. I asked which years and which months. For illustration, let's assume that the years are all the years between 2010 and 2019 and that all 12 months are represented in each year. I'll also assume that months January through September are represented as zero-padded two-digit numbers. This is pretty much the simplest scenario:

                          Code:
                          forvalues y = 2010/2019 {
                              forvalues m = 1/12 {
                                  local mm: display %02.0f `m'
                                  append using ENR_`y'_`mm'
                                  // etc.
                              }
                          }

                          Comment


                          • #14
                            Originally posted by Clyde Schechter View Post
                            You didn't answer the question completely. I asked which years and which months. For illustration, let's assume that the years are all the years between 2010 and 2019 and that all 12 months are represented in each year. I'll also assume that months January through September are represented as zero-padded two-digit numbers. This is pretty much the simplest scenario:

                            Code:
                            forvalues y = 2010/2019 {
                            forvalues m = 1/12 {
                            local mm: display %02.0f `m'
                            append using ENR_`y'_`mm'
                            // etc.
                            }
                            }
                            Sorry the files goes from ENR_2014_01 to ENR_2019_12 suggesting sheets from January 2014 through December 2019. I tried the code with the respective numbers and got the following error:

                            . forvalues y = 2014/2019 {
                            2. forvalues m = 1/12 {
                            3. local mm: display %02.0f `m'
                            4. append using ENR_`y'_`mm'
                            5. // etc.
                            6. }
                            7. }


                            file ENR_2014_01 not found
                            r(601);

                            this is puzzling me because I do have the ENR_2014_01 file... my cd is set to the correct folder so I don't know what could be going on. Also had another question, is there a way incorporoate in this loop code, that I cant a specific sheet (sheet name is County, it is the third sheet on each file) imported?

                            Thank you for your help on this

                            Comment


                            • #15
                              OK, there's a fundamental misunderstanding here. You cannot append spreadsheets. -append- (and -use-, -merge-) works only with Stata .dta files. The error message is telling you that it cannot find a file named ENR_2014_01.dta. But from the rest of your post, I'm gleaning that you want to read in and append an Excel file named ENR_2014_01.xlsx (or perhaps .xls).

                              Now, between this and other threads I feel like we're going around in circles.

                              If what you want to do is read in a bunch of Excel files and append them all together, the simplest way to do that is to use Daniel Klein's -xls2dta- command, available from SSC. If you are 100% sure that all of those spreadsheets' data is clean and that everything is consistent across all 72 of them (same variable names, agreement on what is string and what is numeric) then go ahead and use that. The help file that comes with it gives you instructions.

                              But I recommend strongly against doing that because I think it is extremely unlikely that all 72 files will be compatible with each other. I would say this even if you were starting with 72 Stata data files, but Excel files are even looser: they impose no real structure on the data, so it is very easy for strings like "N/A" or "--" to slip into columns that should be holding numbers in one file but not in another. It is also extremely easy for the variable name to be ID in one files and id in another: for Stata these are different names. If there are these kinds of inconsistencies among the files, what will happen is either that your -xls2dta- will halt and give you an error message when it encounters a string vs numeric clash on the same variable. Or, the resulting large appended file will be a mess as what should be the same variable is spread out among several different variables whose names are almost, but not exactly, the same-- and, in general they won't even be located near each other in the file so you will find out about this primarily when you see a lot of missing values where you do not expect the data to be missing!

                              My advice is to first import each Excel file into Stata separately, without appending them:

                              Code:
                              forvalues y = 2014/2019 {
                                  forvalues m = 1/12 {
                                      local mm: display %02..0f `m'
                                      import excel using ENR_`y'_`mm'.xlsx, firstrow case(lower) sheet(County)
                                      save ENR_`y'_`mm', replace
                                  }
                              }
                              This loop will give you 72 .dta files with the same names (except for extension) as the original spreadsheets.

                              Next, and I know you don't want to hear this, go through all 72 of the files to verify that they are all compatible. It is likely you will find some variations that need to be dealt with. Take notes as you check each file, and when you've identified all the problems, write a new Stata program to read in the offending .dta files and fix the problems in them. Then once you have cleaned up all the .dta files, you can do a mass append:

                              Code:
                              clear
                              forvalues y = 2014/2019 {
                                      forvalues m = 1/12 {
                                          local mm: display %02.0f `m'
                                          append using ENR_`y'_`mm'
                                      }
                              }
                              save ENR_all_months, replace
                              This loop reads in all the (fixed) .dta files and appends them together, saving them in a combined file which I have called ENR_all_months.dta.

                              Comment

                              Working...
                              X