Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Append multiple worksheets into a single dataset

    Hi Everyone!

    I'm kinda new to Stata and I'm trying to import and append different excel worksheets in a single .dta file. Reading on the forum, I managed to create the different .dta files, each one corresponding to each worksheet. The problem is that in the column C of each worksheet (showing the ID of a company) we can find sometimes only numerical variables (for instance: 2131231) and sometimes both numerical and nonnumerical (34728QM) depending on the ID of the company. To append all the .dta files I was trying to do something like:
    use filename1, clear foreach num of numlist 2/1126 { append using filename`num' } But the problem was that some of those variables in column C (like the previous 34728QM) need to be encoded. I don't know how to do this in a loop and each time I get something like "variable C is long in master but str6 in using data"

    I hope you guys can help me =)

  • #2
    OK, the worst possible solution to this dilemma is to -encode- the company names. That will get you an unbelievable jumble that will take you an eternity to straighten out.

    In some of your files, company ID is a string variable and in others it's a long integer. So what you need is to use the data type that is able to hold both of these, namely string. You need to convert your company ID to a string in those files where it is currently a number. So my approach would be something like this:

    Code:
    clear
    tempfile building
    save `building', emptyok
    
    forvalues i = 1/1126 {
        use filename`i', clear
    
        // DETERMINE WHETHER company_id IS NUMERIC
        // AND IF SO MAKE IT A STRING
        capture confirm numeric variable company_id
        if c(rc) == 0 {
            tostring company_id, replace usedisplayformat
        }
    
        gen int source = `i' // IDENTIFY WHICH FILE OBSERVATION COMES FROM
        append using `building'
        save `"`building'"', replace
    }
    
    use `building', clear
    // FURTHER DATA CLEANING OPERATIONS, ETC.
    // ULTIMATELY SAVE AS A PERMANENT FILE
    NOTE: Not tested, but this is the logic. Beware of typos, unbalanced parentheses, etc.

    Added: Once you have put the files together in this way, if there is some good reason to -encode- the company_id, you can safely do that at this point. But separately -encode-ing them in each source file before putting them together would create one of the biggest messes you've ever seen.
    Last edited by Clyde Schechter; 04 Oct 2016, 09:56.

    Comment


    • #3
      Something like this should do.

      Code:
      use filename1.dta ,clear
      tostring C ,replace
      save final.dta
      
      forvalues j = 2/1126 {
          use filename`j'.dta , clear
          tostring C , replace
          append using final.dta
          save final.dta , clear
      }
      Make sure you have a reasonable idea about the reason for the differences in the sheets and that putting them together like this is reasonable as well.

      Best
      Daniel

      Edit:

      Crossed with Clyde's reply. Note that you do not need to test whether the variable is numeric as tostring will not produce an error if it is not. But do include the usedisplayformat option that Clyde used in his code.
      Last edited by daniel klein; 04 Oct 2016, 10:08.

      Comment


      • #4
        Thank you both for the rapid answer! However there is at some point of the loop an error. Indeed, after having computed several times the loop, at around j=349, this appears:
        "B contains nonnumeric characters; no replace
        C was byte now str1
        variable B is str4 in master but double in using data
        You could specify append's force option to ignore this string/numeric mismatch. The using variable
        would then be treated as if it contained "".
        r(106);"

        How do you think should I deal with this? The thing is that sometimes there can be some errors in the B column due to stock prices missing. But I don't know how many of these there are!

        Thank you in advance for your kindness =)

        Comment


        • #5
          Well, this is not an uncommon problem when starting with Excel spreadsheets. Unlike Stata and other statistical packages, Excel is designed primarily for human eyes, and it allows the indiscriminate mixing of numeric and string data in the same columns. This is cumbersome when importing to Stata (or other statistical packages). It is especially cumbersome when there are a large number of such files being put together and they are not all consistent with each other.

          You don't show us exactly what you're trying to do with variable B that is causing Stata to balk. "B contains nonnumeric characters; no replace" looks a lot like the kind of error message you might get from -destring, replace- when the variable involved contains something that is not a string representation of a number, e.g. something like "N/A" or "*" or the like.

          In my experience, it is not a good idea to try to do a mass -append- or -merge- of uncleaned data files. Even my suggestion in #2 is something that I, personally, wouldn't do in my practice, but if the data type variation in C were the only problem you were having, it wouldn't be unreasonable to approach it that way. In general, though, it is better to clean the individual files first and then do a quick and simple run to combine them.

          So if I were you, I would take a good hard look at one of the files, and write down each variable along with what it should look like in the final combined file: should it be numeric or string? If numeric, is there some limit on the range of acceptable values? If it's supposed to be numeric, what non-numeric values appear that will have to be either fixed or dealt with in some other way. If it's a string variable, should it take on only a certain small set of values? If so, are there any violations of that? Are there relationships among the variables the require some consistency, such as F must be less than G, or all rows having the same value of W must also have the same value of X, etc. My first pass would be a do-file that loops through the files and checks all of these things and creates a "punch list" of problems to be resolved. Then you can write another do-file that specifically works with the files that have problems and fixes those. Then you can finally append all of the cleaned up files together to get a working file.

          Notes:
          You can identify which variables that are supposed to be numeric aren't, and identify the offending material by:

          Code:
          foreach v of varlist list_of_all_putatively_numeric_variables {
              capture confirm numeric variable `v'
              if c(rc) != 0 {
                  display "`v' should be numeric but isn't"
                  list `v' if missing(real(`v'))
                  charlist `v' if missing(real(`v'))
                  return list
              }
          }
          (-charlist- was written by Nick Cox and you can get it from SSC.)

          I'm not saying you can't do it all in a single do-file. Some people do it that way. But the resulting file is usually very long and very messy, and it's just too easy to make mistakes along the way.

          One other word of advice: strongly resist the temptation to use -force- options (on commands like -destring-, or -append- itself). The -force- option sweeps problems under the rug: it doesn't solve them. They will eventually come back to bite you because your data set will likely be incorrect and any results you calculate from it will be wrong. At some point down the road the errors will probably become obvious, often at a very inconvenient time. So if there are observations that Stata is balking at, fix those observations: don't tell Stata to ignore them. -force- options should be used only if you are quite certain that all possible values that are causing problems should be dealt with by deleting them. Sometimes that is the case, but it is the exception. Usually some of these observations are not candidates for deletion, as when somebody puts 1/2 instead of 0.5 in an Excel file. So only use -force- if you are certain that there is no valid data lurking there.



          Comment

          Working...
          X