Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Importing data from excel

    Hello all,

    I need help importing data from excel into stata. The problem is that the data are not in a convenient format to do so currently. I have variable names across multiple rows which are not the first row, the data is from 1898-1973 and each year is in a separate sheet, and each year of data has a different number of variables and variable names. The attached pictures may help clarify. Is there any way to automate this?

  • #2
    You can loop the -import excel- command over multiple sheets using a -forvalues- loop (changing filename and the starting cell as necessary):

    Code:
    forvalues y = 1898/1973{
        import excel using filename, cellrange(A1) sheet("`y'") firstrow clear
        tempfile ds`y'
        save `ds`y''
    }
    You can then append your datasets using another -forvalues- loop:

    Code:
    use `ds1898' , clear
    forvalues y = 1899/1973 {
        append using `ds`y''
    }
    To deal with the column headers / variable names, it may be preferable to read in all rows of your Excel file as values using the allstring option and removing the firstrow option. You can then rename your variables automatically by adapting the following code (which I saved from a post elsewhere on Statalist or StackOverflow; credit is due to... someone else, whose name I am forgetting -- not me, at any rate):

    Code:
    forvalues y = 1898/1973{
        import excel using filename, cellrange(A1) sheet("`y'") allstring clear
    
        // Rename all vars with numbers & add Excel column headers to labels/chars
        local i = 1
        foreach x of varlist * {
            rename `x' v`i'
            label var v`i' `"`=v`i'[1]'"'
            local newname = strtoname("`=v`i'[1]'")
            rename v`i' `newname'
            local `++i'
        }
        drop in 1
    
        tempfile ds`y'
        save `ds`y''
    }

    Comment


    • #3
      Up to the appending this is automated in xls2dta (SSC). Details of what to do best depend on spelling out details about the data at hand. Best would be an example dataset.

      Best
      Daniel

      Comment


      • #4
        Firstly, thank you for your response. I tried to use this approach but the append command is giving me a very odd looking dataset. I'm not sure how best to show or explain what it looks like. Each year has a different number of variables, and thus I think I need to use the merge command. The variable to match would need to be "station", as it is contained in each "year" . Also, in the third part for renaming the variables I would need a way to apply the labels in the 2nd row to each of the variables below it in the 3rd row. Ex: "revenue" needs to apply to hatch fund, adams fund, state...etc and "equipment spending" needs to apply to buildings, library, machinery...etc.


        I tried to upload the excel file so it will be easier to see, but I am receiving an error "invalid file"
        Last edited by Steven Pawlowski; 02 Sep 2015, 13:50.

        Comment


        • #5
          I do not think I can be of much help without an example dataset I can try myself. My suggestion is this:

          Start with one sheet (year). Import it into Stata and figure out the commands you want to apply to this dataset so it looks like you want it to. Then think about whether code could be generalized to the other sheets. If so, you can start building the loop - or use xls2dta to do this for you.

          You may even be able to create an small example on basis of the first imported sheet, using dataex (SSC). If you manage to do so, I might find the time to play around with it a bit.

          Best
          Daniel

          Comment


          • #6
            Hi Daniel,

            Here is a dropbox link to the data from which you can download it.

            https://www.dropbox.com/s/rtwgpr4gfc...ials.xlsx?dl=0

            Comment


            • #7
              That is one place to start. However, you need a much more detailed picture of how the final dataset should look like. For example there are a lot of row and cell "Totals" in these sheets (well, that is the typical Excel approach) and I am almost sure you do not want that in Stata. So what do you think your data should look like? What are the observations? What are the variables? How should we organize the hole thing?

              To b honest, from what I see automating the process might be pretty difficult. But, again, that very much depends on what exactly you want and why exactly you want it like this.

              Best
              Daniel
              Last edited by daniel klein; 02 Sep 2015, 15:16.

              Comment


              • #8
                I can ignore the "totals" as they can be easily calculated in stata. The observations are each station. The variables of interest are all the heads of the columns. Ex: "equipment spending farm", "equipment spending livestock", etc. As far as organization I would like to have all the data from each year merged and appended into one dataset. Also, I agree; trying to automate this has been a huge pain, but I appreciate your help!

                Steven

                Comment


                • #9
                  I have set up a basic loop to create variable names by combining the first two rows of your sheets (well, I tired with 1921 only).

                  Code:
                  unab vars : *
                  
                  loc J : word count `vars'
                  forv j = 2/`J' {
                      loc cvar : word `j' of `vars'
                      if (`cvar'[1] == "") {
                          loc cvar = `pvar'[1] + " " + `cvar'[2]
                      }
                      else {
                          loc pvar `cvar'
                      }
                      
                      loc name = strtoname(`"`cvar'"')
                      loc newnames `newnames' `name'
                  }
                  
                  loc newnames Stattion `newnames'
                  
                  ren (*)(`newnames')
                  However, this will lead to names that exceed the 32 character limit by far, e.g. "Spending_Appropriated_by_Adams_Act_Seeds_Plan ts_S undry" will be an invalid name and thus be cut at "Spending_Appropriated_by_A". The problem then is that you cannot automate the rename-ing process, because Stata complains that "Spending_Appropriated_by_A" would be assigned to 17 old variables (in this example).

                  You need to figute out a way of dealing with this issue.

                  [Edit]
                  There may be the same "names" in a couple of sheets and not all of them might cause trouble. You could try to come up with a list of problem-names and new name suggestions. Code might look like this

                  Code:
                  if (`cvar'[1] == "Spending Appropriated by Adams Act") {
                          loc cvar "Adams_Act"
                  }
                  If you have typos or so in the sheets, you are lost here, I guess.

                  Similar situation if there are typos in the Station names or if Stations that are supposed to be same have (even slightly) different names. Such things will completely mess up the merging later.
                  [/Edit]

                  If you find a solution for the above, you can go on and get rid of the first two rows, the empty Station names at the end of the file and column totals

                  Code:
                  drop in 1/2
                  drop if inlist(Station, "", "Total")
                  At this point, ignoring the string nature of the dataset, you can try whether the above works for all the sheets. You can then work on the loop to get the hole thing running.

                  Best
                  Daniel
                  Last edited by daniel klein; 03 Sep 2015, 00:55.

                  Comment


                  • #10
                    Again, thank you for the help! I think this strategy will work, and I can adjust the code by just adding more "if" statements. In the code you posted, what is `pvar' doing? Here is the code I have:


                    unab vars : *

                    local J : word count `vars'
                    forvalues j = 2/`J' {
                    local cvar : word `j' of `vars'
                    if (`cvar'[1] == "") {
                    local cvar = `pvar'[1] + " " + `cvar'[2]
                    }
                    if (`cvar'[1] == "Spending Appropriated by Adams Act") {
                    local cvar "adams_act"
                    }
                    if (`cvar'[1] == "Spending Appropriated by Hatch Act") {
                    local cvar "hatch_act"
                    }
                    else {
                    local pvar `cvar'
                    }

                    local name = strtoname(`"`cvar'"')
                    local newnames `newnames' `name'
                    }

                    local newnames Station `newnames'

                    ren(*)(`newnames')


                    when running this code on year 1921 I get an error "RevenueAdamsFunding" not found

                    Comment


                    • #11
                      Update: I've got the code to run for an individual year! Now all I have to do is loop over all the sheets! Thanks Daniel!


                      -Steven

                      Comment

                      Working...
                      X