Help importing data from Excel to Stata

Robert Picard

Join Date: Mar 2014

Posts: 1536
#16

25 Jan 2016, 22:35

Clyde, I wouldn't say that the example you point out breaks my code. The computed means will be correct (because the observation will contain missing values in F G I). One way or another, Roman needs rules to automatically discover data observations. I could make a more sophisticated rule but without eyes on the files, it would only be speculation. Besides, I don't think that Roman has tried my code yet.
1 like
Comment

Roman Mostazir

Join Date: Apr 2014
Posts: 874

#17

26 Jan 2016, 05:11

Dear Clyde and Robert,

I have no words to express my gratitude!! Clyde, many thanks for your sympathy and I feel guilty for causing your eyes blurry. I had the same last night and gave it up. I am yet to apply your codes for the *.DAT files.

Robert, don't know what to say, but many many thanks. Literally you guys just saved me from opening > 7000 files individually !!!

I finally managed to run Robert's codes !!!. I just did some little change in the string positions (strpos) which was messing things up. The good thing is, the extraction keeps a file name in the file from where I can extract the id's of the individuals. Here are the codes:

Code:

filelist, dir(" /Users/mm487/WORK/EXETER/EB/GEM_MOS/GEM_COMPILED/gem_00_mth/") ///
pattern("*.xls") save("xls_files.dta") replace
local nfiles = r(N)

* initialize a dataset to hold the results

clear
save "results.dta", replace emptyok

* loop over each Excel file
forvalues i=1/`nfiles' {

    use "xls_files.dta" in `i', clear
    local f = dirname + "/" + filename
    
    import excel using "`f'", clear
    
    // consistency checks
    count if strpos(A, "---") //This is where I changed
    assert r(N) == 1
    count if strpos(F, "------")
    assert r(N) == 1
    count if strpos(G, "------")
    assert r(N) == 1
    count if strpos(I, "--------")
    assert r(N) == 1
    
    destring *, replace force
    
    // drop observations without a number in column A
    drop if mi(A)
    collapse F G I (count) nobs = A
    gen filename = "`f'"
    append using "results.dta"
    save "results.dta", replace
    
}

* Generate ID from the filenames:

gen character=length(filename)

gen id=substr(filename,61,3)

list in 1/10, clean

               F           G           I   nobs                                                                filename   charac~r    id  
  1.       180.6       141.8      1239.5     10   /Users/mm487/WORK/EXETER/EB/GEM_MOS/GEM_COMPILED/gem_00_mth/307RD.xls         69   307  
  2.       150.9         158      1097.9     10   /Users/mm487/WORK/EXETER/EB/GEM_MOS/GEM_COMPILED/gem_00_mth/306LN.xls         69   306  
  3.         235       176.9      1600.3     10   /Users/mm487/WORK/EXETER/EB/GEM_MOS/GEM_COMPILED/gem_00_mth/305JH.xls         69   305  
  4.       153.3       149.8      1097.9     10   /Users/mm487/WORK/EXETER/EB/GEM_MOS/GEM_COMPILED/gem_00_mth/304LT.xls         69   304  
  5.       137.4       126.4       972.3     10   /Users/mm487/WORK/EXETER/EB/GEM_MOS/GEM_COMPILED/gem_00_mth/303TN.xls         69   303  
  6.   154.46667         147   1100.0667     15   /Users/mm487/WORK/EXETER/EB/GEM_MOS/GEM_COMPILED/gem_00_mth/302JE.xls         69   302  
  7.   174.57143   163.57143        1239     16   /Users/mm487/WORK/EXETER/EB/GEM_MOS/GEM_COMPILED/gem_00_mth/301LD.xls         69   301  
  8.   143.23077   151.07692        1043     13   /Users/mm487/WORK/EXETER/EB/GEM_MOS/GEM_COMPILED/gem_00_mth/300ZJ.xls         69   300  
  9.       154.8       143.9      1097.2     10   /Users/mm487/WORK/EXETER/EB/GEM_MOS/GEM_COMPILED/gem_00_mth/299WW.xls         69   299  
 10.       152.1       154.6      1098.9     10   /Users/mm487/WORK/EXETER/EB/GEM_MOS/GEM_COMPILED/gem_00_mth/298MC.xls         69   298

Robert/Clyde, what I don't understand in Robert's code is F, G, I are being averaged correctly from repeated measures while in Robert's code all I can see is count functions. Which part of the code is calculating the average?

Again thanks a lot for time you spent here.

Best wishes to both of you.

Last edited by Roman Mostazir; 26 Jan 2016, 05:36.

Roman

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35698
#18

26 Jan 2016, 08:39

collapse defaults to producing means.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30100

#19

26 Jan 2016, 09:22

Robert Picard The reason I thought the file in #12 could break your code is that, at least as far as I can make out from the screen shot, the cells in F, G, and I in that offending row will not necessarily be missing. There is a number, 0.156, in that row: I can't tell which column it ends up in, but it could be one of the crucial three.

Evidently, though, Roman ran your code and it worked. So my concern did not materialize.

Roman Mostazir Since you haven't yet run my code for the .DAT files, here's some better code. This one includes checks to assure the one, and only one, value of each of mean VO2, mean VCO2 and mean EE are extracted from each of the files. Rather than aborting on encountering a bad file, I set it up to not post anything for that file, but continue on. While that's not what I usually consider optimal design, since you have a large number of files and there is some reason to think that there will be several bad ones among them, it's probably better in this case to process the files that are suitable, and just give an inventory of the files that the code couldn't handle. So that's what this code does.

Code:

// THE LOOP STRUCTURE IS THE SAME AS BEFORE
// THE CONTENTS OF THE INNER LOOP ARE DIFFERENT
// TO REFLECT THE STRUCTURE OF .DAT FILES
capture postutil clear
postfile handle int year str32 name vo2 vco2 ee using combined_results_dat_files, replace

forvalues y = 2000/2011 {
    cd YR`y'
    local fnames: dir "." files "*.dat"
    foreach f of local fnames {
       local name: subinstr local f ".dat" ""
       file open datfile using "`f'", read text
       file read datfile line
       local vo2
       local vco2
       local ee
       local bad_file = 0
       while !r(eof) {
              if substr(`"`line'"', 1, 10) == "Mean VO2 :" {
                  capture assert `"`vo2'"' == ""
                  if c(rc) {
                    display `"Duplicate Mean VO2 in file `f'"'
                    local bad_file = 1
                  }
                  local vo2: word 4 of `line'
              }
              else if substr(`"`line'"', 1, 10) == "Mean VCO2:" {
                  capture assert `"`vco2'"' == ""
                  if c(rc) {
                    display `"Duplicate Mean VCO2 in file `f'"'
                    local bad_file = 1
                  }
                  local vco2: word 3 of `line'
              }
              else if substr(`"`line'"', 1, 25) == "Mean Energy Expenditure =" {
                  capture assert `"`ee'"' == ""
                  if c(rc) {
                    display `"Duplicate EE in file `f'"'
                    local bad_file = 1
                  }
                  local ee: word 5 of `line'
             }
             file read datfile line
        }
        capture assert !missing(`vo2', `vco2', `ee')
        if c(rc) == 0 {
            display `"Undefined VO2, VCO2, or EE in file `f'"'
            local bad_file = 1
        }
        if !bad_file {
            post handle (`y') (`"`name'"') (`vo2') (`vco2') (`ee')
        }
        file close datfile
    }
    cd ..
}
postclose handle

Good luck!

Comment

Roman Mostazir

Join Date: Apr 2014
Posts: 874

#20

26 Jan 2016, 09:44

Oops !! Many thanks Nick, how could I missed that. !! Just to let everyone know I prettymuch finished processing almost 7000 files with Roberts code. And the whole batch worked like magic !!! I have inspected. Now I have last around 400 files for 400 patients and they are in *.DAT file. Clyde, I tried the codes you provided for the *DAT file, but it is not producing anything apart from producing the output below. No dataset is being produced:

Code:

  "/var/folders/_v/x0t16tt96n9869cmzr8g32xm0000gp/T//SD10504.000000"

capture postutil clear

postfile handle int year str32 name vo2 vco2 ee using combined_results_dat_files, replace

The full codes are not displayed in the Stata result window and not sure why, but you can check your codes up at my time: today 2.44 !! (why I am not seeing any post number to refer!!).

I tried one dataset with 'import' option:

Code:

import delimited using 155HL.dat, delimiter(" ")

This reads the whole texts in the data file and completely un-workable. For everyones convenience, I am pasting the contents of a *DAT file:

Code:

Europa Scientific Ltd.
Indirect Calorimetry Report   09-22-2010  09:06:31

PATIENT DATA
Name: 159HV ?
D.O.B.:  0/ 0/ 0
Record Number: ?
Sex: F
Height: ?cm
Weight: ?kg
Blood Pressure: ?
Heart Rate: ?
Start Body Temp: ?∞C
End Body Temp: ?∞C

SkinFold Measurements:-
Biceps: ? ; Triceps:? ; Subscapular:? ; Suprailiac:?

NOTES:
?

USER DEFINED EE EQUATION = 1.44 * (3.4VO2 + 1.1VCO2)

RESULTS OF RUN
GEM User ID: ?
Start Time: 08:48:06
End Time: 09:03:32
Initial inspired O2 = 20.95%
Initial inspired CO2 = 0.075%

Sample binning in groups of 60
Artifact bin exclusion is OFF

Bin  Time   FeO2   FeCO2 FSTP   VO2   VCO2    RQ       EE     EE  
No.                      l/min ml/min ml/min        kcal/day kJ/day
--- ------ ------ ------ ----- ------ ------ ------ -------- ------
001  08:49 20.52% 0.506% 44.48    191    192  1.002     1241   5197
002  08:50 20.55% 0.476% 44.31    176    178  1.012     1143   4784
003  08:51 20.58% 0.441% 43.98    160    161  1.006     1039   4349
004  08:52 20.58% 0.442% 43.98    159    161  1.013     1036   4335
005  08:53 20.55% 0.482% 43.79    174    178  1.023     1134   4749
006  08:54 20.55% 0.485% 43.65    173    179  1.034     1131   4737
007  08:55 20.57% 0.455% 43.53    165    166  1.003     1071   4483
008  08:56 20.56% 0.468% 43.45    168    171  1.018     1091   4569
009  08:57 20.54% 0.489% 43.23    173    179  1.033     1132   4738
010  08:58 20.56% 0.481% 43.18    165    175  1.062     1087   4550
011  08:59 20.47% 0.583% 43.08    202    219  1.081     1337   5599
012  09:00 20.59% 0.474% 42.96    150    172  1.142     1008   4220
013  09:01 20.52% 0.520% 42.76    182    191  1.048     1192   4989
014  09:02 20.51% 0.525% 42.59    182    192  1.051     1197   5012
015  09:03 20.55% 0.500% 42.46    166    181  1.086     1101   4612

Mean VO2 : 173 ml/min
Mean VCO2: 180 ml/min
Mean RQ  : 1.041

Mean Energy Expenditure = 1129 kcal/Day ; 4728 kJ/Day

I could have attached a *DAT file your inspection by changing its extension but wanted to stick to the rule that the forum does not support *.DAT for uploading. See the output. I am mainly interested in the colored three figures either extracting them from the respecting cells or averaging them from the respecting columns. (VO2, VCO2, EE/kcal/day)

Last edited by Roman Mostazir; 26 Jan 2016, 09:48.

Roman

Comment

Roman Mostazir

Join Date: Apr 2014

Posts: 874
#21

26 Jan 2016, 09:47

Apology Clyde, did not see your new codes. My last feedback is based on your previous codes. I will let you know how I am getting on. Many thanks again.

Last edited by Roman Mostazir; 26 Jan 2016, 09:51.

Roman
Comment

Roman Mostazir

Join Date: Apr 2014
Posts: 874

#22

26 Jan 2016, 10:11

Hi Clyde,

I took two foloders each with 8-9 datasets and ran your last codes. The output is:

- It produces the dataset with following variables without any data:

Code:

des

Contains data from /Users/mm487/test/COMPILED/combined_results_dat_files.dta
  obs:             0                          
 vars:             5                          26 Jan 2016 16:55
 size:             0                          
-------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------------------------------------
year            int     %8.0g                
name            str32   %32s                  
vo2             float   %9.0g                
vco2            float   %9.0g                
ee              float   %9.0g                
------------------------------------------------

 list in 1/10
Obs. nos. out of range

I am wondering are we refering to the correct line numbers in the codes. But that is just a speculation. I have not understood the codes how the line numbers are being refered to specific cell values for the means.

Many thanks.

Last edited by Roman Mostazir; 26 Jan 2016, 10:14.

Roman

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30100

#23

26 Jan 2016, 10:15

I made a few changes here to the code in #19--there were some typos. I tested the following by setting up a YR2000 directory and copying what you showed in #20 as test1.dat. Then I made a test2.dat by putting in a duplicate line for VCO2 and deleting the VO2 line to lightly test the error trapping. It worked for me. Give it a try. Sorry for the earlier errors.

Code:

clear*
capture postutil clear
postfile handle int year str32 name vo2 vco2 ee using combined_results_dat_files, replace
forvalues y = 2000/2000 {
    cd YR`y'
    local fnames: dir "." files "*.dat"
    foreach f of local fnames {
       local name: subinstr local f ".dat" ""
       file open datfile using "`f'", read text
       local vo2
       local vco2
       local ee
       local bad_file = 0
       file read datfile line
       while !r(eof) {
              if substr(`"`line'"', 1, 10) == "Mean VO2 :" {
                  capture assert `"`vo2'"' == ""
                  if c(rc) {
                    display in red `"Duplicate Mean VO2 in year `y' file `f'"'
                    local bad_file = 1
                  }
                  local vo2: word 4 of `line'
              }
              else if substr(`"`line'"', 1, 10) == "Mean VCO2:" {
                  capture assert `"`vco2'"' == ""
                  if c(rc) {
                    display in red `"Duplicate Mean VCO2 in year `y' file `f'"'
                    local bad_file = 1
                  }
                  local vco2: word 3 of `line'
              }
              else if substr(`"`line'"', 1, 25) == "Mean Energy Expenditure =" {
                  capture assert `"`ee'"' == ""
                  if c(rc) {
                    display in red `"Duplicate EE in year `y' file `f'"'
                    local bad_file = 1
                  }
                  local ee: word 5 of `line'
             }
             file read datfile line
        }
        capture assert !missing(`"`vo2'"', `"`vco2'"', `"`ee'"')
        if c(rc) != 0 {
            display in red`"Undefined VO2, VCO2, or EE in year `y' file `f'"'
            display in red `"`vo2', `vco2', `ee'"'
            local bad_file = 1
        }
        if !`bad_file' {
            post handle (`y') (`"`name'"') (`vo2') (`vco2') (`ee')
        }
        file close datfile
    }
    cd ..
}

postclose handle

Do not expect to see output as you go along: it will be silent until it hits a file that is missing one of the parameters or defines it more than once. Then you will get an error message in red about that file. When it finishes, results for the valid files will have been saved in file combined_results_dat_files.dta

Last edited by Clyde Schechter; 26 Jan 2016, 10:23.

Comment

Mercedes Ponce

Join Date: Jan 2016

Posts: 4
#24

26 Jan 2016, 10:19

Hi Roman,

Have you tried using stat transfer. You can transform the excel file to a stata one with it.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#25

26 Jan 2016, 10:57

Here's how I would handle the .DAT files. It's the same setup as before but I use infix to read the raw text into a single string variable called line. I keep only lines where the first word is either "Bin" or a number and then split the string into separate variables. Then, it's only a matter of targeting the desired variables (with checks to make sure that the file conforms to expectations).

Code:

* use -filelist- from SSC (to install, type "ssc install filelist" in Stata's command window)
* to create a dataset with the filenames of all the Excel docs in the "roman" directory
filelist, dir("roman") pattern("*.DAT") save("DAT_files.dta") replace
local nfiles = r(N)

* initialize a dataset to hold the results
clear
save "results_DAT.dta", replace emptyok

* loop over each DAT file
forvalues i=1/`nfiles' {

    use "DAT_files.dta" in `i', clear
    local f = dirname + "/" + filename
    
    infix str line 1-120 using "`f'", clear
    gen word1 = word(line,1)
    keep if !mi(real(word1)) | word1 == "Bin"
    split line
    
    // consistency checks
    count if strpos(line1, "Bin")
    assert r(N) == 1
    count if strpos(line6, "VO2")
    assert r(N) == 1
    count if strpos(line7, "VCO2")
    assert r(N) == 1
    count if strpos(line9, "EE")
    assert r(N) == 1
    
    destring *, replace force
    
    collapse (mean) line6 line7 line9 (count) nobs = line1
    gen filename = "`f'"
    append using "results_DAT.dta"
    save "results_DAT.dta", replace
    
}

Last edited by Robert Picard; 26 Jan 2016, 11:00. Reason: change filename for results to avoid conflicts with previous solution

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#26

26 Jan 2016, 11:30

Again looking at the screenshot in #12, that nonconforming row looks to me like it ends up with that 0.156 number in line9 using the code in #25. It's a little hard to see at that size, and it may be that the it will not. But it's a bit chancy and depends crucially on the presence and absence of spaces in places where they are optional. So while I'm not saying it won't work, as I can't see enough detail in the screenshot, it might not. If that 0.156 does creep into line9 (or line6 or line7 in some other file, given the irregularities), you won't even know it happened--or you'll only find out after getting odd results later on and figuring out why will be a very difficult task at that point.

Of course, the screenshot in #12 is from an Excel spreadsheet, not one of the .dat files. But given the overall variability in these files, I remain wary.

I think the approach taken in #23 is safer. It will never pick up invalid data; and if it misses data that is there but not laid out in the way that the code expects, you will know about that up front. If those cases are few, they can be handled manually. If they are many, there might be a second pattern applying in those cases that can be dealt with by changing the code accordingly.

Last edited by Clyde Schechter; 26 Jan 2016, 11:50.
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#27

26 Jan 2016, 11:56

It looks to me like "0.156%" which would translate to missing because of the percent sign. There are plenty of other ways the data could diverge from the few observed example we have seen. There's no "safe" way to scrape data from a large number of files and be absolutely sure that there are no mistakes. As I said, without eyes on the data, it's kind of a pointless exercise to add rules. That's something that Roman can decide, including when he reaches diminishing returns.

Note that my solution simply offers another approach to the problem. Personally, I would not collapse the data on the fly and I would build a dataset that combines all the measurements. That creates another opportunity to inspect the appended data before calculating the averages (using collapse).
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#28

26 Jan 2016, 12:00

Personally, I would not collapse the data on the fly and I would build a dataset that combines all the measurements. That creates another opportunity to inspect the appended data before calculating the averages (using collapse).

Yes, I agree that would be much better. And also agree that there is no completely safe way to deal with the loosely structured data at hand.
Comment
Roman Mostazir

Join Date: Apr 2014

Posts: 874
#29

26 Jan 2016, 13:50

Dear Clyde and Robert,

Before moving onto your safety check related comments, just want to say a big thank to both of you again for your time and valuable clever inputs. Something in Clyde's code regarding the DAT extension is not functioning, it is creating the dataset but the dataset remains empty. However, I strongly believe that it is due to the random nature of the files rather Clyde's codes. The important thing is the problem is solved and Robert's code again worked perfectly !!! (when I started this thread, I thought I am asking something impossible !!). ...

Regarding the safety check, I completely agree that the process could have been risky. But I was lucky in a sense that there are no whimsical values whovering in the 3 cells I am interested in. The %0.15, rightly pointed by Clyde, is a text input by some errodite data management team whoever they were and it is beyond my imagination that they continued like these for 12 years without being spotted. As Robert mentioned, the text is skipped by "collapse" and rightly calculating the mean.

A big hug and all the best,

Roman
Comment

Roman Mostazir

Join Date: Apr 2014
Posts: 874

#30

28 Jan 2016, 16:52

I feel obliged to update this thread regarding Clyde's codes which I failed to implement correctly and that is due to my ignorance. Apparently, Clyde's code also worked perfectly !!! The reason I failed to implement it at first place is, my files are all having an extension "*.DAT" while Clyde's codes are written with "*.dat" extension. Mac is case sensitive while Windows is not. While Clyde was coding in Windows and successfully implementing it in Windows platform with sample datasets, I was implementing it in Mac and obviously nothing was being posted in the Stata dataset in my Mac eventhough the whole command ran smoothly without any error message. Apology Clyde and again hat's off to both of you.

And thanks to Stata tech support for sorting it out, it is a shame that I failed to pick it up at first place.

Best wishes,

Clyde's codes after changing the extension to *.DAT:

Code:

clear*
capture postutil clear
postfile handle int year str32 name vo2 vco2 ee using combined_results_dat_files, replace
forvalues y = 2011/2012 {
    cd YR`y'
    local fnames: dir "." files "*.DAT"
    foreach f of local fnames {
       local name: subinstr local f ".DAT" ""
       file open datfile using "`f'", read text
       local vo2
       local vco2
       local ee
       local bad_file = 0
       file read datfile line
       while !r(eof) {
              if substr(`"`line'"', 1, 10) == "Mean VO2 :" {
                  capture assert `"`vo2'"' == ""
                  if c(rc) {
                    display in red `"Duplicate Mean VO2 in year `y' file `f'"'
                    local bad_file = 1
                  }
                  local vo2: word 4 of `line'
              }
              else if substr(`"`line'"', 1, 10) == "Mean VCO2:" {
                  capture assert `"`vco2'"' == ""
                  if c(rc) {
                    display in red `"Duplicate Mean VCO2 in year `y' file `f'"'
                    local bad_file = 1
                  }
                  local vco2: word 3 of `line'
              }
              else if substr(`"`line'"', 1, 25) == "Mean Energy Expenditure =" {
                  capture assert `"`ee'"' == ""
                  if c(rc) {
                    display in red `"Duplicate EE in year `y' file `f'"'
                    local bad_file = 1
                  }
                  local ee: word 5 of `line'
             }
             file read datfile line
        }
        capture assert !missing(`"`vo2'"', `"`vco2'"', `"`ee'"')
        if c(rc) != 0 {
            display in red`"Undefined VO2, VCO2, or EE in year `y' file `f'"'
            display in red `"`vo2', `vco2', `ee'"'
            local bad_file = 1
        }
        if !`bad_file' {
            post handle (`y') (`"`name'"') (`vo2') (`vco2') (`ee')
        }
        file close datfile
    }
    cd ..
}

postclose handle

use combined_results_dat_files.dta,clear

list in 1/10, clean

      year    name   vo2   vco2     ee  
  1.   2011   001LB   212    166   1299  
  2.   2011   002AM   215    185   1346  
  3.   2011   003MO   134    121    846  
  4.   2011   004TM   232    210   1468  
  5.   2011   005BH   140    148    920  
  6.   2011   006HB   250    218   1570  
  7.   2011   007RS   239    184   1461  
  8.   2011   008AN   190    164   1192  
  9.   2011   009KC   232    203   1458  
 10.   2011   010DD   184    159   1155

Last edited by Roman Mostazir; 28 Jan 2016, 17:15. Reason: Added the code and sample of exported data

Roman

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment