Help importing data from Excel to Stata

Roman Mostazir

Join Date: Apr 2014
Posts: 874

Help importing data from Excel to Stata

25 Jan 2016, 09:30

Dear Experts,

I am not sure if I will make sense here. My problem is regarding importing data in Stata (Version 13.1) from excel files which are auto generated by some kind of respiratory machines. I will try simplifying the problem.

For example:

Code:

I have the following three Excel files in a directory:


File-1:   247LAG.xls
Fil-2  :   042DM.xls
File-3:   046KFR.xls


Task-1: I need to extract particular cell values from each of these 3 files:
************************************************************************************

From each file:

cell: C40
cell: D39
cell: E43

Cell values do not have any variable name/header.
That is how the software of the machine generates the output.


Task-2: Each three cell values should be imported in three columns in a
Stata dataset and will have three different names as variables:
***********************************************************************************

data from Cell: C40 to be saved as in a column/variable name "VCO2"
data from Cell: D39 to be saved as in a column/variable name "O2"
data from Cell: E43 to be saved as in a column/variable name "REE"


Task-3:  Each row will identify the participant with an ID that corresponds
to the initial part of the file name (without .xls extension).
**************************************************************************************

The number part in the alpha-numeric file name can vary from 001 to 597 and
the Alphabets can vary from any combination of A-Z !!.

Apparently the machine produces one file for each participant. Thus, I have 12 years of data in 12 folders for 597 participants for each year. If an automation process is possible, I can carryout the task for each year, folder wise and then append them together in one dataset as they need to be analyzed longitudinally. But some kind of loop help is needed to initiate the process.

Currently it looks impossible to me. But as always, I want to shelter here as my last resort and if not possible, I will bring a big jar of coffee and start doing it manually. Any suggestions or advice is highly appreciated. Please do let me know if anything is not clear, I will try to clarify.

Best wishes,

Roman

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#2

25 Jan 2016, 10:24

Sounds like a job for -postfile-. I'm going to assume that your folders are named YR2000, YR2001, ...YR2011, all located in the current working directory, and that there is no systematic pattern to the names of the spreadsheets in each folder. I also assume the contents of C40, D39 and E43 are numeric.

Code:

capture postutil clear postfile handle int year str32 name c40 d39 e43 using combined_results, replace forvalues y = 2000/2011 { cd YR`y' local fnames: dir "." files "*.xls" foreach f of local fnames { import excel using `f', clear local name: subinstr local f ".xls" "" post handle (`y') (`"`name'"') (C[40]) (D[39]) (E[43]) } cd .. } postclose handle

should leave you with all your results in combined_results.dta in the current working directory
1 like
Comment
Roman Mostazir

Join Date: Apr 2014

Posts: 874
#3

25 Jan 2016, 10:36

Dear Clyde,

Needless to say how grateful I am for this. Seems like a solution exists !!. I have not tried it yet but will do so and will update how it is going. Your assumption is correct. The cell values are numeric.

All the best,

Roman
Comment

Roman Mostazir

Join Date: Apr 2014
Posts: 874

25 Jan 2016, 11:42

Aplogies Clyde. I think I will have to bear with the pain of a manual process. The reference cell for the calculated figure of interest changes from patient to patient. Some patient had 15 continuous measurements (every minute) some had 20 which is shifting down the reference cell where the mean values are calculated and that is what I am interested in. See one patient for example:

Code:

  A
B
C
D
E
F
G
H
I
J



  Bin
Time
FeO2
FeCO2
FSTP
VO2
VCO2
RQ
EE
EE


No.



l/min
ml/min
ml/min

kcal/day
kJ/day


---
------
------
------
-----
------
------
------
--------
------


1
20:21
20.70%
0.31%
50.70
143
114
0.802
877
3673


2
20:22
20.68%
0.33%
50.58
154
127
0.826
951
3982


3
20:23
20.70%
0.32%
50.58
141
121
0.857
883
3697


4
20:24
20.67%
0.34%
50.46
159
129
0.812
981
4107


5
20:25
20.68%
0.32%
50.18
151
121
0.806
927
3883


6
20:26
20.71%
0.30%
49.94
133
109
0.819
820
3434


7
20:27
20.69%
0.33%
49.82
146
124
0.847
908
3799


8
20:28
20.66%
0.34%
49.82
158
127
0.800
974
4077


9
20:29
20.70%
0.30%
49.59
137
109
0.801
840
3518


10
20:30
20.70%
0.31%
49.67
138
113
0.822
853
3570


11
20:31
20.70%
0.30%
49.65
138
107
0.777
843
3530


12
20:32
20.73%
0.28%
49.48
122
97
0.789
750
3140


13
20:34
20.74%
0.15%
49.54
118
101
0.859
734
3071


14
20:35
20.71%
0.30%
49.41
134
106
0.788
824
3448


15
20:36
20.70%
0.31%
49.30
137
113
0.822
848
3551


16
20:37
20.69%
0.31%
48.96
140
114
0.815
861
3603


17
20:38
20.69%
0.32%
49.10
140
115
0.823
866
3624


18
20:39
20.72%
0.29%
48.93
125
103
0.832
773
3236


19
20:40
20.72%
0.29%
48.78
125
103
0.822
774
3240


20
20:41
20.68%
0.32%
48.62
145
118
0.813
896
3752


























Mean
VO2
:
139
ml/min







Mean
VCO2:
113
ml/min








Mean
RQ
:
0.816




















Mean
Energy
Expenditure
=
859
kcal/Day
;
3596
kJ/Day

Sorry for the mis-alignments. Couldn't do better pasting from excel. The cell reference numbers are not shown.

I am interested in mean of VO2 (variable name refers to cell: F19) mean of VCO2 (cell: G19) and mean of EE kcal/day (cell: I19) which are calculated below the measurements (VO2=139, VCO2=113, EE=859). The number 1 measurement starts from cell: A22 and same for all dataset. But since the number of measurements vary patient to patient (in this case 20 measures, some may have more or less), reference cells for the mean figures shift down.The columns also corresponds to the same variable for each patient. Is there any way to refer to the column and take the mean instead extracting the particular cell value where mean is calculated. Any thought is highly appreciated and many thanks again for reading with patience.

Regards,

Last edited by Roman Mostazir; 25 Jan 2016, 12:39.

Roman

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30101

25 Jan 2016, 13:09

Try this:

Code:

capture postutil clear
postfile handle int year str32 name vo2 vco2 ee using combined_results, replace

forvalues y = 2000/2011 {
    cd YR`y'
    local fnames: dir "." files "*.xls"
    foreach f of local fnames {
        import excel using `f', clear cellrange(A22)
        local name: subinstr local f ".xls" ""
        drop in -4/l // GET RID OF SUMMARY ROWS THAT WILL MESS THINGS UP
        summ F, meanonly
        local mean_vo2 = r(mean)
        summ G, meanonly
        local mean_vco2 = r(mean)
        summ I, meanonly
        local mean_ee = r(mean)
        post handle (`y') (`"`name'"') (`mean_vo2') (`mean_vco2') (`mean_ee')
    }
    cd ..
}
postclose handle

Note: This code assumes that the Mean VO2 through Mean Energy Expenditure rows at the bottom of your example are always the last four rows of the spreadsheet. We get rid of them in case the numbers they contain happen to fall into columns F, G, or I and would be spuriously included in the calculations.

Also, looking at what you have shown in the example, I think that when you -import excel- you are going to get all string variables. If I'm write about that, then you will need to -destring F G I, replace- after the -drop in -4/l- command.

Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

25 Jan 2016, 13:21

Here's another approach that uses filelist (from SSC). It's more flexible if the directory substructure does not follow a predictable pattern (e.g. years for the current example).

The solution assumes that column A contains a number for rows that need to be averaged.

Code:

* use -filelist- from SSC (to install, type "ssc install filelist" in Stata's command window)
* to create a dataset with the filenames of all the Excel docs in the "roman" directory
filelist, dir("roman") pattern("*.xls") save("xls_files.dta") replace
local nfiles = r(N)

* initialize a dataset to hold the results
clear
save "results.dta", replace emptyok

* loop over each Excel file
forvalues i=1/`nfiles' {

    use "xls_files.data" in `i', clear
    local f = dirname + "/" + filename
    
    import excel using "`f'", clear
    destring *, replace force
    keep if !mi(A)
    
    collapse F G I (count) nobs = A
    gen filename = "`f'"
    append using "results.dta"
    save "results.dta", replace
    
}

Comment

Roman Mostazir

Join Date: Apr 2014

Posts: 874
#7

25 Jan 2016, 14:30

Owao, I just wonder which planet you guys do come from. I can now work out on these. My gratitudes to both you Clyde and Robert.
Best wishes,

Roman
Comment
Roman Mostazir

Join Date: Apr 2014

Posts: 874
#8

25 Jan 2016, 16:06

Dear Robert/Clyde,

Will any of these codes provided could be customised with *.DAT files ? Last few years are with *.DAT extensions.

Many thanks again.

Roman
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#9

25 Jan 2016, 16:22

The .DAT extension does not define a specific file type. Files with this extension may be text-based, or they may be proprietary to whatever application created them. Where do your .DAT files come from? Are they text files? Are they SAS files? Something else?

If they are text files, then replacing the -import excel- command with some other command that reads text files (e.g. -import delimited-, or -infile-, or -infix-), and possibly modifying the variable names from A B C, etc. to whatever default variable names apply, should do the trick.

If these files are not text then you will need to find a program such as StatTransfer that translates them into something that Stata can read directly. (It would also be worth using Stata's -search- to see if there is a user-written command that can load .DAT files from that application.)
Comment

Roman Mostazir

Join Date: Apr 2014
Posts: 874

#10

25 Jan 2016, 17:13

Hi Clyde,
Not sure really. I have been requested to undertake a project to explore specific hypothesis of these data which were measured for 12 years on some same patients. My pain is to bring them all in one dataset. I guess the *.DAT file was the later development of the software output !! I opened a *DAT file with my text edit program (in my Mac) and that is what it looks like inside (please see below). Basically it is the same output that is coming with the excel files.

P:S: If I ever become the director of a Stat program in any educational institute, I will ensure that beside the stat course student also undertake a programming course !!

Code:

Europa Scientific Ltd.
Indirect Calorimetry Report   12-17-2009  08:38:02

PATIENT DATA
Name: 013HC ?
D.O.B.:  0/ 0/ 0
Record Number: ?
Sex: F
Height: ?cm
Weight: ?kg
Blood Pressure: ?
Heart Rate: ?
Start Body Temp: ?∞C
End Body Temp: ?∞C

SkinFold Measurements:-
Biceps: ? ; Triceps:? ; Subscapular:? ; Suprailiac:?

NOTES:
?

USER DEFINED EE EQUATION = 1.44 * (3.4VO2 + 1.1VCO2)

RESULTS OF RUN
GEM User ID: ?
Start Time: 08:24:16
End Time: 08:35:39
Initial inspired O2 = 20.97%
Initial inspired CO2 = 0.093%

Sample binning in groups of 60
Artifact bin exclusion is OFF

Bin  Time   FeO2   FeCO2 FSTP   VO2   VCO2    RQ       EE     EE   
No.                      l/min ml/min ml/min        kcal/day kJ/day
--- ------ ------ ------ ----- ------ ------ ------ -------- ------
001  08:25 20.53% 0.491% 45.29    208    180  0.867     1302   5453
002  08:26 20.46% 0.515% 45.25    242    191  0.790     1485   6218
003  08:27 20.44% 0.536% 45.00    253    199  0.786     1556   6513
004  08:28 20.43% 0.540% 44.87    253    200  0.791     1556   6516
005  08:29 20.43% 0.571% 44.67    253    213  0.844     1575   6594
006  08:30 20.42% 0.554% 44.39    255    204  0.803     1571   6578
007  08:31 20.43% 0.556% 44.33    251    205  0.817     1555   6511
008  08:32 20.37% 0.566% 44.21    282    209  0.741     1711   7162
009  08:33 20.39% 0.553% 44.00    270    202  0.748     1644   6883
010  08:34 20.42% 0.560% 43.83    252    204  0.812     1557   6519
011  08:35 20.41% 0.555% 43.67    260    202  0.777     1591   6660

Mean VO2 : 253 ml/min
Mean VCO2: 201 ml/min
Mean RQ  : 0.798

Mean Energy Expenditure = 1555 kcal/Day ; 6510 kJ/Day

Roman

Comment

Roman Mostazir

Join Date: Apr 2014

Posts: 874
#11

25 Jan 2016, 17:23

One more thing Clyde, for the *.xls files, before I run your codes, if I want to drop the first 21 rows, as they contain other irrelevant texts and can cause problem in averaging, will the following work:

Code:

drop in 21/1

I have not tried Robert's code yet as not sure if that will work given the row1-row21 contain other texts.

Roman
Comment
Roman Mostazir

Join Date: Apr 2014

Posts: 874
#12

25 Jan 2016, 18:32

Sorry Clyde and Robert. No luck yet for me and this is because I think I got a bunch of datasets that just do not fit into any criteria. Nothing wrong with your codes. Here is the example of one of them as screenshot. As I said, I am only interested for the average values of column F, G, I and want to be able to identify whom the values correspon to which can be found in each file name.
The rows for averaging the values may vary from 5 to 20 depends how many times each patient was measured for.

I highly really appreciate your help. Even nothing works, the codes you provided will be very helpful for me in future.

All the best.

Roman
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#13

25 Jan 2016, 19:44

In response to #11, if you note my code in #5, it specifies the option -cellrange(A22)- in the -import excel- command. So if you use that, the first 21 rows are skipped and will never come into Stata in the first place, so you won't need to drop them. If, in other circumstances, you did need to drop them, it would be -drop in 1/22-.

With regard to #12, it looks like they have indeed written these files for visual inspection with no thought for computers. Row 42 really breaks my code. I don't see any reasonable fix for this.

As for #10, these .DAT files actually might be more tractable than the spreadsheets. I don't have time right now to fully flesh out the code, but here's an outline of the approach, which relies on -file read-. The idea is to read the .DAT file one line at a time, checking for the lines that begin "Mean VO2 ", "Mean VCO2:", and "Mean Energy Expenditure =" and then extracting the numbers that follow.

Code:

// THE LOOP STRUCTURE IS THE SAME AS BEFORE // THE CONTENTS OF THE INNER LOOP ARE DIFFERENT // TO REFLECT THE STRUCTURE OF .DAT FILES capture postutil clear postfile handle int year str32 name vo2 vco2 ee using combined_results_dat_files, replace forvalues y = 2000/2011 { cd YR`y' local fnames: dir "." files "*.dat" foreach f of local fnames { local name: subinstr local f ".dat" "" file open datfile using "`f'", read text file read datfile line while !r(eof) { if substr(`"`line'"', 1, 10) == "Mean VO2 :" { local vo2: word 4 of `line' } else if substr(`"`line'"', 1, 10) == "Mean VCO2:" { local vco2: word 3 of `line' } else if substr(`"`line'"', 1, 25) == "Mean Energy Expenditure =" { local ee: word 5 of `line' } file read datfile line } post handle (`y') (`"`name'"') (`vo2') (`vco2') (`ee') file close datfile } cd .. } postclose handle

This code is incomplete because it does not actually verify that vo2, vco2 and ee have all been found in each file, and does nothing about it if they have not. So it needs some refinements, but perhaps you can work on that. Also, I don't guarantee that I have correctly counted the number of characters for each of those three key phrases--it's getting late and my eyes are getting blurry.

Needless to say, if some of the .DAT files have variants on these key text strings, varying in capitalization or placement of the colon or spaces, then this code will break. But if they at least adhere consistently to the use of those key strings, this should work for the .DAT files.

Finally, you have my deep sympathies for having to do this project. The clinical world is generally oblivious to the world of automated data processing (and, in my opinion, it hasn't improved much with the spreading adoption of EHRs) and we who work with clinical data are often left to doing tedious, error-prone manual work like this.
1 like
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#14

25 Jan 2016, 19:56

If the format of these Excel files can change, then you should think about adding some checks to make sure that your assumptions about the data are valid. My solution can handle an arbitrary number of rows with measurement data since it keeps only rows with a number in column A. Here's an expanded example with some checks. I also rewrote the condition to show that I'm dropping rows that do not start with a number in column A.

Copy and paste this example in a do-file and then save it in the same directory that includes the home directory for this data. Change Stata's current directory (help cd) to the one that contains the do-file. You'll need to adjust the directory name in the filelist command to match the name of the data directory. All you need to do then is to run the do-file.

Code:

* use -filelist- from SSC (to install, type "ssc install filelist" in Stata's command window)
* to create a dataset with the filenames of all the Excel docs in the "roman" directory
filelist, dir("roman") pattern("*.xls") save("xls_files.dta") replace
local nfiles = r(N)

* initialize a dataset to hold the results
clear
save "results.dta", replace emptyok

* loop over each Excel file
forvalues i=1/`nfiles' {

    use "xls_files.data" in `i', clear
    local f = dirname + "/" + filename
    
    import excel using "`f'", clear
    
    // consistency checks
    count if strpos(A, "Bin")
    assert r(N) == 1
    count if strpos(F, "VO2")
    assert r(N) == 1
    count if strpos(G, "VCO2")
    assert r(N) == 1
    count if strpos(I, "EE")
    assert r(N) == 1
    
    destring *, replace force
    
    // drop observations without a number in column A
    drop if mi(A)
    
    collapse F G I (count) nobs = A
    gen filename = "`f'"
    append using "results.dta"
    save "results.dta", replace
    
}

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#15

25 Jan 2016, 21:52

Robert Picard In #12, the screenshot of the spreadsheet shows that cell A42 contains a number, but the rest of that row does not conform to the layout needed for this to work. It isn't clear from the screen shot exactly what is in columns F, G, and I in that row, but they aren't the values of VO2 VCO2 and EE. So I think that example breaks your code.in #14.
1 like
Comment

Announcement

Help importing data from Excel to Stata

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment