Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help importing data from Excel to Stata

    Dear Experts,

    I am not sure if I will make sense here. My problem is regarding importing data in Stata (Version 13.1) from excel files which are auto generated by some kind of respiratory machines. I will try simplifying the problem.


    For example:


    Code:
    I have the following three Excel files in a directory:
    
    
    File-1:   247LAG.xls
    Fil-2  :   042DM.xls
    File-3:   046KFR.xls
    
    
    Task-1: I need to extract particular cell values from each of these 3 files:
    ************************************************************************************
    
    From each file:
    
    cell: C40
    cell: D39
    cell: E43
    
    Cell values do not have any variable name/header.
    That is how the software of the machine generates the output.
    
    
    Task-2: Each three cell values should be imported in three columns in a
    Stata dataset and will have three different names as variables:
    ***********************************************************************************
    
    data from Cell: C40 to be saved as in a column/variable name "VCO2"
    data from Cell: D39 to be saved as in a column/variable name "O2"
    data from Cell: E43 to be saved as in a column/variable name "REE"
    
    
    Task-3:  Each row will identify the participant with an ID that corresponds
    to the initial part of the file name (without .xls extension).
    **************************************************************************************
    
    The number part in the alpha-numeric file name can vary from 001 to 597 and
    the Alphabets can vary from any combination of A-Z !!.
    Apparently the machine produces one file for each participant. Thus, I have 12 years of data in 12 folders for 597 participants for each year. If an automation process is possible, I can carryout the task for each year, folder wise and then append them together in one dataset as they need to be analyzed longitudinally. But some kind of loop help is needed to initiate the process.

    Currently it looks impossible to me. But as always, I want to shelter here as my last resort and if not possible, I will bring a big jar of coffee and start doing it manually. Any suggestions or advice is highly appreciated. Please do let me know if anything is not clear, I will try to clarify.

    Best wishes,
    Roman

  • #2
    Sounds like a job for -postfile-. I'm going to assume that your folders are named YR2000, YR2001, ...YR2011, all located in the current working directory, and that there is no systematic pattern to the names of the spreadsheets in each folder. I also assume the contents of C40, D39 and E43 are numeric.

    Code:
    capture postutil clear
    postfile handle int year str32 name c40 d39 e43 using combined_results, replace
    
    forvalues y = 2000/2011 {
        cd YR`y'
        local fnames: dir "." files "*.xls"
        foreach f of local fnames {
            import excel using `f', clear
            local name: subinstr local f ".xls" ""
            post handle (`y') (`"`name'"') (C[40]) (D[39]) (E[43])
        }
        cd ..
    }
    postclose handle
    should leave you with all your results in combined_results.dta in the current working directory

    Comment


    • #3
      Dear Clyde,

      Needless to say how grateful I am for this. Seems like a solution exists !!. I have not tried it yet but will do so and will update how it is going. Your assumption is correct. The cell values are numeric.


      All the best,
      Roman

      Comment


      • #4
        Aplogies Clyde. I think I will have to bear with the pain of a manual process. The reference cell for the calculated figure of interest changes from patient to patient. Some patient had 15 continuous measurements (every minute) some had 20 which is shifting down the reference cell where the mean values are calculated and that is what I am interested in. See one patient for example:


        Code:
          
        A B C D E F G H I J
        Bin Time FeO2 FeCO2 FSTP VO2 VCO2 RQ EE EE
        No. l/min ml/min ml/min kcal/day kJ/day
        --- ------ ------ ------ ----- ------ ------ ------ -------- ------
        1 20:21 20.70% 0.31% 50.70 143 114 0.802 877 3673
        2 20:22 20.68% 0.33% 50.58 154 127 0.826 951 3982
        3 20:23 20.70% 0.32% 50.58 141 121 0.857 883 3697
        4 20:24 20.67% 0.34% 50.46 159 129 0.812 981 4107
        5 20:25 20.68% 0.32% 50.18 151 121 0.806 927 3883
        6 20:26 20.71% 0.30% 49.94 133 109 0.819 820 3434
        7 20:27 20.69% 0.33% 49.82 146 124 0.847 908 3799
        8 20:28 20.66% 0.34% 49.82 158 127 0.800 974 4077
        9 20:29 20.70% 0.30% 49.59 137 109 0.801 840 3518
        10 20:30 20.70% 0.31% 49.67 138 113 0.822 853 3570
        11 20:31 20.70% 0.30% 49.65 138 107 0.777 843 3530
        12 20:32 20.73% 0.28% 49.48 122 97 0.789 750 3140
        13 20:34 20.74% 0.15% 49.54 118 101 0.859 734 3071
        14 20:35 20.71% 0.30% 49.41 134 106 0.788 824 3448
        15 20:36 20.70% 0.31% 49.30 137 113 0.822 848 3551
        16 20:37 20.69% 0.31% 48.96 140 114 0.815 861 3603
        17 20:38 20.69% 0.32% 49.10 140 115 0.823 866 3624
        18 20:39 20.72% 0.29% 48.93 125 103 0.832 773 3236
        19 20:40 20.72% 0.29% 48.78 125 103 0.822 774 3240
        20 20:41 20.68% 0.32% 48.62 145 118 0.813 896 3752
        Mean VO2 : 139 ml/min
        Mean VCO2: 113 ml/min
        Mean RQ : 0.816
        Mean Energy Expenditure = 859 kcal/Day ; 3596 kJ/Day
        Sorry for the mis-alignments. Couldn't do better pasting from excel. The cell reference numbers are not shown.

        I am interested in mean of VO2 (variable name refers to cell: F19) mean of VCO2 (cell: G19) and mean of EE kcal/day (cell: I19) which are calculated below the measurements (VO2=139, VCO2=113, EE=859). The number 1 measurement starts from cell: A22 and same for all dataset. But since the number of measurements vary patient to patient (in this case 20 measures, some may have more or less), reference cells for the mean figures shift down.The columns also corresponds to the same variable for each patient. Is there any way to refer to the column and take the mean instead extracting the particular cell value where mean is calculated. Any thought is highly appreciated and many thanks again for reading with patience.


        Regards,
        Last edited by Roman Mostazir; 25 Jan 2016, 12:39.
        Roman

        Comment


        • #5
          Try this:
          Code:
          capture postutil clear
          postfile handle int year str32 name vo2 vco2 ee using combined_results, replace
          
          forvalues y = 2000/2011 {
              cd YR`y'
              local fnames: dir "." files "*.xls"
              foreach f of local fnames {
                  import excel using `f', clear cellrange(A22)
                  local name: subinstr local f ".xls" ""
                  drop in -4/l // GET RID OF SUMMARY ROWS THAT WILL MESS THINGS UP
                  summ F, meanonly
                  local mean_vo2 = r(mean)
                  summ G, meanonly
                  local mean_vco2 = r(mean)
                  summ I, meanonly
                  local mean_ee = r(mean)
                  post handle (`y') (`"`name'"') (`mean_vo2') (`mean_vco2') (`mean_ee')
              }
              cd ..
          }
          postclose handle
          Note: This code assumes that the Mean VO2 through Mean Energy Expenditure rows at the bottom of your example are always the last four rows of the spreadsheet. We get rid of them in case the numbers they contain happen to fall into columns F, G, or I and would be spuriously included in the calculations.

          Also, looking at what you have shown in the example, I think that when you -import excel- you are going to get all string variables. If I'm write about that, then you will need to -destring F G I, replace- after the -drop in -4/l- command.

          Comment


          • #6
            Here's another approach that uses filelist (from SSC). It's more flexible if the directory substructure does not follow a predictable pattern (e.g. years for the current example).

            The solution assumes that column A contains a number for rows that need to be averaged.

            Code:
            * use -filelist- from SSC (to install, type "ssc install filelist" in Stata's command window)
            * to create a dataset with the filenames of all the Excel docs in the "roman" directory
            filelist, dir("roman") pattern("*.xls") save("xls_files.dta") replace
            local nfiles = r(N)
            
            * initialize a dataset to hold the results
            clear
            save "results.dta", replace emptyok
            
            * loop over each Excel file
            forvalues i=1/`nfiles' {
            
                use "xls_files.data" in `i', clear
                local f = dirname + "/" + filename
                
                import excel using "`f'", clear
                destring *, replace force
                keep if !mi(A)
                
                collapse F G I (count) nobs = A
                gen filename = "`f'"
                append using "results.dta"
                save "results.dta", replace
                
            }

            Comment


            • #7
              Owao, I just wonder which planet you guys do come from. I can now work out on these. My gratitudes to both you Clyde and Robert.
              Best wishes,
              Roman

              Comment


              • #8
                Dear Robert/Clyde,

                Will any of these codes provided could be customised with *.DAT files ? Last few years are with *.DAT extensions.

                Many thanks again.
                Roman

                Comment


                • #9
                  The .DAT extension does not define a specific file type. Files with this extension may be text-based, or they may be proprietary to whatever application created them. Where do your .DAT files come from? Are they text files? Are they SAS files? Something else?

                  If they are text files, then replacing the -import excel- command with some other command that reads text files (e.g. -import delimited-, or -infile-, or -infix-), and possibly modifying the variable names from A B C, etc. to whatever default variable names apply, should do the trick.

                  If these files are not text then you will need to find a program ​such as StatTransfer that translates them into something that Stata can read directly. (It would also be worth using Stata's -search- to see if there is a user-written command that can load .DAT files from that application.)

                  Comment


                  • #10
                    Hi Clyde,
                    Not sure really. I have been requested to undertake a project to explore specific hypothesis of these data which were measured for 12 years on some same patients. My pain is to bring them all in one dataset. I guess the *.DAT file was the later development of the software output !! I opened a *DAT file with my text edit program (in my Mac) and that is what it looks like inside (please see below). Basically it is the same output that is coming with the excel files.

                    P:S: If I ever become the director of a Stat program in any educational institute, I will ensure that beside the stat course student also undertake a programming course !!

                    Code:
                    Europa Scientific Ltd.
                    Indirect Calorimetry Report   12-17-2009  08:38:02
                    
                    PATIENT DATA
                    Name: 013HC ?
                    D.O.B.:  0/ 0/ 0
                    Record Number: ?
                    Sex: F
                    Height: ?cm
                    Weight: ?kg
                    Blood Pressure: ?
                    Heart Rate: ?
                    Start Body Temp: ?∞C
                    End Body Temp: ?∞C
                    
                    SkinFold Measurements:-
                    Biceps: ? ; Triceps:? ; Subscapular:? ; Suprailiac:?
                    
                    NOTES:
                    ?
                    
                    USER DEFINED EE EQUATION = 1.44 * (3.4VO2 + 1.1VCO2)
                    
                    RESULTS OF RUN
                    GEM User ID: ?
                    Start Time: 08:24:16
                    End Time: 08:35:39
                    Initial inspired O2 = 20.97%
                    Initial inspired CO2 = 0.093%
                    
                    Sample binning in groups of 60
                    Artifact bin exclusion is OFF
                    
                    Bin  Time   FeO2   FeCO2 FSTP   VO2   VCO2    RQ       EE     EE   
                    No.                      l/min ml/min ml/min        kcal/day kJ/day
                    --- ------ ------ ------ ----- ------ ------ ------ -------- ------
                    001  08:25 20.53% 0.491% 45.29    208    180  0.867     1302   5453
                    002  08:26 20.46% 0.515% 45.25    242    191  0.790     1485   6218
                    003  08:27 20.44% 0.536% 45.00    253    199  0.786     1556   6513
                    004  08:28 20.43% 0.540% 44.87    253    200  0.791     1556   6516
                    005  08:29 20.43% 0.571% 44.67    253    213  0.844     1575   6594
                    006  08:30 20.42% 0.554% 44.39    255    204  0.803     1571   6578
                    007  08:31 20.43% 0.556% 44.33    251    205  0.817     1555   6511
                    008  08:32 20.37% 0.566% 44.21    282    209  0.741     1711   7162
                    009  08:33 20.39% 0.553% 44.00    270    202  0.748     1644   6883
                    010  08:34 20.42% 0.560% 43.83    252    204  0.812     1557   6519
                    011  08:35 20.41% 0.555% 43.67    260    202  0.777     1591   6660
                    
                    Mean VO2 : 253 ml/min
                    Mean VCO2: 201 ml/min
                    Mean RQ  : 0.798
                    
                    Mean Energy Expenditure = 1555 kcal/Day ; 6510 kJ/Day
                    Roman

                    Comment


                    • #11
                      One more thing Clyde, for the *.xls files, before I run your codes, if I want to drop the first 21 rows, as they contain other irrelevant texts and can cause problem in averaging, will the following work:

                      Code:
                      drop in 21/1

                      I have not tried Robert's code yet as not sure if that will work given the row1-row21 contain other texts.
                      Roman

                      Comment


                      • #12
                        Sorry Clyde and Robert. No luck yet for me and this is because I think I got a bunch of datasets that just do not fit into any criteria. Nothing wrong with your codes. Here is the example of one of them as screenshot. As I said, I am only interested for the average values of column F, G, I and want to be able to identify whom the values correspon to which can be found in each file name.
                        The rows for averaging the values may vary from 5 to 20 depends how many times each patient was measured for.

                        I highly really appreciate your help. Even nothing works, the codes you provided will be very helpful for me in future.

                        All the best.


                        Click image for larger version

Name:	Screen Shot 2016-01-26 at 01.21.53.png
Views:	1
Size:	435.6 KB
ID:	1324196

                        Roman

                        Comment


                        • #13
                          In response to #11, if you note my code in #5, it specifies the option -cellrange(A22)- in the -import excel- command. So if you use that, the first 21 rows are skipped and will never come into Stata in the first place, so you won't need to drop them. If, in other circumstances, you did need to drop them, it would be -drop in 1/22-.

                          With regard to #12, it looks like they have indeed written these files for visual inspection with no thought for computers. Row 42 really breaks my code. I don't see any reasonable fix for this.

                          As for #10, these .DAT files actually might be more tractable than the spreadsheets. I don't have time right now to fully flesh out the code, but here's an outline of the approach, which relies on -file read-. The idea is to read the .DAT file one line at a time, checking for the lines that begin "Mean VO2 ", "Mean VCO2:", and "Mean Energy Expenditure =" and then extracting the numbers that follow.

                          Code:
                          // THE LOOP STRUCTURE IS THE SAME AS BEFORE
                          // THE CONTENTS OF THE INNER LOOP ARE DIFFERENT
                          // TO REFLECT THE STRUCTURE OF .DAT FILES
                          capture postutil clear
                          postfile handle int year str32 name vo2 vco2 ee using combined_results_dat_files, replace
                          
                          forvalues y = 2000/2011 {
                              cd YR`y'
                              local fnames: dir "." files "*.dat"
                              foreach f of local fnames {
                                 local name: subinstr local f ".dat" ""
                                 file open datfile using "`f'", read text
                                 file read datfile line
                                 while !r(eof) {
                                        if substr(`"`line'"', 1, 10) == "Mean VO2 :" {
                                            local vo2: word 4 of `line'
                                        }
                                        else if substr(`"`line'"', 1, 10) == "Mean VCO2:" {
                                            local vco2: word 3 of `line'
                                        }
                                        else if substr(`"`line'"', 1, 25) == "Mean Energy Expenditure =" {
                                            local ee: word 5 of `line'
                                       }
                                       file read datfile line
                                  }
                                  post handle (`y') (`"`name'"') (`vo2') (`vco2') (`ee')
                          ​        file close datfile
                              }
                              cd ..
                          }
                          postclose handle
                          This code is incomplete because it does not actually verify that vo2, vco2 and ee have all been found in each file, and does nothing about it if they have not. So it needs some refinements, but perhaps you can work on that. Also, I don't guarantee that I have correctly counted the number of characters for each of those three key phrases--it's getting late and my eyes are getting blurry.

                          Needless to say, if some of the .DAT files have variants on these key text strings, varying in capitalization or placement of the colon or spaces, then this code will break. But if they at least adhere consistently to the use of those key strings, this should work for the .DAT files.

                          Finally, you have my deep sympathies for having to do this project. The clinical world is generally oblivious to the world of automated data processing (and, in my opinion, it hasn't improved much with the spreading adoption of EHRs) and we who work with clinical data are often left to doing tedious, error-prone manual work like this.

                          Comment


                          • #14
                            If the format of these Excel files can change, then you should think about adding some checks to make sure that your assumptions about the data are valid. My solution can handle an arbitrary number of rows with measurement data since it keeps only rows with a number in column A. Here's an expanded example with some checks. I also rewrote the condition to show that I'm dropping rows that do not start with a number in column A.

                            Copy and paste this example in a do-file and then save it in the same directory that includes the home directory for this data. Change Stata's current directory (help cd) to the one that contains the do-file. You'll need to adjust the directory name in the filelist command to match the name of the data directory. All you need to do then is to run the do-file.

                            Code:
                            * use -filelist- from SSC (to install, type "ssc install filelist" in Stata's command window)
                            * to create a dataset with the filenames of all the Excel docs in the "roman" directory
                            filelist, dir("roman") pattern("*.xls") save("xls_files.dta") replace
                            local nfiles = r(N)
                            
                            * initialize a dataset to hold the results
                            clear
                            save "results.dta", replace emptyok
                            
                            * loop over each Excel file
                            forvalues i=1/`nfiles' {
                            
                                use "xls_files.data" in `i', clear
                                local f = dirname + "/" + filename
                                
                                import excel using "`f'", clear
                                
                                // consistency checks
                                count if strpos(A, "Bin")
                                assert r(N) == 1
                                count if strpos(F, "VO2")
                                assert r(N) == 1
                                count if strpos(G, "VCO2")
                                assert r(N) == 1
                                count if strpos(I, "EE")
                                assert r(N) == 1
                                
                                destring *, replace force
                                
                                // drop observations without a number in column A
                                drop if mi(A)
                                
                                collapse F G I (count) nobs = A
                                gen filename = "`f'"
                                append using "results.dta"
                                save "results.dta", replace
                                
                            }

                            Comment


                            • #15
                              Robert Picard In #12, the screenshot of the spreadsheet shows that cell A42 contains a number, but the rest of that row does not conform to the layout needed for this to work. It isn't clear from the screen shot exactly what is in columns F, G, and I in that row, but they aren't the values of VO2 VCO2 and EE. So I think that example breaks your code.in #14.

                              Comment

                              Working...
                              X