Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Importing several excel files: matching columns

    Hi everyone,

    I imported over 250 excel files and appended them all together to 1 dta file, by using this code:


    clear all
    ssc install xls2dta
    xls2dta, save("/Users/abc/DATA/SRRPT") : import excel "/Users/abc/DATA/SRRPT"

    xls2dta, save ("/Users/abc/DATA/SRRPT/SRRPT2.dta") : append, force
    use SRRPT2
    However, now I am facing the problem that the variables I want to use are not in the same column in the excel files. For example: the variable -Price per Share- is in Column E in some excel files, while it is in Column F in other excel files. Is there a possibility to get them all in the same column?

    Thank you so much.

  • #2
    Is it the case that the first row of data in each Excel worksheet (not necessarily the first row of the worksheet, there could be blank rows before) contains column titles such as "Price per Share"? In that case, using the
    Code:
    importopts(firstrow)
    option in the appropriate location should use the column titles (suitably altered) as variable names, and append will then match the variable names when appending. But since I don't use xls2dta I'm not quite sure where the appropriate location for that option is in your code.

    Comment


    • #3
      xls2dta is (most likely) from SSC. The importopts() option is only required to combine Excel files on-the-fly. Using the two-step approach in #1, the firstrow option can simply be added to the import excel call.

      Code:
      cd "/Users/abc/DATA/SRRPT"
      xls2dta : import excel . , firstrow
      xls2dta : append
      Best
      Daniel
      Last edited by daniel klein; 13 Jun 2019, 11:52.

      Comment


      • #4
        Thank you very much for your responses!

        Unfortunately, in my case the first row of data in each excel worksheet (not necessarily the first row of the worksheet, there are blank rows before) contains the title of the worksheet. The second row of data contains column titles such as "price per share" that I want to use. Therefore, as expected, when I try the code of Daniel Klein, the appended file uses the title of the worksheet as a variable.

        Is there a possibility to drop the titles of the worksheets before starting to append based on the column titles? Or is there a possibility to append based on "second row"? I couldn't find these options.

        Thank you in advance.

        Comment


        • #5
          It would be a lot easier to assist if you gave us a data example. Use dataex to show us an excerpt of one of the imported files.

          In general, you can apply any (series of) commands to the imported files. You could, for example, drop the first line, which contains the worksheet name, like this

          Code:
          xls2dta : import excel .
          xls2dta : xeq drop in 1
          If the original question implies data like this

          Code:
          . clear
          
          . input E F
          
          E F
          1. 42 .
          2. . 73
          3. end
          
          . list
          
          +---------+
          | E F |
          |---------|
          1. | 42 . |
          2. | . 73 |
          +---------+
          then one way of getting what I believe is wanted is this

          Code:
          generate EF = cond(mi(E), F,E)
          which creates

          Code:
          . generate EF = cond(mi(E), F, E)
          
          . list
          
               +--------------+
               |  E    F   EF |
               |--------------|
            1. | 42    .   42 |
            2. |  .   73   73 |
               +--------------+
          Best
          Daniel
          Last edited by daniel klein; 14 Jun 2019, 02:51.

          Comment


          • #6
            Thank you very much for your response.

            Below, I have two data examples. These are both excel files imported to Stata.

            I want to append these excel files based on variable names "Company", "Stock Code", "Sec. Type", "Trading Date (yyyy/mm/dd)", "Shares purchased", "Unit or Highest Price ($)", "Lowest Unit Price ($)", "Purchase Method" and "Total No. of Shares Purchased on the Exchange in the Year to Date (since Ord. Resolution)".

            As you can see below, the columns/rows where these variable names are placed differ.

            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input str266 A str5 B str4 C str12 D str9 E str10 F str23 G str17 H str19 I str89 J
            "Date Printed : 12/12/2007" ""      ""     ""             ""          ""          "Share Repurchase Report" ""                  ""         ""                                                                                        
            ""                          ""      ""     ""             ""          ""          ""                        ""                  ""         ""                                                                                        
            ""                          ""      ""     ""             ""          ""          ""                        ""                  ""         "Total No. of Shares Purchased on the Exchange in the Year to Date (since Ord. Resolution)"
            ""                          ""      ""     ""             ""          ""          ""                        ""                  ""         ""                                                                                        
            ""                          ""      ""     ""             ""          ""          ""                        ""                  ""         ""                                                                                        
            ""                          ""      ""     ""             ""          ""          ""                        ""                  ""         ""                                                                                        
            ""                          ""      ""     ""             ""          ""          ""                        ""                  ""         ""                                                                                        
            ""                          ""      ""     ""             ""          "Unit Or"   "Lowest"                  ""                  ""         ""                                                                                        
            "Company"                   "Stock" "Sec." "Trading Date" "Shares"    "Highest"   "Unit"                    "Total Paid"        "Purchase" ""                                                                                        
            ""                          "Code"  "Type" "(yyyy/mm/dd)" "Purchased" "Price ($)" "Price ($)"               "($)"               "Method"   ""                                                                                        
            ""                          ""      ""     ""             ""          ""          ""                        ""                  ""         "(a)"                                                                                      
            ""                          ""      ""     ""             ""          ""          ""                        ""                  ""         ""                                                                                        
            ""                          ""      ""     ""             ""          ""          ""                        ""                  ""         ""                                                                                        
            "CH GRAND FOREST"           "910"   "ORD"  "11dec2007"    "11452000"  "HKD 1.92"  "HKD 1.88"                "HKD 21,759,760.00" "Exchange" "24132000"                                                                                
            "CHINA METAL"               "319"   "ORD"  "11dec2007"    "272000"    "HKD 2.48"  "HKD 2.35"                "HKD 646,200.00"    "Exchange" "6356000"                                                                                  
            "DAWNRAYS PHARMA"           "2348"  "ORD"  "11dec2007"    "248000"    "HKD 0.74"  "HKD 0.73"                "HKD 181,520.00"    "Exchange" "7632000"                                                                                  
            "I-CABLE COMM"              "1097"  "ORD"  "11dec2007"    "223000"    "HKD 1.59"  "HKD 1.57"                "HKD 352,270.00"    "Exchange" "2193000"                                                                                  
            "IPE GROUP"                 "929"   "ORD"  "11dec2007"    "15000"     "HKD 1.02"  "  -  "                   "HKD 15,300.00"     "Exchange" "3275000"                                                                                  
            "PERFECTECH INTL"           "765"   "ORD"  "11dec2007"    "362000"    "HKD 0.90"  "HKD 0.88"                "HKD 320,560.00"    "Exchange" "1470000"                                                                                  
            "REGAL INT'L"               "78"    "ORD"  "11dec2007"    "7246000"   "HKD 0.61"  "HKD 0.60"                "HKD 4,386,460.00"  "Exchange" "109498000"                                                                                
            end
            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input str266 A str10 B str9 C byte D str25 E str16 F str23 G str9 H str25 I byte J str21 K str17 L str14 M byte(N O)
            ""                          ""           ""          . ""                          ""                 ""                        ""          ""                          . ""                      ""                  ""               . .
            "Date Printed : 20/09/2011" ""           ""          . ""                          ""                 "Share Repurchase Report" ""          ""                          . ""                      ""                  ""               . .
            ""                          ""           ""          . "Trading Date (yyyy/mm/dd)" ""                 ""                        ""          ""                          . ""                      ""                  ""               . .
            "Company"                   "Stock
            Code" "Sec.
            Type" . ""                          "Shares
            Purchased" ""                        ""          "Unit or
            Highest
            Price ($)" . "Lowest
            Unit
            Price ($)" ""                  "Total Paid
            ($)" . .
            ""                          ""           ""          . ""                          ""                 ""                        ""          ""                          . ""                      ""                  ""               . .
            ""                          ""           ""          . ""                          ""                 ""                        ""          ""                          . ""                      ""                  ""               . .
            ""                          ""           ""          . ""                          ""                 ""                        ""          ""                          . ""                      ""                  ""               . .
            "ANXIN-CHINA"               "1149"       "ORD"       . "2011/09/19"                "6,752,000"        ""                        "HKD 1.42"  ""                          . "HKD 1.37"              "HKD 9,412,520.00"  ""               . .
            "C C LAND"                  "1224"       "ORD"       . "2011/09/19"                "1,289,000"        ""                        "HKD 1.61"  ""                          . "HKD 1.60"              "HKD 2,070,290.00"  ""               . .
            "CASH FIN SER GP"           "510"        "ORD"       . "2011/09/19"                "1,590,000"        ""                        "HKD 0.118" ""                          . "HKD 0.112"             "HKD 181,062.00"    ""               . .
            "CHIHO-TIANDE"              "976"        "ORD"       . "2011/09/19"                "1,298,000"        ""                        "HKD 4.40"  ""                          . "HKD 4.32"              "HKD 5,705,540.00"  ""               . .
            "CHINA WATER"               "855"        "ORD"       . "2011/09/19"                "1,660,000"        ""                        "HKD 2.64"  ""                          . "HKD 2.59"              "HKD 4,345,140.00"  ""               . .
            "CST MINING"                "985"        "ORD"       . "2011/09/19"                "11,776,000"       ""                        "HKD 0.142" ""                          . "HKD 0.139"             "HKD 1,661,593.60"  ""               . .
            "FIRST PACIFIC"             "142"        "ORD"       . "2011/09/19"                "500,000"          ""                        "HKD 7.02"  ""                          . "HKD 6.85"              "HKD 3,465,900.00"  ""               . .
            "GCL-POLY ENERGY"           "3800"       "ORD"       . "2011/09/19"                "5,000,000"        ""                        "HKD 2.79"  ""                          . "-"                     "HKD 13,950,000.00" ""               . .
            "HIGH FASHION"              "608"        "ORD"       . "2011/09/19"                "20,000"           ""                        "HKD 3.03"  ""                          . "-"                     "HKD 60,600.00"     ""               . .
            "HSBCDRAGON FUND"           "820"        "FUND"      . "2011/09/19"                "5,000"            ""                        "HKD 6.668" ""                          . "-"                     "-"                 ""               . .
            "JOHNSON ELEC H"            "179"        "ORD"       . "2011/09/19"                "921,500"          ""                        "HKD 4.20"  ""                          . "HKD 4.17"              "HKD 3,858,375.00"  ""               . .
            "KINGMAKER"                 "1170"       "ORD"       . "2011/09/19"                "94,000"           ""                        "HKD 1.34"  ""                          . "HKD 1.32"              "HKD 125,400.00"    ""               . .
            "NET2GATHER"                "1049"       "ORD"       . "2011/09/19"                "180,000"          ""                        "HKD 0.113" ""                          . "HKD 0.112"             "HKD 20,172.00"     ""               . .
            end

            Thank you so much.
            Last edited by MIchael Jefferson; 14 Jun 2019, 03:35.

            Comment


            • #7
              Your second example does not work: I get several warnings from the input command. There is something wrong with the quotation marks.

              However, even if this was not the case, I think this will take much more time than I can afford right now. Sorry.

              Repost your second example and hope that someone on the list has the time to play around with this.

              Best
              Daniel

              Comment


              • #8
                Thank you anyway:

                Here's the second example again:

                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input str266 A str10 B str9 C byte D str25 E str16 F str23 G str9 H str25 I byte J
                ""                          ""           ""          . ""                          ""                 ""                        ""          ""                          .
                "Date Printed : 20/09/2011" ""           ""          . ""                          ""                 "Share Repurchase Report" ""          ""                          .
                ""                          ""           ""          . "Trading Date (yyyy/mm/dd)" ""                 ""                        ""          ""                          .
                "Company"                   "Stock
                Code" "Sec.
                Type" . ""                          "Shares
                Purchased" ""                        ""          "Unit or
                Highest
                Price ($)" .
                ""                          ""           ""          . ""                          ""                 ""                        ""          ""                          .
                ""                          ""           ""          . ""                          ""                 ""                        ""          ""                          .
                ""                          ""           ""          . ""                          ""                 ""                        ""          ""                          .
                "ANXIN-CHINA"               "1149"       "ORD"       . "2011/09/19"                "6,752,000"        ""                        "HKD 1.42"  ""                          .
                "C C LAND"                  "1224"       "ORD"       . "2011/09/19"                "1,289,000"        ""                        "HKD 1.61"  ""                          .
                "CASH FIN SER GP"           "510"        "ORD"       . "2011/09/19"                "1,590,000"        ""                        "HKD 0.118" ""                          .
                "CHIHO-TIANDE"              "976"        "ORD"       . "2011/09/19"                "1,298,000"        ""                        "HKD 4.40"  ""                          .
                "CHINA WATER"               "855"        "ORD"       . "2011/09/19"                "1,660,000"        ""                        "HKD 2.64"  ""                          .
                "CST MINING"                "985"        "ORD"       . "2011/09/19"                "11,776,000"       ""                        "HKD 0.142" ""                          .
                "FIRST PACIFIC"             "142"        "ORD"       . "2011/09/19"                "500,000"          ""                        "HKD 7.02"  ""                          .
                "GCL-POLY ENERGY"           "3800"       "ORD"       . "2011/09/19"                "5,000,000"        ""                        "HKD 2.79"  ""                          .
                "HIGH FASHION"              "608"        "ORD"       . "2011/09/19"                "20,000"           ""                        "HKD 3.03"  ""                          .
                "HSBCDRAGON FUND"           "820"        "FUND"      . "2011/09/19"                "5,000"            ""                        "HKD 6.668" ""                          .
                "JOHNSON ELEC H"            "179"        "ORD"       . "2011/09/19"                "921,500"          ""                        "HKD 4.20"  ""                          .
                "KINGMAKER"                 "1170"       "ORD"       . "2011/09/19"                "94,000"           ""                        "HKD 1.34"  ""                          .
                "NET2GATHER"                "1049"       "ORD"       . "2011/09/19"                "180,000"          ""                        "HKD 0.113" ""                          .
                end
                Hope that someone can help me solve this issue!

                Comment


                • #9
                  In the first dataset we see column headers split over three rows of the spreadsheet.
                  Code:
                  ""                          ""      ""     ""             ""          "Unit Or"   "Lowest"                  ""                  ""         ""                                                                                        
                  "Company"                   "Stock" "Sec." "Trading Date" "Shares"    "Highest"   "Unit"                    "Total Paid"        "Purchase" ""                                                                                        
                  ""                          "Code"  "Type" "(yyyy/mm/dd)" "Purchased" "Price ($)" "Price ($)"               "($)"               "Method"   ""
                  In the second dataset we see that the column headings appear in a single row, but have embedded newline characters which cause problems for the dataex command.
                  Code:
                  ""                          ""           ""          . "Trading Date (yyyy/mm/dd)" ""                 ""                        ""          ""                          .
                  "Company"                   "Stock
                  Code" "Sec.
                  Type" . ""                          "Shares
                  Purchased" ""                        ""          "Unit or
                  Highest
                  Price ($)" .
                  which, if you were to replace the newlines with underscores would look like the following.
                  Code:
                  ""                          ""           ""          . "Trading Date (yyyy/mm/dd)" ""                 ""                        ""          ""                          .
                  "Company"                   "Stock_Code" "Sec._Type" . ""                          "Shares_Purchased" ""                        ""          "Unit or_Highest_Price ($)" .
                  I regret that I do not see any reasonable way to automate the reading and cleanup of these spreadsheets.

                  Comment


                  • #10
                    Thank you for your answer.

                    Does anyone else have an idea to solve this issue?

                    Comment


                    • #11
                      It seems your data start at row 7. If there is a similar scheme for all file, then just use the -cellrange- option of -import excel-. You can give only the top left corner of the table, e.g. A7. You can then rename variables. With a foreach or a forvalue loop, you should be able to import all Excel files, then you simply append them.

                      It's the method I use to import several Excel files that follow the same model. Sometimes there are variations between years, and I add an -if- in the loop to deal with the (few) cases. Here are the problems I typically have to deal with:

                      * The table does not start on the same row for all files: you have to first find out where to begin.
                      * The variables change, or the contents are the same, but the names change: if it's possible to identify the changes, it can be dealt with in Stata with -if- conditions.
                      * The variable format change: sometimes numeric, sometimes string (often because of some error code, or unrecognized date): it /may/ be enough to -destring- or apply a date function.

                      If there are many files, and an irregular structure, you are out of luck. But I would still try to do some automated guesswork with VBA (or Python with either pywin32 for COM connection or xlrd/openpyxl to read the dataset), and let the program write a do file to import all Excel files with the correct parameters.

                      I can't provide a complete solution, since I don't know the level of complexity your are really dealing with (are the variables always the same, for instance? Do datasets always start at A7? Do filenames follow a pattern? Does the file structure follow a pattern similar to the filename pattern? etc.)
                      Last edited by Jean-Claude Arbaut; 17 Jun 2019, 17:02.

                      Comment


                      • #12
                        I'd also ask where these excel files came from, and if there is another source. I'm surprised that data on seemingly the same info would be so horribly differently written to file. Did these excel files come from some manipulation and export from an original source? Can you access that original source?

                        Comment


                        • #13
                          Thank you for your replies!

                          Originally posted by Jorrit Gosens View Post
                          I'd also ask where these excel files came from, and if there is another source. I'm surprised that data on seemingly the same info would be so horribly differently written to file. Did these excel files come from some manipulation and export from an original source? Can you access that original source?
                          I need daily data of share repurchases in Hong Kong in the past 10 years. The website of the Hong Kong stock exchange provides this data: https://www.hkexnews.hk/reports/sharerepur/sbn.asp.
                          I downloaded all daily data excel files manually, and as stated before, I now try to append these excel files via Stata.

                          Comment


                          • #14
                            So it does look like its the original source. Then I'd say the best you can hope for is that they have kept their reporting mostly standardized, with some changes to the standard at clear points in time.
                            That would allow at least to create a script to manipulate the data into something standardized for each group, and you can try to identify at what dates they changed the reporting standard.
                            So then it is a matter of coming up with a way you want to have your data organized, figuring out what logic each group of xls sheets uses, and writing code for each group to get it into something standard. After you standardize you can append all of it.

                            Comment


                            • #15
                              If you downloaded 10 years of daily Excel files manually, there is a problem. The download links are so simple that it could be done with 3 nested loops (year, month, day).

                              When trying to download with the default Stata install I get an error because of a https certificate issue: I had the same problem with the french Insee web site, and you have to add the root certificate (seems to be Sectigo, you can check with Firefox) to the Java certificate store (the Java installed with Stata).

                              Detailed steps (on Windows, but it should be similar on another OS):

                              First, import the needed certificate. In Firefox, click on the green padlock on the left of the URL bar, then on the arrow, then "more information", then "show the certificate", then details.
                              Have a look at the hierarchy: "USERTrust RSA Certification Authority", and "Sectigo RSA Organization Validation Secure Server CA".
                              A google lookup of "Sectigo RSA Organization Validation Secure Server CA" leads to the page https://support.sectigo.com/Com_Know...A01N000000rfBO.
                              There, download the bundle in the "Organization Validation" section. After a few clicks, you get a pem file. It's this file we need to add to the certificate store.

                              In a command line, type (and change the path if needed, as well as the JRE version as it has changed with Stata updates):

                              Code:
                              set JRE=C:\APPLI\Stata15\utilities\java\windows-x64\jre1.8.0_202
                              %JRE%\bin\keytool -keystore "%JRE%\lib\security\cacerts" -importcert -alias sectigo -storepass changeit -file SectigoRSAOVBundle.pem
                              Depending on your installation, this might require administrator priviledges (and then an administrator console).

                              Then you are done. Restart Stata, and you can copy a file with:

                              Code:
                              copy https://www.hkexnews.hk/reports/sharerepur/documents/SRRPT20190603.xls .
                              To copy several years:

                              Code:
                              loc mdays 31 28 31 30 31 30 31 31 30 31 30 31
                              forv y=2010/2018 {
                                  loc leap=mod(`y',400)==0 | (mod(`y',4)==0 & mod(`y',100)~=0)
                                  forv m=1/12 {
                                      loc maxd :word `m' of `mdays'
                                      if `m'==2 & `leap' loc ++maxd
                                      loc mm :display %02.0f `m'
                                      forv d=1/`maxd' {
                                          loc dd :display %02.0f `d'
                                          *di "https://www.hkexnews.hk/reports/sharerepur/documents/SRRPT`y'`mm'`dd'.xls"
                                          copy "https://www.hkexnews.hk/reports/sharerepur/documents/SRRPT`y'`mm'`dd'.xls" .
                                      }
                                  }
                              }
                              A note: it's possible to import directly in Stata without downloading the file on disk (import excel with the URL).
                              Ususally, I don't do that: when I create the program, as there are often bugs, I have to run the program many times, and download again, and again, and it's not kind to servers when there are many files. Even when the program runs fine, I prefer to keep a copy, in case I have something to check, or in case the server is down.

                              I'll have a look at the file structure shortly.
                              Last edited by Jean-Claude Arbaut; 20 Jun 2019, 03:00.

                              Comment

                              Working...
                              X