Importing several excel files: matching columns

Jean-Claude Arbaut

Join Date: Jul 2017

Posts: 209
#16

21 Jun 2019, 00:05

I'm not yet done, but a few remarks, from what I have seen in the downloaded files.

And one remark to begin with: you need -capture copy- instead of simply -copy- because there can be missing files.

The files:
-There is no Excel data before 2003 (there were PDF files before, but I have no simple way to import that). There is never data on week-ends, so 5 files per week. But there are many holes before 2008.
-The years 2008 to 2018 are complete (one file per week day monday-friday, 2870 files), but during the download, the server mysteriously failed on a few files. A second attempt on the missing files succeeded. It's not an uncommon problem when downloading many files.

The data:
- Apparently the variables do not change, there are always 11 columns. But not always A-K: it starts at column A, but there are sometimes merged columns, and not always the same. This will likely be quite annoying.
- The title section has randomly merged cells, and is not really usable. This isn't a problem for the import, as we are only looking at the data.
- The data section starts on a row that changes, and I have yet to check all files (by program), but there seems to be a pattern (all recent years start at 6). It can be at least 6, 8, 14.
- The sheet name changes too.
- The end of data has a marker, and there is a disclaimer below (both in merged cells).
- The data section may be empty, and there is a NIL marker.
- The data types are going to be a problem: there are numeric columns with (very often) some numbers stored as string. Stata will likely import as string and we will have to destring. There is also a the currency in some columns (HKD), so we will have to deal with this when using destring.
- The purchase method is almost always "Exchange". But for instance on 2012-12-21 I get a row with a long description instead: "On request from shareholders, the Company, by private arrangement, buys back shares of less than 1 unit at the closing price quoted at the Tokyo Stock Exchange on 14 December 2012. The repurchase of shares was related to the adoption of Share Unit System implemented on 1 October 2012, according to the Company’s announcement dated 14 September 2012."
This can still be stored in a Stata string, but it will not be pretty. So maybe we should replace this by a label or a note number, especially if there are not too many cases.

Overall, the data format is rather simple, with the usual quirks. My next step (tonight I think): a Python program to at least find the first and last data row, and check the sheet name. I may as well do all the conversion in Python and write a csv file that will be easier to import. The reason is: it's simple in Python to read the file cell by cell and convert string to numeric on the fly, and find long strings where they are. In Stata I would import whole columns, hence problems are also spread on whole columns.

Last edited by Jean-Claude Arbaut; 21 Jun 2019, 00:15.
1 like
Comment
Jean-Claude Arbaut

Join Date: Jul 2017

Posts: 209
#17

23 Jun 2019, 08:31

See the program share_repurchases_reports.do at https://github.com/jcarbaut/import_hkexnews_data.

Checked on the range 2008-01-01 to 2018-12-31.

I urge you to read all comments and the structure of the program before running it. Also, you must have first downloaded the files in a directory, and cd to this directory in Stata (to download you can use the code above).
There is still some manual work to do, but I can't do it for you: for instance, there are some notes added by hand in the files, and removing them may be a problem. There are also revised entries, which may lead to duplicates.

To ease checks, in the final data file I leave some original variables ("letter" variables, which follow the column names in Excel).
The notes which are alone in a row should be all in the srrpt_notes file, together with the file date, so you can find where they come from, and match them with the srrpt_data file.
Some notes are appended to data cells, and I didn't attempt to correct this by program: the transformed variables should have missing values, you have to check with the original ones. Use -browse if mi(...)- to view the problems.

Structure of the program:

It's mainly a Mata program. All Excel files are read as string, cleaned and the data are appended to a Mata string array. In the end, the string array is pushed to a Stata dataset, where conversions to numeric/date are made. The noted are appended to another string array.
The main function is readfiles, and you pass 4 arguments: two numeric (td) dates, and two variables which are overwritten with the data and the notes.

Data cleansing:
* First, look for the data rows, using the first column. The nonempty cells are 1/ a date 2/ 'Company' 3/ data (company names 4/ notes 5/ disclamer. This is checked, and the "end of report" tag is also found. (function findrows)
* Remove all non data rows. Remove empty rows and empty columns. If there are still to many rows, it means there were merged cells, so move everything to the left (overwriting empty cells).

Merged cells: in Excel, when you merge cells, the underlying cells are still here, and all but one are empty (or possibly all are empty). Stata imports the underlying cells, so you end up with to many columns, and empty cells all over the place. There is an extra trick: in some files, you get two columns with all pairs of cells merged, on each row. But the data is not necessarily in the same (say, all in the left one or all in the right one). Quite often the data end up unaligned, so the step to remove empty columns is not enough, you still have to "push to the left" to compress the columns to 11 columns as expected. There is a problem with this approach: sometimes there are empty cells in the data, and this is not due to merged cells. Usually missing data is replaced with "-" or "N/A", but sometimes the cell is really empty. Luckily, it does not seem to happen when there are merged cells (there is a check in the code), at least not in the range 2008-2018.

After this is done, you have a rectangular array, with 11 columns, and as many rows as there are companies. Append this to an array the holds the data from all files, together with a 12th column: the file date. The notes are in another array.

Feel free to ask for clarification if there is any problem, and keep in mind that there is still some work.

Hope this helps,

Jean-Claude Arbaut

PS: Here is a copy of the exact error message when downloading a file fails. This might help other users googling for an answer.
sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderE xception: unable to find valid certification path to requested target
r(5100);
As explained in my previous answer on 20/06, it's due to a missing root certificate in the Java repository inside the Stata install directory.

Last edited by Jean-Claude Arbaut; 23 Jun 2019, 08:36.
1 like
Comment
MIchael Jefferson

Join Date: Feb 2019

Posts: 36
#18

02 Jul 2019, 05:33

Dear Jean-Claude, thank you so much for your help! Really appreciate it.
Comment
MIchael Jefferson

Join Date: Feb 2019

Posts: 36
#19

15 Jul 2019, 06:50

Dear Jean-Claude, the last couple of weeks I was busy working on another project. The coming weeks I will, again, be working on this Hong Kong project.

When trying to replicate your steps, I failed to add the root certificate to Java certificate store.

By looking via Firefox, I, indeed, found that the certificate needed is "Sectigo RSA Organization Validation Secure Server CA", after downloading the link you provided, and typing this command line:

Code:

set JRE=/Applications/Stata/utilities/java/macrosx-x64/jre1.8.0_202.jre %JRE%/bin/keytool -keystore "%JRE%/lib/security/cacerts" importcert -alias sectigo -storepass changeit -file SectigoRSAOVBundle.pem

this showed up: -set JRE- not allowed; 'JRE' not recognized

I already updated and restarted my Stata. Any suggestions? Thank you in advance.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#20

15 Jul 2019, 07:34

Let me take a try at modifying Jean-Claude Arbaut's Windows-based advice from post #15 above to work for macOS. In my case, I'm running Stata 15.1 on macOS Mojave 10.14.5.

First of all, I think you typed the commands into Stata. That is not what's needed. They are shell commands and need to be typed into the window opened by the macOS Terminal app (found in the "Other" group in Launchpad).

One problem is that macOS offers a choice of shells, and I no longer know what the current default is for the Terninal app, having frozen myself in time with tcsh aeons ago. But tchs was never a macOS installation default, so I'm going to assume your shell is bash. You can confirm this by typing the ps command at the command line and seeing what it tells you is running. If it is not bash, then type the command bash to start bash as a subshell. Then issue the following two commands as shown (no spaces around the "=" in particular).

Code:

JRE=/Applications/Stata/utilities/java/macosx-x64/jre1.8.0_202.jre/Contents/Home $JRE/bin/keytool -keystore "$JRE/lib/security/cacerts" -importcert -alias sectigo -storepass changeit -file SectigoRSAOVBundle.pem

That should do what's needed. I can't confirm it, because I don't want to actually install the certificate, but it gets me to the point where keytool tells me it cannot find the SectigoRSAOVBundle.pem file. Once you've done this, you can quit from the Terminal app, and if Stata is running, quit it and relaunch it.

Last edited by William Lisowski; 15 Jul 2019, 07:41.
1 like
Comment

MIchael Jefferson

Join Date: Feb 2019
Posts: 36

#21

15 Jul 2019, 14:23

Thank you William! I indeed typed the commands into Stata. Typing the following code in the Terminal app solved the issue:

Code:

bash

JRE=/Applications/Stata/utilities/java/macosx-x64/jre1.8.0_202.jre/Contents/Home

$JRE/bin/keytool -keystore "$JRE/lib/security/cacerts" -importcert -alias sectigo -storepass changeit -file /Users/x/Documents/SectigoRSAOVBundle.pem

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#22

15 Jul 2019, 14:30

Thank you for closing the loop, and for pointing out that the full path to the downloaded certificate file is required - the one part of the code I didn't test!
Comment
MIchael Jefferson

Join Date: Feb 2019

Posts: 36
#23

17 Jul 2019, 03:52

As suggested by Jean-Claude in post #15, I used the following code (capture copy instead of copy since there are missing files):

Code:

loc mdays 31 28 31 30 31 30 31 31 30 31 30 31 forv y=2010/2018 { loc leap=mod(`y',400)==0 | (mod(`y',4)==0 & mod(`y',100)~=0) forv m=1/12 { loc maxd :word `m' of `mdays' if `m'==2 & `leap' loc ++maxd loc mm :display %02.0f `m' forv d=1/`maxd' { loc dd :display %02.0f `d' di "https://www.hkexnews.hk/reports/sharerepur/documents/SRRPT`y'`mm'`dd'.xls" capture copy "https://www.hkexnews.hk/reports/sharerepur/documents/SRRPT`y'`mm'`dd'.xls" . } } }

My question is: where are these copied files now? And how do I open them in Stata?

This is was Jean-Claude suggested in post #17:

Originally posted by Jean-Claude Arbaut View Post

Also, you must have first downloaded the files in a directory, and cd to this directory in Stata (to download you can use the code above).

But I am not sure how to do this. Any suggestions?
Comment
Jean-Claude Arbaut

Join Date: Jul 2017

Posts: 209
#24

17 Jul 2019, 05:17

The files are downloaded in the directory ".", which is the current directory. You can change the directory before downloading files by typing "cd /a/path/you/choose", and you can check the current directory with "cd" alone.

You may open the files with Stata (with the -import excel- command), but the program I put on GitHub does this for you (remember you have to change the start and end dates in the line "readfiles(17532,21549,alldata=.,allnotes=.)" of this program - the numbers are %td dates). You will also have to check the downloading program to match the same dates. Actually, currently the download starts at 2010, while the import starts at 2008 (17532 = 01jan2008). My bad.
However, if there is a missing file, the import program will report this and continue without throwing an error.

Last edited by Jean-Claude Arbaut; 17 Jul 2019, 05:34.
1 like
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment