I'm not yet done, but a few remarks, from what I have seen in the downloaded files.
And one remark to begin with: you need -capture copy- instead of simply -copy- because there can be missing files.
The files:
-There is no Excel data before 2003 (there were PDF files before, but I have no simple way to import that). There is never data on week-ends, so 5 files per week. But there are many holes before 2008.
-The years 2008 to 2018 are complete (one file per week day monday-friday, 2870 files), but during the download, the server mysteriously failed on a few files. A second attempt on the missing files succeeded. It's not an uncommon problem when downloading many files.
The data:
- Apparently the variables do not change, there are always 11 columns. But not always A-K: it starts at column A, but there are sometimes merged columns, and not always the same. This will likely be quite annoying.
- The title section has randomly merged cells, and is not really usable. This isn't a problem for the import, as we are only looking at the data.
- The data section starts on a row that changes, and I have yet to check all files (by program), but there seems to be a pattern (all recent years start at 6). It can be at least 6, 8, 14.
- The sheet name changes too.
- The end of data has a marker, and there is a disclaimer below (both in merged cells).
- The data section may be empty, and there is a NIL marker.
- The data types are going to be a problem: there are numeric columns with (very often) some numbers stored as string. Stata will likely import as string and we will have to destring. There is also a the currency in some columns (HKD), so we will have to deal with this when using destring.
- The purchase method is almost always "Exchange". But for instance on 2012-12-21 I get a row with a long description instead: "On request from shareholders, the Company, by private arrangement, buys back shares of less than 1 unit at the closing price quoted at the Tokyo Stock Exchange on 14 December 2012. The repurchase of shares was related to the adoption of Share Unit System implemented on 1 October 2012, according to the Company’s announcement dated 14 September 2012."
This can still be stored in a Stata string, but it will not be pretty. So maybe we should replace this by a label or a note number, especially if there are not too many cases.
Overall, the data format is rather simple, with the usual quirks. My next step (tonight I think): a Python program to at least find the first and last data row, and check the sheet name. I may as well do all the conversion in Python and write a csv file that will be easier to import. The reason is: it's simple in Python to read the file cell by cell and convert string to numeric on the fly, and find long strings where they are. In Stata I would import whole columns, hence problems are also spread on whole columns.
And one remark to begin with: you need -capture copy- instead of simply -copy- because there can be missing files.
The files:
-There is no Excel data before 2003 (there were PDF files before, but I have no simple way to import that). There is never data on week-ends, so 5 files per week. But there are many holes before 2008.
-The years 2008 to 2018 are complete (one file per week day monday-friday, 2870 files), but during the download, the server mysteriously failed on a few files. A second attempt on the missing files succeeded. It's not an uncommon problem when downloading many files.
The data:
- Apparently the variables do not change, there are always 11 columns. But not always A-K: it starts at column A, but there are sometimes merged columns, and not always the same. This will likely be quite annoying.
- The title section has randomly merged cells, and is not really usable. This isn't a problem for the import, as we are only looking at the data.
- The data section starts on a row that changes, and I have yet to check all files (by program), but there seems to be a pattern (all recent years start at 6). It can be at least 6, 8, 14.
- The sheet name changes too.
- The end of data has a marker, and there is a disclaimer below (both in merged cells).
- The data section may be empty, and there is a NIL marker.
- The data types are going to be a problem: there are numeric columns with (very often) some numbers stored as string. Stata will likely import as string and we will have to destring. There is also a the currency in some columns (HKD), so we will have to deal with this when using destring.
- The purchase method is almost always "Exchange". But for instance on 2012-12-21 I get a row with a long description instead: "On request from shareholders, the Company, by private arrangement, buys back shares of less than 1 unit at the closing price quoted at the Tokyo Stock Exchange on 14 December 2012. The repurchase of shares was related to the adoption of Share Unit System implemented on 1 October 2012, according to the Company’s announcement dated 14 September 2012."
This can still be stored in a Stata string, but it will not be pretty. So maybe we should replace this by a label or a note number, especially if there are not too many cases.
Overall, the data format is rather simple, with the usual quirks. My next step (tonight I think): a Python program to at least find the first and last data row, and check the sheet name. I may as well do all the conversion in Python and write a csv file that will be easier to import. The reason is: it's simple in Python to read the file cell by cell and convert string to numeric on the fly, and find long strings where they are. In Stata I would import whole columns, hence problems are also spread on whole columns.
Comment