Error importing xml files

Cait Chamberlain

Join Date: Apr 2017

Posts: 2
#1

Error importing xml files

28 Apr 2017, 11:47

I am having trouble importing a series of xml files to Stata. I have pasted the information about my Stata version below. I also pasted the code I attempted to use to import the files and the error messages I received. Finally, I have pasted a link to the page with the links to download the files I am trying to import. (I tried 2015 and 2016 so far)

Stata/SE 14.2 for Windows (64-bit x86-64)
Revision 09 Jan 2017

. xmluse "D:\Google Drive\Work\McCourt\Gordon\Caitlin+Nora\OIRA xml files\EO_RULE_COMPLETED_2015.xml"
unrecognizable XML doctype
r(198);

. xmluse "D:\Google Drive\Work\McCourt\Gordon\Caitlin+Nora\OIRA xml files\EO_RULE_COMPLETED_2015.xml", doctype(dta)
unrecognizable XML doctype
r(198);

. xmluse "D:\Google Drive\Work\McCourt\Gordon\Caitlin+Nora\OIRA xml files\EO_RULE_COMPLETED_2015.xml", doctype(excel)
unrecognizable XML doctype
r(198);

https://www.reginfo.gov/public/do/XMLReportList
Tags: None
Robert Picard

Join Date: Mar 2014

Posts: 1536
#2

28 Apr 2017, 13:31

The xmluse command can only read Excel-format XML or Stata-format XML. Your files are neither so you have to find another solution. There are several online XML to csv converters out there but I prefer the sport of parsing such files in Stata. The important thing to know is that XML files are plain old text files that can be opened using Stata's editor or any other text file editor.

The following reads in your "EO_RULE_COMPLETED_2015.xml" into Stata using the infix command. I read columns 1 to 500 into a single variable called line. This is more than needed to read the longest line in the data. The leftalign command is available from SSC and makes it easier to read string data like this. You can install it by typing in Stata's Command window:

Code:

ssc install leftalign

The information you are looking for is grouped by <REGACT> tags. There's some tweaking needed to combine content split over more than one observation. Then you generate the tag variable to hold the name of the XML tag and content to hold the stored content. This is easier to do using regular expressions.

Finally, you reshape to a wide layout with tags for the variable names.

Code:

* read the whole file, each line is an observation clear infix str line 1-500 using "EO_RULE_COMPLETED_2015.xml" compress leftalign * group observations based on <REGACT> tags, drop header and footer tags gen regact = sum(line == "<REGACT>") drop if regact == 0 drop if line == "<REGACT>" | line == "</REGACT>" drop if line == "</OIRA_DATA>" * combine content broken over several lines gen obsid = _n gen id = sum(regexm(line,"^<[^/]")) bysort id (obsid): gen N = _N list line if N > 1, sepby(id) replace line = line[_n-1] + " " + line if substr(line,1,1) != "<" | substr(line,1,2) == "</" by id: keep if _n == _N list line if N > 1, sepby(id) * extract the tag and content gen tag = lower(regexs(1)) if regexm(line,"^<([^>]+)") gen content = trim(regexs(1)) if regexm(line,">([^<]+)") leftalign * convert to a wide layout with tags converted to variable names drop line obsid id N reshape wide content, i(regact) j(tag) string compress rename content* *
1 like
Comment
Cait Chamberlain

Join Date: Apr 2017

Posts: 2
#3

28 Apr 2017, 14:46

Thank you, Robert. I will try this.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

28 Apr 2017, 15:03

Just to help those who have not downloaded the XML file understand the issues, here's the first two groups of data in the XML file, with the REGACT tags in bold.

Code:

<REGACT>
<AGENCY_CODE>0412</AGENCY_CODE>
<RIN>0412-AA71</RIN>
<TITLE>Partner Vetting in USAID Assistance</TITLE>
<STAGE>Final Rule</STAGE>
<ECONOMICALLY_SIGNIFICANT>No</ECONOMICALLY_SIGNIFICANT>
<DATE_RECEIVED>2014-09-11</DATE_RECEIVED>
<LEGAL_DEADLINE>None</LEGAL_DEADLINE>
<DATE_COMPLETED>2015-04-24</DATE_COMPLETED>
<DECISION>Consistent with Change</DECISION>
<DATE_PUBLISHED>2015-06-26</DATE_PUBLISHED>
<HEALTH_CARE_ACT>No</HEALTH_CARE_ACT>
<DODD_FRANK_ACT>No</DODD_FRANK_ACT>
<INTERNATIONAL_IMPACTS>Yes</INTERNATIONAL_IMPACTS>
</REGACT>
<REGACT>
<AGENCY_CODE>0420</AGENCY_CODE>
<RIN>0420-AA26</RIN>
<TITLE>Eligibility and Standards for Peace Corps Volunteer Service Regarding Procedures</TITLE>
<STAGE>Proposed Rule</STAGE>
<ECONOMICALLY_SIGNIFICANT>No</ECONOMICALLY_SIGNIFICANT>
<DATE_RECEIVED>2014-07-01</DATE_RECEIVED>
<LEGAL_DEADLINE>None</LEGAL_DEADLINE>
<DATE_COMPLETED>2015-07-24</DATE_COMPLETED>
<DECISION>Consistent with Change</DECISION>
<DATE_PUBLISHED>2015-07-31</DATE_PUBLISHED>
<HEALTH_CARE_ACT>No</HEALTH_CARE_ACT>
<DODD_FRANK_ACT>No</DODD_FRANK_ACT>
</REGACT>

and here is how my code converted these:

Code:

. list in 1/2

     +-------------------------------------------------------------------------+
  1. |  regact   |  agency~e   |  date_com~d   |  date_pub~d   |  date_rec~d   |
     |       1   |  0412       |  2015-04-24   |  2015-06-26   |  2014-09-11   |
     |-------------------------+-----------------------------------------------|
     | decision                | dodd_f~t  | econom~t  | health~t  | intern~s  |
     | Consistent with Change  | No        | No        | No        | Yes       |
     |-------------------------------------------------------------------------|
     |      legal_~e       |       rin             |       stage               |
     |      None           |       0412-AA71       |       Final Rule          |
     |-------------------------------------------------------------------------|
     | title                                                                   |
     | Partner Vetting in USAID Assistance                                     |
     +-------------------------------------------------------------------------+

     +-------------------------------------------------------------------------+
  2. |  regact   |  agency~e   |  date_com~d   |  date_pub~d   |  date_rec~d   |
     |       2   |  0420       |  2015-07-24   |  2015-07-31   |  2014-07-01   |
     |-------------------------+-----------------------------------------------|
     | decision                | dodd_f~t  | econom~t  | health~t  | intern~s  |
     | Consistent with Change  | No        | No        | No        |           |
     |-------------------------------------------------------------------------|
     |      legal_~e       |       rin             |       stage               |
     |      None           |       0420-AA26       |       Proposed Rule       |
     |-------------------------------------------------------------------------|
     | title                                                                   |
     | Eligibility and Standards for Peace Corps Volunteer Service Regarding.. |
     +-------------------------------------------------------------------------+

Comment

Michael Costello

Join Date: Dec 2015

Posts: 30
#5

30 Jan 2021, 05:30

Robert, thanks for this great response and code. I was able to adapt it nicely to analyze some of my own personal training data from my Garmin sport watch in order to troubleshoot some problems I'm having with my performance metrics and show it to the company. Thank you kindly.
Comment

Announcement

Error importing xml files

Comment

Comment

Comment

Comment