Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Error importing xml files

    I am having trouble importing a series of xml files to Stata. I have pasted the information about my Stata version below. I also pasted the code I attempted to use to import the files and the error messages I received. Finally, I have pasted a link to the page with the links to download the files I am trying to import. (I tried 2015 and 2016 so far)


    Stata/SE 14.2 for Windows (64-bit x86-64)
    Revision 09 Jan 2017


    . xmluse "D:\Google Drive\Work\McCourt\Gordon\Caitlin+Nora\OIRA xml files\EO_RULE_COMPLETED_2015.xml"
    unrecognizable XML doctype
    r(198);

    . xmluse "D:\Google Drive\Work\McCourt\Gordon\Caitlin+Nora\OIRA xml files\EO_RULE_COMPLETED_2015.xml", doctype(dta)
    unrecognizable XML doctype
    r(198);

    . xmluse "D:\Google Drive\Work\McCourt\Gordon\Caitlin+Nora\OIRA xml files\EO_RULE_COMPLETED_2015.xml", doctype(excel)
    unrecognizable XML doctype
    r(198);


    https://www.reginfo.gov/public/do/XMLReportList

  • #2
    The xmluse command can only read Excel-format XML or Stata-format XML. Your files are neither so you have to find another solution. There are several online XML to csv converters out there but I prefer the sport of parsing such files in Stata. The important thing to know is that XML files are plain old text files that can be opened using Stata's editor or any other text file editor.

    The following reads in your "EO_RULE_COMPLETED_2015.xml" into Stata using the infix command. I read columns 1 to 500 into a single variable called line. This is more than needed to read the longest line in the data. The leftalign command is available from SSC and makes it easier to read string data like this. You can install it by typing in Stata's Command window:
    Code:
    ssc install leftalign
    The information you are looking for is grouped by <REGACT> tags. There's some tweaking needed to combine content split over more than one observation. Then you generate the tag variable to hold the name of the XML tag and content to hold the stored content. This is easier to do using regular expressions.

    Finally, you reshape to a wide layout with tags for the variable names.

    Code:
    * read the whole file, each line is an observation
    clear
    infix str line 1-500 using "EO_RULE_COMPLETED_2015.xml"
    compress
    leftalign
    
    * group observations based on <REGACT> tags, drop header and footer tags
    gen regact = sum(line == "<REGACT>")
    drop if regact == 0
    drop if line == "<REGACT>" | line == "</REGACT>"
    drop if line == "</OIRA_DATA>"
    
    * combine content broken over several lines
    gen obsid = _n
    gen id = sum(regexm(line,"^<[^/]"))
    bysort id (obsid): gen N = _N
    list line if N > 1, sepby(id)
    replace line = line[_n-1] + " " + line if substr(line,1,1) != "<" | substr(line,1,2) == "</"
    by id: keep if _n == _N
    list line if N > 1, sepby(id)
    
    * extract the tag and content
    gen tag = lower(regexs(1)) if regexm(line,"^<([^>]+)")
    gen content = trim(regexs(1)) if regexm(line,">([^<]+)")
    leftalign
    
    * convert to a wide layout with tags converted to variable names
    drop line obsid id N
    reshape wide content, i(regact) j(tag) string
    compress
    rename content* *

    Comment


    • #3
      Thank you, Robert. I will try this.

      Comment


      • #4
        Just to help those who have not downloaded the XML file understand the issues, here's the first two groups of data in the XML file, with the REGACT tags in bold.
        Code:
        <REGACT>
        <AGENCY_CODE>0412</AGENCY_CODE>
        <RIN>0412-AA71</RIN>
        <TITLE>Partner Vetting in USAID Assistance</TITLE>
        <STAGE>Final Rule</STAGE>
        <ECONOMICALLY_SIGNIFICANT>No</ECONOMICALLY_SIGNIFICANT>
        <DATE_RECEIVED>2014-09-11</DATE_RECEIVED>
        <LEGAL_DEADLINE>None</LEGAL_DEADLINE>
        <DATE_COMPLETED>2015-04-24</DATE_COMPLETED>
        <DECISION>Consistent with Change</DECISION>
        <DATE_PUBLISHED>2015-06-26</DATE_PUBLISHED>
        <HEALTH_CARE_ACT>No</HEALTH_CARE_ACT>
        <DODD_FRANK_ACT>No</DODD_FRANK_ACT>
        <INTERNATIONAL_IMPACTS>Yes</INTERNATIONAL_IMPACTS>
        </REGACT>
        <REGACT>
        <AGENCY_CODE>0420</AGENCY_CODE>
        <RIN>0420-AA26</RIN>
        <TITLE>Eligibility and Standards for Peace Corps Volunteer Service Regarding Procedures</TITLE>
        <STAGE>Proposed Rule</STAGE>
        <ECONOMICALLY_SIGNIFICANT>No</ECONOMICALLY_SIGNIFICANT>
        <DATE_RECEIVED>2014-07-01</DATE_RECEIVED>
        <LEGAL_DEADLINE>None</LEGAL_DEADLINE>
        <DATE_COMPLETED>2015-07-24</DATE_COMPLETED>
        <DECISION>Consistent with Change</DECISION>
        <DATE_PUBLISHED>2015-07-31</DATE_PUBLISHED>
        <HEALTH_CARE_ACT>No</HEALTH_CARE_ACT>
        <DODD_FRANK_ACT>No</DODD_FRANK_ACT>
        </REGACT>
        and here is how my code converted these:
        Code:
        . list in 1/2
        
             +-------------------------------------------------------------------------+
          1. |  regact   |  agency~e   |  date_com~d   |  date_pub~d   |  date_rec~d   |
             |       1   |  0412       |  2015-04-24   |  2015-06-26   |  2014-09-11   |
             |-------------------------+-----------------------------------------------|
             | decision                | dodd_f~t  | econom~t  | health~t  | intern~s  |
             | Consistent with Change  | No        | No        | No        | Yes       |
             |-------------------------------------------------------------------------|
             |      legal_~e       |       rin             |       stage               |
             |      None           |       0412-AA71       |       Final Rule          |
             |-------------------------------------------------------------------------|
             | title                                                                   |
             | Partner Vetting in USAID Assistance                                     |
             +-------------------------------------------------------------------------+
        
             +-------------------------------------------------------------------------+
          2. |  regact   |  agency~e   |  date_com~d   |  date_pub~d   |  date_rec~d   |
             |       2   |  0420       |  2015-07-24   |  2015-07-31   |  2014-07-01   |
             |-------------------------+-----------------------------------------------|
             | decision                | dodd_f~t  | econom~t  | health~t  | intern~s  |
             | Consistent with Change  | No        | No        | No        |           |
             |-------------------------------------------------------------------------|
             |      legal_~e       |       rin             |       stage               |
             |      None           |       0420-AA26       |       Proposed Rule       |
             |-------------------------------------------------------------------------|
             | title                                                                   |
             | Eligibility and Standards for Peace Corps Volunteer Service Regarding.. |
             +-------------------------------------------------------------------------+

        Comment


        • #5
          Robert, thanks for this great response and code. I was able to adapt it nicely to analyze some of my own personal training data from my Garmin sport watch in order to troubleshoot some problems I'm having with my performance metrics and show it to the company. Thank you kindly.

          Comment

          Working...
          X