Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • XML file error: unrecognizable XML doctype

    Hello all,

    I have downloaded some data files that are in .xml format. One average there are more or less close to 1 gigabyte.

    I run the command xmluse, but I get the error: unrecognizable XML doctype.

    Can this problem be fixed somehow?

    Thank you in advance.

  • #2
    Pantelis,
    have you tried:
    - xmluse filename, doctype(dta) - ?

    Kind regards,
    Carlo
    Kind regards,
    Carlo
    (Stata 18.0 SE)

    Comment


    • #3
      Originally posted by Carlo Lazzaro View Post
      Pantelis,
      have you tried:
      - xmluse filename, doctype(dta) - ?

      Kind regards,
      Carlo

      Hi Carlo,

      yes this was the first thing that I did, yet to no avail. The problem persists and I don't know what to do to open it. I guess that this might be related to the fact that the file is relatively large.

      Comment


      • #4
        The data that I downloaded, I fear are not in excel xml or STATA xml format.

        Now then, is there any way to make these data humanly readable by STATA?

        Comment


        • #5
          That's like saying how do I read a data file of unspecified type into Stata.

          Comment


          • #6
            Originally posted by Nick Cox View Post
            That's like saying how do I read a data file of unspecified type into Stata.

            Dear Nick,
            the file is .xml and what I understand is that of the xml type. STATA has the option of xml import (i.e. with xmluse). I have tried to make it work, but as I said before I did not succeed.

            Comment


            • #7
              We want to help you but I can't see that you have asked a question that can be answered unless you tell us more (than nothing) about the internal structure of these files. Use of a particular file extension is not itself material. Perhaps you would be better discussing this with StataCorp technical support.

              Comment


              • #8
                So, I found out that the doc type is: us-patent-assignments

                I guess that xmluse cannot be used here unfortunately.

                Comment


                • #9
                  Perhaps usexmlex? Sergiy

                  Comment


                  • #10
                    Originally posted by Sergiy Radyakin View Post
                    Perhaps usexmlex? Sergiy

                    Sergiy, I got this error:


                    t_ate-produced CDATA #IMPLIE invalid name
                    st_addvar(): 3300 argument out of range
                    usexmlex(): - function returned error
                    <istmt>: - function returned error

                    Comment


                    • #11
                      Pantelis:
                      -search 3300- gives back what follows:
                      error . . . . . . . . . . . . . . . . . . . . . . . . Return code 3300
                      argument out of range
                      The eltype and orgtype of the argument are correct, but
                      the argument contains an invalid value, such as if you had
                      asked for the 20th row of a 4 x 5 matrix.
                      See eltype and orgtype in help [M-6] glossary.
                      Could it be that the files you are trying to read in Stata are corrupted?

                      Kind regards,
                      Carlo
                      Kind regards,
                      Carlo
                      (Stata 18.0 SE)

                      Comment


                      • #12
                        The meaning of the error is literal: Stata variable names can't contain a # sign, so the import procedure couldn't create a variable with such a name. Why is it doing so is a whole other matter. Mostly because the file is not compatible with program's expectations, but perhaps in part because code relies on the st_isname() function to judge what is permissible, which is not without its own problems, which I described the problem a while ago in the following post:
                        http://www.statalist.org/forums/foru...variable-names

                        There was no feedback.

                        However, the quoted text doesn't make much sense to me - forget the # sign, the rest doesn't look like a variable name. Perhaps you have encountered yet another XML format.
                        Perhaphs check against this http://www.triple-s.org/sssxml1.htm specification.
                        Guessing will be infinite without seeing the file or at least a portion of it. From the limited information we have, you are likely dealing with a DTD:
                        see here: http://www.w3schools.com/dtd/default.asp
                        and here: http://en.wikipedia.org/wiki/Document_type_definition

                        Carlo's explanation is also valid. Any XML can get corrupted when mishandled.

                        I have downloaded some data files
                        this is not really helpful. Where did you get the files from? Is it in public access? Is the source also providing documentation? approachable with questions? why are we guessing?

                        Best, Sergiy

                        Comment


                        • #13
                          Forgive me for not been very clear.

                          I got the data from the following site: http://patents.reedtech.com/assignment.php

                          (I downloaded the ones from 1980-2013).

                          Comment


                          • #14
                            Originally posted by Pantelis Kazakis View Post
                            Forgive me for not been very clear.

                            I got the data from the following site: http://patents.reedtech.com/assignment.php

                            (I downloaded the ones from 1980-2013).
                            Well, as expected, the data provider is saying upfront: "The file format is eXtensible Markup Language (XML) in accordance with the Patent Assignment Daily XML (PADX) Version 0.3 Document Type Definition (DTD)."

                            The easiest would be to, probably, remove the DTD and open files one by one in Excel. Then save as something more accessible to Stata. My concern is that data files have a DTD version variable, which means it might fluctuate, but hopefully never within the same file. It could even be that the same was used for all files, if it is still at 1 for 2014.

                            The beginning of the file ad20140929.xml should look like this:

                            Click image for larger version

Name:	patents.png
Views:	1
Size:	26.7 KB
ID:	279256

                            -usexmlex- works with a much simpler version of XML files. It will not manage the import of this, sorry.

                            Best regards, Sergiy

                            Comment


                            • #15
                              These are indeed very big xml files. I downloaded "ad20131231-01.zip" (from the very bottom of http://patents.reedtech.com/assignment.php). The zip archive unzips to a single 876.2 MB file called "ad20131231-01.xml".

                              You can quickly get a sense of the content by first using filefilter to insert line breaks and then importing a subset of the observations. This worked for me (you can install leftalign by typing ssc install leftalign)

                              Code:
                              filefilter ad20131231-01.xml ad20131231-01.txt, from(<) to(\M<)
                              import delimited s using ad20131231-01.txt, clear rowrange(1:1000)
                              * leftalign is from SSC
                              leftalign
                              You can import the whole file by skipping the rowrange(1:1000) option but that generates a very big dataset with more than 50 million observations and the length of the string variable required to hold the longest value at 473. Another way to quickly load the whole data in memory with less overhead is to truncate the longer strings using

                              Code:
                              infix str s 1-50 using ad20131231-01.txt, clear
                              leftalign
                              You quickly see the structure of the data, with "<patent-assignment>" identifying the start of a block of information. You can easily group observations using

                              Code:
                              gen pa = sum(s == "<patent-assignment>")
                              sum pa
                              Of course what to do next depends a lot on what you are looking for but at least this gets you started.

                              Comment

                              Working...
                              X