XML file error: unrecognizable XML doctype

Pantelis Kazakis

Join Date: Aug 2014

Posts: 121
#1

XML file error: unrecognizable XML doctype

30 Sep 2014, 21:52

Hello all,

I have downloaded some data files that are in .xml format. One average there are more or less close to 1 gigabyte.

I run the command xmluse, but I get the error: unrecognizable XML doctype.

Can this problem be fixed somehow?

Thank you in advance.
Tags: data import
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17084
#2

01 Oct 2014, 00:52

Pantelis,
have you tried:
- xmluse filename, doctype(dta) - ?

Kind regards,
Carlo

Kind regards,
Carlo
(Stata 18.0 SE)
Comment
Pantelis Kazakis

Join Date: Aug 2014

Posts: 121
#3

01 Oct 2014, 06:44

Originally posted by Carlo Lazzaro View Post

Pantelis,
have you tried:
- xmluse filename, doctype(dta) - ?

Kind regards,
Carlo

Hi Carlo,

yes this was the first thing that I did, yet to no avail. The problem persists and I don't know what to do to open it. I guess that this might be related to the fact that the file is relatively large.
Comment
Pantelis Kazakis

Join Date: Aug 2014

Posts: 121
#4

01 Oct 2014, 07:03

The data that I downloaded, I fear are not in excel xml or STATA xml format.

Now then, is there any way to make these data humanly readable by STATA?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 33642
#5

01 Oct 2014, 07:09

That's like saying how do I read a data file of unspecified type into Stata.
Comment
Pantelis Kazakis

Join Date: Aug 2014

Posts: 121
#6

01 Oct 2014, 08:24

Originally posted by Nick Cox View Post

That's like saying how do I read a data file of unspecified type into Stata.

Dear Nick,
the file is .xml and what I understand is that of the xml type. STATA has the option of xml import (i.e. with xmluse). I have tried to make it work, but as I said before I did not succeed.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 33642
#7

01 Oct 2014, 08:35

We want to help you but I can't see that you have asked a question that can be answered unless you tell us more (than nothing) about the internal structure of these files. Use of a particular file extension is not itself material. Perhaps you would be better discussing this with StataCorp technical support.
Comment
Pantelis Kazakis

Join Date: Aug 2014

Posts: 121
#8

01 Oct 2014, 08:36

So, I found out that the doc type is: us-patent-assignments

I guess that xmluse cannot be used here unfortunately.
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1831
#9

01 Oct 2014, 09:18

Perhaps usexmlex? Sergiy
Comment
Pantelis Kazakis

Join Date: Aug 2014

Posts: 121
#10

01 Oct 2014, 10:38

Originally posted by Sergiy Radyakin View Post

Perhaps usexmlex? Sergiy

Sergiy, I got this error:

t_ate-produced CDATA #IMPLIE invalid name
st_addvar(): 3300 argument out of range
usexmlex(): - function returned error
<istmt>: - function returned error
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17084
#11

01 Oct 2014, 10:54

Pantelis:
-search 3300- gives back what follows:

error . . . . . . . . . . . . . . . . . . . . . . . . Return code 3300
argument out of range
The eltype and orgtype of the argument are correct, but
the argument contains an invalid value, such as if you had
asked for the 20th row of a 4 x 5 matrix.
See eltype and orgtype in help [M-6] glossary.

Could it be that the files you are trying to read in Stata are corrupted?

Kind regards,
Carlo

Kind regards,
Carlo
(Stata 18.0 SE)
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1831
#12

01 Oct 2014, 13:32

The meaning of the error is literal: Stata variable names can't contain a # sign, so the import procedure couldn't create a variable with such a name. Why is it doing so is a whole other matter. Mostly because the file is not compatible with program's expectations, but perhaps in part because code relies on the st_isname() function to judge what is permissible, which is not without its own problems, which I described the problem a while ago in the following post:
http://www.statalist.org/forums/foru...variable-names

There was no feedback.

However, the quoted text doesn't make much sense to me - forget the # sign, the rest doesn't look like a variable name. Perhaps you have encountered yet another XML format.
Perhaphs check against this http://www.triple-s.org/sssxml1.htm specification.
Guessing will be infinite without seeing the file or at least a portion of it. From the limited information we have, you are likely dealing with a DTD:
see here: http://www.w3schools.com/dtd/default.asp
and here: http://en.wikipedia.org/wiki/Document_type_definition

Carlo's explanation is also valid. Any XML can get corrupted when mishandled.

I have downloaded some data files

this is not really helpful. Where did you get the files from? Is it in public access? Is the source also providing documentation? approachable with questions? why are we guessing?

Best, Sergiy
1 like
Comment
Pantelis Kazakis

Join Date: Aug 2014

Posts: 121
#13

01 Oct 2014, 17:02

Forgive me for not been very clear.

I got the data from the following site: http://patents.reedtech.com/assignment.php

(I downloaded the ones from 1980-2013).
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1831
#14

01 Oct 2014, 18:42

Originally posted by Pantelis Kazakis View Post

Forgive me for not been very clear.

I got the data from the following site: http://patents.reedtech.com/assignment.php

(I downloaded the ones from 1980-2013).

Well, as expected, the data provider is saying upfront: "The file format is eXtensible Markup Language (XML) in accordance with the Patent Assignment Daily XML (PADX) Version 0.3 Document Type Definition (DTD)."

The easiest would be to, probably, remove the DTD and open files one by one in Excel. Then save as something more accessible to Stata. My concern is that data files have a DTD version variable, which means it might fluctuate, but hopefully never within the same file. It could even be that the same was used for all files, if it is still at 1 for 2014.

The beginning of the file ad20140929.xml should look like this:

-usexmlex- works with a much simpler version of XML files. It will not manage the import of this, sorry.

Best regards, Sergiy
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#15

01 Oct 2014, 21:46

These are indeed very big xml files. I downloaded "ad20131231-01.zip" (from the very bottom of http://patents.reedtech.com/assignment.php). The zip archive unzips to a single 876.2 MB file called "ad20131231-01.xml".

You can quickly get a sense of the content by first using filefilter to insert line breaks and then importing a subset of the observations. This worked for me (you can install leftalign by typing ssc install leftalign)

Code:

filefilter ad20131231-01.xml ad20131231-01.txt, from(<) to(\M<) import delimited s using ad20131231-01.txt, clear rowrange(1:1000) * leftalign is from SSC leftalign

You can import the whole file by skipping the rowrange(1:1000) option but that generates a very big dataset with more than 50 million observations and the length of the string variable required to hold the longest value at 473. Another way to quickly load the whole data in memory with less overhead is to truncate the longer strings using

Code:

infix str s 1-50 using ad20131231-01.txt, clear leftalign

You quickly see the structure of the data, with "<patent-assignment>" identifying the start of a block of information. You can easily group observations using

Code:

gen pa = sum(s == "<patent-assignment>") sum pa

Of course what to do next depends a lot on what you are looking for but at least this gets you started.
Comment

Announcement

XML file error: unrecognizable XML doctype

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment