I'm trying to import several large json files (ranging from 2.5 to 4 GB, each) into stata. The files are named 2008.json to 2012.json, one for each year. (They contain patent application information, downloaded from the 'download entire data set' option from https://ped.uspto.gov/peds/.)
I don't need every field in each of these files, but the file structures seem relatively complex. I initially thought I'd use insheetjson, but two problems arise. First, I tried -insheetjson using 2009.json, showresponse- and stata returned the following error:
Second, one of the features of my dataset is that once it's flattened, there is more than one item of the same name (that I want to use). For instance, there is a field called "value" within the node "applicationNumberText", and also a field called "value' within the node "groupArtUnitNumber". I wasn't sure how to operationalize this within insheetjson.
I also tried William Buchanan's jsonio package. Specifically, I tried -jsonio kv, file("2009.json") nourl-. But that returned a long error that begins with this text:
I'm aware that json files are basically text and can be parsed using regular expressions, but I'm quite novice with regex and not sure where to begin, especially given the note above about more than one item with the same name ("value"). I've attached a sample with a few records (the type is .txt but if it makes you happy you can replace that with .json), and hoping that someone can offer some suggestions. The fields I'd like to pull off are:
"applicationNumberText":{"value"
"applicationNumberText":{"electronicText"
"filingDate"
"applicationTypeCategory"
"groupArtUnitNumber":{"value"
"groupArtUnitNumber":{"electronicText"
"nationalClass"
"nationalSubclass"
"publicationNumber"
"publicationDate"
"patentNumber"
"grantDate"
And I believe these are all 1:1 within a record (a patentRecordBag) – this is the case with the sample data, though I admit I'm not certain about the full files. (And not sure how to find out - I was able to discern elements of the object structure by using the online json formatter tool at jsonformatter.curiousconcept.com, but I only fed that my sample records, not the full many-GB files - I assume that insheetjson, showresponse or jsonio kv would help, but those didn't work in this case.)
Any help is very much appreciated – thanks!
I don't need every field in each of these files, but the file structures seem relatively complex. I initially thought I'd use insheetjson, but two problems arise. First, I tried -insheetjson using 2009.json, showresponse- and stata returned the following error:
Code:
fread(): 691 I/O error libjson::getrawcontents(): - function returned error [17] injson_sheet(): - function returned error <istmt>: - function returned error
I also tried William Buchanan's jsonio package. Specifically, I tried -jsonio kv, file("2009.json") nourl-. But that returned a long error that begins with this text:
Code:
java.lang.OutOfMemoryError: Java heap space at java.util.LinkedHashMap.newNode(LinkedHashMap.java:256) at java.util.HashMap.putVal(HashMap.java:630) at java.util.HashMap.put(HashMap.java:611) at com.fasterxml.jackson.databind.node.ObjectNode.replace(ObjectNode.java:397) at com.fasterxml.jackson.databind.deser.std.BaseNodeDeserializer.deserializeObject(JsonNodeDeserializer.java:250)
"applicationNumberText":{"value"
"applicationNumberText":{"electronicText"
"filingDate"
"applicationTypeCategory"
"groupArtUnitNumber":{"value"
"groupArtUnitNumber":{"electronicText"
"nationalClass"
"nationalSubclass"
"publicationNumber"
"publicationDate"
"patentNumber"
"grantDate"
And I believe these are all 1:1 within a record (a patentRecordBag) – this is the case with the sample data, though I admit I'm not certain about the full files. (And not sure how to find out - I was able to discern elements of the object structure by using the online json formatter tool at jsonformatter.curiousconcept.com, but I only fed that my sample records, not the full many-GB files - I assume that insheetjson, showresponse or jsonio kv would help, but those didn't work in this case.)
Any help is very much appreciated – thanks!
Comment