Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Trouble importing very large JSON file

    I'm trying to import several large json files (ranging from 2.5 to 4 GB, each) into stata. The files are named 2008.json to 2012.json, one for each year. (They contain patent application information, downloaded from the 'download entire data set' option from https://ped.uspto.gov/peds/.)

    I don't need every field in each of these files, but the file structures seem relatively complex. I initially thought I'd use insheetjson, but two problems arise. First, I tried -insheetjson using 2009.json, showresponse- and stata returned the following error:

    Code:
    fread():   691  I/O error
    libjson::getrawcontents():     -  function returned error [17]
    injson_sheet():     -  function returned error
    <istmt>:     -  function returned error
    Second, one of the features of my dataset is that once it's flattened, there is more than one item of the same name (that I want to use). For instance, there is a field called "value" within the node "applicationNumberText", and also a field called "value' within the node "groupArtUnitNumber". I wasn't sure how to operationalize this within insheetjson.

    I also tried William Buchanan's jsonio package. Specifically, I tried -jsonio kv, file("2009.json") nourl-. But that returned a long error that begins with this text:

    Code:
    java.lang.OutOfMemoryError: Java heap space
            at java.util.LinkedHashMap.newNode(LinkedHashMap.java:256)
            at java.util.HashMap.putVal(HashMap.java:630)
            at java.util.HashMap.put(HashMap.java:611)
            at com.fasterxml.jackson.databind.node.ObjectNode.replace(ObjectNode.java:397)
            at com.fasterxml.jackson.databind.deser.std.BaseNodeDeserializer.deserializeObject(JsonNodeDeserializer.java:250)
    I'm aware that json files are basically text and can be parsed using regular expressions, but I'm quite novice with regex and not sure where to begin, especially given the note above about more than one item with the same name ("value"). I've attached a sample with a few records (the type is .txt but if it makes you happy you can replace that with .json), and hoping that someone can offer some suggestions. The fields I'd like to pull off are:
    "applicationNumberText":{"value"
    "applicationNumberText":{"electronicText"
    "filingDate"
    "applicationTypeCategory"
    "groupArtUnitNumber":{"value"
    "groupArtUnitNumber":{"electronicText"
    "nationalClass"
    "nationalSubclass"
    "publicationNumber"
    "publicationDate"
    "patentNumber"
    "grantDate"

    And I believe these are all 1:1 within a record (a patentRecordBag) – this is the case with the sample data, though I admit I'm not certain about the full files. (And not sure how to find out - I was able to discern elements of the object structure by using the online json formatter tool at jsonformatter.curiousconcept.com, but I only fed that my sample records, not the full many-GB files - I assume that insheetjson, showresponse or jsonio kv would help, but those didn't work in this case.)

    Any help is very much appreciated – thanks!
    Attached Files

  • #2
    Robert,

    I know this is not the answer to your question, but perhaps you could look for the same data in an alternative format?
    Patents data has been mentioned more than once in this forum in XML format, see e.g. this thread:
    https://www.statalist.org/forums/for...le-xml-doctype

    And if XML is not possible perhaps you could recover it from the pages yourself using scraping technique:
    https://www.stata.com/meeting/oceani...-oceania16.pdf

    After visiting the link you included, I see that there is an API available, and downloads are offered in both JSON and XML. So both approaches might work.

    Best, Sergiy

    Comment


    • #3
      I downloaded the entire dataset from the link provided in #1 and then unzipped the content. The JSON files were saved in a subdirectory called "pairbulk-full-20180102-json" and I count 86 files using 58.84 GB. The following shows how to extract the desired information from the "2009.txt" file. The first step is to insert some line breaks using filefilter.
      Code:
      filefilter pairbulk-full-20180102-json/2009.json 2009.txt, from({) to(\M{)
      You can then import the text file into Stata and group observations per patent (using patentRecordBag as a flag for a new patent):
      Code:
      * input up to 2000 characters per line, check that we are not truncating
      infix strL s 1-2000 using "2009.txt", clear
      format %-120s s
      assert length(s) < 2000
      
      * group observations per patentRecordBag
      gen long obs = _n
      gen long pid = sum(strpos(s,"patentRecordBag") > 0)
      bysort pid (obs): gen N = _N
      Since this is a very large file, the extraction process will go faster if you reduce the data to observations with the desired data. This step is optional.
      Code:
      gen tokeep = ///
          strpos(s, "applicationNumberText") | ///
          strpos(s[_n-1], "applicationNumberText") | ///
          strpos(s, "groupArtUnitNumber") | ///
          strpos(s[_n-1], "groupArtUnitNumber") | ///
          strpos(s, "nationalClass") | ///
          strpos(s, "publicationNumber") | ///
          strpos(s, "patentNumber")
          
      keep if tokeep
      You can then proceed to extract the desired fields using regular expression matching. This will look pretty complicated to novice users but all commands follow the same type of pattern. The code also looks complicated because each field is enclosed with double quotes so you need to use Stata's compound double quotes (see help quotes). With regular expressions, the first subexpression that matches the pattern is returned in regexs(1). A subexpression is enclosed in parentheses in the regexm() function call. In all cases, I use "([^"]+)" to define the subexpression. The square brackets define a character class and because the first character after the left square bracket is a caret (^), this negates the character class. The only other character in the character class is a double quote so the subexpression will match any character that is not a double quote. By following the character class with a + modifier, the subexpression will match one or more characters that is not a double quote. Since the subexpression is itself enclosed in double quotes, the pattern will match a double quote, followed by one or more non double quote characters, and will end with a closing double quote. Here's code that will extract the fields mentioned in #1:
      Code:
      gen app_no_value = regexs(1) if regexm(s,`""value":"([^"]+)""') & strpos(s[_n-1], "applicationNumberText")
      gen app_no_text = regexs(1) if regexm(s,`""electronicText":"([^"]+)""') & strpos(s[_n-1], "applicationNumberText")
      gen app_no_fdate = regexs(1) if regexm(s,`""filingDate":"([^"]+)""') & strpos(s[_n-1], "applicationNumberText")
      gen app_no_cat = regexs(1) if regexm(s,`""applicationTypeCategory":"([^"]+)""') & strpos(s[_n-1], "applicationNumberText")
      
      gen gart_value = regexs(1) if regexm(s,`""value":"([^"]+)""') & strpos(s[_n-1], "groupArtUnitNumber")
      gen gart_no_text = regexs(1) if regexm(s,`""electronicText":"([^"]+)""') & strpos(s[_n-1], "groupArtUnitNumber")
      
      gen natclass    = regexs(1) if regexm(s,`""nationalClass":"([^"]+)""')
      gen natsubclass = regexs(1) if regexm(s,`""nationalSubclass":"([^"]+)""')
      
      gen pubnum    = regexs(1) if regexm(s,`""publicationNumber":"([^"]+)""')
      gen pubdate    = regexs(1) if regexm(s,`""publicationDate":"([^"]+)""')
      
      gen patentnum     = regexs(1) if regexm(s,`""patentNumber":"([^"]+)""')
      gen patentdate    = regexs(1) if regexm(s,`""grantDate":"([^"]+)""')
      You can then reduce to one observation per patent using a single call to collapse:
      Code:
      collapse (firstnm) app_no_value-patentdate, by(pid)
      And here are the results for the first 10 patents found in the file:
      Code:
      . list pid-gart_no_text in 1/10
      
           +-------------------------------------------------------------------------+
           | pid   app_n~ue   app_n~xt   app_no_f~e   app_n~at   gart_v~e   gart_n~t |
           |-------------------------------------------------------------------------|
        1. |   1   12280216   12280216   2009-01-01    Utility       2433       2433 |
        2. |   2   12307268   12307268   2009-01-01    Utility       3738       3738 |
        3. |   3   12307269   12307269   2009-01-01    Utility       1655       1655 |
        4. |   4   12307271   12307271   2009-01-01    Utility       3779       3779 |
        5. |   5   12307272   12307272   2009-01-01    Utility       2612       2612 |
           |-------------------------------------------------------------------------|
        6. |   6   12307273   12307273   2009-01-01    Utility       2612       2612 |
        7. |   7   12347999   12347999   2009-01-01    Utility       2179       2179 |
        8. |   8   12348001   12348001   2009-01-01    Utility       2875       2875 |
        9. |   9   12348002   12348002   2009-01-01    Utility       2836       2836 |
       10. |  10   12348003   12348003   2009-01-01    Utility       2461       2461 |
           +-------------------------------------------------------------------------+
      
      . list pid natclass-patentdate in 1/10
      
           +------------------------------------------------------------------------------------+
           | pid   natclass   natsubcl~s            pubnum      pubdate   patent~m   patentdate |
           |------------------------------------------------------------------------------------|
        1. |   1        726   726/005000   US20090300710A1   2009-12-03                         |
        2. |   2        606   606/246000   US20100234901A1   2010-09-16                         |
        3. |   3        424   424/770000   US20090317501A1   2009-12-24                         |
        4. |   4        600   600/172000   US20090287058A1   2009-11-19    8529440   2013-09-10 |
        5. |   5        340   340/426190   US20090231189A1   2009-09-17                         |
           |------------------------------------------------------------------------------------|
        6. |   6        340   340/539180   US20090315699A1   2009-12-24                         |
        7. |   7        715   715/727000   US20100169781A1   2010-07-01    9591118   2017-03-07 |
        8. |   8        362   362/473000                                   8075168   2011-12-13 |
        9. |   9        307   307/130000   US20090184746A1   2009-07-23                         |
       10. |  10        370   370/281000   US20090316610A1   2009-12-24    8018897   2011-09-13 |
           +------------------------------------------------------------------------------------+

      Comment


      • #4
        Robert Werth
        The issue you ran into with jsonio has to do with a setting in Stata. If you run:

        Code:
        . query java
        -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
            Java system information
              Java initialized     yes
              heap usage           247m
              system classpath     /Applications/Stata/utilities/jar/libstata-core.jar:/Applications/Stata/utilities/jar/libdeps-core.jar:/Applications/Stata/utilities/jar/sfi-api.jar:
              javacall classpath   /Applications/Stata/ado/base/jar/libstata-plugin.jar
        
            Advanced Java settings
              set java_heapmax     8192m
              set java_vmpath      /Applications/Stata/utilities/java/macosx-x64/jre1.8.0_121.jre/Contents/Home/lib/jli/libjli.dylib
        You'll see the setting "set java_heapmax". You can make Stata allocate a larger amount of memory to the JVM when it spins up. As an example, I have my computer set up to allow 8GB of memory to the heap.

        Comment

        Working...
        X