Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • JSON from a variable?

    Hi, I have quite a bit of JSON-formatted strings stored in a csv, such as the following:

    Code:
    {"rt":1378,"stimulus":"<p>Are you wearing headphones?</p>","button_pressed":"0","trial_type":"html-button-response","trial_index":5,"time_elapsed":42668,"internal_node_id":"0.0-3.0-0.0"}
    {"rt":4086,"responses":"{\"Q0\":\"25\"}","trial_type":"survey-text-number","trial_index":2,"time_elapsed":22128,"internal_node_id":"0.0-1.0-1.0"}
    {"rt":1604,"responses":"{\"Q0\":\"\"}","trial_type":"survey-text-longer","trial_index":201,"time_elapsed":289924,"internal_node_id":"0.0-16.0"}
    etc for several million lines of data. Not all lines of data have the same categories, so it's not straightforward to use .split to parse all this into columns. Is there a good way to use one of the various JSON features in Stata to convert this very long set of strings into more manageable columns?

  • #2
    I had a similar problem - have a look at:
    https://www.statalist.org/forums/for...lar-expression

    I could not really solve it well but at the moment I found a workflow and would be happy to learn better ways to handle JSON-files in Stata.

    Comment


    • #3
      At the moment I'm doing something pretty rudimentary, but that works; i.e., splitting the variable with embedded json with the only consistent break point
      Code:
      split string, parse(`",""')
      and then using strpos to create new variables from the mixed resulting variables. Eventually I'm going to need something that works at scale, though, so this may not be an ideal solution as it involved a lot of searching through strings.

      I wonder if it would be faster/more memory-efficient to use moss ? (could be worse as it would involve doubling the size of an already enormous dataset...)

      Comment


      • #4
        You can use regular expressions to extract values from such text. For example:
        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input str186 var1
        `"{"rt":1378,"stimulus":"<p>Are you wearing headphones?</p>","button_pressed":"0","trial_type":"html-button-response","trial_index":5,"time_elapsed":42668,"internal_node_id":"0.0-3.0-0.0"}"'
        `"{"rt":4086,"responses":"{\"Q0\":\"25\"}","trial_type":"survey-text-number","trial_index":2,"time_elapsed":22128,"internal_node_id":"0.0-1.0-1.0"}"'                                         
        `"{"rt":1604,"responses":"{\"Q0\":\"\"}","trial_type":"survey-text-longer","trial_index":201,"time_elapsed":289924,"internal_node_id":"0.0-16.0"}"'                                           
        end
        
        // "rt":1378,
        gen rt = regexs(1) if regexm(var1,`""rt":([^,]+),"')
        
        // "time_elapsed":22128,
        gen te = regexs(1) if regexm(var1,`""time_elapsed":([^,]+),"')
        
        list rt te
        and the results:
        Code:
        . list rt te
        
             +---------------+
             |   rt       te |
             |---------------|
          1. | 1378    42668 |
          2. | 4086    22128 |
          3. | 1604   289924 |
             +---------------+
        
        .
        You can use sites like https://regex101.com to check the pattern and see what's going on.

        Comment


        • #5
          I followed the link given by Marc Kaulisch in post #2 to his previous topic and was able to learn enough JSON to use insheetjson to read he single line of complex JSON he provided. I've posted the results there, and modified that example to solve this simpler problem. If I do this enough eventually I'll figure out what I'm doing.
          Code:
          cls
          clear
          set more off
          type "3linesof.json"
          insheetjson using "3linesof.json", showresponse
          insheetjson using "3linesof.json", showresponse flatten
          
          local selectors rt stimulus button_pressed trial_type trial_index time_elapsed internal_node_id
          
          local n 0
          foreach s of local selectors {
              generate str200 `s' = ""
              }
          
          insheetjson `selectors' using "3linesof.json", columns(`selectors')
          list, noobs abbreviate(20)
          Code:
          . list, noobs abbreviate(20)
          
            +-----------------------------------------------------------------------------------+
            |   rt |                           stimulus | button_pressed |           trial_type |
            | 1378 | <p>Are you wearing headphones?</p> |              0 | html-button-response |
            |-----------------------------------------------------------------------------------|
            |       trial_index       |       time_elapsed       |       internal_node_id       |
            |                 5       |              42668       |            0.0-3.0-0.0       |
            +-----------------------------------------------------------------------------------+

          Comment


          • #6
            Whoops, I had replied to Robert's message but apparently didn't post it...

            re Robert: Yea, I was thinking that my next step was to convert my strpos statements to regex. Do you know if regex matching runs faster than strpos? What I've got now is running nicely on 7M rows but eventually we're going to have closer to 100M+ rows to run this on at a time. So if there's a simple way to speed things up it might be nice.

            re William: Do you know if insheetjson can be applied to fields within existing data? The full json of the 7M lines of data I have so far seems to have real trouble being imported into Stata, whereas when I have each nested line of json as part of one row of a very long CSV, Stata can read in the full dataset in about 45 seconds (on a rather fast machine). So I kinda wonder if it's even worth switching over to an admittedly more elegant direct import of json, or sticking with string parsing.

            Comment


            • #7
              All I know about insheetjson is what I have learned by trial and error. The documentation is somewhat opaque to me because of my weak understanding of JSON. Your example data was trivially easy to read. Have you followed the link in post #2 since I added my example on its complex data to the end of that topic?

              Also, I do not understand what you mean by "fields within existing data", and I do not know what you mean by "in a CSV". Your sample data is three lines of JSON in a plain text file.

              The trick I see to reading JSON objects with varying categories is figuring out beforehand an exhaustive list of categories - or at least the list of categories you care to import.

              Sorry to be so vague - I'm still learning by doing. Let me know if any of this helps, or if my larger examples helps you understand better, or if you can provide other example data or clarification.

              Comment


              • #8
                Sorry, I wasn't clear. The data are coming from an enormous SQL database and I can export JSON (where the field "data_string" contains a set of JSON-nested fields) or a CSV where each line of the JSON text I posted corresponds to a cell of one column.

                Comment


                • #9
                  I'd say if you can get what you want using Robert's elegant regex solution from your data, that's likely to be substantially faster than using insheetjson. Coding for regular expressions is a mature field, and you don't need all the overhead of insheetjson. Realistically, I think insheetjson will prove most helpful and elegant with more complex JSON data structures like the one in Marc Kaulisch's example.

                  Thanks for your response, I'm trying to learn something about JSON because it's apparently an increasingly important data format. The examples from you and Marc have been helpful. And it was satisfying working Marc's example, selector notation has not been obvious to me.

                  Comment


                  • #10
                    Originally posted by Samuel Mehr View Post
                    The data are coming from an enormous SQL database and I can export JSON . . . or a CSV
                    Pardon if this is a naive question, but does your database management system have functions that can parse the JSON and deliver the elemental values that you're seeking via a conventional query?

                    It might be faster than exporting the entire JSON cell with all of its text overhead and then reading it all into Stata over the wire only to have to parse it locally.

                    If a 100-million-row JSON-parsing query is going to cause complaints, then perhaps you can negotiate with the administrators to allow it to be scheduled in batch mode during relative lulls in database, server and network usage.

                    Comment

                    Working...
                    X