Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • JSON

    Hello, I am trying to import bulk data in JSON format. I have downloaded the data from the web in zip file. I used the insheetjson command

    but recieved an invalid file specification error.

    Ultimately I dont want all of the data just certain columns, but I dont know how to specify that when I import. I have attached the link to the data that I am trying to get.

    Any help will be much appreciated

    Best regards, Michael
    congress - Public domain data collectors for the work of Congress, including legislation, amendments, and votes.

  • #2
    I have never used this user written program, but in the help file (type help insheetjson) there are specific instructions and examples for importing it to Stata. Please read the FAQ to able other users of this user written program to help you next time. Especially, what exactly you type and what Stata gave you back.

    Comment


    • #3
      Thank you Oded,
      I have uploaded an attachment that shows exactly what command I gave to Stata and the exact error that I received back.

      Thank you
      Best regards,
      Michael
      Attached Files

      Comment


      • #4
        I'm not familiar with the JSON format and have never used insheetjson (from SSC) but I gave it a try. I downloaded the following zip file:

        Code:
        http://unitedstates.sunlightfoundation.com/congress/data/113.zip
        The archive is big, it contains 16,674 files (about 120MB) in JSON format. In the few tests I did, I was able to extract some data using insheetjson. For example, the following appears to work:

        Code:
        clear
        gen str80 id = ""
        gen str80 state = ""
        gen str80 display_name = ""
        gen str80 party = ""
        
        insheetjson id state display_name party using "113/votes/2014/h1/data.json", ///
            table(votes Present) col("id" "state" "display_name" "party")
        compress
        However, I could not figure out a syntax that worked to get data with the following structure

        Code:
          "votes": {
            "Not Voting": [
              {
                "display_name": "Amodei", 
                "id": "A000369", 
                "party": "R", 
                "state": "NV"
              },
        which should be something like

        Code:
        clear
        gen str80 id = ""
        gen str80 state = ""
        gen str80 display_name = ""
        gen str80 party = ""
        
        insheetjson id state display_name party using "113/votes/2014/h1/data.json", ///
            table(votes "Not Voting") col("id" "state" "display_name" "party")
        compress
        I might be wrong but my guess is that using insheetjson to extract data like this may be pushing the design capabilities of insheetjson.

        However, the most challenging problem will be to deal with the sheer number of files. Perhaps it will be enough to just be able to look for the information within the raw JSON text. These raw FSON files can be input in Stata and easily combined using filelist, from SSC. Here's an example.

        Code:
        * based on files contained in
        * http://unitedstates.sunlightfoundation.com/congress/data/113.zip
        * -filelist- if from SSC; to install: ssc install filelist
        filelist , dir("113/amendments/hamdt") pattern("*.json")
        tempfile flist
        save "`flist'"
        local nfiles = _N
        
        * import each file into Stata
        forvalues i=1/`nfiles' {
        
            use "`flist'" in `i', clear
            local filepath = dirname + "/" + filename
            dis "reading `filepath'"
            
            qui import delimited "`filepath'", clear
            gen source = "`filepath'"
            gen line_id = _n
            gen strL json_line = v1
            drop v1
            
            tempfile json`i'
            qui save "`json`i''"
            
        }
        
        * combine all datasets
        clear
        forvalues i=1/`nfiles' {
            append using "`json`i''"
        }
        
        * -leftalign- is from SSC; to install: ssc install leftalign
        format %80s json_line
        leftalign

        Comment


        • #5
          Thank you Robert,
          I was not even able to download the data using the first line of code that you provided due to the attached error.

          Perhaps I need to move the data to a different environment to make it more manageable before importing into Stata.

          Best regards,

          Michael
          Attached Files

          Comment


          • #6
            Start by downloading the zip archive and expanding it on your computer. Then change Stata's current directory to the directory that contains the expanded archive (help cd). The directory for the expanded archive is called "113". Then, you can run the example I gave, which is

            Code:
            clear
            gen str80 id = ""
            gen str80 state = ""
            gen str80 display_name = ""
            gen str80 party = ""
            
            insheetjson id state display_name party using "113/votes/2014/h1/data.json", ///
                table(votes Present) col("id" "state" "display_name" "party")
            compress
            not just the insheetjson command as you did in your screen shot.

            Comment


            • #7
              Robert, thank you again for your response! I downloaded the data and expanded it on my desktop. I then set my cd

              Code:
              cd "C:\Users\JayaRam\Desktop\New folder (3)\113"
              Then I ran the code you gave (11:48), but I received an error:

              Code:
              insheetjson id state display_name party using "113/votes/2014/h1/data.json", ///
              option / not allowed
              I'm not sure what I am doing wrong; I'm trying to fiddle with it a bit, but no results yet.
              Can you just take a quick look to make sure that I am on the right track?

              Thanks again,
              Michael

              Comment


              • #8
                The "///" characters at the end of the line indicate that the command continues on the next line (see help comments). The indicators are not supported by Stata's command window. You must either remove them and put the whole command on a single line or put the whole thing in a do-file and run the code from there.

                Comment


                • #9
                  Thanks Robert,

                  I got it to work! Now I have a lot of fiddling to do but at least I am off to a good start! Thanks again

                  Comment


                  • #10
                    Originally posted by Robert Picard View Post
                    Start by downloading the zip archive and expanding it on your computer. Then change Stata's current directory to the directory that contains the expanded archive (help cd). The directory for the expanded archive is called "113". Then, you can run the example I gave, which is

                    Code:
                    clear
                    gen str80 id = ""
                    gen str80 state = ""
                    gen str80 display_name = ""
                    gen str80 party = ""
                    
                    insheetjson id state display_name party using "113/votes/2014/h1/data.json", ///
                    table(votes Present) col("id" "state" "display_name" "party")
                    compress
                    not just the insheetjson command as you did in your screen shot.

                    Robert Picard I am facing insheeting json. Can you please guide? My .json file looks like this

                    Code:
                    {
                      "snapshot_list": [
                        {
                          "beat_time": "2010-01-01 00:02:22",
                          "node_list": [
                            
                            {
                              "power": 0,
                              "energy": 0.0,
                              "flag_list": [],
                              "current": 0.0,
                              "credits": 0,
                              "voltage": 0.0,
                              "timeout": true,
                              "address": "1.1.1.1",
                              "price": 0.0
                            },
                            {
                              "power": 0,
                              "energy": 0.0,
                              "flag_list": [],
                              "current": 0.0,
                              "credits": 0,
                              "voltage": 0.0,
                              "timeout": true,
                              "address": "2.2.2.2",
                              "price": 0.0
                            }
                          ]
                        }
                      ],
                      "grid_name": "test-grid"
                    }
                    I tried using the insheetjson command in the format as you mentioned in earlier replies but it isn't working out!

                    My code looks as below

                    Code:
                    clear
                    gen str power = ""
                    gen str energy = ""
                    gen str flat_list = ""
                    gen str current = ""
                    gen str credits = ""
                    gen str voltage = ""
                    gen str timeout = ""
                    gen str address = ""
                    gen str price = ""
                    
                    
                    *** tried various combinations of variables to written inside table()***
                    
                    insheetjson power energy flat_list current credits voltage timeout address price using "test.json", ///
                    table (snaptshot_list beat_time node_list) col("power" "energy" "flat_list" "current" "credits" "voltage" "timeout" "address" "price")
                    Last edited by Karthik Dinne; 04 May 2016, 08:57.

                    Comment


                    • #11
                      I'm not a user of insheetjson and since JSON files are simple text files, I think it's easier to just extract the information the good old fashion way (using standard data management techniques). Here's one way to get there:

                      Code:
                      clear
                      import delimited using "test.json"
                      drop v2
                      
                      * to install, type in Stata's Command window: ssc install leftalign
                      leftalign
                      
                      * extract the components
                      gen item =  ustrregexs(1) if ustrregexm(v1,`""(.*?)": "?([^"]*)"?"')
                      gen value = ustrregexs(2) if ustrregexm(v1,`""(.*?)": "?([^"]*)"?"')
                      
                      * since there are more than one item of the same name, use an increment
                      keep if !mi(item)
                      gen level = sum(item == "power")
                      
                      leftalign
                      list item value level, sepby(level)
                      and here's the results
                      Code:
                      . list item value level, sepby(level)
                      
                           +---------------------------------------------+
                           | item            value                 level |
                           |---------------------------------------------|
                        1. | snapshot_list   [                         0 |
                        2. | beat_time       2010-01-01 00:02:22       0 |
                        3. | node_list       [                         0 |
                           |---------------------------------------------|
                        4. | power           0                         1 |
                        5. | energy          0.0                       1 |
                        6. | flag_list       []                        1 |
                        7. | current         0.0                       1 |
                        8. | credits         0                         1 |
                        9. | voltage         0.0                       1 |
                       10. | timeout         true                      1 |
                       11. | address         1.1.1.1                   1 |
                       12. | price           0.0                       1 |
                           |---------------------------------------------|
                       13. | power           0                         2 |
                       14. | energy          0.0                       2 |
                       15. | flag_list       []                        2 |
                       16. | current         0.0                       2 |
                       17. | credits         0                         2 |
                       18. | voltage         0.0                       2 |
                       19. | timeout         true                      2 |
                       20. | address         2.2.2.2                   2 |
                       21. | price           0.0                       2 |
                       22. | grid_name       test-grid                 2 |
                           +---------------------------------------------+

                      Comment


                      • #12
                        Thanks Robert Picard. This is helpful. But I want the data in the following format. Can you also please suggest a way to do this transformation?
                        beat_time power energy flag_list current credits voltage timeout address price grid_name
                        01-01-10 2:22 0 0 0 0 0 true 1.1.1.1 0 test-grid
                        01-01-10 2:22 0 0 0 0 0 true 2.2.2.2 0 test-grid
                        Note: All rows in the the extracted data should have same "beat_time" and "grid_name"
                        Last edited by Karthik Dinne; 05 May 2016, 05:02.

                        Comment


                        • #13
                          The following works with your sample data:

                          Code:
                          clear
                          import delimited using "test.json"
                          drop v2
                          
                          * extract the components
                          gen item =  regexs(1) if regexm(v1,`""([^"]*)": "?([^"]*)"?"')
                          gen value = regexs(2) if regexm(v1,`""([^"]*)": "?([^"]*)"?"')
                          
                          * since there are more than one item of the same name, use an increment
                          keep if !mi(item)
                          gen level = sum(item == "power")
                          
                          * reshape to wide form
                          drop if value == "["
                          drop v1
                          reshape wide value, i(level) j(item) string
                          rename value* *
                          
                          replace beat_time = beat_time[1]
                          replace grid_name = grid_name[_N]
                          
                          drop if level == 0
                          order beat_time power energy flag_list current credits voltage timeout address price grid_name
                          list , noobs compress
                          and the results:
                          Code:
                          . list , noobs compress
                          
                            +-------------------------------------------------------------------------------------------------------------------+
                            |           beat_time   power   ene~y   fla~t   cur~t   cre~s   vol~e   tim~t   address   price   grid_name   level |
                            |-------------------------------------------------------------------------------------------------------------------|
                            | 2010-01-01 00:02:22       0     0.0      []     0.0       0     0.0    true   1.1.1.1     0.0   test-grid       1 |
                            | 2010-01-01 00:02:22       0     0.0      []     0.0       0     0.0    true   2.2.2.2     0.0   test-grid       2 |
                            +-------------------------------------------------------------------------------------------------------------------+

                          Comment


                          • #14
                            Thanks a lot Robert Picard. It worked.

                            One final favour! I know a bit of regular expressions but not much. Can you kindly elaborate on what the following means or how to read/understand it? It would help me for similar purposes in future.

                            Code:
                            `""([^"]*)": "?([^"]*)"?"'​

                            Comment


                            • #15
                              Sure. In Stata, to define a string that contains one or more double quotes, you must use compound double quotes (see help quotes). The left delimiter is `" and the right is "'. For example:

                              Code:
                              count if store name == `"TOYS "R" US"'
                              So the regex pattern is delimited with `" and "'. Without these, the pattern is
                              Code:
                              "([^"]*)": "?([^"]*)"?
                              and here are some of the strings you are trying to match:
                              Code:
                              "beat_time": "2010-01-01 00:02:22",
                              "energy": 0.0,
                              The pattern is defining two sub-expressions, delimited with parentheses. The first is for the item and the second for the value. Both sub-expression are defined using the same construct:
                              Code:
                              [^"]*
                              The above is, in regex parlance, a negated character class, i.e. square brackets with the first character within being a ^ (caret). The next character is a double quote. So [^"] means any character except a double quote. The * is a repetition operator that means zero or more. So [^"]* will match zero or more characters that are not double quotes.

                              Since the value is sometimes delimited by double quotes and sometimes not, the ? repetition operator is used. The question mark means zero or one, making the double quote optional around the second sub-expression.

                              The first sub-expression is returned in regexs(1) and the second in regexs(2). Because of how StataCorp has chosen to implement regular expressions, you need to fully match the expression twice to extract both sub-expressions. It would be nice if they developed a command that would, à la tsrevar, generate temporary variables with all the sub-expressions matched.

                              Comment

                              Working...
                              X