JSON

John Michael Perdue

Join Date: Aug 2015

Posts: 28
#1

JSON

10 Aug 2015, 03:26

Hello, I am trying to import bulk data in JSON format. I have downloaded the data from the web in zip file. I used the insheetjson command

but recieved an invalid file specification error.

Ultimately I dont want all of the data just certain columns, but I dont know how to specify that when I import. I have attached the link to the data that I am trying to get.

Any help will be much appreciated

Best regards, Michael

Home · unitedstates/congress Wiki · GitHub

https://github.com

congress - Public domain data collectors for the work of Congress, including legislation, amendments, and votes.
Tags: None
Oded Mcdossi

Join Date: Jun 2014

Posts: 577
#2

10 Aug 2015, 07:45

I have never used this user written program, but in the help file (type help insheetjson) there are specific instructions and examples for importing it to Stata. Please read the FAQ to able other users of this user written program to help you next time. Especially, what exactly you type and what Stata gave you back.
Comment
John Michael Perdue

Join Date: Aug 2015

Posts: 28
#3

10 Aug 2015, 08:25

Thank you Oded,
I have uploaded an attachment that shows exactly what command I gave to Stata and the exact error that I received back.

Thank you
Best regards,
Michael

Attached Files
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

10 Aug 2015, 10:34

I'm not familiar with the JSON format and have never used insheetjson (from SSC) but I gave it a try. I downloaded the following zip file:

Code:

http://unitedstates.sunlightfoundation.com/congress/data/113.zip

The archive is big, it contains 16,674 files (about 120MB) in JSON format. In the few tests I did, I was able to extract some data using insheetjson. For example, the following appears to work:

Code:

clear
gen str80 id = ""
gen str80 state = ""
gen str80 display_name = ""
gen str80 party = ""

insheetjson id state display_name party using "113/votes/2014/h1/data.json", ///
    table(votes Present) col("id" "state" "display_name" "party")
compress

However, I could not figure out a syntax that worked to get data with the following structure

Code:

  "votes": {
    "Not Voting": [
      {
        "display_name": "Amodei", 
        "id": "A000369", 
        "party": "R", 
        "state": "NV"
      },

which should be something like

Code:

clear
gen str80 id = ""
gen str80 state = ""
gen str80 display_name = ""
gen str80 party = ""

insheetjson id state display_name party using "113/votes/2014/h1/data.json", ///
    table(votes "Not Voting") col("id" "state" "display_name" "party")
compress

I might be wrong but my guess is that using insheetjson to extract data like this may be pushing the design capabilities of insheetjson.

However, the most challenging problem will be to deal with the sheer number of files. Perhaps it will be enough to just be able to look for the information within the raw JSON text. These raw FSON files can be input in Stata and easily combined using filelist, from SSC. Here's an example.

Code:

* based on files contained in
* http://unitedstates.sunlightfoundation.com/congress/data/113.zip
* -filelist- if from SSC; to install: ssc install filelist
filelist , dir("113/amendments/hamdt") pattern("*.json")
tempfile flist
save "`flist'"
local nfiles = _N

* import each file into Stata
forvalues i=1/`nfiles' {

    use "`flist'" in `i', clear
    local filepath = dirname + "/" + filename
    dis "reading `filepath'"
    
    qui import delimited "`filepath'", clear
    gen source = "`filepath'"
    gen line_id = _n
    gen strL json_line = v1
    drop v1
    
    tempfile json`i'
    qui save "`json`i''"
    
}

* combine all datasets
clear
forvalues i=1/`nfiles' {
    append using "`json`i''"
}

* -leftalign- is from SSC; to install: ssc install leftalign
format %80s json_line
leftalign

Comment

John Michael Perdue

Join Date: Aug 2015

Posts: 28
#5

10 Aug 2015, 11:37

Thank you Robert,
I was not even able to download the data using the first line of code that you provided due to the attached error.

Perhaps I need to move the data to a different environment to make it more manageable before importing into Stata.

Best regards,

Michael
Attached Files
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#6

10 Aug 2015, 11:48

Start by downloading the zip archive and expanding it on your computer. Then change Stata's current directory to the directory that contains the expanded archive (help cd). The directory for the expanded archive is called "113". Then, you can run the example I gave, which is

Code:

clear gen str80 id = "" gen str80 state = "" gen str80 display_name = "" gen str80 party = "" insheetjson id state display_name party using "113/votes/2014/h1/data.json", /// table(votes Present) col("id" "state" "display_name" "party") compress

not just the insheetjson command as you did in your screen shot.
Comment
John Michael Perdue

Join Date: Aug 2015

Posts: 28
#7

10 Aug 2015, 12:13

Robert, thank you again for your response! I downloaded the data and expanded it on my desktop. I then set my cd

Code:

cd "C:\Users\JayaRam\Desktop\New folder (3)\113"

Then I ran the code you gave (11:48), but I received an error:

Code:

insheetjson id state display_name party using "113/votes/2014/h1/data.json", /// option / not allowed

I'm not sure what I am doing wrong; I'm trying to fiddle with it a bit, but no results yet.
Can you just take a quick look to make sure that I am on the right track?

Thanks again,
Michael
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#8

10 Aug 2015, 12:24

The "///" characters at the end of the line indicate that the command continues on the next line (see help comments). The indicators are not supported by Stata's command window. You must either remove them and put the whole command on a single line or put the whole thing in a do-file and run the code from there.
Comment
John Michael Perdue

Join Date: Aug 2015

Posts: 28
#9

10 Aug 2015, 13:15

Thanks Robert,

I got it to work! Now I have a lot of fiddling to do but at least I am off to a good start! Thanks again
Comment

Karthik Dinne

Join Date: May 2016
Posts: 6

#10

04 May 2016, 08:51

Originally posted by Robert Picard View Post

Start by downloading the zip archive and expanding it on your computer. Then change Stata's current directory to the directory that contains the expanded archive (help cd). The directory for the expanded archive is called "113". Then, you can run the example I gave, which is

Code:

clear
gen str80 id = ""
gen str80 state = ""
gen str80 display_name = ""
gen str80 party = ""

insheetjson id state display_name party using "113/votes/2014/h1/data.json", ///
table(votes Present) col("id" "state" "display_name" "party")
compress

not just the insheetjson command as you did in your screen shot.

Robert Picard I am facing insheeting json. Can you please guide? My .json file looks like this

Code:

{
  "snapshot_list": [
    {
      "beat_time": "2010-01-01 00:02:22",
      "node_list": [
        
        {
          "power": 0,
          "energy": 0.0,
          "flag_list": [],
          "current": 0.0,
          "credits": 0,
          "voltage": 0.0,
          "timeout": true,
          "address": "1.1.1.1",
          "price": 0.0
        },
        {
          "power": 0,
          "energy": 0.0,
          "flag_list": [],
          "current": 0.0,
          "credits": 0,
          "voltage": 0.0,
          "timeout": true,
          "address": "2.2.2.2",
          "price": 0.0
        }
      ]
    }
  ],
  "grid_name": "test-grid"
}

I tried using the insheetjson command in the format as you mentioned in earlier replies but it isn't working out!

My code looks as below

Code:

clear
gen str power = ""
gen str energy = ""
gen str flat_list = ""
gen str current = ""
gen str credits = ""
gen str voltage = ""
gen str timeout = ""
gen str address = ""
gen str price = ""


*** tried various combinations of variables to written inside table()***

insheetjson power energy flat_list current credits voltage timeout address price using "test.json", ///
table (snaptshot_list beat_time node_list) col("power" "energy" "flat_list" "current" "credits" "voltage" "timeout" "address" "price")

Last edited by Karthik Dinne; 04 May 2016, 08:57.

Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#11

04 May 2016, 10:05

I'm not a user of insheetjson and since JSON files are simple text files, I think it's easier to just extract the information the good old fashion way (using standard data management techniques). Here's one way to get there:

Code:

clear
import delimited using "test.json"
drop v2

* to install, type in Stata's Command window: ssc install leftalign
leftalign

* extract the components
gen item =  ustrregexs(1) if ustrregexm(v1,`""(.*?)": "?([^"]*)"?"')
gen value = ustrregexs(2) if ustrregexm(v1,`""(.*?)": "?([^"]*)"?"')

* since there are more than one item of the same name, use an increment
keep if !mi(item)
gen level = sum(item == "power")

leftalign
list item value level, sepby(level)

and here's the results

Code:

. list item value level, sepby(level)

     +---------------------------------------------+
     | item            value                 level |
     |---------------------------------------------|
  1. | snapshot_list   [                         0 |
  2. | beat_time       2010-01-01 00:02:22       0 |
  3. | node_list       [                         0 |
     |---------------------------------------------|
  4. | power           0                         1 |
  5. | energy          0.0                       1 |
  6. | flag_list       []                        1 |
  7. | current         0.0                       1 |
  8. | credits         0                         1 |
  9. | voltage         0.0                       1 |
 10. | timeout         true                      1 |
 11. | address         1.1.1.1                   1 |
 12. | price           0.0                       1 |
     |---------------------------------------------|
 13. | power           0                         2 |
 14. | energy          0.0                       2 |
 15. | flag_list       []                        2 |
 16. | current         0.0                       2 |
 17. | credits         0                         2 |
 18. | voltage         0.0                       2 |
 19. | timeout         true                      2 |
 20. | address         2.2.2.2                   2 |
 21. | price           0.0                       2 |
 22. | grid_name       test-grid                 2 |
     +---------------------------------------------+

Comment

Karthik Dinne

Join Date: May 2016

Posts: 6
#12

05 May 2016, 04:56

Thanks Robert Picard. This is helpful. But I want the data in the following format. Can you also please suggest a way to do this transformation?
beat_time power energy flag_list current credits voltage timeout address price grid_name

01-01-10 2:22 0 0 0 0 0 true 1.1.1.1 0 test-grid

01-01-10 2:22 0 0 0 0 0 true 2.2.2.2 0 test-grid

Note: All rows in the the extracted data should have same "beat_time" and "grid_name"

Last edited by Karthik Dinne; 05 May 2016, 05:02.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#13

05 May 2016, 07:46

The following works with your sample data:

Code:

clear
import delimited using "test.json"
drop v2

* extract the components
gen item =  regexs(1) if regexm(v1,`""([^"]*)": "?([^"]*)"?"')
gen value = regexs(2) if regexm(v1,`""([^"]*)": "?([^"]*)"?"')

* since there are more than one item of the same name, use an increment
keep if !mi(item)
gen level = sum(item == "power")

* reshape to wide form
drop if value == "["
drop v1
reshape wide value, i(level) j(item) string
rename value* *

replace beat_time = beat_time[1]
replace grid_name = grid_name[_N]

drop if level == 0
order beat_time power energy flag_list current credits voltage timeout address price grid_name
list , noobs compress

and the results:

Code:

. list , noobs compress

  +-------------------------------------------------------------------------------------------------------------------+
  |           beat_time   power   ene~y   fla~t   cur~t   cre~s   vol~e   tim~t   address   price   grid_name   level |
  |-------------------------------------------------------------------------------------------------------------------|
  | 2010-01-01 00:02:22       0     0.0      []     0.0       0     0.0    true   1.1.1.1     0.0   test-grid       1 |
  | 2010-01-01 00:02:22       0     0.0      []     0.0       0     0.0    true   2.2.2.2     0.0   test-grid       2 |
  +-------------------------------------------------------------------------------------------------------------------+

Comment

Karthik Dinne

Join Date: May 2016

Posts: 6
#14

05 May 2016, 14:49

Thanks a lot Robert Picard. It worked.

One final favour! I know a bit of regular expressions but not much. Can you kindly elaborate on what the following means or how to read/understand it? It would help me for similar purposes in future.

Code:

`""([^"]*)": "?([^"]*)"?"'
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#15

06 May 2016, 08:52

Sure. In Stata, to define a string that contains one or more double quotes, you must use compound double quotes (see help quotes). The left delimiter is `" and the right is "'. For example:

Code:

count if store name == `"TOYS "R" US"'

So the regex pattern is delimited with `" and "'. Without these, the pattern is

Code:

"([^"]*)": "?([^"]*)"?

and here are some of the strings you are trying to match:

Code:

"beat_time": "2010-01-01 00:02:22", "energy": 0.0,

The pattern is defining two sub-expressions, delimited with parentheses. The first is for the item and the second for the value. Both sub-expression are defined using the same construct:

Code:

[^"]*

The above is, in regex parlance, a negated character class, i.e. square brackets with the first character within being a ^ (caret). The next character is a double quote. So [^"] means any character except a double quote. The * is a repetition operator that means zero or more. So [^"]* will match zero or more characters that are not double quotes.

Since the value is sometimes delimited by double quotes and sometimes not, the ? repetition operator is used. The question mark means zero or one, making the double quote optional around the second sub-expression.

The first sub-expression is returned in regexs(1) and the second in regexs(2). Because of how StataCorp has chosen to implement regular expressions, you need to fully match the expression twice to extract both sub-expressions. It would be nice if they developed a command that would, à la tsrevar, generate temporary variables with all the sub-expressions matched.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment