Directly importing from dropbox.com

Federico Nutarelli

Join Date: Sep 2018

Posts: 430
#1

Directly importing from dropbox.com

23 Jun 2022, 08:38

Hi all,

I have Dropbox Plus installed both as an App and online. It turns out that I have very big data to handle and cannot store them in the App, otherwise a lot of memory would be lost. Instead I can sync them only online and have them stored in the dropbox servers.
Now, the problem is that I need to use those dta files directly from dropbox.com which apparently is not as easy as I was thinking. My naive approach was simply to copy paste the link provided by dropbox.com:

Code:

use "https://www.dropbox.com/s/cxmbo2gsw8yuoic/pcs.dta",clear

however it did not work. Can someone help me out on this please?
Tags: data, panel data, Suggestion
Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#2

23 Jun 2022, 09:02

Hi Federico,

What error does this line produce?
Comment
Federico Nutarelli

Join Date: Sep 2018

Posts: 430
#3

23 Jun 2022, 09:05

Hi,

the error is the following:

Code:

file https://www.dropbox.com/s/cxmbo2gsw8yuoic/pcs.dta not Stata format
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#4

23 Jun 2022, 09:21

Okay, so my guess is that the link is serving you HTML for the webpage and not the file itself. Check out this piece of dropbox documentation: https://help.dropbox.com/files-folde...force-download

Does this work?

Code:

use "https://www.dropbox.com/s/cxmbo2gsw8yuoic/pcs.dta?dl=1",clear
Comment
Federico Nutarelli

Join Date: Sep 2018

Posts: 430
#5

23 Jun 2022, 09:43

actually it is taking a lot...so I forced a break. Even with a small .dta file

Last edited by Federico Nutarelli; 23 Jun 2022, 09:50.
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#6

23 Jun 2022, 10:16

Okay, are we talking about something on the order of 5 minutes, or 20 to 30 minutes of wait time before you force a break? Keep in mind that in this setup you will have to download all of the content each time you invoke the -use- command. Can you please try this instead?

Code:

use "https://www.dropbox.com/s/cxmbo2gsw8yuoic/pcs.dta?raw=1",clear
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35651
#7

23 Jun 2022, 10:35

Cross-posted and answered at https://stackoverflow.com/questions/...y-from-dropbox

Please note our policy on cross-posting, which is that you should tell us about it.
1 like
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#8

23 Jun 2022, 11:15

Let me be the one to tell you that this is a python problem. I know i know, you may not know python, but assuming you have 17, Python will literally be your best friend in this situation, specifically the Selenium web driver library.

I haven't look at the post Nick mentioned, but if this were my problem I'd likely use Python to grab it from online.

Suppose we wanna download CDC data on vaccinations for COVID-19.

Code:

python: import time, os from selenium import webdriver from webdriver_manager.chrome import ChromeDriverManager from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.ui import Select options = webdriver.ChromeOptions() preferences= {"download.default_directory": os.getcwd(), "directory_upgrade": True} options.add_experimental_option("prefs", preferences) #options.headless = True options.add_experimental_option('excludeSwitches', ['enable-logging']) url = "https://tinyurl.com/ygxx9ede" # Path of my WebDriver driver = webdriver.Chrome(ChromeDriverManager().install(), options=options) wait = WebDriverWait(driver, 20) # to maximize the browser window driver.maximize_window() #get method to launch the URL driver.get(url) paths = ["#app > div > div:nth-child(2) > div > div > div.entry-header > div > div.entry-actions > div > div:nth-child(3) > button", "#export-flannel > section > ul > li:nth-child(1) > a"] for x in paths: wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, x))).click() end

My Python code is not perfect, and there exist Pythonistas who can run many circles around me. But, this works from a Stata terminal. It grabs the same data from the same place pretty much every single time. It is efficient, and allows you to fully automate your data collection process. You'll likely need to learn to fill in boxes and forms to log into your dropbox, and all other relevant stuff. But even though I'm an athest, I swear to God you wanna learn Python, particularly if you're a young researcher like me who uses a variety of datasets from a wide variety of different places and don't wanna manually recollect data each time you need to do a paper.

Last edited by Jared Greathouse; 23 Jun 2022, 11:33.
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#9

23 Jun 2022, 11:46

From the stack overflow thread:

To be clear, you can pass a URL in use but then you need a URL that return a Stata dataset and not instructions to a browser, the way Dropbox does. And the dataset is nevertheless downloaded to your computer when you do that, as in order for Stata to read a dataset if first needs to be on your computer. If you download it manually yourself first or let Stata do it to a temporary folder first, does not make a difference on your disk space requirements.
– TheIceBear

This is exactly the problem that ?dl=1 or ?raw=1 are supposed to solve. These are two slightly different implementations of a way to get the file directly rather than the html, and in general this is how one should programmatically download files from dropbox. Of course, I have no idea whether or not stata's -use- command can handle a redirect (as with ?raw=1), and OP has clearly concluded that ?dl=1 doesn't work. One downside of a high level language like Ado is that you don't usually have low level control of things like this. TheIceBear also makes an excellent point in the other thread when he says the data needs to be downloaded anyway. Your 7 gig file will almost certainly not fit in memory (RAM), and will likely have to be written to the disk when you download it regardless.

EDIT: ?dl=1 might trigger a browser command of some kind, which is why I think raw might be better. It renders the file in the browser - which is a bit of a red haring actually. When a server provides a web page or other file for rendering, it is really just allowing a client to download the object directly. The trick, of course, is handling the redirect and getting the direct URI for the file.

Last edited by Daniel Schaefer; 23 Jun 2022, 11:59.
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#10

23 Jun 2022, 12:09

Jared Greathouse it looks like your python script will load the webpage, goes through every clickable <div> on the page, and then clicks it? I think this is probably overkill, and isn't really what OP wants anyway, since it will ultimately just download the file to the filesystem anyway, right?
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#11

23 Jun 2022, 15:54

My code loads the webpage and clicks two buttons. It doesn't go through all the <div> items, as this indeed would be overkill.

And once we've clicked those two buttons, the file begins to download in the users current working directory. Isn't that about what OP wanted? Daniel Schaefer Perhaps I've misunderstood?
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#12

23 Jun 2022, 16:07

Jared Greathouse, I could be wrong, but I believe OP is looking for a way to load a .dta file hosted on dropbox directly into his local RAM so that he doesn't need to store it on his hard drive. Not that it particularly matters: my guess is that OP has discovered that this isn't practical for a few reasons.

It's a neat script regardless. I also prefer python for crawling websites.
Comment
Federico Nutarelli

Join Date: Sep 2018

Posts: 430
#13

24 Jun 2022, 02:15

Nick Cox I am sorry about cross-posting. I did not know the rule.
Thank you all for the replies. Actually I do use python and selenium but the anaalyses that we want to do are better performed in STATA.
Daniel Schaefer is right actually. I found out that this is not possible so at the end of the day I am trying to use only the .dta strictly needed compressing them and store online the other ones. However if also this turns out to be infeasible I must go to python.

Thanks all for the kind replies and sorry again for cross-posting
1 like
Comment

Announcement

Directly importing from dropbox.com

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment