Top Menu

How can I download all files at once from a data request?

When you request a downloaded dataset from the Data Portal, there are many ways to work with the results. Sometimes, rather than accessing the data through THREDDS (such as via .ncml or the subset service), you just want to download all of the files to work with on your own machine.

There are several methods you can use to download your delivered files from the server en masse, including:

  • shell – curl or wget
  • python – urllib2
  • java – java.net.URL

Below, we detail how you can use wget or python to do this.

It’s important to note that the email notification you receive from the system will contain two different web links. They look very similar, but the directories they point to differ slightly.

First Link: https://opendap.oceanobservatories.org/thredds/catalog/ooi/sage-marine-rutgers/20171012T172409-CE02SHSM-SBD11-06-METBKA000-telemetered-metbk_a_dcl_instrument/catalog.html

The first link (which includes thredds/catalog/ooi) will point to your dataset on a THREDDS server. THREDDS provides additional capabilities to aggregrate or subset the data files if you use a THREDDS or OpenDAP compatible client, like ncread in Matlab or pydap in Python.

Second Link: https://opendap.oceanobservatories.org/async_results/sage-marine-rutgers/20171012T172409-CE02SHSM-SBD11-06-METBKA000-telemetered-metbk_a_dcl_instrument

The second link points to a traditional Apache web directory. From here, you can download files directly to your machine by simply clicking on them.

Using wget

First you need to make sure you have wget installed on your machine. If you are on a mac and have the homebrew package manager installed, in the terminal you can type:

brew install wget

Alternatively, you can grab wget off GitHub here https://github.com/jay/wget

Once wget is installed, you can recursively download an entire directory of data using the following command (make sure you use the second (Apache) web link (URL) provided by the system when using this command):

wget -r -l1 -nd -nc -np -e robots=off -A.nc --no-check-certificate URL 

This simpler version may also work.

wget -r -nd -np -e robots=off URL

Here is an explanation of the specified flags.

  • -r signifies that wget should recursively download data in any subdirectories it finds.
  • -l1 sets the maximum recursion to 1 level of subfolders.
  • -nd copies all matching files to current directory. If two files have identical names it appends an extension.
  • -nc does not download a file if it already exists.
  • -np prevents files from parent directories from being downloaded.
  • -e robots=off tells wget to ignore the robots.txt file. If this command is left out, the robots.txt file tells wget that it does not like web crawlers and this will prevent wget from working.
  • -A.nc restricts downloading to the specified file types (with .nc suffix in this case)
  • –no-check-certificate disregards the SSL certificate check. This is useful if the SSL certificate is setup incorrectly, but make sure you only do this on servers you trust.

Using python

wget is rather blunt, and will download all files it finds in a directory, though as we noted you can specify a specific file extension.

If you want to be more granular about which files you download, you can use Python to parse through the data file links it finds and have it download only the files you really want. This is especially useful when your download request results in a lot of large data files, or if the request includes files from many different instruments that you may not need.

Here is an example script that uses the THREDDS service to find all .nc files included in the download request. Under the hood, THREDDS provides a catalog.xml file which we can use to extract the links to the available data files. This xml file is relatively easier to parse than raw html.

The first part of the main() function creates an array of all of the files we would like to download (in this case, only ones ending in .nc), and the second part actually downloads them using urllib.urlretrieve(). If you want to download only files from particular instruments, or within specific date ranges, you can customize the code to filter out just the files you want (e.g. using regex).

#!/usr/bin/env python
# Script to download all .nc files from a THREDDS catalog directory
# Written by Sage 4/5/16

from xml.dom import minidom
import urllib2
import urllib

# Divide the url you get from the data portal into two parts
# Everything before "catalog/"
server_url = 'http://opendap-devel.ooi.rutgers.edu:8090/thredds/' 
# Everything after "catalog/"
request_url = 'ooi/sage-rutgers-edu/96bb5f8b-07c6-44c6-bc6c-6982f5d4d238/' 

def get_elements(url, tag_name, attribute_name):
  """Get elements from an XML file"""
  usock = urllib2.urlopen(url)
  xmldoc = minidom.parse(usock)
  usock.close()
  tags = xmldoc.getElementsByTagName(tag_name)
  attributes=[]
  for tag in tags:
    attribute = tag.getAttribute(attribute_name)
    attributes.append(attribute)
  return attributes


def main():
  url = server_url + 'catalog/' + request_url + 'catalog.xml'
  catalog = get_elements(url,'catalogRef','xlink:href')
  files=[]
  for c in catalog:
    dataset_url = server_url + 'catalog/' + request_url + c
    datasets = get_elements(dataset_url, 'dataset', 'urlPath')
    for d in datasets:
      if (d[-3:]=='.nc'):
        files.append(d)
  count = 0
  for f in files:
    count +=1
    file_url = server_url + 'fileServer/' + f
    file_prefix = file_url.split('/')[-1][:-3]
    file_name = file_prefix + '_' + str(count) + '.nc'
    print 'Downloading ' + str(count) + ' of ' + str(len(files)) + ' ' + f
    urllib.urlretrieve(file_url,file_name)


# Run main function when in comand line mode        
if __name__ == '__main__':
  main()

Don’t forget to update the server_url and request_url variables before running the code. You may also need to install the required libraries if you don’t already have them on your machine.

— Last revised on November 16, 2017 —