OOI Raw Data are currently accessible through the Raw Data Archive
OOI Raw Data Archive
- What is the raw data archive, and what is it for?
- How is “raw” data defined in OOI?
- What is stored in the raw data archive?
- What is NOT in the raw data archive, and how do I get any data I can’t find there?
- How is the archive organized?
- How do I find a specific instrument/platform I’m interested in?
- What do the vocabulary terms mean, e.g. DCL, etc.?
- Where is the metadata?
- How do I know what data are from a valid time range, not including test data or bad data?
- How do I use these raw data formats?
- How do I download an entire raw data directory?
What is the raw data archive, and what is it for?
The OOI system has a budget-defined amount of data storage capacity that can be devoted to files that cannot be delivered via uFrame (i.e. plotting or synchronous download). These data are stored in a ~200 TB Network Attached Storage (NAS) system at Rutgers University that provides secure access to a limited number of files.
The raw data partition of the NAS is an Apache server with a capacity of ~86 TB, organized by platform reference designator. The hardware architecture is scalable to allow the NAS to grow over time and allow larger amounts of data over a longer time range to be stored and served to users in the future (with certain constraints).
This server provides access to the raw data to allow SMEs to assist with quality control of specialized instrumentation, to allow users to perform their own analyses using their own scripts or software, and to allow MIOs to confirm operation of deployed instrumentation.
How is “raw” data defined in OOI?
“Raw” indicates data as they are received directly from the instrument, in instrument-specific format. They may, depending on the instrument, contain data for multiple sensors (interleaved), be in native sensor units (e.g., counts, volts) or have processing steps already performed within the instrument (e.g., primary calibration). This also includes large format audio/visual data (HD video, still images, and hydrophone data). Data may not be in familiar, vendor formats (see “how to use raw data” below).
What is stored in the raw data archive?
All raw data from all platforms and instruments are stored by OOI, including Antelope data (hydrophones and seismometers). However, the OOI also delivers a subset of the total stored raw data via the NAS raw data archive.
Data in the raw data partition of the NAS (based on the current size) should be
- All uncabled raw data for an initial period of 10 years
- All cabled raw data (minus Antelope and HD Video) for an initial period of 10 years
- An initial period of 6 months of broadband hydrophone (HYDBB) data
- An initial period of 6 months of full-resolution HD Video data (.mov files)
- An initial period of 10 years of compressed HD Video data (.mp4 files)
All uncabled raw data currently in the system = ~1.3TB, cabled non-A/V raw data currently in the system = ~1.1 TB, and large format A/V data (HYDBB and HD Video) = ~50 TB.
If the partition utilization becomes higher than a given threshold (e.g., 80% full, as initial value), all active users who signed up for email notifications would get a message that a purge will commence soon. Data are purged from the partition after the initial periods of time stated above (or when utilization hits the threshold value), starting with the oldest data first.
What is NOT in the raw data archive, and how do I get any data I can’t find there?
OOI low-frequency hydrophones (HYDLF) and seismometer data (OBSBB, OBSSP) are currently only hosted by IRIS (Incorporated Research Institutions for Seismology: http://ds.iris.edu/ds/). Shipboard data collected during OOI deployment and recovery cruises are stored on the Alfresco document management system (Cruise Data Archive).
When a user submits a help desk request for raw data that are not available on the NAS, the data manager will work with the Rutgers CI to provide an alternate data delivery method:
- If the total request size is smaller than a threshold, (e.g., 500 Gb as initial value), the files will be loaded onto a dedicated staging area on the NAS. The system should support at least 10 concurrent requests for < 500 GB of raw data that are not on the NAS. Staged data will be purged 7 days after the user is notified of delivery (as an initial value).
- If the request is larger than 500 Gb (as an initial value), users will be notified to deliver physical media of sufficient capacity to the Rutgers CI (care of the data manager), to be returned no later than 1 month after the request (unless other arrangements are made).
How is the archive organized?
The archive is a mirror of the data repository where all raw data enters the system. To view a file-tree structure of the archive, follow this link: https://rawdata.oceanobservatories.org/files/.tree.html
For uncabled mooring data (any platform not attached to the electro-optical cable on the west coast) that is a server organized by deployment/recovery number for each uncabled platform (e.g. D00001/R00001), with subfolders for each node (the control computer to which the instruments are attached), containing a subfolder for each attached instrument which contains all data for that instrument from that deployment.
e.g. https://rawdata.oceanobservatories.org/files > CE01ISSM > D00001 > dcl16 > flort
Uncabled mobile assets (aka. Gliders and AUVs) are organized similarly, but the subfolders below the deployment/recovery number are organized based on the glider’s internal science computer file structure. The telemetered data folders are titled D0000*, and contain subfolders called archive, from-glider, logs, and merged-from-glider. The recovered data folders (R0000*) are titled cache, merged, dvl, flight, and science. In both cases, the “merged” folders contain most of the science instrument and engineering data of interest, although the glider ADCP data is contained in the “dvl” subfolder.
Cabled data are currently being pulled from two archives. Most cabled data is organized by node, which refers to the alphanumeric ID of the junction box attached directly to the undersea cable that aggregates, time-stamps, and routes the data from all instruments connected to that node. Certain high-bandwidth instruments (like hydrophones, HD video, and sonar systems) are organized by date (yyyy/mm). Eventually, all cabled data will be pulled from the same port agent log archive, and will be organized by date.
How do I find a specific instrument/platform I’m interested in?
Unless you already know the route to the specific platform and instrument you’re looking for, you should begin on the OOI website or data portal. Start on the landing page of the OOI Data Portal and navigate to the platform, instrument, or data that you are interested in, using the table of contents. Make sure that “Reference Designator” is toggled on (button at the bottom of the table of contents), so that you can see the full name of the instrument and node number. Use that as a reference when navigating within the raw data archive.
What do the vocabulary terms mean, e.g. DCL, etc.?
|BEP||Benthic Experiment Package, a cabled seafloor platform containing a Low-Power Junction Box and multiple instruments inside a trawl-resistant frame|
|Cabled||Any node or instrument attached to the electro-optical cable on the west coast|
|CPM||Platform Controller (engineering data only)|
|D0000*||Deployment folder, aka telemetered data (sent to shore while platform is deployed)|
|DCL||Data Concentrator Logger, the controller computer that concentrates the data from multiple instruments on uncabled moorings, packages the data, and telemeters to shore|
|DVL||Glider ADCP, Teledyne RDI 600 kHz Explorer DVL|
|LJ0**||Low-Power Junction Box, a smaller cabled node that powers seafloor instruments, occasionally located inside a seafloor platform|
|LV0**||Low-Voltage Node, connected to the Primary Node|
|MFN||A seafloor instrument frame at the base of coastal moorings, containing the anchor and acoustic release package, battery packs, and multiple instruments|
|MJ0**||Medium-Power Junction Box, a cabled node that powers seafloor instruments|
|PN0**||Primary Node, connected directly to the electro-optical cable, usually not instrumented|
|Profiler||A mooring installation with an instrumented platform that moves up and down vertically in the water column. One type of profiler on the OOI is a wire-following modified McLane profiler (WFP) that crawls along an inductive cable using a traction motor. These are indicated by reference designator platform codes ending in “PM” for “Profiler Mooring”. The other type of profiler is a coastal surface piercing profiler (CSPP) that uses a winch to spool out line from a buoyant instrument platform, raising and lowering it in the water. This type of profiler has a platform code that ends in “SP” for “Surface Piercing”.|
|R0000*||Recovery folder, aka recovered data (downloaded after platform is recovered)|
|Recovered||Data offloaded directly from an instrument or data logger; usually by connecting the instrument to a computer after the instrument has been recovered and writing to files, often onboard the recovery vessel.|
|Reference Designators||The machine-readable codes used to refer to arrays, sites, platforms, and instruments on the OOI. See reference sheet for site codes and the 5-letter instrument codes.|
|Streamed||Data received via transmission over electro-optical cable. Streaming data are provided at full temporal resolution and near-real time.|
|Telemetered||Data received through a transmission media over distance. Examples are: surface buoy to satellite, glider to satellite, acoustic modem. Data received through satellite relay or other mechanisms results in “batch” receipt and may be decimated in time. These data have greater latencies than the streaming data.|
|Uncabled||Any platform (mooring, glider, or profiler) not attached to the electro-optical cable on the west coast|
|X0000*||Test data, collected on deck or during integration and burn-in testing. Use at your own risk.|
Where is the metadata?
OOI platform and instrument metadata are currently provided when users request processed data, or via the asset management page of the OOI Data Portal. A new “raw data repository” option will be added to all “Download” links in the GUI, and as a direct link on the main OOI website. Users can use the reference designator to find the instrument they’re looking for.
We are working on delivering access to vendor-provided calibration or instrument setting sheets that may be required for data analysis. Currently the marine operators enter the information from those sheets directly into the system, and the actual vendor sheets are being stored in local repositories rather than the raw data archive. The calibration values can be found here (https://github.com/ooi-integration/asset-management/tree/master/deployment), and copies of the sheets that accompany new or refurbished instruments can be provided upon request by contacting the Help Desk.
How do I know what data are from a valid time range, not including test data or bad data?
To find the valid science data range, refer to the Deployment start and Recovery end dates from the asset management information loaded into the system. You can find a graphical representation here (http://marine.rutgers.edu/cool/ooi/sensors-calendar/) or get a list of deployment and recovery dates from the data catalog in the data portal (https://ooinet.oceanobservatories.org/streams/).
How do I use these raw data formats?
- Broadband Hydrophone
- Cabled Bioacoustic Sonar Data
- Cabled Deep Profiler (MMP)
- Cabled HD video data
- Glider Data
- Streamed Cabled Data
- Uncabled Mooring Data
- Uncabled Surface Piercing Profiler
Broadband Hydrophone (HYDBB) files are currently in .mseed format. Tools for opening those files can be found on the IRIS site (http://ds.iris.edu/ds). Eventually these files will be converted into FLAC and .wav format, which should be able to be opened and listened to using most audio player software.
Cabled Bioacoustic Sonar Data
Cabled bioacoustic sonar data (ZPLSC-B; modified Kongsberg EK-60 echosounders) are in vendor format .raw files. These files can be opened using EchoView or vendor software.
Download mi-dataset from the oceanobservatories github page. https://github.com/oceanobservatories/mi-dataset. There are two options for downloading the parsers. 1) Download from the site directly and unzip into a folder. 2) Use git to clone the repository (i.e. enter “git clone https://github.com/oceanobservatories/mi-dataset.git” in a terminal window)
To install mi-dataset, go to the working directory that you want to save the repository into. On unix systems this is using the cd command. On Windows systems, this is using the dir() command.
$ cd /Users/michaesm/Documents/ $ git clone https://github.com/oceanobservatories/mi-dataset.git $ cd mi-dataset/ $ pip install -r requirements.txt $ pip install msgpack-python $ pip install .
This will install the mi-dataset package in python
To use mi-dataset, there is a utility included in the mi-dataset package that takes a driver and list of raw data files as inputs and parses the files into a few different formats (csv, json, pd-pickle (pandas dataframe in python), and xr-pickle (xarray dataset in python). The –fmt flag selects the file format to save the raw data to. The –out flag selects the directory that you want to save the raw data files to.
$ python utils/parse_file.py --help Usage: parse_file.py [OPTIONS] DRIVER [FILES]... Options: --fmt [csv|json|pd-pickle|xr-pickle] --out PATH --help Show this message and exit.
A good way to decide which driver is needed to parse a specific raw data file is to check the ooi-integration csv files. Once you find the raw data directory that you would like to parse in that ingestion CSV, you will need to grab the ‘uframe_route’ on the left. A lookup table exists at https://github.com/ooi-data-review/parse_spring_files/blob/master/uframe_routes.csv which maps the ‘uframe_route’ to the proper driver that you will need to run in mi-dataset.
Here is an example of how to parse the following Ingestion CSV: https://github.com/ooi-integration/ingestion-csvs/blob/master/CE09OSSM/CE09OSSM_R00001_ingest.csv#L17
On line 17 of the above csv file, you can see the following row
uframe_route, filename_mask, reference_designator, data_source Ingest.ctdbp-cdef-dcl_recovered, /omc_data/whoi/OMC/CE09OSSM/R00001/cg_data/dcl27/ctdbp1/*.ctdbp1.log, CE09OSSM-RID27-03-CTDBPC000, recovered_host The raw data for the reference designator CE09OSSM-RID27-03-CTDBPC000 from the recovered deployment #1 for the Washington Offshore Surface Mooring (CE09OSSM) is located at: /omc_data/whoi/OMC/CE09OSSM/R00001/cg_data/dcl27/ctdbp1/*.ctdbp1.log This directory corresponds to the web directory at <a href="https://rawdata.oceanobservatories.org/files/CE09OSSM/R00001/">https://rawdata.oceanobservatories.org/files/CE09OSSM/R00001/</a>. Download the raw data files that you want to look at from the previous link. For this example, I downloaded the file at <a href="https://rawdata.oceanobservatories.org/files/CE09OSSM/R00001/cg_data/dcl27/ctdbp1/20150412.ctdbp1.log">https://rawdata.oceanobservatories.org/files/CE09OSSM/R00001/cg_data/dcl27/ctdbp1/20150412.ctdbp1.log</a> Next you would look go to the lookup table at <a href="https://github.com/ooi-data-review/parse_spring_files/blob/master/uframe_routes.csv">https://github.com/ooi-data-review/parse_spring_files/blob/master/uframe_routes.csv</a> and search for 'Ingest.ctdbp-cdef-dcl_recovered.' You will find that the mi-dataset driver that mounts to this uframe_route is 'mi.dataset.driver.ctdbp_cdef.dcl.ctdbp_cdef_dcl_recovered_driver' $ cd /Users/michaesm/Documents/mi-dataset $ python utils/parse_file --fmt csv --out ./parsed mi.dataset.driver.ctdbp_cdef.dcl.ctdbp_cdef_dcl_recovered_driver /Users/michaesm/Downloads/20150412.ctdbp1.log
The parse_file.py script will parse the raw data file ‘20150412.ctdbp1.log’ with the driver ‘mi.dataset.driver.ctdbp_cdef.dcl.ctdbp_cdef_dcl_recovered_driver’ and save the parsed data as a csv to /Users/michaesm/Documents/mi-dataset/parsed directory.
Cabled HD video data
Cabled HD video data (CAMHD) are in two formats, both able to be opened and played using most video player software (VLC, Quicktime, Windows Media Player, etc.). Uncompressed full-HD video files are in .mov file format, which are very large and may take a long time to download. Smaller files created using lossless compression are in .mp4 format, which are still large but are more easily downloaded and played.
Uncabled glider data are in vendor formatted files: .tbd, .sbd, .dbd, and .pd0 (for the DVL ADCP files). They can be opened using the TWR Slocum glider software, or various other tools. Some of the tools can be found via the TWR forum (https://datahost.webbresearch.com/; requires registration and login), and some open access code has been made available by Rutgers University (http://marine.rutgers.edu/~kerfoot/slocum/gliders.php ). If you have more specific questions, please contact the Help Desk.
Streamed Cabled Data
For streamed cabled data (.dat files), the Digi computer that handles the streaming data adds a timestamp to every file, which slightly changes the format. To strip out that timestamp and return the file to vendor or original format, follow the steps below:
- Step 1: Install bbe (bbe-.sourceforge.net). For example, if you have brew installed, in a bash shell type: brew install bbe
- Step 2: Create output directory in working directory: mkdir -p ‘output’
- Step 3: Download your desired raw binary data files from https://rawdata.oceanobservatories.org/files/ and put them in your working directory
- Step 4: Iterate through raw binary data files in your working directory. In this example they end in .dat
for file in ./*.dat do Step 4.1: Rename output files convertedfile="$(basename $file .dat)_converted.dat"; Step 4.2: Strip out the timestamps bbe -b "/\xa3\x9d\x7a/:16" -e "D" -o "./output/$convertedfile" "$file"; done
Uncabled Mooring Data
Uncabled mooring data, including wire-following profiler moorings (aka MMP, which stands for Modified McLane Profiler), are in the following formats (see OMC_Data_Format_2016-05-25.txt). Not all instruments are currently included here. More are being added, but if you notice a missing instrument type or have questions about the data you find on the archive, contact the Help Desk.
Uncabled Surface Piercing Profiler
Raw uncabled surface piercing profiler mooring data (any platform with the suffix “SP”, for Surface Piercing) can be saved into easily accessible .mat files via a MATLAB routine developed by WET Labs called ‘rdWETLabs_2015_04.m‘. This script reads in a directory of profiler text files (found in the “extract” subdirectory, eg. https://rawdata.oceanobservatories.org/files/CE01ISSP/D00001/extract/) that contain all of the data extracted from the binary files output directly from the profiler, and aggregates them into one MAT file. The MAT file contains the following variables:
- ACS: WETLabs ac-s
- ADCP: VELPT Aquadopp
- OCR: SPKIR – Satlantic OCR507
- OPT: DOSTA
- PARS: PARAD – WETLabs PARS
- TRIP: FLORT – WETLabs ECO Triplet
- SNA: SUNA Nitrate
- HMR: Heading, pitch, roll
- SBE: Fast pressure
- WM: Winch status
- SUM: Wave summary
- WB: Wave Burst
If you have any issues running the Matlab routine or run into versioning issues, please contact the Help Desk.
How do I download an entire raw data directory?
In order to recursively download an entire raw data directory from rawdata.oceanobservatories.org, you can use the following script:
wget -r --no-check-certificate -e robots=off -nd URL
- The -r is recursive
- The -nd does not create directories, but will append an extension if two files have identical names.
- The –no-check-certificate tells wget to not check the server certificate against the available certificate authorities
- The -e robots=off tells wget to ignore the robots.txt file. If this command is left out, the robots.txt file tells wget that it does not like web crawlers which makes wget stop.
wget -r --no-check-certificate -e robots=off -nd https://rawdata.oceanobservatories.org/files/CE01ISSP/D00001/