Search datasets (currently 13 matching datasets)

Add to this registry

Unless specifically stated in the applicable dataset documentation, datasets available through the Registry of Open Data on AWS are not provided and maintained by AWS. Datasets are provided and maintained by a variety of third parties under a variety of licenses. Please check dataset licenses and related documentation to determine if a dataset may be used for your application.

The Sentinel-2 mission is
a land monitoring constellation of two satellites that provide high resolution
optical imagery and provide continuity for the current SPOT and Landsat missions.
The mission provides a global coverage of the Earth's land surface every 5 days,
making the data of great use in on-going studies. L1C data are available from
June 2015 globally. L2A data are available from April 2017 over wider Europe
region and globally since December 2018.

This project creates a S3 repository with imagery acquired
by the China-Brazil Earth Resources Satellite (CBERS). The
image files are recorded and processed by Instituto Nacional de Pesquisa
Espaciais (INPE) and are converted to Cloud Optimized Geotiff
format in order to optimize its use for cloud based applications.
The repository contains all CBERS-4 MUX, AWFI, PAN5M and
PAN10M scenes acquired since
the start of the satellite mission and is daily updated with
new scenes.

This project monitors the world's broadcast, print,
and web news from nearly every corner of every country in
over 100 languages and identifies the people, locations,
organizations, counts, themes, sources, emotions,
quotes, images and events driving our global society every
second of every day.

Sentinel-1 is a pair of European radar imaging (SAR) satellites launched in 2014 and 2016. Its 6 days revisit cycle and ability to observe through clouds makes it perfect for sea and land monitoring, emergency response due to environmental disasters, and economic applications. GRD data is available globally since January 2017.

The Deutsche Börse Public Data Set consists of trade data aggregated to one minute intervals from the Eurex and Xetra trading systems. It provides the initial price, lowest price, highest price, final price and volume for every minute of the trading day, and for every tradeable security. If you need higher resolution data, including untraded price movements, please refer to our historical market data product here. Also, be sure to check out our developer's portal.

SILO is a database of Australian climate data from 1889 to the present. It provides continuous, daily time-step data products in ready-to-use formats for research and operational applications.
Gridded SILO data in annual NetCDF format are on AWS. Point data are available from the SILO website.

The Amazon Bin Image Dataset contains over 500,000 images and metadata from bins of a pod in an operating Amazon Fulfillment Center. The bin images in this dataset are captured as robot units carry pods as part of normal Amazon Fulfillment Center operations.

Amazon Customer Reviews (a.k.a. Product Reviews) is one of Amazon’s iconic products. In a period of over two decades since the first review in 1995, millions of Amazon customers have contributed over a hundred million reviews to express opinions and describe their experiences regarding products on the Amazon.com website. Over 130+ million customer reviews are available to researchers as part of this dataset.

ERA5 is the fifth generation of ECMWF atmospheric reanalyses of the global climate, and the first reanalysis produced as an operational service. It utilizes the best available observation data from satellites and in-situ stations, which are assimilated and processed using ECMWF's Integrated Forecast System (IFS) Cycle 41r2.
The dataset provides all essential atmospheric meteorological parameters like, but not limited to, air temperature, pressure and wind at different altitudes, along with surface parameters like rainfall, soil moisture content and sea parameters like sea-surface temperature and wave height.
ERA5 provides data at a considerably higher spatial and temporal resolution than its legacy counterpart ERA-Interim. ERA5 consists of high resolution version with 31 km horizontal resolution, and a reduced resolution ensemble version with 10 members. It is currently available since 2008, but will be continuously extended backwards, first until 1979 and then to 1950.
Learn more about ERA5 in Jon Olauson's paper ERA5: The new champion of wind power modelling?.

The Hubble Space Telescope (HST) is one of the most productive scientific instruments ever created. This dataset contains calibrated and raw data for all of the currently active instruments on HST: ACS, COS, STIS and WFC3.

The National Agriculture Imagery Program (NAIP) acquires aerial imagery during the agricultural growing seasons in the continental U.S. This "leaf-on" imagery andtypically ranges from 60 centimeters to 100 centimeters in resolutionand is available from the naip-analytic Amazon S3 bucket as 4-band (RGB + NIR) imagery in MRF format, on naip-source Amazon S3 bucket as 4-band (RGB + NIR) in uncompressed Raw GeoTiff format and naip-visualization as 3-band (RGB) Cloud Optimized GeotTiff format. NAIP data is delivered at the state level; every year, a number of states receive updates, with an overall update cycle of two or three years. More details on NAIP

Global, aggregated physical air quality data from public data sources provided by government, research-grade and other sources. These awesome groups do the hard work of measuring these data and publicly sharing them, and our community makes them more universally-accessible to both humans and machines.

The 1000 Genomes Project is an international collaboration which has established the most detailed catalogue of human genetic variation, including SNPs, structural variants, and their haplotype context. The final phase of the project sequenced more than 2500 individuals from 26 different populations around the world and produced an integrated set of phased haplotypes with more than 80 million variants for these individuals.

Earth & Atmospheric Sciences at Cornell University has created a public data lake of climate data. The data is stored in columnar storage formats (ORC) to make it straightforward to query using standard tools like Amazon Athena or Apache Spark. The data itself is originally intended to be used for building decision support tools for farmers and digital agriculture. The first dataset is the historical NDFD / NDGD data distributed by NCEP / NOAA / NWS. The NDFD (National Digital Forecast Database) and NDGD (National Digital Guidance Database) contain gridded forecasts and observations at 2.5km resolution for the Contiguous United States (CONUS). There are also 5km grids for several smaller US regions and non-continguous territories, such as Hawaii, Guam, Puerto Rico and Alaska. NOAA distributes archives of the NDFD/NDGD via its NOAA Operational Model Archive and Distribution System (NOMADS) in Grib2 format. The data has been converted to ORC to optimize storage space and to, more importantly, simplify data access via standard data analytics tools.

Usage examples

GOES satellites (GOES-16 & GOES-17) provide continuous weather imagery and
monitoring of meteorological and space environment data across North America.
GOES satellites provide the kind of continuous monitoring necessary for
intensive data analysis. They hover continuously over one position on the surface.
The satellites orbit high enough to allow for a full-disc view of the Earth. Because
they stay above a fixed spot on the surface, they provide a constant vigil for the
atmospheric "triggers" for severe weather conditions such as tornadoes, flash floods,
hailstorms, and hurricanes. When these conditions develop, the GOES satellites are able
to monitor storm development and track their movements.

Usage examples

The Global Forecast System (GFS) is a weather forecast model produced by the National Centers for Environmental Prediction (NCEP). Dozens of atmospheric and land-soil variables are available through this dataset, from temperatures, winds, and precipitation to soil moisture and atmospheric ozone concentration. The entire globe is covered by the GFS at a base horizontal resolution of 18 miles (28 kilometers) between grid points, which is used by the operational forecasters who predict weather out to 16 days in the future. Horizontal resolution drops to 44 miles (70 kilometers) between grid point for forecasts between one week and two weeks.

Usage examples

The Cancer Genome Atlas (TCGA) is a joint effort of the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI) to accelerate our understanding of the molecular basis of cancer. TCGA-funded researchers across the United States have produced a corpus of raw and processed genomic, transcriptomic, and epigenomic data from thousands of cancer patients.

Usage examples

The Transiting Exoplanet Survey Satellite (TESS) is a two-year survey that will discover exoplanets in orbit around bright stars. More information about TESS is available at MAST and the TESS Science Support Center.

VOiCES is a speech corpus recorded in acoustically challenging settings,
using distant microphone recording. Speech was recorded in real rooms with various
acoustic features (reverb, echo, HVAC systems, outside noise, etc.). Adversarial noise,
either television, music, or babble, was concurrently played with clean speech.
Data was recorded using multiple microphones strategically placed
throughout the room. The corpus includes audio recordings, orthographic transcriptions,
and speaker labels.

This dataset is the result of a collaborative project between the Communications Security Establishment (CSE) and The Canadian Institute for Cybersecurity (CIC) that use the notion of profiles to generate cybersecurity dataset in a systematic manner. It incluides a detailed description of intrusions along with abstract distribution models for applications, protocols, or lower level network entities. The dataset includes seven different attack scenarios, namely Brute-force, Heartbleed, Botnet, DoS, DDoS, Web attacks, and infiltration of the network from inside. The attacking infrastructure includes 50 machines and the victim organization has 5 departments includes 420 PCs and 30 servers. This dataset includes the network traffic and log files of each machine from the victim side, along with 80 network traffic features extracted from captured traffic using CICFlowMeter-V3.
For more information on the creation of this dataset, see this paper by researchers at the Canadian Institute for Cybersecurity (CIC) and the University of New Brunswick (UNB): Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization.

High resolution climate data to help assess the impacts of climate change primarily on agriculture. These open access datasets of climate projections will help researchers make climate change impact assessments.

COCO is a large-scale object detection, segmentation, and captioning dataset.
This is part of the fast.ai datasets collection hosted by AWS for convenience
of fast.ai students. If you use this dataset in your research please cite
arXiv:1405.0312 [cs.CV].

ICON global numerical weather prediction model; average resolution of 13km with 90 vertical levels; udpated at 00UTC and every following 6h with a forecast range of 120h (180h for 00UTC and 12UTC); selection of commonly used parameters

ICON global EPS ensemble prediction model; 40 ensemble members; average resolution of 40km; updated at 00UTC and every following 6h with a forecast range of 120h (extended to 180h for 00UTC and 12UTC); selection of commonly used parameters; ensemble members are bundled in joint grib files

ICON-EU regional numerical weather prediction model; european nesting region with increased resolution of approximately 6.5km with 60 vertical levels; updated at 00UTC and every following 3h with 120h forecast range; selection of commonly used parameters

ICON-EU EPS regional ensemble weather prediction model; 40 ensemble members; European nesting region with increased resolution of approximately 20km; updated at 00UTC and every following 3h with 120h forecast range; selection of commonly used parameters; ensemble members are bundled in joint grib files

LiDAR point cloud data for Washington, DC is available for anyone to use on Amazon S3.
This dataset, managed by the Office of the Chief Technology Officer (OCTO), through the
direction of the District of Columbia GIS program, contains tiled point cloud data for
the entire District along with associated metadata.

This dataset contains historical and projected dynamically downscaled climate data for the State of Alaska and surrounding regions at 20km spatial resolution and hourly temporal resolution. This data was produced using the Weather Research and Forecasting (WRF) model (Version 3.5). We downscaled both ERA-Interim historical reanalysis data (1979-2015) and both historical and projected runs from 2 GCM’s from the Coupled Model Inter-comparison Project 5 (CMIP5): GFDL-CM3 and NCAR-CCSM4 (historical run: 1970-2005 and RCP 8.5: 2006-2100).

N-grams are fixed size tuples of items. In this case the items are words extracted from the Google Books corpus. The n specifies the number of elements in the tuple, so a 5-gram contains five words or characters. The n-grams in this dataset were produced by passing a sliding window of the text of books and outputting a record for each new token.

The International Cancer Genome Consortium (ICGC) coordinates projects with the common aim of accelerating research into the causes and control of cancer. The PanCancer Analysis of Whole Genomes (PCAWG) study is an international collaboration to identify common patterns of mutation in whole genomes from ICGC. More than 2,400 consistently analyzed genomes corresponding to over 1,100 unique ICGC donors are now freely available on Amazon S3 to credentialed researchers subject to ICGC data sharing policies.

Some of the most important datasets for image classification research, including
CIFAR 10 and 100, Caltech 101, MNIST, Food-101, Oxford-102-Flowers, Oxford-IIT-Pets,
and Stanford-Cars. This is part of the fast.ai datasets collection hosted by
AWS for convenience of fast.ai students. See documentation link for citation and
license details for each dataset.

Some of the most important datasets for image localization research, including
Camvid and PASCAL VOC (2007 and 2012). This is part of the fast.ai datasets
collection hosted by AWS for convenience of fast.ai students. See
documentation link for citation and license details for each dataset.

Dataset and benchmarks for computer vision research in the context of autonomous driving. The dataset has been recorded in and around the city of Karlsruhe, Germany using the mobile platform AnnieWay (VW station wagon) which has been equipped with several RGB and monochrome cameras, a Velodyne HDL 64 laser scanner as well as an accurate RTK corrected GPS/IMU localization unit. The dataset has been created for computer vision and machine learning research on stereo, optical flow, visual odometry, semantic segmentation, semantic instance segmentation, road segmentation, single image depth prediction, depth map completion, 2D and 3D object detection and object tracking. In addition, several raw data recordings are provided. The datasets are captured by driving around the mid-size city of Karlsruhe, in rural areas and on highways. Up to 15 cars and 30 pedestrians are visible per image.

The Multimedia Commons is a collection of audio and visual features computed for the nearly 100 million Creative Commons-licensed Flickr images and videos in the YFCC100M dataset from Yahoo! Labs, along with ground-truth annotations for selected subsets. The International Computer Science Institute (ICSI) and Lawrence Livermore National Laboratory are producing and distributing a core set of derived feature sets and annotations as part of an effort to enable large-scale video search capabilities. They have released this feature corpus into the public domain, under Creative Commons License 0, so it is free for anyone to use for any purpose.

Some of the most important datasets for NLP, with a focus on classification, including
IMDb, AG-News, Amazon Reviews (polarity and full), Yelp Reviews (polarity and
full), Dbpedia, Sogou News (Pinyin), Yahoo Answers, Wikitext 2 and Wikitext
103, and ACL-2010 French-English 10^9 corpus. This is part of the
fast.ai datasets collection hosted by AWS for convenience of fast.ai
students. See documentation link for citation and license details for each
dataset.

The Global Ensemble Forecast System (GEFS), previously known as the GFS Global ENSemble (GENS), is a weather forecast model made up of 21 separate forecasts, or ensemble members. The National Centers for Environmental Prediction (NCEP) started the GEFS to address the nature of uncertainty in weather observations, which is used to initialize weather forecast models. The GEFS attempts to quantify the amount of uncertainty in a forecast by generating an ensemble of multiple forecasts, each minutely different, or perturbed, from the original observations. With global coverage, GEFS is produced four times a day with weather forecasts going out to 16 days.

Global Historical Climatology Network - Daily is a dataset from NOAA that contains daily observations over global land areas. It contains station-based measurements from land-based stations worldwide, about two thirds of which are for precipitation measurement only. Other meteorological elements include, but are not limited to, daily maximum and minimum temperature, temperature at the time of observation, snowfall and snow depth. It is a composite of climate records from numerous sources that were merged together and subjected to a common suite of quality assurance reviews. Some data are more than 175 years old. The data is in CSV format. Each file corresponds to a year from 1763 to present and is named as such.

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

The NOAA National Water Model Reanalysis dataset contains output from a 25-year retrospective simulation (January 1993 through December 2017) of version 1.2 of the National Water Model. This simulation used observed rainfall as input and ingested other required meteorological input fields from a weather Reanalysis dataset. The output frequency and fields available in this historical NWM dataset differ from those contained in the real-time forecast model. One application of this dataset is to provide historical context to current real-time streamflow, soil moisture and snowpack NWM conditions. The Reanalysis data can be used to infer flow frequencies and perform temporal analyses with hourly streamflow output and 3-hourly land surface output. The long-term dataset can also be used in the development of end user applications which require a long baseline of data for system training or verification purposes.

The National Water Model (NWM) is a water resources model that simulates and forecasts water
budget variables, including snowpack, evapotranspiration, soil moisture and streamflow, over
the entire continental United States (CONUS). The model, launched in August 2016, is designed
to improve the ability of NOAA to meet the needs of its stakeholders (forecasters, emergency
managers, reservoir operators, first responders, recreationists, farmers, barge operators, and
ecosystem and floodplain managers) by providing expanded accuracy, detail, and frequency of water
information. It is operated by NOAA’s Office of Water Prediction. This bucket contains a four-week
rollover of the Short Range Forecast model output and the corresponding forcing data for the
model. The model is forced with meteorological data from the High Resolution Rapid Refresh (HRRR)
and the Rapid Refresh (RAP) models. The Short Range Forecast configuration cycles hourly and produces
hourly deterministic forecasts of streamflow and hydrologic states out to 18 hours.

The Operational Forecast System (OFS) has been developed to serve the maritime user community. OFS was developed in a joint project of the NOAA/National Ocean Service (NOS)/Office of Coast Survey, the NOAA/NOS/Center for Operational Oceanographic Products and Services (CO-OPS), and the NOAA/National Weather Service (NWS)/National Centers for Environmental Prediction (NCEP) Central Operations (NCO). OFS generates water level, water current, water temperature, water salinity (except for the Great Lakes) and wind conditions nowcast and forecast guidance four times per day.

OpenNeuro is a database of openly-available brain imaging data. The data are shared according to a Creative Commons CC0 license, providing a broad range of brain imaging data to researchers and citizen scientists alike. The database primarily focuses on functional magnetic resonance imaging (fMRI) data, but also includes other imaging modalities including structural and diffusion MRI, electroencephalography (EEG), and magnetoencephalograpy (MEG). OpenfMRI is a project of the Center for Reproducible Neuroscience at Stanford University. Development of the OpenfMRI resource has been funded by the National Science Foundation, National Institute on Drug Abuse, and the Laura and John Arnold Foundation.

OSMLR a linear referencing system built on top of OpenStreetMap. OSM has great information about roads around the world and their interconnections, but it lacks the means to give a stable identifier to a stretch of roadway. OSMLR provides a stable set of numerical IDs for every 1 kilometer stretch of roadway around the world. In urban areas, OSMLR IDs are attached to each block of roadways between significant intersections.

Tabula Muris is a compendium of single cell transcriptomic data from the model organism Mus musculus comprising more than 100,000 cells from 20 organs and tissues. These data represent a new resource for cell biology, reveal gene expression in poorly characterized cell populations, and allow for direct and controlled comparison of gene expression in cell types shared between tissues, such as T-lymphocytes and endothelial cells from different anatomical locations. Two distinct technical approaches were used for most organs: one approach, microfluidic droplet-based 3’-end counting, enabled the survey of thousands of cells at relatively low coverage, while the other, FACS-based full length transcript analysis, enabled characterization of cell types with high sensitivity and coverage. The cumulative data provide the foundation for an atlas of transcriptomic cell biology. See: https://www.nature.com/articles/s41586-018-0590-4

The Genome Institute at Washington University has developed a high-throughput, fault-tolerant analysis information management system called the Genome Modeling System (GMS), capable of executing complex, interdependent, and automated genome analysis pipelines at a massive scale. The GMS framework provides detailed tracking of samples and data coupled with reliable and repeatable analysis pipelines. GMS includes a full system image with software and services, expandable from one workstation to a large compute cluster.

The Human Connectome Project aims to provide an unparalleled compilation of neural data, an interface to graphically navigate this data and the opportunity to achieve never before realized conclusions about the living human brain.

The NIH-funded Human Microbiome Project (HMP) is a collaborative effort of over 300 scientists from more than 80 organizations to comprehensively characterize the microbial communities inhabiting the human body and elucidate their role in human health and disease. To accomplish this task, microbial community samples were isolated from a cohort of 300 healthy adult human subjects at 18 specific sites within five regions of the body (oral cavity, airways, urogenital track, skin, and gut). Targeted sequencing of the 16S bacterial marker gene and/or whole metagenome shotgun sequencing was performed for thousands of these samples. In addition, whole genome sequences were generated for isolate strains collected from human body sites to act as reference organisms for analysis. Finally, 16S marker and whole metagenome sequencing was also done on additional samples from people suffering from several disease conditions.

Meteorological data reusers now have an exciting opportunity to sample, experiment and evaluate
Met Office atmospheric model data, whilst also experiencing a transformative method of requesting
data via Restful APIs on AWS. All ahead of Met Office’s own operationally supported API platform
that will be launched in late 2019.For information about the data see the Met Office website.
For examples of using the data check out the examples repository.
If you need help and support using the data please raise an issue on the examples repository.

This dataset contains paired wet and dry chemistry measurements for
georeferenced soil samples that were collected through the Africa Soil
Information Service (AfSIS) project, which lasted from 2009 through 2018.
In this release, we include data collected during Phase I (2009-2013.)
Georeferenced samples were collected from many Sub-Saharan African
countries, and their soil properties were analyzed using both wet and
dry chemistry. The two types of data can be paired to form a training
dataset for machine learning, such that certain soil properties can be
well-predicted through less expensive dry chemistry techniques.

Usage examples

Open City Model is an initiative to provide cityGML data for all the buildings in the United States.
By using other open datasets in conjunction with our own code and algorithms it is our goal to provide 3D geometries for every US building.

Obstacle history of American Ninja Warrior seasons 1-9
This dataset includes every obstacle in the history of American Ninja Warrior from season 1 to 9. This includes the obstacles at Sasuke (also known as the original Ninja Warrior in Japan) during seasons 1-3 when American Ninja Warrior (ANW) was on G4, and the top 10 competitors from the semi-finals round of ANW were sent to Sasuke to compete. Starting in season 4 of ANW, which is known as the "NBC era" when the show took on the regional/city formats for both qualifying and semi-final rounds with the finalists from each region competing at the National Finals of ANW in Las Vegas.

Usage examples

The Cell Painting Image Collection is a collection of freely
downloadable microscopy image sets. Cell Painting is an
unbiased high throughput imaging assay used to analyze
perturbations in cell models. In addition to the images
themselves, each set includes a description of the biological
application and some type of "ground truth" (expected results).
Researchers are encouraged to use these image sets as reference
points when developing, testing, and publishing new image
analysis algorithms for the life sciences. We hope that the
this data set will lead to a better understanding of which
methods are best for various biological image analysis
applications.

This project is set to pull the latest daily coin data from Coin Metrics using the data.world sync applet on IFTTT.
Daily on-chain transaction volume is calculated as the sum of all transaction outputs belonging to the blocks mined on the given day. "Change" outputs are not included.
Transaction count figure doesn’t include coinbase transactions.

Our National Footprint Accounts (NFAs) measure the ecological resource use and resource capacity of nations from 1961 to 2013.
The calculations in the National Footprint Accounts are primarily based on United Nations data sets, including those published by the Food and Agriculture Organization, United Nations Commodity Trade Statistics Database, and the UN Statistics Division, as well as the International Energy Agency.

Usage examples

The basic geo-data set for public transport stops comprises public transport stops in Switzerland and additional selected geo-referenced public transport locations that are of operational or structural importance (operating points).

The IChangeMyCity project provides insight into the complaints raised by citizens from diffent cities of India related to the issues in their neighbourhoods and the resolution of the same by the civic bodies.