Top open source datasets from Amazon

2022-05-14 02:03:32 By : Mr. Laughing Wang

Amazon has open-sourced Multilingual Amazon SLURP for Slot Filling, Intent Classification, and Virtual-assistant Evaluation (MASSIVE), a speech dataset that supports 51 languages to encourage developers to build more third-party apps and tools for its AI speaker device Alexa. It contains one million spoken samples and an open-source code to train multilingual AI models. It has been compiled through translators translating an English-only dataset into several languages spoken across Africa, Latin America, Europe, and Asia. It largely contains questions or common commands like asking a device to play a song or checking the weather situation.

Over the years, Amazon and AWS have contributed massively to the open-source community by releasing their comprehensive datasets to the public. We will have a look at a few of them in this article.

Amazon Customer Reviews is a collection of product reviews that have been collected over a period of over two decades. It contains over a hundred million reviews where customers have described their experience with products bought from the website. This makes the data a rich source of information for academic research, particularly in the field of NLP, information retrieval, and machine learning, among others. This dataset has been created to represent a sample of customer evaluations and opinions, which also reflect the variation in the perception of the same product across different geographical regions.

Last year, Amazon and the University of California, Berkeley, jointly released the Amazon Berkley Objects dataset. It is a massive dataset of product images and associated metadata for supporting research on product information management, visual understanding, and information retrieval. It would like researchers to develop more powerful AI models for image-based shopping and for expanding retailers’ product graphs. This dataset includes images of close to 150,000 products that are all annotated with metadata like multilingual title, model, brand, product type, and dimensions, among others. Further, there are close to 400,000 static catalogue images, over 8,000 images that provide 360-degree rotations in the plane at 5-degree intervals, and over 7,000 product models that can be rotated along any axis and rendered in any 3D environment under different lighting conditions.

Launched in 2016, SpaceNet is an open innovation project that offers a repository of freely available imagery with co-registered map features. SpaceNet hosts datasets developed by its team along with data sets from projects like IARPA’s Functional Map of the World (fMoW). Before SpaceNet, researchers had much lesser options to get free, precision-labelled and high-resolution satellite imagery.

The Cancer Genome Atlas is the result of a collaboration between the National Cancer Institute and the National Human Genome Research Institute. By analysing matched tumour and normal tissue samples from 11,000 patients, the group aims to generate comprehensive and multi-dimensional maps of key genomic changes in major types of cancer. The group was able to chart out a comprehensive characterisation of 33 cancer types and subtypes, including ten rare cancers. This dataset contains Clinical Supplement, miRNA-Seq Isoform Expression Quantification, Genotyping Array Masked Copy Number Segment, Genotyping Array Gene Level Copy Number Scores, and WXS Masked Somatic Mutation data from Genomic Data Commons (GDC), Whole Exome Sequencing (WXS), RNA-Seq, miRNA-Seq, and WXS Aggregated Somatic Mutation data.

The Genome Aggregation Database (gnomAD) is developed jointly by an international coalition of investigators who aggregate both exome and genome data from a range of large-scale human sequencing projects. The v2 data set of GRCh37 spans 125,748 exome sequences and 15,708 whole-genome sequences from unrelated persons. The v3 data set or GRCh38 contains 71,702 genomes selected as in v2.

Folding@home is a major distributed computing project which uses biomolecular simulations to find the molecular origins of disease to accelerate the discovery of newer treatments. During the COVID-19 pandemic, Folding@home partnered with several experimental collaborators to accelerate the progress toward building effective therapies for treating COVID-19. One of the outcomes of these efforts was the creation of the world’s first exascale distributed computing resource to generate scientific datasets of massive size.

Webinar Speed up deep learning inference 13th May

Conference, in-person (Bangalore) MachineCon 2022 24th Jun

Conference, Virtual Deep Learning DevCon 2022 30th Jul

Conference, in-person (Bangalore) Cypher 2022 21-23rd Sep

Stay Connected with a larger ecosystem of data science and ML Professionals

Discover special offers, top stories, upcoming events, and more.

The basic tenet that Gato followed was to train using the widest range of data possible, including modalities like images, text, button presses, joint torques and other actions based on the context.

IISc plans to bring the Indian pursuit in this field on par with the rest of the world, with a dedicated and focused effort.

AIIMS Jodhpur will also deliver mixed reality enabled remote healthcare services in the district of Sirohi to strengthen medical facilities delivered to underserved locations.

The new Gaudi2 and Greco processors are purpose-built for AI deep learning applications, implemented in 7-nanometer technology and manufactured on Habana’s high-efficiency architecture.

Protected Computing will allow users to remove personally identifiable information from Google Search results.

The summit will feature talks, workshops, paper presentations, exhibitions and hackathons.

Curriculum learning is also a type of machine learning that trains the model in such a way that humans get trained using their education system

Google informs that AlloyDB for PostgreSQL was built on the principle of disaggregation of compute and storage and designed to leverage disaggregation at every layer of the stack.

The statistical features of a time series could be made stationary by differencing method.

This is the first institutional round for USEReady.

Stay up to date with our latest news, receive exclusive deals, and more.

© Analytics India Magazine Pvt Ltd 2022