11. Undocumented: Various datasets without documentation (feel free to explore!)

12. MSNBC.com Anonymous Web Data: This data describes the page visits of users who visited msnbc.com on September 28, 1999. Visits are recorded at the level of URL category (see description) and are recorded in time order.

13. UNIX User Data: This file contains 9 sets of sanitized user data drawn from the command histories of 8 UNIX computer users at Purdue over the course of up to 2 years.

14. Volcanoes on Venus - JARtool experiment: The JARtool project was a pioneering effort to develop an automatic system for cataloging small volcanoes in the large set of Venus images returned by the Magellan spacecraft.

37. Syskill and Webert Web Page Ratings: This database contains HTML source of web pages plus the ratings of a single user on these web pages. Web pages are on four seperate subjects (Bands- recording artists; Goats; Sheep; and BioMedical)

38. Pseudo Periodic Synthetic Time Series: This data set is designed for testing indexing schemes in time series databases. The data appears highly periodic, but never exactly repeats itself.

39. Robot Execution Failures: This dataset contains force and torque measurements on a robot after failure detection. Each failure is characterized by 15 force/torque samples collected at regular time intervals

48. Reuters Transcribed Subset: This dataset is created by reading out 200 files from the 10 largest Reuters
classes and using an Automatic Speech Recognition system to create
corresponding transcriptions.

49. Hill-Valley: Each record represents 100 points on a two-dimensional graph. When plotted in order (from 1 through 100) as the Y co-ordinate, the points will create either a Hill (a bump in the terrain) or a Valley (a dip in the terrain).

50. Abscisic Acid Signaling Network: The objective is to determine the set of boolean rules that describe the interactions of the nodes within this plant signaling network. The dataset includes 300 separate boolean pseudodynamic simulations using an asynchronous update scheme.

55. Steel Plates Faults: A dataset of steel platesā faults, classified into 7 different types.
The goal was to train machine learning for automatic pattern recognition.

56. Localization Data for Person Activity: Data contains recordings of five people performing different activities. Each person wore four sensors (tags) while performing the same scenario five times.

57. Online Handwritten Assamese Characters Dataset: This is a dataset of 8235 online handwritten assamese characters. The āonlineā process involves capturing of data as text is written on a digitizing tablet with an electronic pen.

58. Amazon Commerce reviews set: The dataset is used for authorship identification in online Writeprint which is a new research field of pattern recognition.

59. Relative location of CT slices on axial axis: The dataset consists of 384 features extracted from CT images. The class variable is numeric and denotes the relative location of the CT slice on the axial axis of the human body.

61. Farm Ads: This data was collected from text ads found on twelve websites that deal with various farm animal related topics. The binary labels are based on whether or not the content owner approves of the ad.

62. YouTube Comedy Slam Preference Data: This dataset provides user vote data on which video from a pair of videos is funnier collected on YouTube Comedy Slam. The task is to automatically predict this preference based on video metadata.

64. SMS Spam Collection: The SMS Spam Collection is a public set of SMS labeled messages that have been collected for mobile phone spam research.

65. Northix: Northix is designed to be a schema matching benchmark problem for data integration of two entity relationship databases.

66. Individual household electric power consumption: Measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 years. Different electrical quantities and some sub-metering values are available.

71. Gas sensor arrays in open sampling settings: The dataset contains 18000 time-series recordings from a chemical detection platform at six different locations in a wind tunnel facility in response to ten high-priority chemical gaseous substances

74. NYSK: NYSK (New York v. Strauss-Kahn) is a collection of English news articles about the case relating to allegations of sexual assault against the former IMF director Dominique Strauss-Kahn (May 2011).

77. Thoracic Surgery Data: The data is dedicated to classification problem related to the post-operative life expectancy in the lung cancer patients: class 1 - death within one year after surgery, class 2 - survival.

79. Bike Sharing Dataset: This dataset contains the hourly and daily count of rental bikes between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information.

80. EMG dataset in Lower Limb: 3 different exercises: sitting, standing and walking in the muscles: biceps femoris, vastus medialis, rectus femoris and semitendinosus addition to goniometry in the exercises.

81. Dresses_Attribute_Sales: This dataset contain Attributes of dresses and their recommendations according to their sales.Sales are monitor on the basis of alternate days.

91. microblogPCU: MicroblogPCU data is crawled from sina weibo microblog[http://weibo.com/]. This data can be used to study machine learning methods as well as do some social network research.

92. Gas sensor array under dynamic gas mixtures: The data set contains the recordings of 16 chemical sensors exposed to two dynamic gas mixtures at varying concentrations. For each mixture, signals were acquired continuously during 12 hours.

94. Greenhouse Gas Observing Network: Design an observing network to monitor emissions of a greenhouse gas (GHG) in California given time series of synthetic observations and tracers from weather model simulations.

103. Indoor User Movement Prediction from RSS data: This dataset contains temporal data from a Wireless Sensor Network deployed in real-world office environments. The task is intended as real-life benchmark in the area of Ambient Assisted Living.

104. GPS Trajectories: The dataset has been feed by Android app called Go!Track. It is available at Goolge Play Store(https://play.google.com/store/apps/details?id=com.go.router).

107. Crowdsourced Mapping: Crowdsourced data from OpenStreetMap is used to automate the classification of satellite images into different land cover classes (impervious, farm, forest, grass, orchard, water).

109. Air quality: Contains the responses of a gas multisensor device deployed on the field in an Italian city.

110. Devanagari Handwritten Character Dataset: This is an image database of Handwritten Devanagari characters. There are 46 classes of characters with 2000 examples each. The dataset is split into training set(85%) and testing set(15%).

118. YouTube Spam Collection: It is a public set of comments collected for spam research. It has five datasets composed by 1,956 real messages extracted from five videos that were among the 10 most viewed on the collection period.

119. KASANDR: KASANDR is a novel, publicly available collection for recommendation systems that records the behavior of customers of the European leader in e-Commerce advertising, Kelkoo.

120. chestnut ā LARVIC: The research project presents this database, shows the images of chestnuts that will be processed to determine the presence or absence of defects

124. Paper Reviews: This sentiment analysis data set contains scientific paper reviews from an international conference on computing and informatics. The task is to predict the orientation or the evaluation of a review.

126. APS Failure at Scania Trucks: The datasets' positive class consists of component failures for a specific component of the APS system. The negative class consists of trucks with failures for components not related to the APS.

127. DSRC Vehicle Communications: This set Provides data regarding wireless communications between vehicles and road side units. two separate data sets are provided (normal scenario) and in the presence of attacker (jammer).

135. Victorian Era Authorship Attribution: To create the largest authorship attribution dataset, we extracted works of 50 well-known authors. To have a non-exhaustive learning, in training there are 45 authors whereas, in the testing, it's 50

139. PMU-UD: The handwritten dataset was collected from 170 participants with a total of 5,180 numeral patterns. The dataset is named Prince Mohammad Bin Fahd University - Urdu/Arabic Database (PMU-UD).

140. Student Academics Performance: The dataset tried to find the end semester percentage prediction based on different social, economic and academic attributes.

144. Parking Birmingham: Data collected from car parks in Birmingham that are operated by NCP from
Birmingham City Council. UK Open Government Licence (OGL).
https://data.birmingham.gov.uk/dataset/birmingham-parking