downloads and prepares various mnist-compatible datasets
Find a file
2020-03-30 10:48:42 -07:00
mnists bump version 2020-03-30 10:48:42 -07:00
.gitignore add .gitignore 2018-03-15 01:26:10 +01:00
LICENSE init 2018-03-14 16:45:53 +01:00
README.md add qmnist dataset 2020-03-30 10:41:09 -07:00
requirements.txt init 2018-03-14 16:45:53 +01:00
setup.py bump version 2020-03-30 10:48:42 -07:00
TODO support python 3.5, fix exceptions 2018-03-24 11:51:56 +01:00

mnists

downloads and prepares various mnist-compatible datasets.

files are downloaded to ~/.mnist and checked for integrity by SHA-256 hashes.

dependencies

python 3.5 (or later), numpy.

install

pip install --upgrade 'https://github.com/notwa/mnists/tarball/master#egg=mnists'

I recommend adding --upgrade-strategy only-if-needed to the command so that you don't accidentally "upgrade" numpy to a version not compiled specifically for your environment. This can happen when using e.g. Anaconda.

usage

import mnists

dataset = "emnist_balanced"
train_images, train_labels, test_images, test_labels = mnists.prepare(dataset)

the default images shape is (n, 1, 28, 28) and scaled to the range [0, 1]. labels are output in one-hot encoding.

prepare arguments

pass flatten=True to get a flattened (n, 784) image shape.

pass return_floats=False to get the raw [0, 255] integer range of images.

pass return_onehot=False to get the raw [0, M-1] integer encoding of labels.

why the extra dimension?

you will notice that, by default, there is a single-dimensional entry in the shape of images: (n, 1, 28, 28). this exists to obtain compatibility with programs that expect a number of color channels in that place. since mnist-like datasets are (as of writing) all grayscale, there is only one color channel, and thus the size of this dimension is 1.

datasets

in alphabetical order, using default mnists.prepare arguments:

subdirectory dataset train images shape train labels shape test images shape test labels shape
emnist emnist_balanced (112800, 1, 28, 28) (112800, 47) (18800, 1, 28, 28) (18800, 47)
emnist emnist_byclass (697932, 1, 28, 28) (697932, 62) (116323, 1, 28, 28) (116323, 62)
emnist emnist_bymerge (697932, 1, 28, 28) (697932, 47) (116323, 1, 28, 28) (116323, 47)
emnist emnist_digits (240000, 1, 28, 28) (240000, 10) (40000, 1, 28, 28) (40000, 10)
emnist emnist_letters (124800, 1, 28, 28) (124800, 26) (20800, 1, 28, 28) (20800, 26)
emnist emnist_mnist (60000, 1, 28, 28) (60000, 10) (10000, 1, 28, 28) (10000, 10)
fashion-mnist fashion_mnist (60000, 1, 28, 28) (60000, 10) (10000, 1, 28, 28) (10000, 10)
mnist mnist (60000, 1, 28, 28) (60000, 10) (10000, 1, 28, 28) (10000, 10)
qmnist qmnist (60000, 1, 28, 28) (60000, 10) (60000, 1, 28, 28) (60000, 10)