Python Machine Learning Datasets

Written: 03/17/15

Last Updated: 03/22/15

I just started a project for working with various datasets for machine learning applications. At this time, I have only added MNIST. The data may be used as 1D or 2D, depending on what your needs are. As time goes on, I’ll add support for other datasets that I encounter in my research.

This project supports some various (simple) manipulations for the datasets. These are the current features:

  1. Randomization of the data
  2. Reduction of the desired number of elements
  3. Normalized reduction (equal number of instances of each category) of the data
  4. Export of the data to a pickled data file
  5. Export of the data to a CSV
  6. Compatible with Windows and *NIX

The planned features are:

  1. Division of training data into training and validation
    1. Simple random split
    2. k-fold cross-validation
  2. Division of single input into training and test sets with and without normalizing the data
  3. Some sort of implementation to allow users to easily use their own dataset, without having to mess around with writing their own class

As a side note – I don’t currently use any third-party tools; however, I may decide to use NumPy or other modules in the future.

The code is licensed under the MIT license, so feel free to use it for whatever purpose you may have.

DISCLAIMER – I am not responsible for what you do with this code. If you want to use this for a school project, I highly advise you consult with your professor before doing so.

You can grab the full source code, here. Additionally, I have generated API docs. Please refer to those docs and the relevant READMEs in my repo for more info on getting started. To quickly see an example, first install the code:

python setup.py install

and then execute the MNIST example:

python -m mldata.vision.mnist.mnist

If you run into any problems or have any questions, please let me know in the comments!

Leave a Reply