I find it so aggravating that nearly every last ML framework documents their CNN libraries in terms of canned MNIST datasets imported from the library in a preprocessed form.
It's always left as a useless exercise for the reader to divine how to generate such a dataset from his/her own data.
Examples should be way more general. The starting point shouldn't be:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)
, it should start with "here is a directory of images and their classes" and end with a CNN model.
EDIT: Should anyone have any insight as to where I might get such a tutorial (or have the desire to write one), I know a herd of ML pre-initiates that would be grateful.
That's typically going to be problem specific. A simple cat/dog tutorial would be reasonably easy with a lot of the already existing libraries like scikit-image/PIL.
The other problem here though is corpus layout. Image net and the academic datasets typically require special readers, but even then: for actual datasets there's a few ways to do the corpus layout. In our experience there are 2 things that have worked well have been: a folder per label or labels in the name.
Then there's still having balanced minibatches though. When you get to having balanced minibatches images segmented by folder means you don't get balanced minibatches out.
Then there's disk to think about, do I really want to re run the same pre processing every time I train? Then: how do I explain that to a beginner?
So I'm probably going to want to have a corpus generator where we end up with a pre saved/balanced minibatches for training..which leads us back to what you see now.
A good middle ground here might be a corpus generator that takes all the minor stuff like that in to consideration...but still data is messy.
I would suggest looking at the wealth of imaging libraries out there in python and building something based on a "from scratch" image corpus.
You may not use java but the idea of "vectorization" is still a good one I think any ml practitioner who's touched pandas could appreciate. We built an abstraction called a datasetiterator which auto magically returns the batches for people so they don't have to think about the details but still having access to "real" data. I'm not sure what the python equivalent to this would be though.
A good exercise would be to figure out how to extract the data and put it into a numpy array. Then you can test on most - if not all - of the frameworks.
I recently parsed that in C#. It's simple enough to do, but the format is weird. It would be much simpler if the digits were stored as individual raw files in folders corresponding to each label.
Larger problem: mnist, images, and speech are probably low hanging fruits with rich features, large datasets, balanced datasets, etc. for which off the shelf packages work very well today.
It's always left as a useless exercise for the reader to divine how to generate such a dataset from his/her own data.
Examples should be way more general. The starting point shouldn't be:
, it should start with "here is a directory of images and their classes" and end with a CNN model.EDIT: Should anyone have any insight as to where I might get such a tutorial (or have the desire to write one), I know a herd of ML pre-initiates that would be grateful.