I find it so aggravating that nearly every last ML framework documents their CNN...

adhsu01 · on May 31, 2016

The cifar_10 example code is a good starting point: https://github.com/tensorflow/tensorflow/tree/r0.8/tensorflo...

Read the how to on reading data from files: https://www.tensorflow.org/versions/r0.8/how_tos/reading_dat...

and check out this useful Stack Overflow answer: http://stackoverflow.com/questions/33648322/tensorflow-image...

mmanfrin · on May 31, 2016

A few of the lessons on Udacity's Dada Science course cover finding/sifting through datasets and formatting them in a way to work with:

https://classroom.udacity.com/courses/ud359

zellyn · on May 31, 2016

"Dada Science"… fantastic!

agibsonccc · on May 31, 2016

That's typically going to be problem specific. A simple cat/dog tutorial would be reasonably easy with a lot of the already existing libraries like scikit-image/PIL.

The other problem here though is corpus layout. Image net and the academic datasets typically require special readers, but even then: for actual datasets there's a few ways to do the corpus layout. In our experience there are 2 things that have worked well have been: a folder per label or labels in the name.

Then there's still having balanced minibatches though. When you get to having balanced minibatches images segmented by folder means you don't get balanced minibatches out.

Then there's disk to think about, do I really want to re run the same pre processing every time I train? Then: how do I explain that to a beginner?

So I'm probably going to want to have a corpus generator where we end up with a pre saved/balanced minibatches for training..which leads us back to what you see now.

A good middle ground here might be a corpus generator that takes all the minor stuff like that in to consideration...but still data is messy.

I would suggest looking at the wealth of imaging libraries out there in python and building something based on a "from scratch" image corpus.

Minor plug: We thought about that a lot in building deeplearning4j. http://deeplearning4j.org/canova

You may not use java but the idea of "vectorization" is still a good one I think any ml practitioner who's touched pandas could appreciate. We built an abstraction called a datasetiterator which auto magically returns the batches for people so they don't have to think about the details but still having access to "real" data. I'm not sure what the python equivalent to this would be though.

shepardrtc · on May 31, 2016

The format for the dataset is here: http://yann.lecun.com/exdb/mnist/

A good exercise would be to figure out how to extract the data and put it into a numpy array. Then you can test on most - if not all - of the frameworks.

gambler · on May 31, 2016

I recently parsed that in C#. It's simple enough to do, but the format is weird. It would be much simpler if the digits were stored as individual raw files in folders corresponding to each label.

shas3 · on May 31, 2016

Larger problem: mnist, images, and speech are probably low hanging fruits with rich features, large datasets, balanced datasets, etc. for which off the shelf packages work very well today.

mikecb · on May 31, 2016

Data cleaning is always a problem in statistics, so much so that most data scientists say that most of their time is spent on it.