> by filtering any "books" (rather, files) that are larger than 30 MiB we can reduce the total size of the collection from 51.50 TB to 18.91 TB, shaving a whopping 32.59 TB
Books greater than 30 MiB are all the textbooks.
You are killing the knowledge.
Also killing a lot of rare things.
If you want to do something amazing and small, OCR them.
As an example of greater than 30 meg, I grabbed a short story by Greg Bear the other day not available digitally, it was in a 90 meg copy of a 1983 Analog Science Fiction and Fact
Side note de-duping is an incredibly hard project, how will you diff a mobi and a epub and then make a decision? Or a decision between a mobi and a mobi?
Books also change with time. Even in the 90's kids books from the 60's had been 'edited' These can be hidden gems to collectors. Cover art also.
Books greater than 30 MiB are all the textbooks.
You are killing the knowledge.
Also killing a lot of rare things.
If you want to do something amazing and small, OCR them.
As an example of greater than 30 meg, I grabbed a short story by Greg Bear the other day not available digitally, it was in a 90 meg copy of a 1983 Analog Science Fiction and Fact
Side note de-duping is an incredibly hard project, how will you diff a mobi and a epub and then make a decision? Or a decision between a mobi and a mobi?
Books also change with time. Even in the 90's kids books from the 60's had been 'edited' These can be hidden gems to collectors. Cover art also.