large unstructured blobs and large files are among the things not well suited to...

bertday · on June 8, 2021

Can you elaborate on what requirements these blob systems should have?

My understanding is object stores are typically “flat” by design to scale well (in contrast to a tree structure found in filenames).

For content addressing, are people using the keys in sophisticated ways or are the values being indexed? Any reason to push this complexity into the storage layer as opposed to composing the functionality?

jFriedensreich · on June 8, 2021

well there is often no hierarchy as in folders on an old school filesystem but if the system uses chunking, the organization of blob chunks is very important for the performance and scaling characteristics. The chunking algorithm needs to be performant but also lead to sensible chunk size and count and in addition can also do data based boundaries so chunks can be reused even if blob data changes at the start of the data. This can be different depending on your specific application (eg. the read/write ratio and average file sizes) and requirements for optimal use of the underlying filesystem, that's one reason why no de facto standard chunker has been established so far. There are many tradeoffs for key organization too. Do you need more sophisticated range queries or only single keys? How balanced is growing and shrinking of your data structure vs performance? What is the clustering story? How do you handle rebalancing/cleanup/pruning? Is your primary key organization content hashes like in ipfs or more arbitrary strings as in s3/minio? Is your metadata/ secondary keys system completely integrated or more independent?

Thats exactly what your last questions points to. If you are lets say dropbox and have probably a super sophisticated key value store setup i can imagine you would want your content addressable layer to be as simple and narrowly optimized as possible and develop and optimize the indexing, metadata and key queries system nearly completely separately. If you are working on some system that also should scale down to run on individual machines like ipfs, git annex or minio before their focus on kubernetes you want a system that can run as a single daemon but also where users can reason about the whole system as an integrated concept.

bertday · on June 8, 2021

I see. Some of these, like ipfs, are more general purpose systems (basically communication protocols) than I was thinking of.