Why dump CSVs when you can outright store in them?

laumars · on Jan 11, 2022

Sorry, I don’t understand your question. But just in case this answers it:

S3 is cloud storage on AWS. Athena can work directly off the CSVs stored on S3.

Where I said “dump” it was just a colourful way of saying “transfer your files to…”. I appreciate “dump” can also mean different things with databases so maybe that wasn’t the best choice of word in my part. Sorry for any confusion there.

cgio · on Jan 12, 2022

Dump is the proper term in this case, though, given S3 limitations (I.e. no append to file) which means you need to either create new file for each insert (very expensive) or append replace file for update. So practically it’s a workable read only replica with dumps of updates. For reasonably small datasets it can potentially work otherwise you look at rebalancing, partitions etc. and probably you’re better off with parquet,avro etc. given you are usually at the stage of introducing spark etc.