Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This seems to have very little on Data Quality and it is on Chapter 16...How much practical experience in Industry do the authors have? Because 90% of your time will be spent on Data Quality and Data Cleansing...


Arguably that’s a separate (obviously critical) concern. I think it’s worth it to abstract that away as just a step that exists in the pipeline with its own set of concerns/challenges/methods etc that really requires its own deeper study to do well.

For instance, my ML work is almost entirely in the context of engineering simulation regression/surrogate development, where data quality/cleaning is almost no issue at all - all of the work is on the dataset generation side and on the model selection/training/deployment side.

Every job is different!


Agree, Data Quality in-the-wild is a huge concern. I've led efforts to establish Lineage/Quality in large orgs and doing this after-the-fact is a massive undertaking. Having this "up-front" before all the data pipelines (origination, transformation, pre-processing) calcify saves a lot of headache down the road.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: