ML generally needs to deal with a huge amount of data, which puts a challenge not only for storage cost but also for vCore, memories, I/O pressure.
In our talk, we will present how do we reduce the cost and resource pressure inside Uber Data Lake by taking several initiatives in the Apache Parquet layer, e.g. advanced encoding, higher ratio compression, precision reduction, re-ordering. We will also talk about challenges when using Spark to perform the initiatives at Uber scale, the innovations we took to speed up the Spark jobs to achieve our goal.