Parsing 30k+ Parquet files overloads the cluster leader

Description

We see that the ParseSetup phase of Parquet parse can lead to OOM on the leader node. During the setup phase memory is consumed mainly on the leader while rest of the nodes are underutilized. This behavior was observed on a cluster with 50 nodes when trying to parse 30k+ (almost 40k) small Parquet files.

Assignee

Michal Kurka

Fix versions

Reporter

Michal Kurka

Support ticket URL

Labels

None

Affected Spark version

None

Customer Request Type

None

Task progress

None

CustomerVisible

No

Priority

Major
Configure