Loading…
May 10-12, 2023
Vancouver, British Columbia, Canada + Virtual
View More Details & Registration
Note: The schedule is subject to change.

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for Open Source Summit North America 2023 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

This schedule is automatically displayed in Pacific Daylight Time (UTC/GMT -8). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date."

IMPORTANT NOTE: Timing of sessions and room locations are subject to change.

Back To Schedule
Thursday, May 11 • 2:55pm - 3:35pm
How to Eliminate the I/O Bottleneck and Continuously Feed the GPU While Training in the Cloud - Lu Qiu, Alluxio

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.
Model training is a time-consuming, data-intensive, and resource-hungry phase in machine learning, with much use of storage, CPUs, and GPUs. The data access pattern in training requires frequent I/O of a massive number of small files, such as images and audio files. With the advancement of distributed training in the cloud, it is challenging to maintain the I/O throughput to keep expensive GPUs highly utilized without waiting for access to data. The unique data access patterns and I/O challenges associated with model training compared to traditional data analytics necessitate a change in the architecture of your data platform. In this talk, Lu Qiu will introduce a new architecture to optimize I/O in the entire data pipeline and maintain the throughput required by the GPU. Also, she will share how to implement this new architecture in Kubernetes for Pytorch workloads in the public cloud.

Speakers
avatar for Lu Qiu

Lu Qiu

Machine Learning Engineer, Alluxio
Lu Qiu is a machine learning engineer at Alluxio and is a PMC maintainer of the open source project Alluxio. Lu develops big data solutions for AI/ML training. Before that, Lu was responsible for core Alluxio components including leader election, journal management, and metrics management... Read More →


Thursday May 11, 2023 2:55pm - 3:35pm PDT
205 (Level 2)