分享
NVIDIA+Merlin+HugeCTR+推荐系统框架.pdf
下载文档

ID:3048663

大小:4.50MB

页数:39页

格式:PDF

时间:2024-01-18

收藏 分享赚钱
温馨提示:
1. 部分包含数学公式或PPT动画的文件,查看预览时可能会显示错乱或异常,文件下载后无此问题,请放心下载。
2. 本文档由用户上传,版权归属用户,汇文网负责整理代发布。如果您对本文档版权有争议请及时联系客服。
3. 下载前请仔细阅读文档内容,确认文档内容符合您的需求后进行下载,若出现内容与标题不符可向本站投诉处理。
4. 下载文档时可能由于网络波动等原因无法下载或下载错误,付费完成后未能成功下载的用户请联系客服处理。
网站客服:3074922707
NVIDIA Merlin HugeCTR 推荐 系统 框架
Related Session in GTC 2022Merlin HugeCTR:由GPU加速的推荐系统训练和推理 S41352-Minseok Lee|NVIDIA MERLIN 推荐系统团队高级经理Merlin HugeCTR:使用GPU Embedding 缓存的分布式分层推理参数服务器 S41126-Yingcan Wei,Fan Yu,Matthias Langer|NVIDIABuilding and Deploying Recommender Systems Quickly and Easily with NVIDIA Merlin S41119 Even Oldridge,Senior Manager,Merlin Recommender Systems Team,NVIDIAGetting startedHugeCTR NVIDIA.com:https:/ GitHub:https:/ storiesLeading Design and Development of the Advertising Recommender System at Tencent:An Interview with Xiangting KongMeituan/Optimizing Meituans Machine Learning Platform:An Interview with Jun HuangLearn more about HugeCTRAccelerating Embedding with the HugeCTR TensorFlow Embedding PluginHugeCTR Series Part 1:Scaling and Accelerating large Deep Learning Recommender Systems(CN)HugeCTR 系列第 1 部分:扩展和加速大型深度学习推荐系统HugeCTR Series Part 2:Training large Deep Learning Recommender Models with Merlin HugeCTRs Python APIs(CN)HugeCTR 系列第 2 部分:使用 Merlin HugeCTR 的 Python API 训练大型深度学习推荐模型HugeCTR Parameter Server Series Part 1:Introduction to Hierarchical Parameter ServerWe are Hiring(Full Time&Intern):C+Engineer,CUDA Engineer,Recommendation System Algorithm ResearcherPlease email your Resume to:sh-HugeCTRResourcesMERLIN HUGECTR:GPU-ACCELERATED RECOMMENDER SYSTEM TRAINING AND INFERENCEJERRY SHI3SOCIAL MEDIADIGITAL ADVERTISINGE-COMMERCEDIGITAL CONTENTRECOMMENDERSTHE PERSONALIZATION ENGINE OF THE INTERNET 4.3B Active Users4.3B Watch Videos Online3.7B Shop Online4.7B Internet Users“Already,35 percent of what consumers purchase on Amazon and 75 percent of what they watch on Netflix come from product recommendations based on such algorithms.”Source:McKinsey4NVTabular Pipelines are slow and complexChallengeSolutionInferenceTrainingData LoadingETLUsing common item-by-item loading can be slowHigh throughput to rank more items is difficult while maintaining low latencyEmbedding tables of large deep learning recommender systems can exceed memory GPU-accelerated and easy-to-use ETL pipelines prepares datasets in minutesAsynchronous and GPU-accelerated dataloader for PyTorch and TensorFlow/KerasEasy data and model parallel training allow to scale TB size embeddings High throughput,low-latency production deploymentNVIDIA Merlin is an open-source library to deploy recommender systems end-2-endTritonHugeCTRNVIDIA Merlin addresses Recommender System challenges5AGENDAHugeCTR OverviewHugeCTR InferenceHugeCTR Sparse Operation KitHUGECTR OVERVIEWHUGECTR:SCALABLE,ACCELERATED RECSYS TRAININGhttps:/ An open-source framework to accelerate the training of CTR estimation models on NVIDIA GPUs.Written in CUDA C+and highly exploits GPU-accelerated libraries such as cuBLAS,cuDNN,and NCCL Provide a high-level,Keras-like Python API Continue to power NVIDIAs MLPerf DLRM submission Designed for training recommender models at scale where gigantic embedding tables are included Support common models and their variants such as Deep Learning Recommendation Model(DLRM),Wide and Deep,Deep Cross Network,and DeepFM Support GPU-accelerated,concurrent model inference as a Triton backend based on an efficient embedding cache and hierarchical parameter server(HPS)CHALLENGES OF EMBEDDING LAYERMemory Demands and Communication Overhead Challenge 1.Embedding tables may not fit in a single GPU memory,e.g.,100GB,1TB,etc Solution:Store tables across multiple GPUs(and multiple nodes)Model Parallelism Challenge 2.Embedding primitives such as table lookup and pooling/reduction operator are memory bound Solution:Fuse such CUDA kernels Challenge 3.Model parallelism requires inter-GPU/inter-node communication Solution:Use NCCL and/or exploit NVLink(or NVSwitch)Common solution for Challenge 1&2&3:Use lower precision such as FP16RECOMMENDER MODEL AT SCALEHugeCTR embedding table can span multiple nodes beyond multiple GPUs Neural NetworkNeural NetworkNeural NetworkNeural NetworkNeural NetworkNeural NetworkNeural NetworkNeural NetworkGPU0GPU1GPU2GPU3GPU4GPU5GPU6GPU7Node 0Neural NetworkNeural NetworkNeural NetworkNeural NetworkNeural NetworkNeural NetworkNeural NetworkNeural NetworkGPU0GPU1GPU2GPU3GPU4GPU5GPU6GPU7Node 1EmbeddingsCOMMON EMBEDDING PRIMITIVESAll Memory BoundEmbeddingTablesCategoricalfeaturesfea1(multi-hot)fea2fea3fea0(1)Table lookup(2)Pooling/Reduction(3)ConcatenationHUGECTR EMBEDDING LAYERGPU Hash Table+CUDA Kernel FusionEmbeddingTablesCategoricalfeaturesfea1(multi-hot)fea2fea3fea0(1)Hash table lookup(2)In-register reduction&concatenation(one fused kernel)WEIGHT CONVERSION FOR MIXED PRECISION TRAININGFrom FP32 Master Weights to FP16 Weights In each training iteration,optimizer computes the master weights in FP32 To support mixed precision training,FP32 weights must be casted into FP16 Currently the optimizer weight update and weight conversion happen in two different kernels Memory bandwidth utilization can be further improved NxfpropNxbpropWeight Conversion(FP32-to-FP16)OptimizerNew FP32 weights FP16 weightsDATA PARALLEL NEURAL NETWORKMultithreaded Kernel Launch to keep GPUs busyGPU1GPU2GPU3CPU Thread 1L0.K0CPU Thread 2CPU Thread 3L1.K0L1.K1L2.K0L2.K0L1.K0L1.K1L0.K0L0.K0L1.K0L1.K1L2.K0L2.K0L1.K0L1.K1L0.K0L0.K0L1.K0L1.K1L2.K0L2.K0L1.K0L1.K1L0.K0TimeAccumulated kernel latency may not be trivial!CPU Thread 0L0.K0GPU0L1.K0L1.K1L2.K0L2.K0L1.K0L1.K1L0.K0fpropbpropDATA PARALLEL NEURAL NETWORKCUDA Graph to minimize CUDA kernel launch latencyGPU1GPU2GPU3CPU Thread 1L0.K0CPU Thread 2CPU Thread 3L1.K0L1.K1L2.K0L2.K0L1.K0L1.K1L0.K0L0.K0L1.K0L1.K1L2.K0L2.K0L1.K0L1.K1L0.K0L0.K0L1.K0L1.K1L2.K0L2.K0L1.K0L1.K1L0.K0TimeCPU Thread 0L0.K0GPU0L1.K0L1.K1L2.K0L2.K0L1.K0L1.K1L0.K0L0.K0L1.K0L1.K1L2.K0L0.K0L1.K0L1.K1L2.K0L2.K0L1.K0L1.K1L0.K0One CUDA Graph launchOne CUDA Graph launchHUGECTR TRAINING PERFORMANCEWide&Deep on DGX A100Higher is BetterLower is BetterHUGECTR MLPERF DLRM TRAINING PERFORMANCEIn the MLPerf Training v1.0/v1.1 Benchmark Model:DLRM Dataset:Criteo 1TB dataset HugeCTR is a key driver behind the recent NVIDIA MLPerf win In this v1.1 edition,optimization over the entire hardware and software stack sees continuing improvement Higher is BetterNVIDIA MLPerf DLRM submission details:4x4S CPU(v1.0):1.0-1045|1xDGX A100(v1.0):1.0-1058|14xDGX A100(v1.0):1.0-1067|14xDGX A100(v1.1):1.1-2073For more infromation,see www.mlperf.org.EMBEDDING TRAINING CACHE(ETC)Train models(TB+)larger than the memory capacity of your GPUs by iteratively prefetching portions of the embedding table to the GPU memory in the granularity of pass.This is especially useful for continuous training.Workflow:Step 1)Load embeddings from storage to GPU memory Step 2)Training pass Step 3)Update model on storagerepeatGPU Memory(100s GBs)Key AEmbeddingTablePass NStorage-CPU Memory or SSD(TBs)Step 1)Step 3)Step 2)Key BEmbeddingTablePass N+1TimeINTEROPERABILITY WITH OPEN NEURAL NETWORK EXCHANGE(ONNX)Compatible with HugeCTR Training APIsOpen Neural Network Exchange(ONNX)is a common open-source format for AI model.hugectr2onnx is a python package that can convert HugeCTR models to ONNX format Input:Graph configuration JSON,dense models,and optionally,sparse models Output:ONNX model,which can then be used with ONNX runtime,or other framework native inference engine,hence better interoperability.wdl_model wdl0_sparse_2000.model wdl1_sparse_2000.model wdl_dense_2000.model wdl.jsonHugeCTRONNXConverterhugectr2onnx.converter.convert(onnx_model_path=wdl.onnx,graph_config=wdl.json,dense_model=wdl_dense_2000.model,convert_embedding=True,sparse_models=wdl0_sparse_2000.model,wdl1_sparse_2000.model)https:/ OPTIMIZATIONS&FEATUREShttps:/ MLPerf training v1.0/1.1 optimizations V1.0 optimization:https:/ We are actively working on their full generalization.Stay tuned!HugeCTR release notes https:/ HugeCTR samples&Notebooks Model samples:https:/ Notebooks:https:/ INFERENCECHALLENGES OF INFERENCE Challenge 1.Embedding tables may not fit in a single GPU memory,e.g.,100GB,1TB,etc Challenge 2.Need to meet the target latency whilst maintaining high throughput/concurrency Challenge 3.Loading an updated incremental model from training to inference must be streamlined(Fast Deployment)Challenge 4.GPU compute&memory resources must be well isolated GPU-CENTRIC MEMORY HIERARCHYHigh GPU memory bandwidth and fast GPU processor for low-latency inference GPU streaming multiprocessors(SM)provides much better computational throughput than CPU GPU memory bandwidth is 10 x faster than CPU memory bandwidth GPU memory size is sufficient for holding“hot embedding vectors”from a gigantic embedding table Only few embedding vectors are required for majority of requests.Inference can be optimized by leveraging GPU in productionCPU Memory:100s GB-TBsGPU Memory:10s GBGPU SMsStorage/RAID:10s-100s TB100s GB/s-TBs/s10s GB/sGBs/s-10s GB/sHUGECTR INFERENCE ARCHITECTUREAt High Level Designed to effectively use the GPU memory to accelerate the inference Supports concurrent model inference execution on the same GPU or across multiple GPUs Embedding tables are cached hierarchically Parameter server is in charge of managing whole embedding tables of different models on CPU memory and SSDs Embedding cache manages hot portions of the embedding tables while providing their GPU accelerated lookups The same models instances share the embedding caches to do their predictionsModel1 InstanceModel1 InstanceModel2 InstanceModel2 InstanceModel1 Emb.CacheEmb.Table cache1 Model2 Emb.CacheEmb.Table cache1 Emb.Table cache2 Parameter ServerEmb.Table 1Emb.Table 1Emb.Table 2Model 1Model 2CPUGPUDATABASE BACKENDPersistent vs.VolatilePersistent Database HugeCTR will deploy,“persist”and maintain a full copy of all parameters in this database.“persistent”databases are regarded as having virtually unlimited storage space.E.g.,RocksDBVolatile Database Fast!Resides in RAM(either remote or local).Can consider as a cache for the persistent database.Assumed to operate with limited space.However,limited space means that additional handlers are implemented to establish function,when running out of storage space.E.g.,RedisEC,persistent and volatile DB,areEVENTUALLY CONSISTENT!HUGECTR HIERARCHICAL PARAMETER SERVERMulti-Level Caching across Different Types of MemoryCPU MemoryGPU MemoryParameter replication on SSDLevel 3Local parametersLevel 2NetworkMeta Parameters Level 2Parameter shards Level 2Temporary memory for networkParameter buffer poolEmbedding cache on GPU Level 1Incremental ParametersInsertion ParametersQuery parametersLocal dataEvent SinkParameter ReplicationLocal SSDsTemporary MemoryLocal ParametersParameter shardsMeta ParametersParameter BufferCPU MemoryLocal data Parameter Buffer Embedding cacheGPU MemoryINCREMENTAL UPDATES Basically,a lazy queue.Optimizer publishes updates via sink to Kafka.Each HugeCTR instance monitors queues of related models.Highly automated process:Automatic workload distribution.Can handle certain degree of node failure.Asynchronous injection of updates in background.Message SinkMessage SourceTraining(e.g.,HugeCTR native,TensorFlow SOK)MessageBuffer(e.g.,Kafka,RabbitMQ,)HPS E2E SETUP EXAMPLEFrom Training to DeploymentCONFIGURING HUGECTRe.g.:Inference with Triton/Just add to your Triton config:persistent_db:type:rocks_db,path:/models/wide_and_deep,num_threads:16,read_only:false,max_get_batch_size:8192,max_set_batch_size:8192volatile_db:type:redis_cluster,address:host_name:port,.,user_name:default,password:123456,num_partitions:16,overflow_margin:1000000,overflow_policy:evict_oldest,overflow_resolution_target:800000,max_get_batch_size:8192,max_set_batch_size:8192,cache_missed_embeddings:trueSTREAM MODEL UPDATES LIVEINTO YOUR INFERENCE SYSTEM?HugeCTR will automatically discover and ingest model updates.If multiple HugeCTR instances connect to shared databases,the related workload is distributed about evenly among them.Fast,reliably,automatically,via Apache Kafka!/In your model training code:import hugectrsolver=hugectr.CreateSolver(max_eval_batches=300,batchsize_eval=2048,batchsize=2048,lr=0.001,vvgpu=2,repeat_dataset=True,i64_input_key=True,kafka_brokers=host_name:port.)/Just add to your Triton config:update_source:type:kafka_message_queue,brokers:host_name:port.,poll_timeout_ms:50,max_receive_buffer_size:4096,max_batch_size:1024,failure_backoff_ms:300HPS PERFORMANCEWe continue to actively improve it!100%100%100%1.3 x4.2 x80.0 x0%100%200%300%400%500%600%700%800%RocksDB BackendHashMap BackendRedis BackendAv.g Query Latency ImprovementAxis TitleHugeCTR 3.1HugeCTR 3.4Model:Wide&Deep,Dataset:criteo;evaluation performed across 2000 consecutive random queries.Cluster node configuration:DGX-1V(https:/ configuration:EDR InfiniBand(https:/ is BetterMORE RESOURCES ON HUGECTR INFERENCEDocumentations&Deep Dive Documentations:HugeCTR Inference Architecture:https:/ HugeCTR Hierarchical Parameter Server:https:/ HugeCTR Triton Backend README:https:/ Notebooks:Hierarchical Deployment:https:/ Continuous Training and Inference:Part 1:https:/ Part 2:https:/ GTC sessions:GPU Embedding Cache Performance Deep Dive at GTC 2020:https:/ HugeCTR:Distributed Hierarchical Inference Parameter Server Using GPU Embedding Cache S41126 at GTC 2022HUGECTR SPARSE OPERATION KITSPARSE OPERATION KIT(SOK):TENSORFLOW EMBEDDING PLUGINhttps:/ Motivation Enable HugeCTR features and optimizations for general DL frameworks such as Tensorflow e.g.HugeCTR embedding operators Sparse Operation Kit(SOK)is a python package that provides GPU accelerated operations for sparse training and inference,e.g.,recommender for popular frameworks such as TensorFlow Its major operators are extracted from HugeCTRFEATURESGood Compatibility with TensorFlow Compatible with TensorFlow 1.15 and 2.4+No need to recompile or reinstall TensorFlow Compatible with distributed training frameworks such as Horovod Workable with MirroredStrategy and MultiWorkerMirroredStrategyFEATURESModel-Parallel GPU Embedding Layer for RecSys Training at ScaleOptional Subtitle*DP stands for Data-Parallel*MP stands for Model-ParallelSPARSE OPERATION KIT USAGE(HOROVOD)Define training loop with Horovodmodel=TFModel()tf.functiondef train_step(inputs,labels):with tf.GradientTape()as tape:logits=model(inputs)loss=_loss_fn(labels,logits)trainable_variables=model.trainable_variablesgrads=tape.gradient(loss,trainable_variables)grads=hvd.allreduce(grad)for grad in gradsoptimizer.apply_gradients(zip(grads,trainable_variables)return losssok.Init()model=SOKModel()tf.functiondef train_step(inputs,labels):with tf.GradientTape()as tape:logits=model(inputs)loss=_loss_fn(labels,logits)emb_variables,other_variables=sok.split_embedding_variable_from_others(model.trainable_variables)emb_grads,other_grads=tape.gradient(loss,emb_variables,other_variables)other_grads=hvd.allreduce(grad)for grad in other_gradsoptimizer.apply_gradients(zip(emb_grads,emb_variables)optimizer.apply_gradients(zip(other_grads,other_variables)return lossMORE RESOURCES ON SPARSE OPERATION KITDocumentations&Deep Dive Documentations:https:/ https:/nvidia-merlin.github.io/HugeCTR/sparse_operation_kit/master/index.html Notebooks:SOK

此文档下载收益归作者所有

下载文档
你可能关注的文档
收起
展开