温馨提示:
1. 部分包含数学公式或PPT动画的文件,查看预览时可能会显示错乱或异常,文件下载后无此问题,请放心下载。
2. 本文档由用户上传,版权归属用户,汇文网负责整理代发布。如果您对本文档版权有争议请及时联系客服。
3. 下载前请仔细阅读文档内容,确认文档内容符合您的需求后进行下载,若出现内容与标题不符可向本站投诉处理。
4. 下载文档时可能由于网络波动等原因无法下载或下载错误,付费完成后未能成功下载的用户请联系客服处理。
网站客服:3074922707
大规模
训练
语言
模型
百度
搜索
中的
应用
研究
王帅强
Search SciencePre-trained Language Model forWeb-Scale Retrieval&Rankingin Baidu SearchShuaiqiang WANGhttp:/Search ScienceOutline1234BackgroundRetrievalRankingSummarySearch ScienceOutline1234BackgroundRetrievalRankingSummarySearch ScienceBackground Retrieval and ranking are two crucial stages in web-scale search engineDatabaseRetrievalRankingQueryResultsWeb-scaledocumentsFew hundreds orthousands candidatesSearch Sciencehttps:/ al,2019.Ernie:Enhanced representation through knowledge integration.In arXiv:1904.09223.2.Sun,Y.et al,2020.Ernie 2.0:A continual pre-training framework for language understanding.In AAAI.百度ERNIESearch Science Beyond text matching:semantic retrieval&ranking Representation-based methods Representation:document semantics aslatent vectors Retrieval:nearest neighbor search inlatent space Interaction-based models Ranking:matching over the local interactions*Picture from:Dai,Andrew M.,Christopher Olah,and Quoc V.Le.Document embedding with paragraph vectors.arXiv preprint arXiv:1507.07998(2015).QuerySemantically-related candidatesBackgroundSearch Science Semantic retrieval Effectively understand the semantics of queries and documents Large number of low-frequency queries Web-scale retrieval system Semantic ranking Expensive computations Ranking-agnostic pre-training Challenges!#$%&#(.%&.)(CLS*#*$SEP*#(.SEP.*)(&!+%&#&$&%&#(.&%&.&)(,%ERNIE-+-+-Masked Sentence AMasked Sentence BOur contribution:One of the largest application ofPLM for Web-scale Retrieval&Ranking1.Zou L.et al.Pre-trained Language Model based Ranking in Baidu Search.In KDD 2021.2.Liu Y.et al.Pre-trained Language Model for Web-scale Retrieval in Baidu Search.In KDD 2021.Search ScienceOutline1234BackgroundRetrievalRankingSummarySearch Science Retrieval model Goal:learning query-document semantic relatedness Backbone:a bi-encoders(i.e.,two-tower)architecture*,with Query&Doc encoders:transformersMethodology Retrieval ModelQuery EncoderCLS-pooling!CLS#$SEP.Doc EncoderCLS-pooling%CLS#&SEP.Query(tokenized)Doc(tokenized)retrieval score(Query embeddingDoc embedding*Chang,Wei-Cheng,et al.Pre-training tasks for embedding-based large-scale retrieval.arXiv preprint arXiv:2002.03932(2020).Search Science Retrieval model Goal:learning query-document semantic relatedness Poly-attention:bi-encoders with more query-document interaction*Methodology Retrieval Model*Humeau,Samuel,et al.Poly-encoders:Transformer architectures and pre-training strategies for fast and accurate multi-sentence scoring.arXiv preprint arXiv:1905.01969(2019).!#$!.#&.&EncoderEncoderAttentionCode 1AttentionCode mCLS-poolingP1Pm.!score s1.score smscore(=*+,-.#/(-Produce multiple embeddingson the query sideSearch Science Positive&negative data mining for different data sources Search log positives:user-clicked documents;negatives:non-clicked documents Manually labeled data positives:high-scored documents;negatives:low-scored documents In-batch negative mining Introducing random negatives Benefits:More aligned with retrieval task Efficiently scale up the number of negativesMethodology Retrieval Model!#!$!%&#&$&%&(&#(&$(&%(!)query&)relevant doc&)(strong negativerelevant(+,-)pairirrelevant(+,-)paircorresponding tostrong negativeirrelevant(+,-)paircorresponding torandom negativeSearch Science Multi-stage training paradigm Unsupervised-Supervised General corpus-Task-specific dataMethodology Training ParadigmSearch ScienceMethodology Embedding Compression Mode deployment Compression QuantizationDoc EmbeddingDoc EmbeddingQuantizationDoc EmbeddingCompression with additional FC layerSearch ScienceMethodology System Workflow Deployment Integrating term-based&ERNIE-based retrieval Unifying results with post-retrieval filtering1.2.3.Text MatchingSearch ScienceEvaluation Online Evaluation Metrics:DCG&GSB#Good=#queries that the new system performs better ResultsSearch ScienceOutline1234BackgroundRetrievalRankingSummarySearch ScienceContent-aware Pre-trained Language ModelPyramid-ERNIEMethodTime ComplexityOriginal ErniePyramid-ErnieO(Lh(Nq+Nt+Ns)2)O(Llowh(Nq+Nd)2)+Llowh(Ns)2+Lhighh(Nq+Nt+Ns)2)Search ScienceQUery-WeIghted Summary ExTraction(QUIET)QueryTerm1Term2Term3W1W2W3Sentence1A0B0C0Term1W1Term3W3Sentence2D0E0G0Term1W1F0H0STEP-1Sentence1W1+W3Sentence2W1TermTerm WeightTermTerm WeightTermTerm WeightCandidateScoreChoose the sentence with max score.Remove the selected sentence from the Candidate.STEP-2STEP-3!%!&!&%Search ScienceFinetune with Web-Scale Calibrated ClicksRaw clicksNoisy and inconsistent with relevanceCalibrated clicksAligning clicks with human labelsPretrainwith general dataPost-pretrainwithSearch logFinetune withcalibrated clicksFinetune withhuman labelsGeneral ErnieErnie for SearchErnie for SearchErnie for SearchRawclicksCalibratedclicksLabelgeneratorHumanlabelsSearch ScienceFinetune with Human LabelsWe manually labeled millions of query-document pairs and train the Pyramid-ERNIE with amixture of pairwise and pointwise loss(Y,F(q,D)=Xyiyjmax(0,f(q,di)?f(q,dj)+)+?(?(f(q,di),yi)+?(f(q,dj),yj)Search ScienceEvaluationBaseA basic ERNIE-based ranking policy,fine-tuned with a pairwise lossusing human-labeled query-document pairs.Content-aware Pyramid-ERNIE(CAP)A Pyramid-ERNIE architecture,incorporatingthe query-dependentdocument summary into the deep contextualization to better capturethe relevance between the query and document.Relevance-oriented Pre-training(REP)Pre-training the Pyramid-ERNIE model with refined large-scale user-behavioral data before fine-tuning it on the task data.Human-anchored Fine-tuning(HINT)HINT anchors the ranking model with human-preferred relevancescores.Search ScienceEvaluationModel?DCG?AB?GSB?DCG2?DCG4RandomLong-Tail RandomLong-TailBase-+CAP0.65%0.76%0.15%0.35%3.50%6.00%+CAP+REP2.78%1.37%0.58%0.41%5.50%7.00%+CAP+REP+HINT 2.85%1.58%0.14%0.45%6.00%7.50%“”indicates the statistically significant improvement(t-test with p 0.05 over the baseline).Search ScienceOutline1234BackgroundRetrievalRankingSummarySearch ScienceConclusionPLM-based retrieval and ranking models ERNIE-based models Multi-stage training paradigm Fully deployed onlineDatabaseRetrievalRankingQueryResultsWeb-scaledocumentsHundreds or thousandsof candidatesA simple search pipelineSearch ScienceWe are hiring!Please drop a message if interested.Search ScienceThank YouShuaiqiang WANGhttp:/