分享
05 - 杨植麟 - 循环智能 - 《Latest Advances of Neural Language Models》.pdf
下载文档

ID:3506636

大小:4.72MB

页数:36页

格式:PDF

时间:2024-05-16

收藏 分享赚钱
温馨提示:
1. 部分包含数学公式或PPT动画的文件,查看预览时可能会显示错乱或异常,文件下载后无此问题,请放心下载。
2. 本文档由用户上传,版权归属用户,汇文网负责整理代发布。如果您对本文档版权有争议请及时联系客服。
3. 下载前请仔细阅读文档内容,确认文档内容符合您的需求后进行下载,若出现内容与标题不符可向本站投诉处理。
4. 下载文档时可能由于网络波动等原因无法下载或下载错误,付费完成后未能成功下载的用户请联系客服处理。
网站客服:3074922707
Latest Advances of Neural Language Models 05 杨植麟 循环智能 Latest Advances of Neural Language Models 循环
杨植麟Recurrent AILatest Advances of Neural Language ModelsPart I:About Me Recurrent AI联合创始人,曾效力于谷歌大脑研究院和Facebook人工智能研究院 与多位图灵奖得主合作发表论文 曾在自然语言理解、半监督学习等30多个数据集上取得历史最好结果(state-of-the-art)2015年本科毕业于清华大学,2019年博士毕业于卡内基梅隆大学,师从苹果AI负责人Ruslan Salakhutdinov自我介绍 XLNet(NeurIPS 2019)在20个数据集好于Google BERT 是目前世界上最好的预训练模型(给定标准FLOPs)NeurIPS Oral(0.5%)被数十家AI媒体报道 Transformer-XL(ACL 2019)刷新所有主流自然语言建模世界纪录 历史上第一个同时在word-level和char-level超越LSTM的注意力模型 可以连贯生成几千个词的文本 HotpotQA(EMNLP 2018)多步推理数据集 被斯坦福、华盛顿大学、UT Austin、清华大学、字节跳动、京东、微软等机构用于模型评测 Semi-supervised graph learning(ICML 2016)400+引用 推广了图学习领域的标准数据集 被数百项工作采纳为标准baseline主要研究成果Latest advances of neural language modelsPart II:XLNetLearning from Unlabeled DataUnlabeled dataAbundant(1000 x more),accessibleLabeled dataScarce,expensiveUnsupervised PretrainingUnlabeled dataLabeled dataAlgorithms/ModelsImprove over supervised learningUnsupervised Pretraining RBMs(Salakhutdinov et al 2007),Autoencoders(Vincent et al 2008),Jigsaw(Noroozi and Favaro 2016),GANs(Donahue and Simonyan 2019)word2vec(Mikolov et al 2013),GloVe(Pennington et al 2014)Semi-supervised sequence learning(Dai and Le 2015),ELMo(Peters et al 2017),CoVe(McCann et al 2017),GPT(Radford et al 2018),BERT(Devlin et al 2018)Related WorkTwo Objectives for PretrainingAuto-regressive(AR)language modelingUnidirectional TransformerNew(Denoising)Auto-encoding(AE)YorkisacityYorkisacityBidirectional TransformermaskmaskisacityNewYorkNot able to model bidirectional context.Predicted tokens are independent of each other.mask is not used during finetuning.Sample a factorization order Determine the attention masks based on the order Optimize a standard language modeling objective Benefits:Autoregressive,avoiding disadvantages of AE Able to model bidirectional contextNew Objective:Permutation Language ModelingExamplesP(New York is a city)=P(New)*P(York|New)*P(is|New York)*P(a|New York is)*P(city|New York is a)Factorization order:New York is a cityFactorization order:city a is New YorkP(New York is a city)=P(city)*P(a|city)*P(is|city a)*P(New|city a is)*P(York|city a is New)Sequence order is not shuffled.Attention masks are changed to reflect factorization order.xxxxh()h()h()h()h()h()Factorization order:3 2 4 1xxxxh()h()h()h()h()Factorization order:1 4 2 3h()h()h()h()h()mem()mem()xxxxh()h()h()h()h()Factorization order:2 4 3 1h()h()h()xxxxh()h()h()h()h()h()h()h()Factorization order:4 3 1 2mem()mem()mem()mem()mem()mem()xxxxComparing XLNet and BERT objectivesBERT objective(auto-encoding)XLNet objective(auto-regressive)New and York are independent.Able to model dependency between New and York.Able to model bidirectional context.Factorize the joint probability using a product rule that holds universally.orReparameterizationhdoes not contain the position of the target.Solution:condition the distribution on the position.Standard Parameterization“Stand at”and predict selfReduced to predicting a bag of words.How to Formulate Features Let()denote the feature of the i-th token on layer lSuppose the factorization order is 3 2 4 112341(1)2(1)3(1)4(1)1(2)2(2)3(2)4(2)212341(1)2(1)3(1)4(1)1(2)2(2)3(2)4(2)4How to Formulate Features Let()denote the feature of the i-th token on layer lSuppose the factorization order is 3 2 4 112341(1)2(1)3(1)4(1)1(2)2(2)3(2)4(2)212341(1)2(1)3(1)4(1)1(2)2(2)3(2)4(2)4Cannot see self,otherwise trivialHow to Formulate Features Let()denote the feature of the i-th token on layer lSuppose the factorization order is 3 2 4 112341(1)2(1)3(1)4(1)1(2)2(2)3(2)4(2)212341(1)2(1)3(1)4(1)1(2)2(2)3(2)4(2)4Should not encode Should encode 4only has access to 2in the first layer!Two-Stream Attention Factorization order:3,2,4,1Content streamQuery streamTwo-Stream Attentionh()g()h()g()h()g()h()g()h()g()AttentionQK,Vh()g()h()g()h()g()h()g()h()g()AttentionQK,VAt first layer,h is the word embeddings,and g is a trainable parameter.Only h is used during finetuning.The last g is used for optimizing the LM loss.Summarizing XLNetIndependence assumption and distribution discrepancy in BERTPermutation language modelingStandard parameterization is reduced to bag-of-wordsReparameterization with positionsContradiction for predicting both self and othersTwo-stream attentionChallengesSolutions Same training data as in BERT:Wikipedia+BooksCorpus Same hyperparameters for pretraining as in BERT Model size:L=24,H=1024,A=16 Batch size:256 Number of steps:1M Same hyperparameter search space for finetuning as in BERTExperiment 1:Comparison with BERTXLNet outperforms BERT on 20 tasksWe report the best of 3 BERT variants.Almost identical training recipes.Less training data for XLNet:126GB vs 160GB Same hyperparameters for pretraining as in RoBERTa Model size:L=24,H=1024,A=16 Batch size:8192 Number of steps:500K Same hyperparameter search space for finetuning as in RoBERTaExperiment 2:Comparison with RoBERTaXLNet outperforms RoBERTa on all considered tasksAlmost identical training recipes.XLNetisthe best pretrained model todaygiven standard FLOPs.FLOPsAccuracyBERT-LargeRoBERTaXLNetALBERTT54x16x1xPart III:Research Plan Challenge:XLNet and similar methods still rely on a large amount of labeled data for target tasks Goal:improve data efficiency of pretraining-finetuning paradigm Directions Pretraining+meta learning Pretraining+multi-view integrationResearch ProposalMeta Learning:BackgroundChen et al 2019Pretraining+Meta Learning-Main idea:A meta learning paradigm for finetuning-Why it might work:Learning to compare a novel instance against memory-Goal:Reduce sample complexity and improve data efficiency-Technical novelties and challenges:A meta learning algorithm that works with dozens/hundreds of examples and a pretrained model Example 1:We can use XLNet to learn to classify sales calls/texts Meanwhile,there are also structured data stored in databases Question:how to combine the two views?Example 2:We can use XLNet to learn to classify medical texts Meanwhile,there is another black-box model trained on medical imaging data Question:how to combine the two views?Data often have multiple views(features)Data often have multiple views(features)Naive Approach:Shallow Mixing-Extra views do not change the text representations-Not expressive enough to capture the dependency between viewsProposed Approach:Deep IntegrationDeep integration better models dependency among views!-Challenge:a pretrained model normally only takes text as inputs-Solution 1:turn extra views into text-like representations-Solution 2:add additional structures to XLNet to incorporate extra views Goal:improve data efficiency of XLNet-like methods Proposed Methods:Pretraining+Meta Learning Pretraining+Multi-View Learning Turn extra views into text-like representations Add additional structures to XLNet to incorporate extra views Datasets and experimental settings In-house text classification datasets extracted from sales calls Billions of unlabeled sentences A few thousand labeled sentences for each class Multiple domains Evaluation metric:F1 scoreResearch Plan Highlight杨植麟Recurrent AIThanks!

此文档下载收益归作者所有

下载文档
你可能关注的文档
收起
展开