剑桥大学
GPT
技术
报告
100
WN9
GPT-4 Technical ReportOpenAIAbstractWe report the development of GPT-4,a large-scale,multimodal model which canaccept image and text inputs and produce text outputs.While less capable thanhumans in many real-world scenarios,GPT-4 exhibits human-level performanceon various professional and academic benchmarks,including passing a simulatedbar exam with a score around the top 10%of test takers.GPT-4 is a Transformer-based model pre-trained to predict the next token in a document.The post-trainingalignment process results in improved performance on measures of factuality andadherence to desired behavior.A core component of this project was developinginfrastructure and optimization methods that behave predictably across a widerange of scales.This allowed us to accurately predict some aspects of GPT-4sperformance based on models trained with no more than 1/1,000th the compute ofGPT-4.1IntroductionThis technical report presents GPT-4,a large multimodal model capable of processing image andtext inputs and producing text outputs.Such models are an important area of study as they have thepotential to be used in a wide range of applications,such as dialogue systems,text summarization,and machine translation.As such,they have been the subject of substantial interest and progress inrecent years 134.One of the main goals of developing such models is to improve their ability to understand and generatenatural language text,particularly in more complex and nuanced scenarios.To test its capabilitiesin such scenarios,GPT-4 was evaluated on a variety of exams originally designed for humans.Inthese evaluations it performs quite well and often outscores the vast majority of human test takers.For example,on a simulated bar exam,GPT-4 achieves a score that falls in the top 10%of test takers.This contrasts with GPT-3.5,which scores in the bottom 10%.On a suite of traditional NLP benchmarks,GPT-4 outperforms both previous large language modelsand most state-of-the-art systems(which often have benchmark-specific training or hand-engineering).On the MMLU benchmark 35,36,an English-language suite of multiple-choice questions covering57 subjects,GPT-4 not only outperforms existing models by a considerable margin in English,butalso demonstrates strong performance in other languages.On translated variants of MMLU,GPT-4surpasses the English-language state-of-the-art in 24 of 26 languages considered.We discuss thesemodel capability results,as well as model safety improvements and results,in more detail in latersections.This report also discusses a key challenge of the project,developing deep learning infrastructure andoptimization methods that behave predictably across a wide range of scales.This allowed us to makepredictions about the expected performance of GPT-4(based on small runs trained in similar ways)that were tested against the final run to increase confidence in our training.Despite its capabilities,GPT-4 has similar limitations to earlier GPT models 1,37,38:it is not fullyreliable(e.g.can suffer from“hallucinations”),has a limited context window,and does not learnPlease cite this work as“OpenAI(2023).Full authorship contribution statements appear at the end of thedocument.Correspondence regarding this technical report can be sent to gpt4-arXiv:2303.08774v3 cs.CL 27 Mar 2023群内每日免费分享5份+最新资料 群内每日免费分享5份+最新资料 300T网盘资源+4040万份行业报告为您的创业、职场、商业、投资、亲子、网赚、艺术、健身、心理、个人成长 全面赋能!添加微信,备注“入群”立刻免费领取 立刻免费领取 200套知识地图+最新研报收钱文案、增长黑客、产品运营、品牌企划、营销战略、办公软件、会计财务、广告设计、摄影修图、视频剪辑、直播带货、电商运营、投资理财、汽车房产、餐饮烹饪、职场经验、演讲口才、风水命理、心理思维、恋爱情趣、美妆护肤、健身瘦身、格斗搏击、漫画手绘、声乐训练、自媒体打造、效率软件工具、游戏影音扫码先加好友,以备不时之需扫码先加好友,以备不时之需行业报告/思维导图/电子书/资讯情报行业报告/思维导图/电子书/资讯情报致终身学习者社群致终身学习者社群关注公众号获取更多资料关注公众号获取更多资料from experience.Care should be taken when using the outputs of GPT-4,particularly in contextswhere reliability is important.GPT-4s capabilities and limitations create significant and novel safety challenges,and we believecareful study of these challenges is an important area of research given the potential societal impact.This report includes an extensive system card(after the Appendix)describing some of the risks weforesee around bias,disinformation,over-reliance,privacy,cybersecurity,proliferation,and more.It also describes interventions we made to mitigate potential harms from the deployment of GPT-4,including adversarial testing with domain experts,and a model-assisted safety pipeline.2Scope and Limitations of this Technical ReportThis report focuses on the capabilities,limitations,and safety properties of GPT-4.GPT-4 is aTransformer-style model 39 pre-trained to predict the next token in a document,using both publiclyavailable data(such as internet data)and data licensed from third-party providers.The model wasthen fine-tuned using Reinforcement Learning from Human Feedback(RLHF)40.Given boththe competitive landscape and the safety implications of large-scale models like GPT-4,this reportcontains no further details about the architecture(including model size),hardware,training compute,dataset construction,training method,or similar.We are committed to independent auditing of our technologies,and shared some initial steps andideas in this area in the system card accompanying this release.2We plan to make further technicaldetails available to additional third parties who can advise us on how to weigh the competitive andsafety considerations above against the scientific value of further transparency.3Predictable ScalingA large focus of the GPT-4 project was building a deep learning stack that scales predictably.Theprimary reason is that for very large training runs like GPT-4,it is not feasible to do extensivemodel-specific tuning.To address this,we developed infrastructure and optimization methods thathave very predictable behavior across multiple scales.These improvements allowed us to reliablypredict some aspects of the performance of GPT-4 from smaller models trained using1,00010,000 less compute.3.1Loss PredictionThe final loss of properly-trained large language models is thought to be well approximated by powerlaws in the amount of compute used to train the model 41,42,2,14,15.To verify the scalability of our optimization infrastructure,we predicted GPT-4s final loss on ourinternal codebase(not part of the training set)by fitting a scaling law with an irreducible loss term(as in Henighan et al.15):L(C)=aCb+c,from models trained using the same methodologybut using at most 10,000 x less compute than GPT-4.This prediction was made shortly after the runstarted,without use of any partial results.The fitted scaling law predicted GPT-4s final loss withhigh accuracy(Figure 1).3.2Scaling of Capabilities on HumanEvalHaving a sense of the capabilities of a model before training can improve decisions around alignment,safety,and deployment.In addition to predicting final loss,we developed methodology to predictmore interpretable metrics of capability.One such metric is pass rate on the HumanEval dataset 43,which measures the ability to synthesize Python functions of varying complexity.We successfullypredicted the pass rate on a subset of the HumanEval dataset by extrapolating from models trainedwith at most 1,000 less compute(Figure 2).For an individual problem in HumanEval,performance may occasionally worsen with scale.Despitethesechallenges,wefindanapproximatepowerlawrelationshipEPlog(pass_rate(C)=Ck2In addition to the accompanying system card,OpenAI will soon publish additional thoughts on the socialand economic implications of AI systems,including the need for effective regulation.2ObservedPredictiongpt-4100p10n11000.011Compute1.02.03.04.05.06.0Bits per wordOpenAI codebase next word predictionFigure 1.Performance of GPT-4 and smaller models.The metric is final loss on a dataset derivedfrom our internal codebase.This is a convenient,large dataset of code tokens which is not contained inthe training set.We chose to look at loss because it tends to be less noisy than other measures acrossdifferent amounts of training compute.A power law fit to the smaller models(excluding GPT-4)isshown as the dotted line;this fit accurately predicts GPT-4s final loss.The x-axis is training computenormalized so that GPT-4 is 1.ObservedPredictiongpt-41101000.0010.010.11Compute012345 Mean Log Pass RateCapability prediction on 23 coding problemsFigure 2.Performance of GPT-4 and smaller models.The metric is mean log pass rate on a subset ofthe HumanEval dataset.A power law fit to the smaller models(excluding GPT-4)is shown as the dottedline;this fit accurately predicts GPT-4s performance.The x-axis is training compute normalized so thatGPT-4 is 1.3wherekandare positive constants,andPis a subset of problems in the dataset.We hypothesizethat this relationship holds for all problems in this dataset.In practice,very low pass rates are difficultor impossible to estimate,so we restrict to problemsPand modelsMsuch that given some largesample budget,every problem is solved at least once by every model.We registered predictions for GPT-4s performance on HumanEval before training completed,usingonly information available prior to training.All but the 15 hardest HumanEval problems were splitinto 6 difficulty buckets based on the performance of smaller models.The results on the3rdeasiestbucket are shown in Figure 2,showing that the resulting predictions were very accurate for thissubset of HumanEval problems where we can accurately estimatelog(pass_rate)for several smallermodels.Predictions on the other five buckets performed almost as well,the main exception beingGPT-4 underperforming our predictions on the easiest bucket.Certain capabilities remain hard to predict.For example,the Inverse Scaling Prize 44 proposedseveral tasks for which model performance decreases as a function of scale.Similarly to a recentresult by Wei et al.45,we find that GPT-4 reverses this trend,as shown on one of the tasks calledHindsight Neglect 46 in Figure 3.adababbagecuriegpt-3.5gpt-4Model050100AccuracyInverse scaling prize,hindsight neglectFigure 3.Performance of GPT-4 and smaller models on the Hindsight Neglect task.Accuracy is shownon the y-axis,higher is better.ada,babbage,and curie refer to models available via the OpenAI API 47.We believe that accurately predicting future capabilities is important for safety.Going forward weplan to refine these methods and register performance predictions across various capabilities beforelarge model training begins,and we hope this becomes a common goal in the field.4CapabilitiesWe tested GPT-4 on a diverse set of benchmarks,including simulating exams that were originallydesigned for humans.4We did no specific training for these exams.A minority of the problems in theexams were seen by the model during training;for each exam we run a variant with these questionsremoved and report the lower score of the two.We believe the results to be representative.For furtherdetails on contamination(methodology and per-exam statistics),see Appendix C.Exams were sourced from publicly-available materials.Exam questions included both multiple-choice and free-response questions;we designed separate prompts for each format,and images wereincluded in the input for questions which required it.The evaluation setup was designed basedon performance on a validation set of exams,and we report final results on held-out test exams.Overall scores were determined by combining multiple-choice and free-response question scoresusing publicly available methodologies for each exam.We estimate and report the percentile eachoverall score corresponds to.See Appendix A for further details on the exam evaluation methodology.3For AMC 10 and AMC 12 2022 exams,the human percentiles are not yet published,so the reported numbersare extrapolated and likely have wide uncertainty.See Appendix A.5.4We used the post-trained RLHF model for these exams.4ExamGPT-4GPT-4(no vision)GPT-3.5Uniform Bar Exam(MBE+MEE+MPT)298/400(90th)298/400(90th)213/400(10th)LSAT163(88th)161(83rd)149(40th)SAT Evidence-Based Reading&Writing710/800(93rd)710/800(93rd)670/800(87th)SAT Math700/800(89th)690/800(89th)590/800(70th)Graduate Record Examination(GRE)Quantitative163/170(80th)157/170(62nd)147/170(25th)Graduate Record Examination(GRE)Verbal169/170(99th)165/170(96th)154/170(63rd)Graduate Record Examination(GRE)Writing4/6(54th)4/6(54th)4/6(54th)USABO Semifinal Exam 202087/150(99th-100th)87/150(99th-100th)43/150(31st-33rd)USNCO Local Section Exam 202236/6038/6024/60Medical Knowledge Self-Assessment Program75%75%53%Codeforces Rating392(below 5th)392(below 5th)260(below 5th)AP Art History5(86th-100th)5(86th-100th)5(86th-100th)AP Biology5(85th-100th)5(85th-100th)4(62nd-85th)AP Calculus BC4(43rd-59th)4(43rd-59th)1(0th-7th)AP Chemistry4(71st-88th)4(71st-88th)2(22nd-46th)AP English Language and Composition2(14th-44th)2(14th-44th)2(14th-44th)AP English Literature and Composition2(8th-22nd)2(8th-22nd)2(8th-22nd)AP Environmental Science5(91st-100th)5(91st-100th)5(91st-100th)AP Macroeconomics5(84th-100th)5(84th-100th)2(33rd-48th)AP Microeconomics5(82nd-100th)4(60th-82nd)4(60th-82nd)AP Physics 24(66th-84th)4(66th-84th)3(30th-66th)AP Psychology5(83rd-100th)5(83rd-100th)5(83rd-100th)AP Statistics5(85th-100th)5(85th-100th)3(40th-63rd)AP US Government5(88th-100th)5(88th-100th)4(77th-88th)AP US History5(89th-100th)4(74th-89th)4(74th-89th)AP World History4(65th-87th)4(65th-87th)4(65th-87th)AMC 10330/150(6th-12th)36/150(10th-19th)36/150(10th-19th)AMC 12360/150(45th-66th)48/150(19th-40th)30/150(4th-8th)Introductory Sommelier(theory knowledge)92%92%80%Certified Sommelier(theory knowledge)86%86%58%Advanced Sommelier(theory knowledge)77%77%46%Leetcode(easy)31/4131/4112/41Leetcode(medium)21/8021/808/80Leetcode(hard)3/453/450/45Table 1.GPT performance on academic and professional exams.In each case,we simulate theconditions and scoring of the real exam.We report GPT-4s final score graded according to exam-specific rubrics,as well as the percentile of test-takers achieving GPT-4s score.5AP Calculus BCAMC 12Codeforces RatingAP English LiteratureAMC 10Uniform Bar ExamAP English LanguageAP ChemistryGRE QuantitativeAP Physics 2USABO Semifinal 2020AP MacroeconomicsAP StatisticsLSATGRE WritingAP MicroeconomicsAP BiologyGRE VerbalAP World HistorySAT MathAP US HistoryAP US GovernmentAP PsychologyAP Art HistorySAT EBRWAP Environmental ScienceExam0%20%40%60%80%100%Estimated percentile lower bound(among test takers)Exam results(ordered by GPT-3.5 performance)gpt-4gpt-4(no vision)gpt3.5Figure 4.GPT performance on academic and professional exams.In each case,we simulate theconditions and scoring of the real exam.Exams are ordered from low to high based on GPT-3.5performance.GPT-4 outperforms GPT-3.5 on most exams tested.To be conservative we report thelower end of the range of percentiles,but this creates some artifacts on the AP exams which have verywide scoring bins.For example although GPT-4 attains the highest possible score on AP Biology(5/5),this is only shown in the plot as 85th percentile because 15 percent of test-takers ach