GlobalConvergenceofPolicyGradientforLinear-QuadraticMean-FieldControl/GameinContinuousTimeWeichenWang1JiequnHan2ZhuoranYang3ZhaoranWang4Abstractmorerealisticreal-worldproblems,suchasroboticcontrol(...
FOP:FactorizingOptimalJointPolicyofMaximum-EntropyMulti-AgentReinforcementLearningTianhaoZhang1YuehengLi1ChenWang1GuangmingXie1ZongqingLu1Abstractvalue-basedandactor-criticMARLmethods,whereglobalin...
DecouplingValueandPolicyforGeneralizationinReinforcementLearningRobertaRaileanu1RobFergus1Abstractization(Farebrotheretal.,2018;Zhangetal.,2018a;Cobbeetal.,2018;Igletal.,2019),dataaugmentation(Cobb...
CharacterizingtheGapBetweenActor-CriticandPolicyGradientJunfengWen1SaurabhKumar2RamkiGummadi3DaleSchuurmans13Abstractonarangeofchallengingtasks.DespitethesuccessofACmethods,ACandPGhavesubtlediffere...
Average-RewardOff-PolicyPolicyEvaluationwithFunctionApproximationShangtongZhang1YiWan2RichardS.Sutton2ShimonWhiteson1AbstractwhichaimtogenerateaPolicythatmaximizestherewardratebyiterativelyimprovin...
AdversarialPolicyLearninginTwo-playerCompetitiveGamesWenboGuo1XianWu1SuiHuang2XinyuXing1Abstract2020),wearguethatattacksdevelopedunderthisassump-tionarenotpractical.Forexample,givenamasteragentInat...
AdaptiveSamplingforBestPolicyIdentificationinMarkovDecisionProcessesAymenAlMarjani1AlexandreProutiere2Abstractcertainty.Thispaper,asmostrelatedworkinthisfield,fo-cusesonsystemsandcontrolobjectivest...
APolicyGradientAlgorithmforLearningtoLearninMultiagentReinforcementLearningDong-KiKim12MiaoLiu23MatthewRiemer23ChuangchuangSun12MarwaAbdulhai12GolnazHabibi12SebastianLopez-Cot12GeraldTesauro23Jonat...
TaylorExpansionPolicyOptimizationYunhaoTang1MichalValko2Re´miMunos2AbstractgorithmicideashavecontributedsignificantlytostabilizingPolicyoptimization.Inthiswork,weinvestigatetheapplicationofTaylore...
StructuredPolicyIterationforLinearQuadraticRegulatorYoungsukPark1RyanA.Rossi2ZhengWen3GangWu2HandongZhao2Abstractson&Moore,2007)spanningseveraldecades.Linearquadraticregulator(LQR)isoneoftheThissto...
StatisticallyEfficientOff-PolicyPolicyGradientsNathanKallus1MasatoshiUehara2AbstractTable1.Comparisonofoff-PolicyPolicygradientestimators.Here,f=Θ(g)means0<liminff/g≤limsupf/g<∞(nottoPolicygradi...
ReadyPolicyOne:WorldBuildingThroughActiveLearningPhilipJ.Ball1JackParker-Holder1AldoPacchiano2KrzysztofChoromanski3StephenRoberts1Abstractenvironment)thatcanbeleveragedacrossmanydifferenttasks(tran...
ProvablyEfficientModel-basedPolicyAdaptationYudaSong1AditiMavalankar1WenSun2SicunGao1AbstractMordatchetal.,2015),ormeta-learnpoliciesormodelsthatcanbequicklyadaptedtoin-distributionenvironments(Fin...
ProvablyEfficientExplorationinPolicyOptimizationQiCai1ZhuoranYang2ChiJin3ZhaoranWang1Abstractofiterations,evengiveninfinitedata.Meanwhile,fromthestatisticalperspective,itremainsunclearhowtoattainWh...
PolicyTeachingviaEnvironmentPoisoning:Training-timeAdversarialAttacksagainstReinforcementLearningAminRakhsha1GoranRadanovic1RatiDevidze1XiaojinZhu2AdishSingla1Abstractcisions,poisoningattacksmanipu...
EvolutionaryReinforcementLearningforSample-EfficientMultiagentCoordinationShauhardaKhadka1SomdebMajumdar1SantiagoMiret1StephenMcAleer2KaganTumer3Abstracttowardmaximizingaglobalobjective.Cooperative...
OptimisticPolicyOptimizationwithBanditFeedbackYonathanEfroni1LiorShani1AvivRosenberg2ShieMannor1AbstractDuetotheirpopularity,thereisarichliteraturethatpro-videsdifferenttypesoftheoreticalguarantees...
NeuralNetworkControlPolicyVerificationwithPersistentAdversarialPerturbationsYuh-ShyangWang1Tsui-WeiWeng2LucaDaniel2Abstractneuralnetworksaresurprisinglyvulnerabletoadversarialexamplesandattacks(Hua...
Multi-PrecisionPolicyEnforcedTraining(MuPPET):Aprecision-switchingstrategyforquantisedfixed-pointtrainingofCNNsAdityaRajagopal1DiederikAdriaanVink1StylianosI.Venieris2Christos-SavvasBouganis1Abstra...
Monte-CarlotreesearchasregularizedPolicyoptimizationJean-BastienGrill1FlorentAltche´1YunhaoTang12ThomasHubert3MichalValko1IoannisAntonoglou3Re´miMunos1AbstractAlphaZeroemploysanalternativehandcra...