RobustReinforcementLearningusingLeastSquaresPolicyIterationwithProvablePerformanceGuaranteesKishanPanaganti1DileepKalathil1AbstractThismismatchbetweenthetrainingandtestingenvironmentparameterscansi...
RobustPolicyGradientagainstStrongDataCorruptionXuezhouZhang1YidingChen1JerryZhu1WenSun2Abstracthighlynoisydata,suchasautonomousdriving,quantitativetrading,ormedicaldiagnosis.Westudytheproblemofrobu...
Re-understandingFinite-StateRepresentationsofRecurrentPolicyNetworksMohamadH.Danesh1AnuragKoul1AlanFern1SaeedKhorram1Abstracttivehumaninterpretationoftheunderlying“strategicrole"oftheattended-toel...
ProvablyEfficientFictitiousPlayPolicyOptimizationforZero-SumMarkovGameswithStructuredTransitionsShuangQiu1XiaohanWei2JiepingYe1ZhaoranWang3ZhuoranYang4Abstractunderstandingofmulti-agentPolicyoptimi...
PosteriorValueFunctions:HindsightBaselinesforPolicyGradientMethodsChrisNota1BrunoCastrodaSilva1PhilipS.Thomas1Abstractcases,suchinformationcanbeusefulforassessingwhichoutcomeswerelikelytohaveoccurr...
PolicyInformationCapacity:Information-TheoreticMeasureforTaskComplexityinDeepReinforcementLearningHirokiFuruta1TatsuyaMatsushima1TadashiKozuno2YutakaMatsuo1SergeyLevine3OfirNachum3ShixiangShaneGu3A...
PolicyCacheswithSuccessorFeaturesMarkNemecek1RonaldParr1Abstracttaskswhichvaryintheirrewardfunctions,butwherethedynamicsremainthesame.Althoughlimitedinscope,thisTransferinreinforcementlearningisbas...
PolicyGradientBayesianRobustOptimizationforImitationLearningZaynahJaved1DanielS.Brown1SatvikSharma1JerryZhu1AshwinBalakrishna1MarekPetrik2AncaD.Dragan1KenGoldberg1Abstracthuman-designedrewardfuncti...
PolicyAnalysisusingSyntheticControlsinContinuous-TimeAlexisBellot12MihaelavanderSchaar123Abstractoraverageinaneighbourhoodofcontrols)oftenprovidesamoreinformativecomparisonfortreatmenteffectestimat...
PODS:PolicyOptimizationviaDifferentiableSimulationMiguelZamora1MomchilPeychev1SehoonHa2MartinVechev1StelianCoros1Abstractpotentiallyunsafe.Fortunately,recentyearshaveseenexcit-ingprogressinsimulati...
PhasicPolicyGradientKarlCobbe1JacobHilton1OlegKlimov1JohnSchulman1Abstractcanbeusedtobetteroptimizetheother.WeintroducePhasicPolicyGradient(PPG),are-However,therearealsodisadvantagestosharingnetwor...
PC-MLP:Model-basedReinforcementLearningwithPolicyCoverGuidedExplorationYudaSong1WenSun2Abstractsuccessrate0.5HandEgg0.4Model-basedReinforcementLearning(RL)isa0.3DeepPC-MPL200000popularlearningparad...
OptiDICE:OfflinePolicyOptimizationviaStationaryDistributionCorrectionEstimationJongminLee1WonseokJeon23Byung-JunLee4JoellePineau235Kee-EungKim16Abstractandthentodeploythemodelwithitsparameterfixedw...
OnlinePolicyGradientforModelFre√eLearningofLinearQuadraticRegulatorswithTRegretAsafCassel1TomerKoren12AbstractModel-basedmethods,whichperformplanningbasedonasystemidentificationprocedurethatestima...
OntheOptimalityofBatchPolicyOptimizationAlgorithmsChenjunXiao12YifanWu3TorLittlemore4BoDai2JinchengMei12LihongLi†5CsabaSzepesvari14DaleSchuurmans12Abstractafixeddatasetofpreviouslycollectedexperie...
OnProximalPolicyOptimization’sHeavy-tailedGradientsSaurabhGarg1JoshuaZhanson2EmilioParisotto1AdarshPrasad1J.ZicoKolter2ZacharyC.Lipton1SivaramanBalakrishnan3RuslanSalakhutdinov1PradeepRavikumar1Ab...
Muesli:CombiningImprovementsinPolicyOptimizationMatteoHessel1IvoDanihelka12FabioViola1ArthurGuez1SimonSchmitt1LaurentSifre1TheophaneWeber1DavidSilver12HadovanHasselt1AbstractMedianhuman-normalizeds...
MonotonicRobustPolicyOptimizationwithModelDiscrepancyYuankunJiang1ChenglinLi2WenruiDai1JunniZou1HongkaiXiong2Abstractcontroltasks,e.g.,playingcomputergameswithhuman-levelperformance(Mnihetal.,2013;...
Model-FreeandModel-BasedPolicyEvaluationwhenCausalityisUncertainDavidBruns-Smith1Abstractunobservedshocksareoftenassumedtobedrawniidev-eryperiod.ConsidertheFederalReserveBoardadjustingWhendecision-...
GuidedExplorationwithProximalPolicyOptimizationusingaSingleDemonstrationGabrieleLibardi1SebastianDittert1GianniDeFabritiis12AbstractLearningfromdemonstrationsallowstodirectlybypassthisproblembutito...