StateRelevanceforOff-PolicyEvaluationSimonP.Shen1YechengJasonMa2OmerGottesman3FinaleDoshi-Velez1Abstractimportantasmanydomainshavetrajectorieswithdifferentlengths:inhealthsettings,patients’lengtho...
OptimalOff-PolicyEvaluationfromMultipleLoggingPoliciesNathanKallus1YutaSaito1MasatoshiUehara1AbstractInmostoftheabovestudies,theobservationsusedtoevalu-ateanewpolicyareassumedgeneratedbyasinglelogg...
Off-PolicyConfidenceSequencesNikosKarampatziakis1PaulMineiro2AadityaRamdas3Abstractthattheprobabilitythattheyeverexcludethetruevalueisboundedbyaprespecifiedquantity.Inotherwords,theyWedevelopconfid...
LearningRoutinesforEffectiveOff-PolicyReinforcementLearningEdoardoCetin1OyaCeliktutan1Abstractengineeringandareoftenquiteinfluentialontheperfor-mance(Mahmoodetal.,2018).AlgorithmsthatlearnalsoThepe...
Finite-SampleAnalysisofOff-PolicyNaturalActor-CriticAlgorithmSajadKhodadadian∗1ZaiweiChen∗2SivaThejaMaguluri1AbstractAnACalgorithmcanbethoughtasageneralizedpolicyiter-ation(Puterman,1995),andcons...
Deeply-DebiasedOff-PolicyIntervalEstimationChengchunShi1RunzheWan2VictorChernozhukov3RuiSong2Abstractvalue,itiscrucialtoconstructaconfidenceinterval(CI)thatquantifiestheuncertaintyofthevalueestimat...
Data-efficientHindsightOff-PolicyOptionLearningMarkusWulfmeier1DushyantRao1RolandHafner1ThomasLampe1AbbasAbdolmaleki1TimHertweck1MichaelNeunert1DhruvaTirumala1NoahSiegel1NicolasHeess1MartinRiedmill...
DoublyRobustOff-PolicyActor-Critic:ConvergenceandOptimalityTengyuXu1ZhuoranYang2ZhaoranWang3YingbinLiang1Abstract(Haarnojaetal.,2018),etc.However,thesesuccessesusu-allyrelyontheaccesstoon-policysam...
Average-RewardOff-PolicyPolicyEvaluationwithFunctionApproximationShangtongZhang1YiWan2RichardS.Sutton2ShimonWhiteson1Abstractwhichaimtogenerateapolicythatmaximizestherewardratebyiterativelyimprovin...
BootstrappingFittedQ-EvaluationforOff-PolicyInferenceBotaoHao1XiangJi2YaqiDuan2HaoLu2CsabaSzepesva´ri13MengdiWang12Abstractetal.,2013;Munos&Szepesva´ri,2008;Leetal.,2019).Inpractice,FQEhasdemonst...
UnderstandingtheCurseofHorizoninOff-PolicyEvaluationviaConditionalImportanceSamplingYaoLiu1Pierre-LucBacon2EmmaBrunskill1Abstractincreasinginterestindevelopingaccurateandefficientalgo-rithmsforoff-...
StrivingforSimplicityandPerformanceinOff-PolicyDRL:OutputNormalizationandNon-UniformSamplingCheWang12YanqiuWu12QuanVuong3KeithRoss12Abstract(Lillicrapetal.,2015;Fujimotoetal.,2018).TD3,whichintrodu...
StatisticallyEfficientOff-PolicyPolicyGradientsNathanKallus1MasatoshiUehara2AbstractTable1.ComparisonofOff-Policypolicygradientestimators.Here,f=Θ(g)means0<liminff/g≤limsupf/g<∞(nottoPolicygradi...
RepresentationsforStableOff-PolicyReinforcementLearningDibyaGhosh1MarcBellemare1Abstract1995;Tsitsiklis&Roy,1996).Despitethispotentialforfailure,Q-learningandothertemporal-differencealgorithmsReinf...
ProvablyConvergentTwo-TimescaleOff-PolicyActor-CriticwithFunctionApproximationShangtongZhang1BoLiu2HengshuaiYao3ShimonWhiteson1Abstractatwo-timescaleconvergentanalysisunderfunctionapproxi-mation(Ko...
Off-PolicyActor-CriticwithSharedExperienceReplaySimonSchmitt1MatteoHessel1KarenSimonyan1AbstractTable1.Comparisonofmodel-freestate-of-the-artagentson57Atarigamesinthestandardregime:Herenoexperience...
MinimaxWeightandQ-FunctionLearningforOff-PolicyEvaluationMasatoshiUehara1JiaweiHuang2NanJiang2Abstractfromthecommunity(Liuetal.,2018;Xieetal.,2019),astheyovercomethecurseofhorizonwithrelativelymild...
Minimax-OptimalOff-PolicyEvaluationwithLinearFunctionApproximationYaqiDuan1ZeyuJia2MengdiWang34Abstractvalue)tobeearnedbyanewpolicybasedonloggedhistory.Thispaperstudiesthestatisticaltheoryofoff-Int...
InterpretableOff-PolicyEvaluationinReinforcementLearningbyHighlightingInfluentialTransitionsOmerGottesman1JosephFutoma1YaoLiu2SonaliParbhoo1LeoAnthonyCeli3EmmaBrunskill2FinaleDoshi-Velez1Abstractan...
DoublyrobustOff-PolicyevaluationwithshrinkageYiSu1MariaDimakopoulou2AkshayKrishnamurthy3MiroslavDud´ık3Abstractsubroutinesforoptimizingapolicy(Dud´ıketal.,2011).Weproposeanewframeworkfordesigni...