Adafactor:AdaptiveLearningRateswithSublinearMemoryCostNoamShazeer1MitchellStern12Abstractvectorsummarizingthehistoryofsquaredgradients,usuallyobtainedthroughsummationasinAdagrad(Duchietal.,Insevera...