ViLT:Vision-and-LanguageTransformerWithoutConvolutionorRegionSupervisionWonjaeKim1†BokyungSon1IldooKim2AbstractVisualEmbeddingSchemaVision-and-LanguagePre-training(VLP)hasim-RegionFeatureImageCNNR...
UnifyingVision-and-LanguageTasksviaTextGenerationJaeminCho1JieLeiHaoTanMohitBansalUNCChapelHill{jmincho,jielei,haotan,mbansal}@cs.unc.eduAbstract21ExistingmethodsforVision-and-Languagelearn-ingtypi...