disentangling visual and written concepts in clip