unified embedding space | text+vision convergence