Search papers, labs, and topics across Lattice.
4
3
5
3
Synthetically corrupting data with a taxonomy of OCR errors lets you train LLMs to fix real-world OCR mistakes and dramatically improve document understanding.
Multilingual retrievers often prioritize irrelevant English documents over relevant foreign-language documents, even when the query is in that foreign language.
Low-resource languages can get a 15% boost in cross-lingual retrieval accuracy by using English as a Rosetta Stone during training.
Forget just mining hard negatives: the secret to better knowledge distillation for retrieval lies in matching the *entire* score distribution of your teacher model.