Learning Notebook - David Rostcheck

learning_event details

Learning Event ID
Subject
Topic
Program
Length
Institution
Presenter
Format
Recorded Date
Completed Date
Notes	Using window attention with attention sink tokens allows pretrained chat-style LLMs, such as all Llama, Mistral, MPT, Falcon, and GPT-NeoX (Pythia) models, to stay fluent across hundreds of subsequent prompts, unlike when these models are loaded using transformers. Furthermore, this approach allows for constant memory usage, while most LLMs loaded with transformers have linear space complexity resulting in memory issues. Using this form of attention is as simple as importing your model class from attention_sinks rather than transformers: from attention_sinks import AutoModel model = AutoModel.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1", device_map="auto")
Personal Notes
Link
Review