Transformers: Attention, Architecture, Training, and Scaling - FeynmanWiki