22 6 397

Yazan Agha-Schrader PRO

phi0112358

AI & ML interests

Brain, EEG, BCI, consciousness, autism, octopus, automation, a.i., etymology, numbers, spirituality, astronomy

Recent Activity

liked a model 1 day ago

Qwen/Qwen2.5-Coder-32B-Instruct

liked a model 9 days ago

meta-llama/Meta-Llama-3-8B-Instruct

liked a model 27 days ago

unsloth/QwQ-32B-Preview-GGUF

View all activity

Organizations

phi0112358's activity

liked a model 1 day ago

Qwen/Qwen2.5-Coder-32B-Instruct

Text Generation • Updated Nov 18 • 376k • • 1.36k

liked a model 9 days ago

meta-llama/Meta-Llama-3-8B-Instruct

Text Generation • Updated Sep 27 • 1.78M • • 3.71k

liked 2 models 27 days ago

unsloth/QwQ-32B-Preview-GGUF

Text Generation • Updated 27 days ago • 1.67k • 7

nanowell/QwQ-32B-Preview-Q4_K_M-GGUF

Text Generation • Updated 27 days ago • 3.11k • 6

liked 2 models 29 days ago

Mozilla/whisperfile

Updated Oct 2 • 737 • 238

Mozilla/gemma-2-27b-it-llamafile

Updated Oct 31 • 2.4k • 29

liked a Space about 2 months ago

Running

📝🗣️

Edge TTS

Microsoft Edge's Text To Speech

liked a model 2 months ago

nvidia/Llama-3.1-Nemotron-70B-Instruct-HF

Text Generation • Updated Oct 25 • 169k • 1.93k

upvoted a collection 2 months ago

Qwen2.5

Collection

Qwen2.5 language models, including pretrained and instruction-tuned models of 7 sizes, including 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B. • 45 items • Updated 27 days ago • 444

liked a model 2 months ago

Qwen/Qwen2.5-72B-Instruct

Text Generation • Updated Sep 25 • 229k • • 628

liked a dataset 2 months ago

Thorsten-Voice/TV-44kHz-Full

Viewer • Updated Oct 20 • 78.5k • 196 • 6

reacted to m-ric's post with 👍 3 months ago

Post

3051

📜 𝐎𝐥𝐝-𝐬𝐜𝐡𝐨𝐨𝐥 𝐑𝐍𝐍𝐬 𝐜𝐚𝐧 𝐚𝐜𝐭𝐮𝐚𝐥𝐥𝐲 𝐫𝐢𝐯𝐚𝐥 𝐟𝐚𝐧𝐜𝐲 𝐭𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐞𝐫𝐬!

Researchers from Mila and Borealis AI just have shown that simplified versions of good old Recurrent Neural Networks (RNNs) can match the performance of today's transformers.

They took a fresh look at LSTMs (from 1997!) and GRUs (from 2014). They stripped these models down to their bare essentials, creating "minLSTM" and "minGRU". The key changes:
❶ Removed dependencies on previous hidden states in the gates
❷ Dropped the tanh that had been added to restrict output range in order to avoid vanishing gradients
❸ Ensured outputs are time-independent in scale (not sure I understood that well either, don't worry)

⚡️ As a result, you can use a “parallel scan” algorithm to train these new, minimal RNNs, in parallel, taking 88% more memory but also making them 200x faster than their traditional counterparts for long sequences

🔥 The results are mind-blowing! Performance-wise, they go toe-to-toe with Transformers or Mamba.

And for Language Modeling, they need 2.5x fewer training steps than Transformers to reach the same performance! 🚀

🤔 Why does this matter?

By showing there are simpler models with similar performance to transformers, this challenges the narrative that we need advanced architectures for better performance!

💬 François Chollet wrote in a tweet about this paper:

“The fact that there are many recent architectures coming from different directions that roughly match Transformers is proof that architectures aren't fundamentally important in the curve-fitting paradigm (aka deep learning)”

“Curve-fitting is about embedding a dataset on a curve. The critical factor is the dataset, not the specific hard-coded bells and whistles that constrain the curve's shape.”

It’s the Bitter lesson by Rich Sutton striking again: don’t need fancy thinking architectures, just scale up your model and data!

Read the paper 👉 Were RNNs All We Needed? (2410.01201)