What LLMs, image models, and agents really do — what they don't — and how to read past the hype.
Hosted by Bob (Jue) Guo — CS PhD candidate & adjunct faculty, University at Buffalo
There's a huge gap between what AI can do, what people say it does, and what's actually useful to you. Office Hours is where we close that gap — calmly, with real examples, and your questions.
Relaxed but structured. I walk through a mental model, you interrupt with questions whenever.
Beginners leave with a real model. If you already build with these tools, you'll get sharper angles.
No prerequisites. No slides to memorize. We close with open Q&A — so bring a question.
One ask: when something doesn't land, say so. The confusing parts are the whole point.
Office Hours runs as a series — each session stands on its own, so you can drop into any one. Here's the arc.
The model family tree, LLMs, image models, agents — and how to read past the hype.
A focused deep dive into one area — shaped by the questions you bring today.
Hands-on building, prompting, agents in practice — and whatever you keep asking about.
Today is the foundation — the shared mental model the rest of the series builds on.
A 70-year-old idea that, for most of those years, didn't work.
Three researchers kept betting on neural networks through decades when almost no one else did. They shared the 2018 Turing Award for it.
Championed backpropagation — how nets learn from mistakes. Mentored the team behind the 2012 breakthrough. Nobel Prize in Physics, 2024.
Built convolutional nets that could read handwriting in the '90s — the foundation of modern computer vision. Now chief AI scientist at Meta.
Pushed deep learning for language & sequences and early attention ideas — groundwork for today's LLMs.
One artificial neuron. Big hype → a decades-long "AI winter."
Nets can finally learn many layers — but still too slow to matter.
Crushes the ImageNet contest. Big data + GPUs make deep learning suddenly work.
"Attention Is All You Need." Google's architecture — the engine of everything since.
More data + compute → emergent abilities. ChatGPT takes AI mainstream overnight.
Multimodal models & agents — what we'll unpack today.
Those six were the turning points. Zoom out and the full landscape looks like this — four eras, every model that mattered. Skim it now; then we'll rewind and walk the key ideas in depth.
The mathematical building blocks of neural networks.
— Rosenblatt's single artificial neuron — learns to separate linear classes.
— Minsky & Papert prove the XOR limit. Triggers the first AI winter.
— Fukushima's hierarchical visual model — the direct ancestor of CNNs.
— Rumelhart, Hinton & Williams — multi-layer training finally tractable.
— LeCun's first CNN — handwritten digit recognition for the US Postal Service.
— Hochreiter & Schmidhuber — gated cells solve the vanishing-gradient problem.
— Refined CNN evaluated on MNIST — the benchmark that defined the era.
— Hinton's layer-wise pretraining revives the field. Coins "deep learning."
Data + GPUs + depth, in that order.
— Fei-Fei Li's 14M-image, 21k-category dataset. The benchmark everyone had to beat.
— Krizhevsky, Sutskever & Hinton — deep CNN on GPUs halves ImageNet error.
— Mikolov — dense word vectors that capture meaning algebraically.
— Goodfellow — two networks playing a min-max game. The generative unlock.
— Oxford and Google — much deeper nets; Inception modules; ensemble tricks.
— Sutskever — encoder-decoder framework. Makes neural translation work.
— He et al. — skip connections let networks scale to 152 layers without degrading.
— Ioffe & Szegedy — normalises activations layer by layer. Becomes a default.
— DeepMind — deep RL + Monte Carlo tree search beats Lee Sedol 4–1.
One architecture starts eating every domain.
— Vaswani et al. — attention replaces recurrence. The new dominant architecture.
— Devlin et al. — bidirectional pretraining → top scores across NLP tasks.
— OpenAI — decoder-only autoregressive pretraining on books.
— OpenAI — 1.5B params; generation quality so good they delayed release.
— OpenAI — 175B params. In-context learning, no fine-tuning required.
— Dosovitskiy et al. — split image into patches, feed to a transformer. Works.
— Ho et al. — image generation by learning to reverse a noise process.
— OpenAI — contrastive image-text alignment; one model, many vision tasks.
— OpenAI — text-to-image generation via transformer + VAE-style decoder.
— DeepMind — protein structure prediction at experimental accuracy.
AI leaves the lab. Chat, image, video, voice, code, agents.
— OpenAI — RLHF + chat interface make LLMs broadly usable. 100M users in 2 months.
— Stability AI — open-weight text-to-image. Spawns an ecosystem overnight.
— OpenAI — multimodal frontier model. Bar exam, code, vision, the works.
— Meta — open-weight LLMs that anyone can run, fine-tune, and build on.
— Anthropic — Constitutional AI; safety-focused frontier model.
— OpenAI — coherent minute-long video from text. The video-model moment.
— OpenAI — single model handling voice, vision, and text in real time.
— OpenAI — chain-of-thought at inference time; the reasoning-model unlock.
— DeepSeek — open-weight reasoning model trained at a fraction of frontier cost.
— Tool-using LLMs running multi-step tasks autonomously. The agent era begins.
That's the whole landscape. Now let's rewind to 1958 and watch how a handful of these ideas actually work.
Frank Rosenblatt's bet: build a machine that learns from examples instead of following hand-coded rules. One simple unit, loosely inspired by a brain cell.
the features it sees
each input has a weight
over the line → 1
its decision
learning rule: nudge the weights whenever it's wrong
That one idea — adjustable weights tuned by error — is still the beating heart of every neural network in 2026.
The press called it an electronic brain that would soon walk, talk, see, and reproduce itself. Funding poured in.
Minsky & Papert proved a single perceptron can only draw a straight line — it can't even learn XOR. Optimism collapsed.
Keep this cycle in mind — we'll use it as a lens on today's hype in Part 06.
XOR ("exclusive or") outputs 1 only when the two inputs differ — a tiny logic function:
| x₁ | x₂ | XOR |
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
A perceptron's decision is one line: \(\mathbf{w}\cdot\mathbf{x} + b = 0\). Try any line on the plot → one dot always lands on the wrong side.
● = output 1 ● = output 0
the dashed line is one failed attempt
The fix: combine two lines and fold the space — exactly what adding a hidden layer does. That's the next slide.
One neuron draws one line. Add a hidden layer and the network can carve space into any shape it likes — XOR included.
Sounds abstract — so let's build XOR by hand with exactly two hidden neurons and see it work. Next slide.
Give the net two hidden neurons. Each draws its own line; the output neuron combines them. XOR = "at least one input on, but not both."
| x₁ | x₂ | h₁ | h₂ | y |
|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 1 | 0 | 1 |
| 1 | 0 | 1 | 0 | 1 |
| 1 | 1 | 1 | 1 | 0 |
two lines carve a stripe holding exactly the two "output 1" points
Here we chose the weights by hand. For real networks nobody can — so how does it find them itself? That's backprop.
Rumelhart, Hinton & Williams gave the recipe still used today — repeated millions of times:
forward pass
how wrong?
push error backward
a tiny step better
millions of times
Under the hood it's just the chain rule from calculus, at scale. The math was right — but with little data and slow computers, it still didn't take off. The world wasn't ready yet.
Both rules "tune by error." The difference is how the error reaches each weight — and that's the whole leap.
The error \(y - \hat{y}\) is read straight off the output, and those are the only weights. One layer → blame is trivial. \(\eta\) = learning rate.
A hidden neuron has no target of its own. The chain rule propagates the blame \(\delta\) backward to reach it.
Same spirit — step downhill along the error. Backprop's trick is credit assignment: computing ∂L/∂w for weights with no target of their own.
A plain network treats every pixel as unrelated. Yann LeCun asked a sharper question: in an image, the same shape can appear anywhere — so why not teach the net to scan for features instead of memorizing fixed positions?
Slide small filters across the image to detect edges, then corners, then shapes — reusing the same weights everywhere. Far fewer parameters, built-in to handle position.
His LeNet read handwritten digits on real bank checks — one of the first genuinely useful, deployed neural nets. Vision had its blueprint.
The idea sat ahead of its time for ~15 years — waiting for the data and GPUs to catch up. Then 2012 happened.
~14M hand-labeled images (Fei-Fei Li). Finally enough examples to learn from.
Gaming chips, repurposed for the massively parallel math neural nets need.
Small tricks that let much deeper networks actually train without falling apart.
None of these was new on its own. Together, they crossed a threshold.
Krizhevsky, Sutskever & Hinton entered the ImageNet contest with a deep convolutional net — LeCun's idea, scaled up on two GPUs — and didn't just win, they crushed it.
error gap to the field — an unheard-of leap in one year
Overnight, the whole field switched to deep learning. The modern boom starts here. (Sutskever later co-founds OpenAI — keep that thread.)
The best language models (RNNs / LSTMs) walked through a sentence left to right, one word at a time. Two problems:
Each word waits for the one before it — you can't use a GPU's parallelism. Training crawls.
Over long distances the early words fade. Connecting things far apart is hard.
The core move: for every word, the model asks "which other words matter to me?" and weighs them all — in parallel.
Each word scores how relevant every other word is to it — a full grid of similarities.
Scale for stable gradients, then normalize the scores into weights that sum to 1.
Each word becomes a weighted blend of the others' values. All words at once.
It can weigh every part of the input against every other part — capturing context far better than older models.
It processes a whole sequence at once, so it scales beautifully on GPUs. More data + compute kept making it better.
The same architecture handles text, images, audio, code. One idea behind LLMs, image models, and agents.
Everything we talk about today runs on this one 2017 idea — just bigger.
Take one transformer. Train it on a huge slice of the internet to do one thing — predict the next word. Then scale it up. It kept getting better, predictably.
proof the recipe works
surprisingly coherent text
a different kind of thing
Emergence: past a certain size, abilities nobody trained for start to appear — translating, rough arithmetic, and learning a new task from a couple of examples in the prompt.
GPT-3 was powerful but awkward to talk to. Two extra steps turned a text-predictor into a helpful assistant:
predict the next word on the internet
learn to follow requests
humans rank answers; it learns what's helpful
anyone can just chat
in about two months — Nov 2022
The model barely changed — the packaging did. A chat box put it in everyone's hands overnight. That's the moment that put us all in this room.
Where AI actually is right now — past the headlines.
Today's AI is astonishingly capable at
some things and confidently wrong at
others —
and the trick is knowing which is which.
Everything else today is in service of that one skill.
They all learn patterns from enormous data, then predict what fits. Different inputs, same engine.
Predict the next chunk of language. Chat, code, summaries, reasoning-ish.
Turn noise into an image that matches a description. Also edit & restyle.
An LLM that can take actions, check results, and try again. Not a new brain — a new wrapper.
The thing most people mean when they say "AI."
Hold this in your head and almost every quirk below stops being surprising.
Rule of thumb: trust it for shape, verify the specifics.
"It thinks. It reasons. It's basically a junior employee / a brain."
Newer models do show real gains by "thinking out loud" before answering. But it's pattern-completion that looks like reasoning — powerful, yet brittle in unfamiliar territory.
TAKEAWAY › Treat it as a brilliant, fast, slightly unreliable intern — not an oracle and not a search engine. Useful framing, not an insult.
Diffusion, in one honest paragraph.
Same core trick as LLMs: learn the patterns, then generate what fits.
Great for exploration & assets; needs a human for precision & final polish.
Where most of the 2026 hype lives.
You give it a task
LLM decides a step
Calls a tool / API
Reads the result
Until done
No new intelligence here — it's the same model given hands (tools) and a feedback loop. That's powerful and fragile for the same reason.
The gap between a great demo and a dependable system is the whole game right now.
A few filters you can reuse forever.
HEURISTIC › The more general and effortless something sounds, the harder you should look for the catch.
LLMs, image models, agents — all predict "what fits" from patterns in data. Brilliant, not magic.
It optimizes for what sounds/looks right. You supply the judgment about what is right.
Best results come from a human steering, checking, and deciding. It's leverage, not a replacement.
Trust it for shape & speed. Own the specifics & the call.
Happy to go deeper on any of these in Q&A.
Beginner or advanced — anything goes. What's confusing you, what you're trying to build, or what you read this week and didn't believe.
The goal was a real mental model you can reuse — not a list of tools that'll be stale next month. Mission accomplished if you now read AI news a little more skeptically.
Bob (Jue) Guo
CS PhD candidate & adjunct faculty
University at Buffalo
Session 02 coming soon — topic TBD. Bring a question →