AI Office Hours · Session 01 · via Zoom

Where AI Actually Is
in 2026

What LLMs, image models, and agents really do — what they don't — and how to read past the hype.

Relaxed but structured No prerequisites Bring a question Open Q&A

Hosted by Bob (Jue) Guo — CS PhD candidate & adjunct faculty, University at Buffalo

Your host

Hi, I'm Bob (Jue) Guo

  • CS PhD candidate at the University at Buffalo
  • Adjunct faculty — I teach deep learning, machine learning, and AI
  • I spend my days both building these systems and explaining them
WHY THESE SESSIONS

There's a huge gap between what AI can do, what people say it does, and what's actually useful to you. Office Hours is where we close that gap — calmly, with real examples, and your questions.

How today works

Run it like a class

FORMAT

Relaxed but structured. I walk through a mental model, you interrupt with questions whenever.

WHO IT'S FOR

Beginners leave with a real model. If you already build with these tools, you'll get sharper angles.

THE DEAL

No prerequisites. No slides to memorize. We close with open Q&A — so bring a question.

One ask: when something doesn't land, say so. The confusing parts are the whole point.

The plan

What we'll cover

  • 01  How we got here — a short history
  • 02  The big picture — where AI is in 2026
  • 03  LLMs — what they really do
  • 04  Image models — what they really do
  • 05  Agents — the word everyone overloads
  • 06  How to read past the hype
  • 07  A mental model to take home + Q&A
An ongoing series

This is Session 01

Office Hours runs as a series — each session stands on its own, so you can drop into any one. Here's the arc.

SESSION 01 · TODAY

Where AI actually is in 2026

The model family tree, LLMs, image models, agents — and how to read past the hype.

SESSION 02 · NEXT

Topic to be set

A focused deep dive into one area — shaped by the questions you bring today.

SESSION 03+

More to come

Hands-on building, prompting, agents in practice — and whatever you keep asking about.

Today is the foundation — the shared mental model the rest of the series builds on.

Part 01

How we got here

A 70-year-old idea that, for most of those years, didn't work.

The pioneers

The "godfathers" of deep learning

Three researchers kept betting on neural networks through decades when almost no one else did. They shared the 2018 Turing Award for it.

GEOFFREY HINTON

Championed backpropagation — how nets learn from mistakes. Mentored the team behind the 2012 breakthrough. Nobel Prize in Physics, 2024.

YANN LeCUN

Built convolutional nets that could read handwriting in the '90s — the foundation of modern computer vision. Now chief AI scientist at Meta.

YOSHUA BENGIO

Pushed deep learning for language & sequences and early attention ideas — groundwork for today's LLMs.

The short version

From a lab curiosity to ChatGPT

1958

The perceptron

One artificial neuron. Big hype → a decades-long "AI winter."

1986

Backpropagation

Nets can finally learn many layers — but still too slow to matter.

2012

AlexNet + GPUs

Crushes the ImageNet contest. Big data + GPUs make deep learning suddenly work.

2017 ⚡

The Transformer

"Attention Is All You Need." Google's architecture — the engine of everything since.

2018–22

GPT scales up → ChatGPT

More data + compute → emergent abilities. ChatGPT takes AI mainstream overnight.

2026

You are here

Multimodal models & agents — what we'll unpack today.

Part 01 · The full map

The whole map: four eras of models

Those six were the turning points. Zoom out and the full landscape looks like this — four eras, every model that mattered. Skim it now; then we'll rewind and walk the key ideas in depth.

FOUNDATIONS · '58–'06
  • '58Perceptron
  • '69Perceptrons (book)
  • '80Neocognitron
  • '86Backprop
  • '89LeNet
  • '97LSTM
  • '98LeNet-5
  • '06Deep Belief Nets
DL REVOLUTION · '09–'16
  • '09ImageNet
  • '12AlexNet
  • '13Word2Vec
  • '14GAN
  • '14VGG, GoogLeNet
  • '14Seq2Seq
  • '15ResNet
  • '15BatchNorm
  • '16AlphaGo
TRANSFORMER ERA · '17–'21
  • '17Transformer
  • '18BERT
  • '18GPT-1
  • '19GPT-2
  • '20GPT-3
  • '20ViT
  • '20DDPM
  • '21CLIP
  • '21DALL·E
  • '21AlphaFold 2
GENERATIVE AI · '22–
  • '22ChatGPT
  • '22Stable Diffusion
  • '23GPT-4
  • '23LLaMA
  • '23Claude
  • '24Sora
  • '24GPT-4o
  • '24o1
  • '25DeepSeek R1
  • '25Agentic systems
Part 01 · The tour · Era 1 of 4

Foundations 1958 – 2006

The mathematical building blocks of neural networks.

1958

Perceptron

Rosenblatt's single artificial neuron — learns to separate linear classes.

1969

Perceptrons (book)

Minsky & Papert prove the XOR limit. Triggers the first AI winter.

1980

Neocognitron

Fukushima's hierarchical visual model — the direct ancestor of CNNs.

1986

Backpropagation

Rumelhart, Hinton & Williams — multi-layer training finally tractable.

1989

LeNet

LeCun's first CNN — handwritten digit recognition for the US Postal Service.

1997

LSTM

Hochreiter & Schmidhuber — gated cells solve the vanishing-gradient problem.

1998

LeNet-5

Refined CNN evaluated on MNIST — the benchmark that defined the era.

2006

Deep Belief Networks

Hinton's layer-wise pretraining revives the field. Coins "deep learning."

Part 01 · The tour · Era 2 of 4

Deep learning revolution 2009 – 2016

Data + GPUs + depth, in that order.

2009

ImageNet

Fei-Fei Li's 14M-image, 21k-category dataset. The benchmark everyone had to beat.

2012

AlexNet

Krizhevsky, Sutskever & Hinton — deep CNN on GPUs halves ImageNet error.

2013

Word2Vec

Mikolov — dense word vectors that capture meaning algebraically.

2014

GAN

Goodfellow — two networks playing a min-max game. The generative unlock.

2014

VGG & GoogLeNet

Oxford and Google — much deeper nets; Inception modules; ensemble tricks.

2014

Seq2Seq

Sutskever — encoder-decoder framework. Makes neural translation work.

2015

ResNet

He et al. — skip connections let networks scale to 152 layers without degrading.

2015

Batch Normalization

Ioffe & Szegedy — normalises activations layer by layer. Becomes a default.

2016

AlphaGo

DeepMind — deep RL + Monte Carlo tree search beats Lee Sedol 4–1.

Part 01 · The tour · Era 3 of 4

Transformer era 2017 – 2021

One architecture starts eating every domain.

2017

Transformer

Vaswani et al. — attention replaces recurrence. The new dominant architecture.

2018

BERT

Devlin et al. — bidirectional pretraining → top scores across NLP tasks.

2018

GPT-1

OpenAI — decoder-only autoregressive pretraining on books.

2019

GPT-2

OpenAI — 1.5B params; generation quality so good they delayed release.

2020

GPT-3

OpenAI — 175B params. In-context learning, no fine-tuning required.

2020

ViT

Dosovitskiy et al. — split image into patches, feed to a transformer. Works.

2020

DDPM

Ho et al. — image generation by learning to reverse a noise process.

2021

CLIP

OpenAI — contrastive image-text alignment; one model, many vision tasks.

2021

DALL·E

OpenAI — text-to-image generation via transformer + VAE-style decoder.

2021

AlphaFold 2

DeepMind — protein structure prediction at experimental accuracy.

Part 01 · The tour · Era 4 of 4

Generative & multimodal 2022 –

AI leaves the lab. Chat, image, video, voice, code, agents.

2022

ChatGPT

OpenAI — RLHF + chat interface make LLMs broadly usable. 100M users in 2 months.

2022

Stable Diffusion

Stability AI — open-weight text-to-image. Spawns an ecosystem overnight.

2023

GPT-4

OpenAI — multimodal frontier model. Bar exam, code, vision, the works.

2023

LLaMA

Meta — open-weight LLMs that anyone can run, fine-tune, and build on.

2023

Claude

Anthropic — Constitutional AI; safety-focused frontier model.

2024

Sora

OpenAI — coherent minute-long video from text. The video-model moment.

2024

GPT-4o

OpenAI — single model handling voice, vision, and text in real time.

2024

o1

OpenAI — chain-of-thought at inference time; the reasoning-model unlock.

2025

DeepSeek R1

DeepSeek — open-weight reasoning model trained at a fraction of frontier cost.

2025

Agentic systems

Tool-using LLMs running multi-step tasks autonomously. The agent era begins.

That's the whole landscape. Now let's rewind to 1958 and watch how a handful of these ideas actually work.

Deep dive · 1958

The perceptron: one artificial neuron

Frank Rosenblatt's bet: build a machine that learns from examples instead of following hand-coded rules. One simple unit, loosely inspired by a brain cell.

INPUTS

x₁ x₂ x₃

the features it sees

WEIGH + SUM

Σ wᵢxᵢ + b

each input has a weight

THRESHOLD

fire?

over the line → 1

OUTPUT

0 or 1

its decision

\[ \hat{y} = \begin{cases} 1 & \text{if } w_1 x_1 + w_2 x_2 + \dots + b > 0 \\ 0 & \text{otherwise} \end{cases} \]

learning rule: nudge the weights whenever it's wrong

That one idea — adjustable weights tuned by error — is still the beating heart of every neural network in 2026.

Deep dive · 1958 → the first freeze

Big promise, then a hard wall

THE HYPE (1958)

The press called it an electronic brain that would soon walk, talk, see, and reproduce itself. Funding poured in.

THE WALL (1969)

Minsky & Papert proved a single perceptron can only draw a straight line — it can't even learn XOR. Optimism collapsed.

promise ↑↑↑ → limit discovered → funding ↓↓↓   // the first "AI winter," 1970s

Keep this cycle in mind — we'll use it as a lens on today's hype in Part 06.

Deep dive · the wall, up close

Why one straight line can't do XOR

XOR ("exclusive or") outputs 1 only when the two inputs differ — a tiny logic function:

x₁ x₂ XOR
0 0 0
0 1 1
1 0 1
1 1 0

A perceptron's decision is one line: \(\mathbf{w}\cdot\mathbf{x} + b = 0\). Try any line on the plot → one dot always lands on the wrong side.

0 1 1 0 x₁ → x₂ ↑

● = output 1  ● = output 0

the dashed line is one failed attempt

The fix: combine two lines and fold the space — exactly what adding a hidden layer does. That's the next slide.

Deep dive · 1986

Stacking neurons breaks the limit

One neuron draws one line. Add a hidden layer and the network can carve space into any shape it likes — XOR included.

input layer
hidden layer
output

Sounds abstract — so let's build XOR by hand with exactly two hidden neurons and see it work. Next slide.

Deep dive · the fix, concretely

How a hidden layer builds XOR

Give the net two hidden neurons. Each draws its own line; the output neuron combines them. XOR = "at least one input on, but not both."

\[ h_1 = \operatorname{step}(x_1 + x_2 - 0.5) \quad (\text{OR}) \] \[ h_2 = \operatorname{step}(x_1 + x_2 - 1.5) \quad (\text{AND}) \] \[ y = \operatorname{step}(h_1 - h_2 - 0.5) \quad (h_1 \,\wedge\, \neg h_2) \]
x₁x₂h₁h₂y
00000
01101
10101
11110
0 0 1 1

two lines carve a stripe holding exactly the two "output 1" points

Here we chose the weights by hand. For real networks nobody can — so how does it find them itself? That's backprop.

Deep dive · 1986

Backpropagation: learning from mistakes

Rumelhart, Hinton & Williams gave the recipe still used today — repeated millions of times:

01

Guess

forward pass

02

Measure error

how wrong?

03

Assign blame

push error backward

04

Nudge weights

a tiny step better

05

Repeat

millions of times

Under the hood it's just the chain rule from calculus, at scale. The math was right — but with little data and slow computers, it still didn't take off. The world wasn't ready yet.

Deep dive · the math

How a weight actually moves

Both rules "tune by error." The difference is how the error reaches each weight — and that's the whole leap.

PERCEPTRON · 1958
\[ \hat{y} = \operatorname{step}(\mathbf{w}\cdot\mathbf{x} + b) \] \[ w_i \leftarrow w_i + \eta\,(y - \hat{y})\,x_i \]

The error \(y - \hat{y}\) is read straight off the output, and those are the only weights. One layer → blame is trivial. \(\eta\) = learning rate.

BACKPROP · 1986
\[ L = \tfrac{1}{2}(y - \hat{y})^2 \] \[ w_i \leftarrow w_i - \eta\,\frac{\partial L}{\partial w_i} \] \[ \delta_j = \sigma'(z_j)\sum_k \delta_k\,w_{kj} \]

A hidden neuron has no target of its own. The chain rule propagates the blame \(\delta\) backward to reach it.

Same spirit — step downhill along the error. Backprop's trick is credit assignment: computing ∂L/∂w for weights with no target of their own.

Deep dive · 1989–1998

LeCun's big idea: the convolutional net

A plain network treats every pixel as unrelated. Yann LeCun asked a sharper question: in an image, the same shape can appear anywhere — so why not teach the net to scan for features instead of memorizing fixed positions?

THE TRICK

Slide small filters across the image to detect edges, then corners, then shapes — reusing the same weights everywhere. Far fewer parameters, built-in to handle position.

LeNet · 1998

His LeNet read handwritten digits on real bank checks — one of the first genuinely useful, deployed neural nets. Vision had its blueprint.

The idea sat ahead of its time for ~15 years — waiting for the data and GPUs to catch up. Then 2012 happened.

Deep dive · 2012

Three things finally lined up

DATA

ImageNet

~14M hand-labeled images (Fei-Fei Li). Finally enough examples to learn from.

COMPUTE

GPUs

Gaming chips, repurposed for the massively parallel math neural nets need.

METHODS

ReLU, dropout

Small tricks that let much deeper networks actually train without falling apart.

None of these was new on its own. Together, they crossed a threshold.

Deep dive · 2012

AlexNet: the spark

Krizhevsky, Sutskever & Hinton entered the ImageNet contest with a deep convolutional net — LeCun's idea, scaled up on two GPUs — and didn't just win, they crushed it.

top-5 error  runner-up 26.2% → AlexNet 15.3%
~10 pts

error gap to the field — an unheard-of leap in one year

Overnight, the whole field switched to deep learning. The modern boom starts here. (Sutskever later co-founds OpenAI — keep that thread.)

Deep dive · 2017

Before: reading one word at a time

The best language models (RNNs / LSTMs) walked through a sentence left to right, one word at a time. Two problems:

SLOW

Each word waits for the one before it — you can't use a GPU's parallelism. Training crawls.

FORGETFUL

Over long distances the early words fade. Connecting things far apart is hard.

"The cat that the dog chased across the yard … was fast." — by the time you reach "was," the subject is far behind.
Deep dive · 2017

Self-attention: look at everything at once

The core move: for every word, the model asks "which other words matter to me?" and weighs them all — in parallel.

"The animal didn't cross the street because it was too tired."
Attention links itanimal (not "street") — that's how it keeps meaning straight.
\[ \operatorname{Attention}(Q,K,V) = \operatorname{softmax}\!\left( \frac{QK^{\top}}{\sqrt{d_k}} \right) V \]
\( Q K^{\top} \)

Each word scores how relevant every other word is to it — a full grid of similarities.

\( \div \sqrt{d_k},\ \operatorname{softmax} \)

Scale for stable gradients, then normalize the scores into weights that sum to 1.

\( \cdot\, V \)

Each word becomes a weighted blend of the others' values. All words at once.

The turning point

Why the Transformer changed everything

ATTENTION

It can weigh every part of the input against every other part — capturing context far better than older models.

PARALLEL

It processes a whole sequence at once, so it scales beautifully on GPUs. More data + compute kept making it better.

UNIVERSAL

The same architecture handles text, images, audio, code. One idea behind LLMs, image models, and agents.

Everything we talk about today runs on this one 2017 idea — just bigger.

Deep dive · 2018–2020

"Just make it bigger"

Take one transformer. Train it on a huge slice of the internet to do one thing — predict the next word. Then scale it up. It kept getting better, predictably.

GPT-1 · 2018
117M

proof the recipe works

GPT-2 · 2019
1.5B

surprisingly coherent text

GPT-3 · 2020
175B

a different kind of thing

\( L(N) \approx (N_c / N)^{\alpha} \) test loss falls as a power law in model size \(N\) — not luck, a measured trend (scaling laws, 2020)

Emergence: past a certain size, abilities nobody trained for start to appear — translating, rough arithmetic, and learning a new task from a couple of examples in the prompt.

Deep dive · 2022

From raw model to ChatGPT

GPT-3 was powerful but awkward to talk to. Two extra steps turned a text-predictor into a helpful assistant:

01

Pretrain

predict the next word on the internet

02

Instruction-tune

learn to follow requests

03

RLHF

humans rank answers; it learns what's helpful

04

Assistant

anyone can just chat

~100M users

in about two months — Nov 2022

The model barely changed — the packaging did. A chat box put it in everyone's hands overnight. That's the moment that put us all in this room.

Part 02

The big picture

Where AI actually is right now — past the headlines.

Today's AI is astonishingly capable at some things and confidently wrong at others —
and the trick is knowing which is which.

Everything else today is in service of that one skill.

The landscape

Three families, one idea

They all learn patterns from enormous data, then predict what fits. Different inputs, same engine.

LLMs

Text in → text out

Predict the next chunk of language. Chat, code, summaries, reasoning-ish.

IMAGE MODELS

Prompt → pixels

Turn noise into an image that matches a description. Also edit & restyle.

AGENTS

Model + tools + loop

An LLM that can take actions, check results, and try again. Not a new brain — a new wrapper.

Part 03

Large Language Models

The thing most people mean when they say "AI."

What it really does

A very good next-word guesser

  • It reads everything so far and predicts the most plausible next piece of text — over and over.
  • That's it. No database lookup, no understanding of "true." Plausibility ≠ accuracy.
  • The magic: at huge scale, "what sounds right" turns out to capture a lot of real structure — grammar, code, reasoning patterns, style.

Hold this in your head and almost every quirk below stops being surprising.

Calibrate

What LLMs are & aren't good at

GREAT AT
  • Drafting, rewriting, summarizing
  • Translating between formats & languages
  • Code: scaffolding, explaining, debugging
  • Brainstorming & "rubber-duck" thinking
  • Turning messy input into structure
SHAKY AT
  • Facts it half-remembers (→ confident hallucinations)
  • Exact arithmetic & counting
  • Knowing what it doesn't know
  • Recent events past its training cutoff
  • Staying consistent over very long contexts

Rule of thumb: trust it for shape, verify the specifics.

Hype vs. reality

The "reasoning" question

THE HYPE

"It thinks. It reasons. It's basically a junior employee / a brain."

THE REALITY

Newer models do show real gains by "thinking out loud" before answering. But it's pattern-completion that looks like reasoning — powerful, yet brittle in unfamiliar territory.

TAKEAWAY ›  Treat it as a brilliant, fast, slightly unreliable intern — not an oracle and not a search engine. Useful framing, not an insult.

Part 04

Image Models

Diffusion, in one honest paragraph.

What it really does

Sculpting an image out of noise

  • Start with pure static. Step by step, the model removes noise toward "something that matches the prompt."
  • It learned this by watching billions of image + caption pairs — so it knows what words tend to look like.
  • It's painting plausible pixels, not retrieving a real photo or understanding the scene.

Same core trick as LLMs: learn the patterns, then generate what fits.

Calibrate

Where image models shine & struggle

SHINE
  • Concept art, mood, mockups, style
  • Backgrounds, textures, variations at speed
  • Editing: inpaint, remove, restyle, upscale
  • "Show me 20 directions" exploration
STRUGGLE
  • Reliable text inside images (improving, still iffy)
  • Precise counts, hands, fine spatial logic
  • Exact consistency of a character across shots
  • "Pixel-perfect to spec" without iteration

Great for exploration & assets; needs a human for precision & final polish.

Part 05

Agents — the overloaded word

Where most of the 2026 hype lives.

What it really is

An LLM in a loop, with tools

01

Goal

You give it a task

02

Plan

LLM decides a step

03

Act

Calls a tool / API

04

Observe

Reads the result

05

Repeat

Until done

No new intelligence here — it's the same model given hands (tools) and a feedback loop. That's powerful and fragile for the same reason.

Calibrate

Agents today: promise vs. practice

WORKS WELL
  • Bounded, well-defined, repeatable tasks
  • Coding assistants in a tight feedback loop
  • Retrieval + tools + a human checkpoint
  • Automating the boring 80%
FALLS DOWN
  • Long chains — small errors compound
  • Open-ended "go run my business" tasks
  • No human in the loop on anything risky
  • Reliability ≠ a flashy demo

The gap between a great demo and a dependable system is the whole game right now.

Part 06

Reading past the hype

A few filters you can reuse forever.

Your BS detector

Five questions to ask any claim

  • Demo or product? Cherry-picked once, or reliable at scale?
  • What's the failure mode? If they won't show it, be suspicious.
  • Who's measuring? Benchmark vs. your actual job are different.
  • What does it replace? A task, or a whole human? Usually a task.
  • Capability or packaging? New model — or an old one with a nicer wrapper?

HEURISTIC ›  The more general and effortless something sounds, the harder you should look for the catch.

Part 07

The mental model
to take home

If you remember one slide

This one

IT'S A PATTERN ENGINE

LLMs, image models, agents — all predict "what fits" from patterns in data. Brilliant, not magic.

PLAUSIBLE ≠ TRUE

It optimizes for what sounds/looks right. You supply the judgment about what is right.

YOU + AI > AI

Best results come from a human steering, checking, and deciding. It's leverage, not a replacement.

Trust it for shape & speed. Own the specifics & the call.

For the folks already building

Sharper angles

  • Context > cleverness. Most "prompt engineering" wins are really retrieval & context design.
  • Evals are the moat. If you can't measure quality, you can't improve it — vibes don't scale.
  • Keep the human checkpoint exactly where errors are expensive; automate around it.
  • Smaller, scoped, reliable beats big-and-flashy in production almost every time.
  • Watch cost & latency as first-class features, not afterthoughts.

Happy to go deeper on any of these in Q&A.

Open floor

Your questions.

Beginner or advanced — anything goes. What's confusing you, what you're trying to build, or what you read this week and didn't believe.

No bad questions Unmute or drop it in chat
Thanks for coming

That's Session 01

The goal was a real mental model you can reuse — not a list of tools that'll be stale next month. Mission accomplished if you now read AI news a little more skeptically.

STAY IN TOUCH

Bob (Jue) Guo
CS PhD candidate & adjunct faculty
University at Buffalo
Session 02 coming soon — topic TBD. Bring a question →