What Google Images Taught a Room Full of CFA Charterholders About AI Bias

By Lynn Räbsamen, CFA | COO, Global Swiss Learning | Advisory Board Member, CFA Institute | Author, Artificial Stupelligence

A webinar experiment, a surprising CEO result, and what it means for anyone using AI in finance

I started my recent CFA Institute webinar with a confession: no regression equations in the first five minutes. I know. Scandalous.

Instead, I asked everyone to open Google Images on their phone and search three words: housekeeper, driver, CEO.

Then I asked them to vote on what they saw.

The first two results were about as surprising as a bond yield moving when the Fed talks. Overwhelmingly gendered, in the directions you’d expect. But CEO? That’s where it got interesting.

35% of the audience saw predominantly male results. 65% saw a roughly balanced mix of men and women.

Same search term. Same moment in time. Same algorithm. But: a global audience, different geographies — and demonstrably different results.

That gap is not a glitch. It’s ultimately the window into how AI systems learn. And it has direct implications for every finance professional who thinks they’re working with a neutral tool.

The Training Data Is the Problem — And It’s Also All of Us

When we train large language models, we don’t feed them carefully curated, bias-free data. We feed them the internet — images, captions, articles, resumes, job postings, all of it. And the internet, bless its heart, is not neutral.

Here’s the linguistic detail that stopped the room cold.

When my Google Image search returned a male housekeeper, the caption read: “male housekeeper.” When it returned a female housekeeper, the caption just said: “housekeeper.” No qualifier. No label.

The woman is the default. The man is the aberration requiring explanation.

With drivers, the same logic applies in reverse. “Female driver” gets labeled. The man behind the wheel is simply… a driver. Of course he is.

“The model isn’t making things up. It’s learning from us. Which is either reassuring or deeply unsettling depending on how much faith you have in us.”

This isn’t just a quirky internet habit. Research shows these visual and textual cues actively reinforce occupational stereotypes — and those stereotypes get baked into every model trained on that data.

The CEO Slide: Where It Gets Complicated

Now, back to CEO — because this is where the story stops being straightforward.

Despite the fact that female CEOs still represent a depressingly small percentage of actual C-suites globally, the visual narrative online has been pushed hard toward balance. At least in some parts of the world. Board quota coverage, DEI campaigns, feature stories on female executives, stock photo libraries quietly updating their defaults. All of it shows up in the content that ultimately trains the models.

So the images look more equitable than reality. Considerably more.

“Google Images doesn’t show you the truth. It shows you the narrative we’ve pushed hardest.”

And the geographic split in my webinar made that visible in real time. Audiences in regions where that narrative has been amplified most loudly saw more balance. Audiences elsewhere saw something closer to the statistical reality of who actually runs companies. Same algorithm, same query — different cultural context, different output.

That’s not a flaw to patch. That’s the architecture.

Your AI doesn’t have a God’s-eye view of reality. It has a weighted average of what the internet believed when the training data was collected, filtered through whatever corrections the model’s developers layered on top. And as we’re about to see, those corrections have their own problems.

When the Correction Becomes the New Scandal

Companies noticed the bias problem. So they tried to fix it — by nudging models toward more diverse outputs.

Noble intention. Occasionally spectacular execution.

Gemini’s early 2024 image generation episode became the case study. Asked to generate “a 1943 German soldier,” it produced racially diverse soldiers in Wehrmacht uniforms. Asked for “the Founding Fathers of America,” it returned a diverse group of males. Asked for “a pope” — females and men of color. Asked for “Vikings” — I genuinely cannot tell you what emerged, but Khal Drogo’s tribe from Game of Thrones filed a complaint about the resemblance.

The backlash was swift and, importantly, pointed at the right target. The criticism wasn’t “how dare you show diversity.” It was: “you are now rewriting history in the name of diversity” — which is a different problem entirely.

Google admitted it had “missed the mark,” producing “inaccurate historical depictions,” and paused people-image generation while it recalibrated. Which is a very polite way of saying: the fix broke something else.

So here’s where we land. You have two failure modes, both carrying the AI stamp of approval:

Underrepresentation and stereotyping — housekeepers coded female, drivers coded male, leadership roles defaulting to men with a glossy diversity filter.

Overcorrection — diversity injected into contexts where historical accuracy matters more than balance, producing what I have started calling “confidently wrong outputs with excellent intentions.”

“The models are in a permanent tug-of-war between reflecting reality, improving it, and occasionally rewriting it entirely. None of that appears in the product marketing.”

The Real-World Cost When This Hits Finance

Let me move from the abstract to the specific, because CFA-adjacents deal in specifics.

In 2019, Apple Card — issued via Goldman Sachs — was granting husbands credit limits 10 to 20 times higher than their wives, even when the wives had equal or better credit scores. Steve Wozniak’s wife received one-tenth his limit despite their shared finances. One user went viral reporting a 20x gap with his wife.

The New York Department of Financial Services investigated for 16 months, reviewed over 400,000 applications, and found no evidence of gender discrimination. Their argument: the model used legitimate factors like income — not gender.

Which is, of course, precisely the point.

“AI doesn’t need to use gender to discriminate by gender. It just needs to use variables that correlate with gender and let the mathematics handle the rest.”

Proxy discrimination is harder to see, harder to prove, and — as this case demonstrated — harder to prosecute. The model can be entirely gender-blind and still produce a gender-skewed outcome, and your compliance team will find this very difficult to explain to a journalist.

Amazon’s 2018 résumé-screening tool offers an equally instructive example. Trained on a decade of predominantly male engineering hires, the model systematically downgraded candidates associated with women’s colleges and women’s professional organizations. Amazon discontinued the tool when they discovered the flaw.

The unsettling question is how many similar tools are still quietly running at other companies, narrowing candidate pools by learning from historically skewed datasets — and calling it objectivity.

So Where Is Everyone? (Spoiler: Not Where You’d Hope)

I ended the webinar with one more poll. I showed the audience this framework — four quadrants mapping AI governance approaches by two axes: how much control you actually have over your AI’s behavior, and when you intervene.

AI Governance Framework

The 4 AI Governance Frameworks

Control of AI Behavior

High Low

Ex-Post Real-Time

Traditional Retraining

Data Fixes
Model Rebuilds

“We fix the model later.”

Target

Mechanistic Policy Enforcement

Feature-Level Control
Policy-as-Code
Real-Time Intervention

“We control every decision as it happens.”

Monitoring & Reporting Tools

Bias Detection
Model Monitoring
Audit Logs

“We tell you what went wrong.”

Soft Guardrails

Prompt Rules
Output Filters

“We try to steer outcomes.”

Timing of Intervention

Most solutions observe AI. Few actually control it — especially in real time.

The quadrants, for reference:

Bottom left — Monitoring & Reporting: Low control, reactive. Bias detection, audit logs, model monitoring. The tagline: “We tell you what went wrong.” The AI equivalent of checking your bank statement to find out you’ve been robbed.

Bottom right — Soft Guardrails: Real-time, but low control. Prompt rules and output filters. “We try to steer outcomes.” Like putting a polite sign on a cliff edge.

Top left — Traditional Retraining: High control, but slow and expensive. Data fixes and model rebuilds. “We fix the model later.” By which point the biased decisions have already gone out the door. And as Gemini demonstrated: sometimes retraining produces female popes.

Top right — Mechanistic Policy Enforcement: High control, real-time. Feature-level control, policy-as-code, real-time intervention. “We control every decision as it happens.”

Then I asked: which of these does your company actually use?

The results were, let’s say, clarifying.

82% of respondents selected “None / Not Aware.”

18% had soft guardrails. 6% traditional retraining. Zero percent — not a rounding error, a literal zero — reported active monitoring and reporting or mechanistic policy enforcement.

This is a room of CFA-adjacent finance professionals. People who stress-test models for a living. People who know what happens when you trust an output without understanding its inputs. And 82% of their companies are flying entirely without AI governance instrumentation.

“If that number doesn’t make you slightly uncomfortable, I’d gently suggest re-reading it.”

The Question Worth Asking on Monday Morning

I’m not listing five steps here, because the problem doesn’t fit in a checklist and you deserve better than the illusion of tidiness.

But here is what CFA charterholders are actually positioned to do — and it matters precisely because of who we are.

We are trained to interrogate models. To ask what assumptions are baked in, what the training data looked like, what happens when the regime changes. That’s not a technology skill. That’s an analytical discipline that most AI vendors are not accustomed to encountering from their clients.

So use it.

Ask your vendors what their training data actually was. Ask your responsible AI officers — or your CISO, or whoever owns this conversation at your firm — what quadrant your company is operating in. Not as a compliance exercise. As a risk management question. Because a model that discriminates by proxy is a model that produces liability, and liability is something we know how to price.

The governance gap in that poll wasn’t a technology problem. It was a conversation that hadn’t happened yet.

You are exactly the right people to start it.

Enjoyed this? Subscribe for weekly insights at the intersection of AI, finance, and the questions the hype machine forgot to ask.

Lynn Räbsamen, CFA is COO of Global Swiss Learning and author of Artificial Stupelligence: The Hilarious Truth About AI. She serves on the CFA Institute Advisory Board and speaks at CFA Institute LIVE conferences.