Does GPT-4o suck?

INSIDE: OpenAI ChatGPT, GPT-4o Multimodality, AI Hackathon

Adel Ismael Mandanas
June 10, 2024

I'm diving into my next hackathon with an ambitious plan: creating a custom GPT for Dovetail. Totally winnable, right? Well, I've got a talented team backing me up, but as the sole engineer, the pressure is on.

I pitched an incredibly outrageous idea for our GPT's capabilities, but now I'm stuck. Can a GPT even do what I envision? Today, I'm running an experiment to test the absolute limits of ChatGPT using the GPT-4o model.

Spoiler alert: It didn't go exactly as planned.

Long Videos

GPT-4o has multimodal capabilities, including the ability to process video and audio as part of the prompt.

But can it handle a long video? Can it summarize the video, identify key themes, and highlight important points under each theme?

Of course, I had to test it with an Alex Hormozi video. It's 85 minutes long, and (as always) packed with valuable content.

I started out with a classic role-then-task prompt. The task is simple. Extract all useful information from the video and organize them properly.

At first, ChatGPT struggled to extract the audio from the entire video. It then decided to break the video into smaller parts, which worked!

But here's the strange part. It tried to transcribe the audio segments. Since GPT-4o is multimodal, it should have been able to process the audio directly without needing transcription.

What ChatGPT did instead was write and execute some code to transcribe the audio. The problem was that this approach wasn't fully supported in ChatGPT yet. Transcription requires an external library, and without proper integration, the process failed.

In the end, ChatGPT gave up. It asked me to transcribe the video manually and upload the transcription directly.

Let’s try something else. How about if I upload an audio-only version? Maybe it’ll have an easier time processing that?

I know, I know. Wishful thinking, but we all know what’s going to happen.

As usual, the video itself wasn't the issue. It gave me the same response: attempting to chunk the audio into smaller segments and transcribe each one. Which, of course, failed.

This made me wonder. Did OpenAI lie about GPT-4o’s capabilities? Or was it just a fluke because of the content length. It must’ve been the content, right? So, I uploaded a shorter video.

Short Videos

I decided to use another Alex Hormozi video, but this time it’s only 15 minutes long.

With the same exact prompt, this is how it went.

What’s cool is the GPT skipped the part about extracting the audio and transcribing it. It straight up told me this ain’t gonna work.

Uhm, yeah, that’s fair.

Naturally, I asked: Why? Why can’t you analyze this short video?

The reasons were the same as before. It failed to process “large” files, chunked the audio, and attempted transcription. But none of this explained why GPT-4o’s multimodality failed.

Then it hit me. This Reddit post reminded me of OpenAI’s history: great go-to-market strategies BUT painfully slow rollouts.

GPT-4o’s multimodal feature isn’t out yet! It might be available via API, but not in ChatGPT. It’s disappointing. I expected a powerful model ready for Custom GPTs right out of the box. But, oh well.

Okay, so half of the features I want for my GPT aren’t feasible anymore. But let’s try one more thing.

Survey Responses

I wanted to test how ChatGPT handles spreadsheet data. I found an old survey from when I organized a bootcamp back in uni. It was about gauging student interest in a university-wide programming bootcamp.

The spreadsheet has three key data points I want to extract and analyze:

Students' experience in Software Development,
How they plan to use the knowledge gained, and
Their availability

So, I gave ChatGPT a simple task: figure out the best way to run this bootcamp.

It started by describing what data points are in the spreadsheet.

It then moved into analysis mode. It was awesome how it described its plan to me first before executing it.

It gave me a summary of experience levels of the potential attendees.

As seen, despite most being from Computer Science, only a few had experience in software development.

It also gave me a few recommendations for the bootcamp. They were spot on except for one. Since we had some attendees who weren’t CS majors, tailoring the content just for CS majors didn’t make sense.

It even went above and beyond and asked me if it could give me a breakdown of how the students will apply their newly gained skills. This one was helpful enough to let me know what kinds of exercises to give them.

I also asked a few more questions about structuring the bootcamp curriculum. Not bad, honestly. But this was expected from ChatGPT.

The real jaw-dropping moment was how it handled the student availability data. One of my biggest regrets when organizing this bootcamp was the messy survey. Each student listed a series of dates they were available, making it a nightmare to consolidate manually.

But ChatGPT did all that for me, giving me a detailed breakdown of each student's availability.

The first part was a bit odd, but it counted how many respondents were available on certain dates. The useful part was that it highlighted the most common availability and suggested the top three days for the bootcamp.

Overall, I was extremely satisfied. I just wish this existed six years ago—it would’ve made my life as an organizer so much easier.

Final Thoughts

I have to admit, I’m extremely disappointed with the lack of GPT-4o's multimodal capabilities in ChatGPT. This limitation has made the development of my custom GPT much harder than expected. Processing videos and audio was a key part of my plan, and without it, my idea feels weak af.

But all hope is not lost. ChatGPT still excels at deep thematic analysis of spreadsheet data, which is still crucial for my idea. So, while I’m facing setbacks, it’s absolutely not the end of the road.

So yeah, wish me luck!

If you’re enjoying this, can you do me a favor and forward it to a friend? Thanks.