While building out Video To Blog, I have spent quite a lot of time wrestling with AI in an attempt to get a prompt to output a desired result consistently. I figured it would be good for me to document some of the challenges I have encountered thus far and how I have solved (or more accurately, worked around) them.
For Video To Blog I needed the ability to prompt AI with a given blog and video transcript and have the AI tell me which portion of the transcript corresponds to each particular blog section.
This was needed for two reasons...
In order to use AI to choose a relevant screenshot to add to a particular blog section, it is not economical to evaluate every possible frame in a video to determine a relevant screenshot. Rather, it is best to know what timeframe in the video the blog section refers to and then it will greatly reduce the possible screenshots available for the AI to evaluate and choose the best one.
When generating our medium and long blogs, we require multiple prompts where each prompt writes a particular section (or sections) of a blog. To write said section the blog needs the video transcript and including the entire transcript (especially on long videos over an hour long) becomes very inefficient and expensive.
Rather, it is best to only include the portion of the transcript that corresponds to the section the AI is about to write (which requires knowing which portion of the transcript corresponds to said section).
So it turns out this is actually a hard thing for AI to do accurately and consistently.
I believe this is for two reasons:
On my first go around my prompts would look something like this...
However, this gave poor results. The given timestamps were often completely wrong (i.e. not relevant to the section it was paired with), the outline would often not cover the entire transcript (for example, it would give an outline only covering 15 minutes of an hour long transcript), and was very non-deterministic.
It is worth noting that for this particular task I was using Claude Haiku. Claude was the only option here (as opposed to GPT) as the entire transcript was needed and GPT context was not large enough for long videos. Haiku was also the only option in Claude family to keep costs reasonable for the end consumer.
After spending many hours trying different prompts, transcript formats, timestamp formats, different ordering of inputs/instructions, and everything else I could think of, I was finally able to solve / work around this issue by doing two things...
Turns out adding the transcript in the prompt twice (once with timestamps and once without timestamps) increased the accuracy substantially.
Specifically, it drastically helped the issue where the given start and stop timestamps did not correspond to the given section.
My (unfounded) theory is that the timestamps hinder the AI from having a more coherent understanding of the transcript so including the transcript without timestamps fixes (or drastically reduces) that issue.
Although not completely, this did also help the other issue of the AI generating outlines that did not cover the entire transcript.
Here is what the prompt looked like after adding the transcript an additional time...
While the previous solution greatly helped, it did not prevent the AI from generating outlines that did not cover the entire transcript.
Thankfully, this is something that is easy to validate so the work around here was to just check the AI output and if the outline was too short, try again.
Another issue that I have ran into is the unexpected difficulty to get AI to reference a particular image when given multiple images.
For Video To Blog we have a feature that will automatically include screenshots into a generated blog.
To build this feature we split a blog into sections and for each blog section we give AI 20 possible screenshots (along with the blog section) and ask it to choose which screenshot would be best (based on relevance and quality) to include in the blog section.
The only model that was in consideration here was Claude Haiku. The reason simply being that GPT-4 (and other Claude models) are too expensive for this task (and GPT-3.5 does not support images) which require multiple prompts (up to 20) each consisting of 20 images. Furthermore, since based on our testing that no model would get this 100% correct, for this feature to provide practical value it would need cheap enough to run multiple times on a particular blog without breaking the bank.
Turns if you give AI a bunch of images and ask it to choose one based on some condition(s), it is usually pretty good about choosing an appropriate image, but getting the AI to tell you which of the 20 images it chose turns out to be a pretty tough issue.
The question here is how best to label the images so when you ask AI to choose the best image (out of 20 given images) based on some condition(s) it can accurately and consistenly output "Image X" where X is a number 1-20.
At first go we tried using the suggested format on the Anthropic docs but this did not bear much fruit. Given our prompt and the same set of images (and a temperature set to 0), the AI would consistently give the same output (yay). However, simply changing the order of the images (and nothing else) would yield different results (boo). To dig in further the prompt was adjusted to give a description of the chosen image and it became clear that the AI was choosing the right image in both cases, but not outputting the correct reference (i.e. outputting t;Image 1" when in fact the image description was actually another image).
Furthermore, with more testing it became clear that even slight modifications to the prompt or image format (for example adding an extra blank line between the images) would completely change the result (again, with a temperature of 0).
This is what our initial prompt looked like...
I was hoping this problem could be solved with some prompt engineering magic so I tried everything from adding various system prompts, many different image formatting (include XML tags, different labelling, etc.), drawing the image number on the image itself (which helped but not enough to justify the extra wait time to edit all the images in real time), changing the order of the instructions / inputs, different image resolutions, getting the AI to describe the images first (works great for first 5 images, but not the rest), and more.
This problem is still under active investigation (see below for the actual solution) but after much trail and error we have our prompt working fairly well and have included a work around that seems sufficient now.
This is less of a solution and more of just "this is the best prompt we have got so far and it works pretty okay".
After lots of testing our prompt looks something like this...
Basically, some XML tags were added, some verbiage was adjusted, we asked the AI to generate a reason (which is not used but helped increase accuracy), explictly told the AI there were 20 images, the AI was asked to choose backup images (see Work Around 2 below), and some other minor changes.
Also, the following system prompt was added...
Knowing that the AI was still going to select screenshots that were not relevant sometimes, we adjusted the prompt to get the AI to select some additional backup choices and we show these backup choices in our UI and make it really easy for users to see these other options and select them if they are more appropriate.
Although it is not currently a viable option for Video To Blog at this time due to costs, I did some testing using Claude Sonnet and got much better results.
I think this issue just might be an Haiku limitation at this time (I really hope I wrong though).
I have not tried GPT-4 or Opus yet, but I would expect better results with those models.
AI is great but it struggles with some simple things sometimes. Hopefully someone finds this useful. If so, I will try to continue to document more of my stuggles / findings. Thanks for reading!