Blog
Category

Turning AWS Documentation into Gold: AI-Assisted Security Research

Full name
Published On:
October 17, 2024

In security research, documentation is often the first stop for uncovering useful information. The key to finding security bugs usually comes from really understanding the platform, and the documentation holds a lot of that critical knowledge. With AWS, however, there’s an overwhelming amount of it—well over 150,000 pages.

At first, I relied on ChatGPT to help with the workload, but I quickly ran into problems. It was full of hallucinations and misleading information, sending me down unproductive paths. While it could speed up certain tasks, the outdated data and frequent inaccuracies made it frustrating to rely on. After spending a week manually scraping the AWS docs and uncovering a few security issues, I decided to try something different: leveraging AWS Bedrock to accelerate my security research.

I found AWS Bedrock's hosted Claude 3.5 Sonnet model with knowledge bases to be significantly more accurate than both OpenAI's ChatGPT 4o and Anthropic's Claude v3.5 Sonnet, leading to the discovery of even more interesting security research ahead!

Introduction to Using Bedrock

AWS Bedrock is a relatively simple way to build out and scale generative AI applications. It was much simpler than I had imagined and if you are interested, I strongly recommend taking fifteen minutes out of your day for a test drive. After requesting access to the models, you can quickly go right into the playgrounds with chat, text, or image generation.

Prompt was "Block an IP in aws using a security group." A common basic trick question that, while it is possible, it is far from ideal.

Bring Your Own Documentation(BYOD)

To BYOD with AWS Bedrock Knowledge Bases, all you need to do is upload your files to S3 to perform queries against them using foundational models. You can upload files in formats such as .txt, .md, .html, .doc/.docx, .csv, .xls/.xlsx, and .pdf as of the time of writing. This process allows you to create embeddings that your foundational model can leverage to provide more accurate answers. Essentially, this is what Retrieval-Augmented Generation (RAG) refers to: providing your prompt along with enhanced context from your knowledge base to the foundational model. Once your files are uploaded, simply navigate to AWS Bedrock, click on "Knowledge Bases," create a new knowledge base, specify your bucket, and sync your documentation with your new knowledge base. If you need a walkthrough, I recommend this helpful video video I came across.

Uploading a markdown file to s3 with "The secret to a happy marriage is a happy wife with lots of ice cream.
Once you have synced your knowledge base, you can now use this data in your prompts in which it returns answers based on your files.

Documentation Landscape

Great now we can just upload all the documentation we like and now we can have more accurate prompts! Not so fast. Unfortunately a lot of documentation sites like AWS are not open source and are difficult to get at. You will have to scrape them in order to retrieve html files and use them for search. Though some documentation sites like azure-docs, terraform-aws-provider, steampipe, and so much more provide you access to the raw markdown files. Please consult your friendly legal team for more information regarding applicable laws in this area.

Scraping the AWS Documentation

What got me started down this path was accidentally stumbling across a misconfigured asset within the documentation, leading me to believe that if it was sitting in plain sight for so long I wonder what else is lying around in the documentation. Remembering a talk at fwd:cloudsec on looking at AWS documentation for security research, I had a blueprint as to how to get started which was to pull down all sitemaps and recursively retrieve all urls within them. After about a week of effort and lots of setbacks, I finally managed to get a workable local copy I can use for my research.

  • Sitemap parsing errors leading to incomplete list of urls
  • Sitemap containing deprecated urls and non-https links
  • Massive amount of automatically generated sdk documentation
  • Being nice for rate limiting, sincere apologies to the WAF team if you are reading this
  • Waiting three plus hours between each attempt to determine completeness
  • Deciding on warc/html format

Once I was done, I immediately starting finding interesting skeletons in the closet. I have opened source the work at SecurityRunners/awsdocs but please note that I am not responsible for how it is used and to be as courteous as possible or else you will ruin it for everyone else. If you are attempting this yourself, please don't hesitate to reach out. You can find me on Linkedin, Cloud Security Forum Slack, or just contact me form.

Searching the AWS Documentation Locally

Now that I have created a repeatable way of downloading a local copy of the AWS documentation, I now have to search through the nearly 4GB of files locally to get any use out of it. Which is when I stumbled across ripgrep. Just grepping alone took on average around 35s to go through all the documentation whereas ripgrep took just a fraction of a second to complete on average completing the query within 0.5s. The massive difference allowed me to be able to perform any query I wanted against the documentation with little overhead.

HTML vs WARC

At first, I decided to go with the WARC file format to include the url within the file itself and give me some ability to be able to debug down the road. I found this to be particularly useful at performing local searches to help provide me with the full URI of the source in question. Though many AI tools require html files for appropriate indexing leading me to support both formats. It would be best to use the html format for AI use cases and warc for searching locally. You will see further on in the research as to why.

WARC Headers Overview

ripgrep (rg)

One thing interesting thing I identified as part of the research was that bucket names prefixed with s3://amzn-s3-demo-* is a reserved prefix that cannot be created by anyone outside of AWS. I figured I would use this as a fun example as part of showing off what you can do with the local copy. As you can see below, you can retrieve the location of the file and any files within it containing that specific string.

ripgrep query overview

Interesting Queries

  • Return a list of s3 buckets containing service-accountid-region inspired by bucket monopoly research: rg -o "[a-z0-9.-]+-\d{10,12}-[a-z]{2}-[a-z]+-\d" .
  • Return a list of documentation urls containing an important message: rg "awsdocs-important" . -l | xargs -I {} rg "Warc-Target-Uri" {} | awk '{print $2}' | sort | uniq
  • Return a list of documentation urls containing a warning message: rg "status-warning" . -l | xargs -I {} rg "Warc-Target-Uri" {} | awk '{print $2}' | sort | uniq

Building the Knowledge Base with Embeddings

So now you have a local copy of the AWS Documentation you can use for security research, why not turbo charge your research by sprinkling a little AI on top? All you need to do is upload the local html files located in aws_html/YYYY/MM/DD/ to an S3 bucket and create your knowledge base.

You might be asking why take the effort? ChatGPT will scrape documentation sites for you with the right prompt. Well I found it all too common for ChatGPT's knowledge to be out of date and to hallucinate based on partial knowledge of AWS services. While it's constantly getting better, having embeddings increases the accuracy of your prompts response and helps you identify where to look. I will further demonstrate just how effective it is compared to other foundational models you may use on a regular cadence below.

Enhancing Searches with Bedrock

Okay so you made it all the way here, congratulations on having the patience on reading this far! As a result, you now get to see the juicy findings I was able to uncover as part of this research. Lets compare foundational model against our embeddings located alongside them.

Can you create a public DynamoDB Table

The correct answer to this question is no you cannot because block public access automated reasoning is forcibly enabled across all AWS accounts. While you can apply a resource-based policy, AWS will refuse to let you apply it due to automated reasoning. Lets see the comparison between OpenAI's ChatGPT 4o, Anthropic's Claude 3.5 Sonnet, and AWS Bedrock Claude 3.5 Sonnet with AWS Documentation embeddings.

ChatGPT 4o vs. Anthropic Claude 3.5 Sonnet vs. Bedrock with Embeddings on Anthropic Claude 3.5 Sonnet
  • ChatGPT 4o advises of it's knowledge cut-off is unaware of DynamoDB's support of resource based policies. In other answers it hallucinates based on impartial knowledge of supporting resource based policies while not being familiar with the forced block public access
  • Anthropic Claude v3.5 Sonnet has legacy information that you cannot apply resource based policies to DynamoDB tables
  • AWS Bedrock with Embeddings using the foundational model Anthropic Claude v3.5 provides us with the correct answer, while not complete, advising you that AWS prevents it from occuring due to IAM access analyzers block public access.

Interesting Security Bugs in AWS Documentation

If you thought this was just a clickbaity article with AI hype, you would be dead wrong. I wouldn't leave my readers without a good security finding or two. I'll just dive into the most interesting ones I stumbled across.

Do as I Say, Not as I Do

During my research I stumbled across over a hundred publicly listable buckets which has since been reported and the ones with impact have been fixed. In this day and age, having any publicly listable bucket that has unintended information disclosure is just inexcusable. Perhaps the team should use IAM Access Analyzer to review all Amazon owned buckets and take appropriate inventory. While this effort is difficult and arduous, it goes to show we as an industry need better practices in inventory management and that you aren't the only one struggling with public buckets.

NoSuchBucket Containing Scripts

Being able to have a local copy of the documentation allows me to also diff the documentation over time. This one is a classic in which buckets able to be taken over contain scripts that could lead to arbitrary code execution.

Screenshot of diff-ing documentation between September 19th and September 26th
Showing ownership of s3://aws-parallelcluster-pcluster/ bucket

Reserved S3 Prefixes

One interesting thing I identified as part of my research was reserved bucket prefixes. You would not know about these unless you have tried creating a bucket referenced in the AWS documentation before. While few documentation references refer to this reserved prefix, I would imagine this is the standard moving forward.

Showing the bucket names that start with s3://amzn-s3-demo-* are reserved and buckets cannot be created with that name as it's prefix

Cloud Security Historian

No matter how interesting my other findings were, finding historical videos and screen captures from early AWS days in publicly listable buckets were by far the most interesting. Seeing how much AWS has evolved over the years, you forget just how terrible the console used to look like. These findings were mostly located within publicly listable buckets I located at the beginning of this effort. Funnily enough I also own s3://docs.aws.amazon.com bucket now.

AWS Console Dating back to 2011
Security Credentials in 2011 allowing you to use x.509 certs
EMR Console Job flows in 2011

Instance Metadata History 2006-2023

Graphed EC2 instance metadata endpoints from 2006-2023 using WSDL files using mermaid diagrams. Unfortunately I lost the mermaid text locally.

History of instance metadata endpoints from 2006-2023

EC2 API Operation History 2006-2014

Graphed EC2 APIs using WSDL files from 2006-2014 using mermaid.

History of EC2 API operations

Summary

Spending your time going through the documentation is always a worthwhile effort and AI can help you answer some of the hard questions you may have on a regular basis. In this article we went over how to use embeddings in AWS Bedrock, scraping AWS documentation, leveraging ripgrep for fast searches on local disk, and some interesting security research along the way.

While it was great to ask some hard questions using Bedrock, it was an expensive endeavor costing a few hundred a month at least. While it may be worth it if you aren't paying the bills, I'll be deleting it until I need it again. Though it was certainly a fun and worthwhile effort for more accurate documentation queries I had during my research. Also I ended up getting a free T-Shirt from AWS's new VDP program. Though my research started before this was announced, I still managed to get at least one report in.

Thanks for reading and don't hesitate to reach out if you have any questions or concerns. Please subscribe to our newsletter if you are interested in this type of content.

Jonathan Walker
Founder and CEO, Security Runners