Jordan Choo

Building a GraphRAG Agent From The Ground Up

Wed, 04 Feb 2026 17:38:08 GMT

As I’ve been diving more and more into Claude Code and AI the more I realize how much I have to learn around the systems and infrastructure that power them.

One piece of infrastructure that has always piqued my interest are graph databases and using them for RAG applications (AKA GraphRAG). Reason being is that not only do they allow for more advanced querying but, as an SEO, it’s a concept that has had quite the buzz since 2012 thanks to Google.

Claude Code Graph (CCGraph)

So what did I build?

Well, it all started thanks to a conversation with Noah Learner on discovering new and useful Claude Code repos. Something that can be both overwhelming (thanks to the +5,000 repos on GitHub that use the topic Claude Code) and is highly personalized based on your workflow.

After staring helplessly into the void for about 5 minutes on how to do this, I started poking around GitHub’s public API for repos with Claude Code as the topic and realized this was the perfect pet project to learn to build an agentic GraphRAG app end to end.

And 6 days of intermittent work later Claude Code Graph (CCGraph) was born!

The Journey

Now that we know what, let’s talk about how.

Prior to this my only experience working with Claude Code were 2 tiny test projects (a personal finance landing page quiz and a proof of concept transcription pipeline). Thankfully because of those two projects I already had a bit of a feel for how to approach development and previously created v1.0 of Claude Code Starter which I used as the starting point for CCGraph.

Milestone 0: Understanding Graph Databases and GraphRAG

Before even starting development I did a deep dive into really understanding what graph databases and GraphRAG is and how they work.

This consisted of having long in-depth multi-day conversations with both Gemini and Claude, watching countless YouTube videos, and going through articles and documentation.

Why do this rather than jumping in blindly?

To prevent brain rot by guiding strategy while Claude lead the execution
I actually want to understand the underlying technology (not enough people do this!)
Claude does some really stupid shit and will gaslight you into allowing it to do dumb things

If you’re looking to learn more about graph databases and graphRAG, these resources came in handy:

Neo4j’s graph academy, blog, resources, and YouTube channel
AI Engineer’s YouTube channel
Cole Medin’s YouTube channel

Deciding on Graph Structure

Based on the learnings from my deep dive and brainstorming with Claude on how users will interact with the app, I decided on the following structure:

4 types of nodes would be used which consisted of:

Author - The person who owns the repository
Repository - The repository itself
Topic - Any tagged topic that a repository may have
Section - A chunked readable and vector embedded version of the README.md and Claude.md

The edges (the way the nodes are connected to each other) would be:

References - Relationship between a repo and another repo when it is mentioned by name
Owned By - Relationship between the author and repo
Has Topic - Relationship between a topic and repo
Has Section - Relationship between the repo and the README.md/Claude.md
Similar To - Relationship between two README.md/Claude.md chunks
Related Topic - Relationship between two topics

With both the edges and nodes now mapped out, the LLM can now easily traverse through the graph allowing it to find repos and topics that are connected or similar with deep context for the user.

Milestone 1: Initial App & Data Ingestions

The first step was collaborating with Claude on the initial PRD.

I personally like taking a collaborative approach with the PRD development by:

Explaining in detail what I want to build, the technology that should be used for it, and what I want the users to learn/get value from
Go back and forth with Claude asking and answering questions about the backend logic, user experience, and any other items
Having Claude write an initial PRD and then creating a 2nd session within the same folder and asking it to “unapologetically and ruthlessly to find edge cases, potential bugs, and items that lack clarity”
Ruthless Claude typically finds 15-20 items that need to be improved which I’ll then go item by item with it to collaborate on finding a solution to
You can take this a step further by having multiple session of Ruthless Claude rereview the updated PRD or even have a separate LLM (ChatGPT or Gemini) review it
Once I’m happy with the solutions Ruthless Claude will then update the PRD and I’ll have the original Claude start the build out process turning the refined PRD into an epic and individual issues
After everything’s been pushed to GitHub I then let Claude Code start development in parallel and launching new agents whenever a new issue becomes unblocked

The initial PRD for CCGraph was focused on creating the foundational elements of the app which included:

Backend server settings and UI
Data ingestion pipeline
Loading the data into Neo4J
Outlining the chat proxy
Developing error handling, security, and testing

The initial infrastructure used:

Firebase for hosting
Neo4J as the graph database
Vue for the framework
Tailwind CSS to make it look pretty

Now that the easy part was done, it was time to start refining the chat responses.

Milestone 2: GraphRAG - Prompt Workflows

The first iteration of CCGraph used a prompt workflow which looked like:

User sends a query
Neo4j is queried based on user’s query
Response is then put into a system prompt
LLM responds back to the user’s query

Though it wasn’t perfect it kinda got the job done… not really…

What I found is that the responses lacked relevancy and context to what the user was actually trying to learn based on their query. The response was just spewing out repos willy nilly.

Three big things had to be changed:

The user’s query had to be categorized to the app could figure out how to use Neo4j
Initially Claude did this by regex but, we quickly migrated to using an LLM (re: Claude doing dumb things)
Structure for the Neo4j requests had to be implemented (e.g. sorting, filtering, aggregating)
A relevancy score was developed for repos based on what the user was asking blending section level and repo level cosine similarity

After a lot of back and forth and testing prompts the workflow then moved to:

User sends a query
LLM categorizes the query
Based on the category Neo4J is queried
Neo4j response is put into a system prompt
LLM responds back to the user’s query

Now we’re starting to get somewhere!

Responses started getting a heck of a lot closer to what you’d expect as a user but, I really wanted to understand what was happening behind the scenes so that I could continue testing and tweaking prompts.

And so came along…

Milestone 3: Tracing & Observability

Not only was adding observability and tracing important from an ops and refinement perspective but, knowing that down the road I plan on making production level AI apps this was something that I knew would be key to understand.

Eventually, I came across Langfuse and not only has it been heavily tested by others, has other key features (prompt management, A/B testing, evals, and a lot more) but, it’s open source and can easily be deployed on Elestio.

Implementation wise it was a super easy one prompt ask with Claude Code simply asking:

*“I want to implement Langfuse’s observability and tracing (https://langfuse.com/docs/observability/overview) into the app for all prompts using the JS SDK (https://github.com/langfuse/langfuse-js)*”

And off it went quickly implementing Langfuse in a couple of short minutes only asking me for the API key along the way.

As I continued to test the prompts more and more the responses still felt a bit off and seemed to lack deep context and understanding of what was being asked.

Digging more and more into the issue and going through resources I then naively thought “Let’s move this over to an agent, I’m sure it’ll be super easy…”

…It was not

Prompt Workflows vs Agents

In case you aren’t aware there’s a major difference between prompt workflows and agents.

On one hand, prompt workflows are like assembly lines and quite linear where it takes an input (user query) and then goes through a pre-determined sequence of steps until there is a response back to the user.

Agents on the other hand take a non-linear problem solving approach where you feed it the user’s query and provide it with tools (e.g. pre-built database queries or 3rd party API end points) to respond back to the user in a dynamic way.

If you want to learn more about the pros and cons of both, Confluent has a great quick guide on them.

Milestone 4: Agent Overhaul

Though, ultimately, it was the right decision (not just from a response quality standpoint but, for learning) it required an entire architectural overhaul and a long time refining the prompt and tools used by the agent.

Eventually, I landed on using LangGraph for the agent’s framework which (not so) coincidentally has a very tidy integration with LangFuse.

After a lot of going back and forth with Claude Code and some hand holding the migration over to an agent was complete but…

…The responses from it were hot flaming trash

Digging into the traces from Langfuse I realized two major things:

I wasn’t providing enough tools for the agent to properly traverse the graph
The agent prompt was garbage

Building Out a Toolset

Initially the agent had two tools available to it;

Vector search which used the previously implemented blended score system to help answer conceptual and broader questions.

Structured query that allowed the agent to search the graph for rankings, stats, filtering, and relationship traversal.

After going back and forth with Claude Code, “Rutheless Claude”, and Gemini we came up with a library of queries that a user may ask. Based on that we identified a whole slew of gaps within the current tools.

The structured query tool was expanded to allow for 4 additional intents for pulling from the graph database which included references, similar repos, and related topics.

README.md and Claude.md tools were added to allow users to do deep dives into how to use a specific repo and the workflows around it.

Improving The Agent Prompt

As embarrassingly shown above, the initial prompt had a LOT of issues…

Diving head first into the problem again with 3 of my favourite friends (Claude Code, “Rutheless Claude”, and Gemini) we slowly refined and tested the prompt to get to a place where the responses were much more useful.

A few key changes that were made included:

Providing a strategy (chain of thought) on how to approach a user’s query
Describing each tool in greater detail along with providing examples of questions that were labeled
Details on how to choose which tool to use and when
Guidelines on how to provide the responded to the user in a helpful way
Forcing no hallucinations
Important rules that should always be followed

After rolling out both the new and improved tools along with a shiny new prompt the agent was finally starting to be useful for users (at least for me)

Adding Response Feedback

Though I’m happy with how the agent is performing it’s still a long way from perfect.

To help with this and to get more experience with production ready AI apps, scoring and feedback was added. This allows users to either give the agent’s response a thumbs up or down along with a reason why the response was bad.

The results from the feedback are collected in LangFuse which can be used to both debug the agent and be used as evals for future tests.

Milestone 5: UX Overhaul & Graph View

The next step was polishing the UX of the app and making it look half decent.

Severely lacking any sense of design and creativity, I fed Pencil my personal website (both URL and screenshots) as a baseline and had it build out a design system that was practically copy and pasted into CCGraph.

Building the Graph Visualization

Now that I had a half decent look the next step was to build out a graph visualization to allow users to visualize repos, authors, and topics and explore them further.

Using D3 as the library, the first response from the LLM was integrated into the main chat interface as a side bar.

The first initial graphs contained a lot of orphan nodes regardless of type (repo, author, and tag) which required going back and forth with Claude on identifying the reason why and creating a set of conditions on how each node should be displayed and interacted with.

Ultimately, I ended up with a graph visualization where users can click into each node and find repos that fit within a certain topic or author, pre-populate a query about a specific repo, or even visit a repo’s GitHub page.

Milestone 6: Deployment

With the development done it was now time to deploy on Firebase and it could not have been easier.

Thankfully Claude Code makes it super simple to do it and takes you by the hand walking you step by step on what it’s doing and what it needs from you along the way.

Learnings and What’s Next

Though it’s not the sexiest and most useful of apps, unlike other ones that I’ve seen rolled out, it was the perfect excuse to learn new technology and get more comfortable with Claude Code.

With that being said, a few key learnings that I had from my journey include:

If you have garbage responses, it’s cause you have garbage prompts
Create a ./tmp directory that doesn’t get included in git (include it in .gitignore) to have Claude save mini-PRDs, prompts, and other items that you can then edit or feed into other LLMs/sessions for feedback on
Workflows and agents are not the same and the way that you prompt them and feed them context need to be approached differently
If you want to integrate a tool or library include links directly to the documentation and repos to make Claude’s life easier
Take time to research, learn, and plan what you want to do rather than blindly trusting in Claude
Observability is KEY when building AI powered applications
Graph databases and GraphRAG comes off as more intimidating than it actually is

As far as what’s next, CCGraph will stay live and occasional updates will be rolled out based on insights from Langfuse and any crazy scientist ideas I may get.

On my side it’s a few things:

Using the learnings on graph databases and graphRAG to build out production grade applications that drive enterprise value
Diving deeper into Claude Code by exploring tools like Beads, Claude Flow, Agentic Flywheel, and Gas Town to name a few

Have thoughts? Leave a comment below

Finding Internal Link Opportunities at Scale with Vector Search

Fri, 08 Dec 2023 21:32:43 GMT

It seems that all the rage when it comes to AI and SEO has been around using it for some form of text generation. But, one of the most interesting features that I have yet to see really discussed is the usage of embeddings and vector search.

What are Emebddings?

To understand what vector search is, you first need to know what embeddings are.

Embeddings are essentially the translation of bodies of text (which I'll call documents) into numbers which allows algorithms to better understand the content of the document.

These documents could be as short as an H1 to as long as an in-depth article.

What is Vector Search?

Once you have these embeddings (i.e. number representations of your documents), a vector search is the comparison of the numbers against other numbers (i.e. comparing documents against each other) to find the similarity of them.

The higher the similarity of these numbers the more likely they are related.

*If you'd like to dive deeper into the nitty gritty details of how vector search works, you can read more about it on OpenAI's blog. *

Why Use Vector Search for Internal Links?

So why the heck should you use vector search instead of using something like ScreamingFrog + regex?

Well... instead of trying to find cases of whether a keyword is on a page or not, you're now able to find opportunities based on semantic similarity. In plain English that means you can flag internal links based on topical similarity.

How To Find Internal Linking Opportunities

The follow sections provides you with a step by step breakdown of this GitHub repo and how the script works. Please note that the repo is simply a proof of concept and would need to be refined further to be production ready.

1. Exporting & Prepping Your Documents

In my case, Wordpress is typically the go to CMS for the clients that I work with and the platform thankfully allows you to export all of the pages or posts as an XML document.

Once exported I parse the XML file into an easy to use JSON object which I then parse and remove all of the internal links from the text:

// Get XML file
let articlesXml = await fs.readFileSync(ARTICLE_POSTS, 'utf8');

// Parse XML file to JSON
let articlesJson = await convertXml.xml2js(articlesXml, {compact: true, spaces: 4, ignoreComment: true})

// Map Reduce (HTMl to Text + Parse Internal Links)
let formattedArticles = articlesJson.rss.channel.item.map((article) => {
    return { ...article,
        articleText: convertHtml(article['content:encoded']._cdata, {linkBrackets: false, ignoreHref: true})
    };
});

2. Translate Your Documents into Embeddings

Once you have the documents ready, you then need to get the embeddings for them (AKA translating them into numbers).

OpenAI provides you with an easy to use Embeddings end point where you simply provide them with the document and they return the embedding version.

You can see how to do that here:

// OpenAI Vectorize + Push to Pinecone
for (let article = 0; article < formattedArticles.length; article++) {

    // Create embedding via OpenAI
    let embedding = await openai.embeddings.create({
        model: 'text-embedding-ada-002',
        input: formattedArticles[article].articleText,
        encoding_format: 'float'
    });

    // Adde embedding data to JSON object
    formattedArticles[article].embedding = embedding

}

3. Save Your Embeddings to a Vector Database

Now that you have the embeddings, you can save them into a vector database, in my case I'm using Pinecone.

Not only do you want to push the embeddings to Pinecone but, you also want to make sure the ID you're using can easily be cross-referenced (pro tip: use the unique ID of the document from the CMS as the ID in Pinecone) and you may also want to include additional meta data about the document such as the category or tags from your CMS.

  // Chunk the articles
  const chunkedArticles = formattedArticles.reduce((chunkedResults, article, index) => { 

    // Set the chunk size
    const chunkIndex = Math.floor(index/50);
    
    // Start a new chunk
    if(!chunkedResults[chunkIndex]) {
        chunkedResults[chunkIndex] = [];
    }
    
    // Add the article to the chunk
    chunkedResults[chunkIndex].push(article)
    
    return chunkedResults
}, []);

// Target a Pinecone index
const pineconeIndex = pinecone.index(PINECONE_INDEX);

// Send the chunks to Pinecone
for (const chunk of chunkedArticles) {

    // Create an empty embeddings array
    let embeddings = [];

    // Push the embeddings of each article to the embeddings
    for (const article of chunk) {
        embeddings.push({
            id: article['wp:post_id']._text,
            values: article.embedding.data[0].embedding,
            metadata: {
                category: article.category._cdata.toLowerCase()
            }
        });
    }

    // Push embedding to Pinecone
    await pineconeIndex.upsert(embeddings);

    // Provide confirmation of saving
    console.log(`Pushed ${chunk.length} article embeddings to Pinecone`);
}

// Save data to a JSON file
fs.writeFileSync('./output/article-embeddings.json', JSON.stringify(formattedArticles));

4. Compare Your Link Target Embedding with Your Vector Database

This is where the rubber finally hits the road, you then take the WordPress post ID, which should also be the ID of the document in Pinecone, of the URL you're trying to find links (I'll call this the target document) to and you request that Pinecone provides you with documents that are similar to it. In my case I am requesting the top 50 similar documents.

Pinecone will then send you a whole slew of results back with a score between 0 and 1, where 0 is irrelevant and 1 is identical.

The list is great but, next we need to filter them do to actual opportunities. I do this by:

Excluding the target document itself
Removing results that are below a certain score threshold (I recommend above a 0.7 at the minimum)
Removing results that are already linking to your target document
Cleaning up the results into something that is human readable

// Get matched opportunities from Pinecone
let opps = await pinecone.index(PINECONE_INDEX).query({ topK: 50, id: TARGET_ARTICLE_ID})

// Get Target Article Info
let targetArticleInfo = formattedArticles.filter(function(target) {
    return target['wp:post_id']._text === TARGET_ARTICLE_ID
})

// Filter
let filteredOpps = opps.matches.filter(function(opp) {
    // Remove target article & articles below the scoreThreshold
    return opp.id !== TARGET_ARTICLE_ID && opp.score >= SCORE_THRESHOLD;
})

// Merge Pinecone Results + WP Data
let finalOpp = filteredOpps
    // Remove the target article from the opps
    .filter(opp => formattedArticles.some(wp => wp['wp:post_id']._text === opp.id))
    // Add WP link, title and HTML
    .map(finalOpp => ({
        targetUrl: targetArticleInfo[0].link._text,
        ...finalOpp,
        link: formattedArticles.find( wp => wp['wp:post_id']._text === finalOpp.id).link._text,
        category: formattedArticles.find( wp => wp['wp:post_id']._text === finalOpp.id).category ? formattedArticles.find( wp => wp['wp:post_id']._text === finalOpp.id).category._cdata : '',
        title: formattedArticles.find( wp => wp['wp:post_id']._text === finalOpp.id).title._cdata,
        htmlContent: formattedArticles.find( wp => wp['wp:post_id']._text === finalOpp.id)['content:encoded']._cdata
    }))
    // Remove articles already linking to target
    .filter(finalOpp => {
        return !finalOpp.htmlContent.includes(targetArticleInfo[0].link._text)
    })
    // clean up the opps for CSV output
    .filter(finalOpps => {
        delete finalOpps.htmlContent;
        delete finalOpps.values;
        delete finalOpps.sparseValues;
        delete finalOpps.metadata;
        return true;
    });

5. Profit!

Last but, most definitely not least is saving the link opportunities into a nice and tidy CSV file for you to do a final manually spot check and start building those internal links to:

// Save output as CSV
fs.writeFileSync('./output/opps-'+TARGET_ARTICLE_ID+'.csv', await json2csv.json2csv(finalOpp));
// Send success message
console.log(`There were ${finalOpp.length} link opportunities found for the URL ${ targetArticleInfo[0].link._text}`);

Closing Thoughts

With the testing that I conducted, I found that accuracy to be the biggest issue here. In that internal linking opportunities with high score thresholds were not topcially relevant while some opportunities that were flagged had higher relevancy.

A few ideas on improving the accuracy of your results could do implementing:

Filtering results by meta data
Vectorizing and searching by page title rather than body content
Using hybrid search and sparse vectors using weights

I hope you enjoyed the walkthrough and it got your gears turning on how you can use embeddings and vector search in your day to day SEO tasks.

How to Save GoogleBot & BingBot IP Addresses to BigQuery

Fri, 30 Jun 2023 19:10:07 GMT

*Big Crawler IPs is part of a much larger initiative of providing SEOs and marketers with open source data warehousing tools *

As I dive deeper into the world of SEO whether it be with my own personal projects or with the clients that I work with, I am realizing that the need to have a data warehouse has become more and more important.

One data source that I've found to be a big pain to consistently have are log files as it typically require sever level access and a developer.

Thankfully, there is a super hand-dandy tool called LogFlare that sits on top of CloudFlare which takes care of all of the heavy lifting when it comes to collecting and storing log data in a BigQuery and thankfully BigQuery is my data warehouse of choice.

As amazing as LogFlare is, it's simply a firehose of log data into your warehouse (the extracting and loading part of ELT). There isn't any filtering or transformation that happens as the onus is on you (as it should be).

Why Build This

So if you're trying to build out an SEO data warehouse one of the first filtering steps you should be taking with log files is "authenticating" the data to make sure that it is actually GoogleBot or BingBot crawling your site versus a tool such as SiteBulb or ScreamingFrog.

To do this you have to rely on the IP address rather than the User-Agent. Thankfully both GoogleBot and BingBot provide you with a list of IP addresses.

So how do you get these IP addresses into your warehouse?

Well that is where Big Crawler IPs come into play.

Download the code on GitHub Here

How It Works

The code lives within a Google Cloud Function which is periodically triggered by a Cloud Scheduler HTTP request.

Once the Cloud Function is triggered, the official GoogleBot and BingBot IP address JSON files are read and then then cross referenced with the IPs already in BigQuery and then saves the missing ones. If an an IP address is missing from BigQuery it is then added to the table.

How To Deploy Your Own

youtube: https://youtu.be/la8kppC8Fwk

Hope you find this tool handy and if you have any feedback feel free to reach out on Twitter (@JordanChoo)