How does docs crawling work

Is there a way to debug what info is fed into the docs crawler? Im trying to add our internal docs and i dont think its working. Maybe because the site is a spa. Would love to know whats happening internally so I can just fix the docs

3 Likes

+1, would love to see the indexing in action if possible so I can format my docs correctly too.

I know they’re still improving the product but this feature would be killer if they can get it reliably working to process and index any kind of documentation website

1 Like

We plan on adding a visualization of what it is the model actually sees when using docs in a future release. We’ll also have better docs management and auditing of text is shown for each page

For now, we use an HTML to markdown parser, then do n-gram deduplication across the crawled pages (to take out useless info like navbars). Finally, we use some simple chunking heuristics to break down a page into multiple ~500 token chunks.

The library name is node-html-markdown if you’d like to see what the raw markdown looks like for a webpage.

7 Likes

Ahh so if the docs are client side rendered its not gonna work?

1 Like

What do you mean by client-side rendered?

Like its a create react app. So its all rendered using react. The html just has the script tag. So im guessing the cursor server wont execute the JavaScript.

@amanrs Thank you Aman, very helpful!

Are the requests made from the client running cursor or from a cloud-based service? Would it be possible to access non-public resources like Azure DevOps wikis or internally hosted sites?

I came here to mention that I think there might need to be some kind of user assisted check or something…

I want to play with a VSCode extension for my workflow, so I entered the url of Extension API | Visual Studio Code Extension API … 10 mins later it’s still learning. It’s following every link ( i didn’t know it did link following!), and there must be a heap going on behind the scenes. And there’s no cancel button…

1 Like

They are rendered client side! We use puppeteer to control Chromium and run the Javascript.

Is there currently any way to control what links get followed? I’m trying to index Modules - io but it appears Cursor doesn’t follow the links. If I want it to index the actual docs for each module I have to do each one individually, and there are a lot of modules.

BTW Cursor is absolutely amazing – it’s the first AI code tool I’ve found that can properly assist with newer and highly complex libraries like Effect!

2 Likes