How does docs crawling work

Is there a way to debug what info is fed into the docs crawler? Im trying to add our internal docs and i dont think its working. Maybe because the site is a spa. Would love to know whats happening internally so I can just fix the docs

3 Likes

+1, would love to see the indexing in action if possible so I can format my docs correctly too.

I know they’re still improving the product but this feature would be killer if they can get it reliably working to process and index any kind of documentation website

1 Like

We plan on adding a visualization of what it is the model actually sees when using docs in a future release. We’ll also have better docs management and auditing of text is shown for each page

For now, we use an HTML to markdown parser, then do n-gram deduplication across the crawled pages (to take out useless info like navbars). Finally, we use some simple chunking heuristics to break down a page into multiple ~500 token chunks.

The library name is node-html-markdown if you’d like to see what the raw markdown looks like for a webpage.

8 Likes

Ahh so if the docs are client side rendered its not gonna work?

1 Like

What do you mean by client-side rendered?

Like its a create react app. So its all rendered using react. The html just has the script tag. So im guessing the cursor server wont execute the JavaScript.

@amanrs Thank you Aman, very helpful!

Are the requests made from the client running cursor or from a cloud-based service? Would it be possible to access non-public resources like Azure DevOps wikis or internally hosted sites?

I came here to mention that I think there might need to be some kind of user assisted check or something…

I want to play with a VSCode extension for my workflow, so I entered the url of Extension API | Visual Studio Code Extension API … 10 mins later it’s still learning. It’s following every link ( i didn’t know it did link following!), and there must be a heap going on behind the scenes. And there’s no cancel button…

1 Like

They are rendered client side! We use puppeteer to control Chromium and run the Javascript.

Is there currently any way to control what links get followed? I’m trying to index Modules - io but it appears Cursor doesn’t follow the links. If I want it to index the actual docs for each module I have to do each one individually, and there are a lot of modules.

BTW Cursor is absolutely amazing – it’s the first AI code tool I’ve found that can properly assist with newer and highly complex libraries like Effect!

3 Likes

I still can’t get cursor to index Modules - effect

It won’t follow the links for whatever reason. This makes it more or less impossible to use cursor features with my codebase. Without the docs it gives completely wrong and non-sensical answers about effect code. It uses APIs that don’t exist, imports the wrong library, makes up APIs that were part of fp-ts two years ago, etc.

I still can’t get it to index the Effect docs which really limits the usefulness of Cursor. Without the docs I get wildly incorrect answers from the AI about any effect code (which is most of my codebase).

Hey everyone!

Sorry to resurrect a “dead” thread, but would love to bring the discussions back around here…

I took am quite seriously struggling to index some documentation - it’s not clear to me what the issues are, however, and debugging/fixing the issues around docs are tough/opaque, at best…

As this is a key feature, I’m sure it’s on the “radar” or whatever, but just wanted to know if there’s some additional help that can be provided here to try and diagnose to help this incredibly useful feature be… well, useful? Any ideas @truell20 et al?

Cheers to the team - love the product.

@jldb What documentation are you trying to index?

If I remember rightly it was Timefold’s documentation; but it’s one of a number of times I have had documentation crawling failed.

I would hazard to say that I don’t think a single documentation I have added manually has given the AI the information I know is in the documentation, but because the docs crawling is so opaque, I couldn’t really tell you why.

Sorry I can’t be more useful!