I would like to try building a search index for this instance (maybe others) and as such would like to crawl the site with automated spiders. Now with the shutdown of the reddit API I expect the site to come under quite substantially load and also I would ofc try to not spam the site with too many requests as to not get banned or blocked, due to looking like a DOS attack. Could anyone provide some information on this?
Many existing Fediverse services are being operated by people who are opposed indexing the content on their instance(s). You may run into resistance from that angle.
I mean unless they make their instance private I don’t see why you wouldn’t index them? That’s literally why google provided such a value in their early days.
Even Google doesn’t index webpages that include “noindex” in a header. You are going to run into a lot of people who don’t agree with what you are trying to do. If you start reaching out to the people running Fediverse services to let them know that you’re trying to index the data on their services, you can learn what they think of the idea.