null.media search has become degraded due to excessive traffic which has resulted in rate limiting from many of the search providers. perhaps an alternative design could be used to supplement the search results without writing a full meta search engine. a lot of work has been put into making searx work with many search providers, and it would be a waste to abandon all of that because null is being rate limited, so we’ll address that first. keep in mind that this is only a mitigation, distributing traffic over many ips is expensive and doesn’t scale well.
for most of 2018-2019 null was a cluster of searx servers running on 4 geographically distributed vms. this made it possible for us to scale to the present level of traffic without too much issue. however, each vm can only serve a limited amount of search traffic before getting rate limited, so eventually more complex routing logic will need to be introduced to make this workable. clustering is certainly necessity anyway if we want to be able to serve large amounts of traffic.
searx-compatible search proxy
another idea that i’ve yet to explore involves some custom middleware. a searx-compatible api that can hit a revolving set of searx servers that is generated using uptime stats. the list could be scraped from the public searx list or be populated by our own searx workers. the middleware would use the backend results to populate an index that can be queried without hitting any external apis. this middleware/api could be used to enable most of the ideas that follow.
it may be interesting to additionally query YaCy, but only if it can be operated in an inexpensive way. if there are community members that operate searx or YaCy servers, it may be a good idea to have a way to submit names and allowable traffic limits if they wish to contribute to null’s search results.
custom web spider
a cluster of web spiders could also be used to update the local index and supplement the adhoc index results pulled from metasearch by the custom middleware. these would have to be fairly specialized in order to provide useful results since we don’t have the resouces to do any of the fancy ML stuff that google uses to process their index. as such, the spiders will need to be paired with high-quality content curation from the community.
curated content index
in order to privide high quality results and make good strategic use of limited index capacity, it may be a good idea to index only curated sources. these curated sites would be added to the system by trusted community members who can vouch for the quality of the submitted content. this could be done using an automated process that scrapes content links from trusted sources. an example of this could be the inspiration page on walkaway.wiki or the resources page on 100 rabbits.
fediverse aggregate index
direct search on a fediverse database will result in mostly useless results as for a general purpose search engine. however, using domain or url metions in aggregate with boost and like numbers could yield some interesting results without directly exposing fediverse content to a search index. it may also be a good strategy for extracting useful information from other social networks.
walkaway handbook / tracker meshwiki
future versions of walkaway handbook will store information using a semantic wiki model in a distributed database, which could also be used to provide rich search results without expensive ML. ideally much of the work behind null.media could be packaged as a tracker application to enable decentralized search infastructure to replace monolithic search systems.
Another suggestion for a search provider to query: the Algolia frontend search for Hacker News. Lots of really good stuff there, even if it is a bit special interest
(and yeah, I think curated resource lists will win out long term over anything else)
HN and other curated niche discussion platforms are definitely good targets for indexing.