

I’d highly recommend using ZIM to download the websites you want! (https://wiki.openzim.org/wiki/Build_your_ZIM_file)
Once downloaded, you honestly can probably get better results from basic notepad search than google/duckduckgo/bing.
Hi, I like to learn about what resources are out there on the internet. I hope you have found my posts useful!


I’d highly recommend using ZIM to download the websites you want! (https://wiki.openzim.org/wiki/Build_your_ZIM_file)
Once downloaded, you honestly can probably get better results from basic notepad search than google/duckduckgo/bing.


Super useful plugin! You can also subscribe to lists that block SEO/AI generated websites. Now only if there was a whitelist plugin that places forums higher up


Someday this will be possible when an open source search engine comes around.


I noticed some of the best resources from the past are unfindable from any search engine. For example some science youtube channels which offer amazing quality content seem to be unfindable. They are replaced with other channels that try to clickbait their way to the top. The same can be said with websites that SEO as much as they can. The highest quality resources are also often in the least quantity. A form of quantity > quality is favored and amplified and sometimes even censored. (Anna’s archive)


It’s quite sad that we are now at a point where we are forced to make our own search engines from scratch. Search engines are hard! Google’s original search algorithm (about 2 decades ago) was quite amazing. You were able to give vague search terms and yet still find the answer you wanted. The secret sauce was ranking based on relevance to the search query. I’m not aware of any guides/projects on search engines. I wish there was a good way I could search for this. (The irony!) But a great starting resource is this series on networks from wikipedia. (https://en.wikipedia.org/wiki/Network_theory)
Some random tips:
As a side note, you are able to tune your model to your own search preferences with little data. You are also able to exchange computation time for search quality! This is amazing. If computation is a concern, traditional traversal algorithms and basic relevance/ranking algorithms work too but at the cost of more engineering.
I hope this sorta helps, if you have any other question feel free to ask! The future of search will likely be self-hosted as conflicts of interest within current search engine providers degrades the quality to the point where they are unusable.


Finding the balance between what to keep to index is hard! The attention mechanism in transformers should be pretty good at ranking results. The idea is to feed into context titles, top answers, etc in bulk along with a search query. The attention heatmap relative to the search gives you a general rank for how good each result is. Ironically enough, this is probably the most powerful indexer, yet no big tech uses it and instead has the model generate answers instead of ranking them. The best part is, this system is tunable and can be adjusted to user preference with little data. The overall goal should be to minimize the number of results a user checks. (This should be what other engines are doing in the first place)
Thanks! That’s a good idea!