Of course, it depends on how you set the crawler, i.e. how deep because of subdomains and how far links to other domains etc. are.
I crawl 71 pages:
My index is currently 4,576,319 documents ( i crawled sites like github too )and occupies just under 14GB
The results depend on several factors. For example, whether you only use local or p2p. But then it also has a number of settings and you can also explicitly control what is responsible for the results down to the smallest detail.
But I have to be honest and say that I haven’t dealt with this at all (especially since it’s a bit complex in some cases) because I want to expand my list of pages to crawl myself first and I only use it locally. I still regularly use duckduckgo to search. However, if you take the time for it you will get what you want in terms of quality of results.
Ah well, depending on how you set up the crawler, it consumes system resources accordingly. However, you can set and limit the utilization of RAM and storage space. The same goes for network utilization, which is pretty important because otherwise no other connections would be possible besides crawling xD
Yacy. Self hosted p2p ( or local only ) search engine. Crawl the Web yourself.
What kind of storage space does it take up, and how good are the results?
Of course, it depends on how you set the crawler, i.e. how deep because of subdomains and how far links to other domains etc. are.
I crawl 71 pages: My index is currently 4,576,319 documents ( i crawled sites like github too )and occupies just under 14GB
The results depend on several factors. For example, whether you only use local or p2p. But then it also has a number of settings and you can also explicitly control what is responsible for the results down to the smallest detail. But I have to be honest and say that I haven’t dealt with this at all (especially since it’s a bit complex in some cases) because I want to expand my list of pages to crawl myself first and I only use it locally. I still regularly use duckduckgo to search. However, if you take the time for it you will get what you want in terms of quality of results.
Ah well, depending on how you set up the crawler, it consumes system resources accordingly. However, you can set and limit the utilization of RAM and storage space. The same goes for network utilization, which is pretty important because otherwise no other connections would be possible besides crawling xD
This is exciting, TBH.
I’m going to try it out! Storage space be damned 😂
looks really good, something to rewrite in Rust some day!