I feel like there are probably some ad based search engines which are privacy and service oriented, but in general even for those there remains a misalignment problem. Hence if I don’t want to be a product now or in the future, what good search engines are there that I can pay for?

  • 𞋴𝛂𝛋𝛆@piefed.world
    link
    fedilink
    English
    arrow-up
    8
    ·
    1 day ago

    There are only 2 relevant web crawlers; Google’s and Microsoft’s. All queries from every search engine goes through these two crawlers either directly or through a middle layer of obfuscation.

    The issue is that the internet is too large to index. This has been a known emerging issue for a long time. This is the real reason search sucks. It is not deterministic because it cannot be, but therein lies the issue. Without deterministic unbiased information, democracy is dead. And so search sucks. No one has been able to find a solution for efficient access to enormous databases like this except through the methodologies behind AI. At least not for real time search queries.

    • DaGeek247@fedia.io
      link
      fedilink
      arrow-up
      6
      ·
      14 hours ago

      The issue is that the internet is too large to index.

      It’s really not. At least, not yet. It’s a large part of why it isn’t done, but it’s not the only one, and I’d argue, not even the main reason it isn’t really done.

      A complete crawl with meta data of the internet in 2025 is only 424TiB. For comparison, my 1000$ home setup can handle about a tenth of that(in storage at least). The hardware to maintain a single database of the internet with metadata could cost under $100,000, easily.

      Dave, your comment about it costing a billion to run Bing or Google might be true, but it is completely unrelated to the realities of running a small search engine and has everything to do with the fact that they are Google and Microsoft products respectively.

      The real issue isn’t the physical size of the internet, it’s much more likely to be the complexity of making a search algorithm that can compete with the 75 billion seo market that wxists to break search engines.

      • 𞋴𝛂𝛋𝛆@piefed.world
        link
        fedilink
        English
        arrow-up
        2
        ·
        13 hours ago

        Original comment said in good faith, but from sketchy long term memory of stuff I’ve come across. It seems like it was in a Lex Friedman or similar podcast at some point, but from some time in the last 3-10 years. I may have conflated or misunderstood, as I am not experienced with such complexity. I seem to recall it coming up around the time several astronomers were speaking publicly about issues with processing large amounts of data and soliciting solutions. I just recall wondering why search started to suck around 2017, and putting the pieces together when I heard this. Now, in retrospect, it seems much of the changes were also adversarial for rival AI training after the Transformers paper. At least, looking at how search results are salted now, and the way images are selected for search is absolutely adversarial for AI training datasets… but that is all I know, and should be taken as friendly neighborhood water cooler talk, always with the best of intentions.

        • DaGeek247@fedia.io
          link
          fedilink
          arrow-up
          2
          ·
          13 hours ago

          I think most startup search engines use Google/bing because it’s free/way cheaper than running their own database, not because it’s impossible. It also likely sidesteps a lot of the seo bullshit simply because Google/bing have more experience working around it

          So like, short term/small size its cheaper and straight up easier to piggyback off of the big two companies, rather than manage your own data set. Long term, if you get popular enough to be noticed, I expect that the seo business would wreck any selfhosting search engine startup company’s results pretty regularly.

    • Dave@lemmy.nz
      link
      fedilink
      arrow-up
      2
      ·
      22 hours ago

      I once read that running a search crawler costs upwards of a billion dollars a year. Anyone other than Microsoft or Google running their own search index are either not getting a wide spread of the internet or they are using their own index to supplement Google or Bing results.

      • DaGeek247@fedia.io
        link
        fedilink
        arrow-up
        5
        ·
        14 hours ago

        That’s like saying that it’s impossible to run a car manufacturing company without 100 billion because that’s how much Ford spends on their car manufacturing processes. It makes no sense.

        Yes, making an original search engine is hard, just like making trucks is. But that doesn’t mean that running either one requires billions of dollars to do.

        Common crawl is a nonprofit that regularly shares free copies of every internet page with metadata, and it damn well doesn’t take billions to do it either. https://commoncrawl.org/

        • Dave@lemmy.nz
          link
          fedilink
          arrow-up
          2
          ·
          edit-2
          10 hours ago

          That website claims they add 3-5 billion pages a month. Google is doing that in a day or three, as recency of information is very important in search. Plus that site claims 100 billion pages to Googles 400 billion. It’s still an impressive project.

          Size isn’t everything, so the real question is: what search site uses only the common crawl index and has results on par with bing or google?

          • DaGeek247@fedia.io
            link
            fedilink
            arrow-up
            3
            ·
            9 hours ago

            Size isn’t everything, so the real question is: what search site uses only the common crawl index and has results on par with bing or google?

            None of them. At least, none that I’m aware of. I just don’t think that direct expenses are the reason that there are are only two major web search tools. I also don’t think Google and bing are good examples to point at when estimating the cost of running a complete search engine.

            If you read all of your article, the author notes that while Google has index of about 400 billion, the internet archives index is actually bigger at around 865 billion.

            The internet archive has an operating cost of about 33m/year. I think that is a much more reasonable example to point to and say “running a complete search engine would have a similar price as that”.

            Also, very neat article btw. I would have never guessed that googles search index count has been shrinking for the past little bit. Or that Google actively culls results from their database that it thinks people won’t ever want to see.

            • Dave@lemmy.nz
              link
              fedilink
              arrow-up
              1
              ·
              6 hours ago

              I’m not disputing that you might be right, but the internet archive runs a very different service. Mainly that Google needs to continuously prune their 400 billion page index because of link rot. The Internet Archive has the opposite aim, they are preserving sites that no longer exist.

              I’m also not sure they even crawl. Do sites get added on user request? When looking at a medium popularity page, you see it only has a couple of scrapes a year.

              None of them. At least, none that I’m aware of. I just don’t think that direct expenses are the reason that there are are only two major web search tools. I also don’t think Google and bing are good examples to point at when estimating the cost of running a complete search engine.

              I would suggest direct expenses are the barrier, but perhaps crawling is not the main expense. I would be interested to know any speculations you have outside of expenses that cause a barrier?

              • DaGeek247@fedia.io
                link
                fedilink
                arrow-up
                2
                ·
                5 hours ago

                When I said ‘direct expenses’ I mostly meant the cost of owning / running a database of internet pages and metadata comprehensive enough to be considered part of a ‘fully featured search engine’. There’s also the other half; the compute required to create that metadata, as well as obtain it, but at most I would guess that those would be equal in cost to just having the space for a database of all the internet pages (scaling up after that based on how many users you need to support). In short, a scaled down web engine that had access to every page on the internet that people would want to find could cost as low as 100,000$ for a first time purchase for the hardware.

                The internet archive does in fact have their own web crawler they use. They also do sites upon request as well; i’ve had my personal website on there for almost two decades now, specifically at my request.

                They also have a full-featured search function available for anyone on their website at archive.org. This is why I say they’re a reasonable price comparison for a full-featured search engine. They may spend more on storage and less on metadata compute than a theoretical smaller search engine, but at the end of the day, that’s just a re-balancing of the cost, not a completely new and more excessive cost.

                I think direct expenses; the cost of owning and maintaining an internet index database, are definitely significant enough that the completely free access that google gives to anyone who wants it, are way more than any single private entity or company is able to support just because they want to have it. I don’t think it would be anywhere even close to a billion dollars though.

                I think the hardest part of having a internet index database would be the knowledge required to create and maintain it, especially under the hostile forces that are the 75 billion dollar seo industry. If a selfhosted search engine became big enough that the seo industry started trying to break it, I don’t think that company would survive for very long at all.

                Google is losing that battle, like, almost completely. What hope would a small startup style company have of battling it and staying financially solvent, especially if they’re trying to be different from google and bing and actually showing results without the pressure of advertisers breathing down their necks?

                I think the hardware side of a search engine is solvable with silicon valley startup level of funding. I think it’s impossible for anyone in the current day and age to make that sort of project solvent while keeping the user (instead of the advertiser) as the main customer. For anyone else who can’t get those funds, or don’t actually want to do a results-oriented search engine, they can just mooch of off google and bing for free.

                • Dave@lemmy.nz
                  link
                  fedilink
                  arrow-up
                  2
                  ·
                  5 hours ago

                  I think you’d be right that the direct cost of running the crawler and index would not be the issue. But fighting SEO to keep your results decent is probably a cost that dwarfs the basic technical cost of running the crawler and index.

                  And you’d need a technical security team on top of things as link farms aren’t your only risk, I’m sure there are countless ways to manipulate the algorithm to put your site on top that Google probably have multiple teams working on fighting it full time.

                  Many of these things would likely not be a problem for a startup, though. No one is paying SEO firms big money to get into a search index no one has heard of and hardly anyone uses, so these costs probably grow exponentially over time as you become more well known.

                  • DaGeek247@fedia.io
                    link
                    fedilink
                    arrow-up
                    2
                    ·
                    5 hours ago

                    Yeah, and on the smaller / earlier side of a theoretical search engine company, google offers their api for free. I think this is actually another one of the biggest contributors to why nobody has tried to make a new search engine with their own index. Why waste hundreds of thousands of dollars in hardware, and even more on personnel costs, when you can just have google do it for you instead?