July 2nd, 2024

Google rejected me and now I'm building a search engine

The article recounts a rejection from Google during an interview, prompting the individual to create a non-profit, community-driven search engine emphasizing ethical values over profit, welcoming contributions for development.

Read original articleLink Icon
Google rejected me and now I'm building a search engine

The article discusses a personal experience of being rejected by Google during an interview, leading the individual to decide to build their own search engine as an alternative to Google. The rejection occurred despite positive feedback from three interviewers, ultimately resulting in the decision not to hire the individual. The rejection prompted the individual to criticize Google's practices, accusing the company of unethical behavior such as misleading advertising, tax evasion, privacy violations, and supporting controversial causes. The individual's new search engine project aims to be non-profit and community-driven, allowing users to contribute to the search engine's development and funding. The project is open source, welcoming contributions from anyone interested, and emphasizes collaboration to create a search engine that prioritizes ethical values over profit. The individual acknowledges the long road ahead to compete with Google but remains optimistic about the potential for a community-driven alternative.

Link Icon 28 comments
By @kevmo314 - 5 months
I'm sure the post's author doesn't need interview advice anymore but in case there are any prospective interview candidates out there, completely freezing during an interview is a super negative signal. Even if you need to manually multiply out 2's on a whiteboard it would be more productive than saying "I don't know".

In my experience the only reason you should say "I don't know" is if you're going to follow it with "but if I had to guess" or similar. Sounds like the interviewer definitely came on strong but being able to ace the psychological part of an interview is often as important or more important than the actual solution.

By @crazygringo - 5 months
This user already submitted this same article yesterday and it was flagged:

https://news.ycombinator.com/item?id=40850725

Rather than this clickbaity "Google rejected me" story about something that happened 15 years ago, here's a link to the actual project:

https://github.com/mwmbl/mwmbl

By @harles - 5 months
> He continued to ask more questions about numbers of bits. I couldn’t answer any of them without a lot of help. He didn’t ask me about my PhD work building a new theory of natural language semantics.

This strikes me as fairly petty “I didn’t answer wrong, you asked me the wrong questions!”. Honestly it’s the recruiting process working as intended - folks with this type of attitude don’t make good team members in my experience.

Also > At the time “Don’t be evil” still meant something. Now it seems like their mantra is just “Be evil”.

Seems really petty. It’s a shame because we could good big tech alternatives, but building something out of spite without much perspective is unlikely to create a good alternative.

By @Imnimo - 5 months
I like this interview question. It's perfectly solvable without a calculator as the interviewer said. It doesn't rely on having memorized some weird binary tree inversion algorithm. It tests the ability to take facts that you already know (e.g. 2^8 or 2^10) and use them to solve a problem that might appear out of reach at first glance.
By @daemonologist - 5 months
Page appears to have been taken down, but is available on archive.org: https://web.archive.org/web/20240702162540/https://daoudclar...
By @philipwhiuk - 5 months
> you who ranks the search results

The actual link from it says the rankings are, like everywhere else:

> To train a learning to rank model. No matter how many queries are manually curated, most user queries will be organic because of the natural diversity of user queries. Curation is still important for these results since it impacts the machine learning model that will be trained on the curated rankings.

so this not true in the long term.

By @dmitrygr - 5 months
Assuming the quotes are accurate, interviewer was indeed being a bit of a dick, but being able to tell approx how many bits a number needs is something I'd expect any programmer to be able to do, and I would also give negative feedback to someone who could not do that in an interview.
By @foota - 5 months
It's at the bottom of the article, but note that this interview experience is from 15 years ago.
By @eterm - 5 months
It wasn't google, but last year I had the worst interview experience of my life when I was berated for not being able to remember if a System.Tick was 10nanoseconds or 100nanoseconds.

I remarked that in the circumstances I'd need to know, that I'd google it and check the documentation to make sure I got it right.

The interviewer (who I later found out was the founder/CEO) absolutely laid into me for that answer, saying if he wanted people to google that a "thousand Indians graduating in computer science every day" could google it.

I tried to argue that I was looking to be employed for my problem solving skills and experience rather than rote knowledge, but he was really angry. He literally said to be verbatim, "Let me give you some interview advice, NEVER tell an interviewer you'd google something". He also made a mildly off-colour remark that if he "wanted someone just to google, [he] could hire one of thousands of fresh graduates coming out of India".

It was an experience so bad that it inspired me to create a glassdoor account just to leave negative feedback, something I've never done before or since. The recruiter was absolutely pissed, and still doesn't provide me leads, which is kind of annoying since he's the most active C#/.Net recruiter in my area.

But my point is that some people have absoultely atrocious interview manners. Interviews are a two-way street and I discovered that there was absoultely no way I'd want to work with them. (Even when I just thought they were a team lead rather than the CEO it was enough to put me off.)

By @Lockal - 5 months
Not sure how "need a few more to get 56, well 6 would be enough. So 26 bits?" is a solution.

If he remembers that max signed int is ~2 billion, than easier to divide 4 billion by 2. 2b/1b/500m/250m/127m/64m - got 6 divisions, 32-6=26.

If you think that max int is irrelevant to the position - it is so relevant, I can't even describe, this number is everywhere, from database design to js-wasm (limited by 32-bit), from deep-learning (where some libraries still limited to 32-bit buffers) to networking (hello ipv4)

By @financltravsty - 5 months
Search engines are dying. Information retrieval and recommendation engines are still mostly living in the dark ages from all the work that's been done in the last 50 years.

Figure that problem out first (something novel and useful), then start marketing yourself.

Right now you just gave us a story we've all lived (academic hazing) without any plan of action -- so 2010.

By @1vuio0pswjnm7 - 5 months
When I try to use the provided URL, I get this:

      Sorry this page does not exist =(
Alternative:

https://cc.bingj.com/cache.aspx?d=4652446581392&w=-V-8V9bl07...

By @yashasolutions - 5 months
Competition is good. We need diverse search product again.

Kagi is great but more options would be good too.

OP's product is clearly at a very early stage. OP's post is also pretty opinionated.

Hard to say which impact on product it will have - but as long we have more options for search engines, this will be one out of many options.

By @jerryjose - 5 months
After giving about 3000 job opportunities world wide , our agency is still giving out Jobs and Business loans worldwide.

if you need a job or financial aid kindly contact us now via email : shalomagency247@outlook.com

Thanks.

By @swyx - 5 months
ah, Spite, the ultimate developer fuel.
By @daoudc - 5 months
I took the page down as it was attracting the wrong sort of attention. As some commenters surmised, the goal was to promote the search engine, but it wasn't working out that way...
By @joatmon-snoo - 5 months
It's really easy to read this as "shitty interviewer runs off good candidate".

It's also easy to read this as "interviewer hand-held a candidate through a problem".

By @bko - 5 months
Whenever I hear about alternative search engines, I try out a few famous people hoping to see Wikipedia entries towards the top. And almost always I see nonsense.

For instance, if you search for 'Trump', the top links are

```

1. http://www.trump.de — found via Mwmbl -- Trump

2. https://itep.org/md/ — found via Mwmbl -- Trump Tax Proposals Would Provide Richest One Percent in Maryland with 69.7 Percent of the State’s Tax Cuts Earlier this year, the Trump administration r…

3. https://is.gd/mUHYTg — found via Mwmbl --- Trump embraces QAnon conspiracy because ‘they like me’ After skirting the issue for weeks, President Donald Trump offered an embrace Wednesday of the fri…

4. http://dict.cn/trump — found via Mwmbl -- trump是什么意思_trump在线翻译_英语_读音_用法_例句_海词词典

```

Surely there are millions of results more relevant to the phrase 'Trump' than trump.de. The other links aren't better. A random article from 2017? Another one from 2020. A Chinese dictionary definition of 'Trump'?

I get that search is hard, but what's going on here? You can try any phrase, and you just get weird results.

By @bentobean - 5 months
I sympathize with some of what the author has to say. That said, Google's choice to do business with Israel does not represent "support for genocide." It is also within their prerogative to dismiss employees who protest company policy.

Naive / biased statements such of these cause me to lend less credence to author's other points.

By @derefr - 5 months
> It’s you who chooses what sites we crawl

Yeah, but you still reserve the right to not crawl sites (or to remove them from your index), yes? So there's still the opportunity to do evil.

I'm still waiting for a "raw" search spidering provider. One that:

1. runs a web-spidering cluster — one that's only smart enough to know what robots.txt is, to know how to follow links in HTML pages, and to obey response caching-policy headers;

2. captures the spidering process losslessly, as e.g. HAR transcript files;

3. packs those HAR transcript files, a few million at a time, into tar.xz.tar files (i.e. grab a "chunk" of N HAR files; group them into subdirs by request Host header; archive each subdir, and compress those archives independently; then archive all the compressed archives without compression) — and then uploads these semi-random-access archives to a CDN or private BitTorrent tracker (or any other data delivery system that enables clients to only retrieve the blocks/byte-ranges of files they're interested in);

4. generate a TOC for the semi-random-access files, as a stream of tuples (signed archive URL, chunk byte-range, hostname, compressed URL-list); push these to a managed reliable message queue on an IaaS, publishing each entry to both an all-hostnames topic, and a per-hostname topic. (I say an IaaS, as this allows consumers to set up their own consumer-groups on these topics within their own IaaS project, and then pay the costs of message retention in these consumer-groups themselves.)

5. Also buffer these TOC-entry streams into files (e.g. Parquet files), one archive series per topic; and host these alongside the HAR archives. Prune TOC topic stream entries if (entries are at least N days old AND the entries have been successfully "offlined" into a hosted TOC-stream archive.)

---

This "web-spidering-firehose data-lake as-a-Service" architecture, would enable pretty much anyone to build whatever arbitrary search index they want downstream of it, containing as much or as little of the web as they want — where each consumer only needs to do as much work as is required to fetch and parse the HARs of the domains they've decided they care about indexing something under.

This architecture would also be "temporal" (akin to a temporal RDBMS table) — as a consumer of this service, you wouldn't see "the current version" of a scraped URL, but rather all previous attempts to scrape that URL, and what happened each time. (This would mean that no website could ever censor the dataset retroactively by adding a robots.txt "Disallow *" after scrapes have already happened. Their robots.txt config would prevent further scraping, but previous scraping would be retained.)

And in fact, in this architecture, the HTTP interaction to retrieve /robots.txt for a domain, would produce a HAR transcript that would get archived like any other. Domains restricted from crawling by robots.txt, would still get regular HAR transcripts recorded of the result of checking that their /robots.txt still restricts crawling. (Reducing over these /robots.txt HAR transcripts is how a consumer-indexer would determine whether they should currently be showing/hiding a domain in their built index.)

By @roschdal - 5 months
Good luck competing with the Alphabet monopoly. See Peter Thiel books on the monopoly.