10% of the Top Million Sites Are Dead
Craig Campbell's research reveals 10% of the top million sites are dead. Data issues in the Majestic Million dataset prompt caution. 10.7% of domains are unreachable, casting doubt on dataset reliability. Campbell suggests exploring alternative domain lists.
Read original article10% of the top million sites are dead according to Craig Campbell's research. He analyzed the Majestic Million dataset, which ranks websites based on the number of links pointing to them. Campbell found data issues in the dataset and highlighted the importance of verifying information before use. He also discussed challenges with domain normalization, where domains with and without the www prefix were not consistently handled. Campbell then conducted a check to verify the responsiveness of the top sites to HTTP requests. The results showed that 10.7% of the domains were unreachable, raising concerns about the quality of the list. Despite potential reasons for the connectivity issues, Campbell expressed doubts about the dataset's reliability. He suggested further investigation into alternative top domain lists for comparison. Campbell shared the CSV file containing the HTTP response codes for those interested in exploring the data further.
Related
How I scraped 6 years of Reddit posts in JSON
The article covers scraping 6 years of Reddit posts for self-promotion data, highlighting challenges like post limits and cutoffs. Pushshift is suggested for Reddit archives. Extracting URLs and checking website status are explained. Findings reveal 40% of sites inactive. Trends in online startups are discussed.
Many website admins have yet to get memo to remove Polyfillio links
More than 384,000 websites linked to a code library involved in a supply-chain attack by a Chinese firm. Industry responses included domain suspensions and ad blocks. Over 1.6 million sites linked to potentially malicious domains. The incident highlights supply-chain attack risks.
384k sites pull code from sketchy code library recently bought by Chinese firm
Over 384,000 websites linked to a code library in a supply-chain attack by a Chinese firm. Altered JavaScript code redirected users to inappropriate sites. Industry responses included suspensions and replacements.
Related
How I scraped 6 years of Reddit posts in JSON
The article covers scraping 6 years of Reddit posts for self-promotion data, highlighting challenges like post limits and cutoffs. Pushshift is suggested for Reddit archives. Extracting URLs and checking website status are explained. Findings reveal 40% of sites inactive. Trends in online startups are discussed.
Many website admins have yet to get memo to remove Polyfillio links
More than 384,000 websites linked to a code library involved in a supply-chain attack by a Chinese firm. Industry responses included domain suspensions and ad blocks. Over 1.6 million sites linked to potentially malicious domains. The incident highlights supply-chain attack risks.
384k sites pull code from sketchy code library recently bought by Chinese firm
Over 384,000 websites linked to a code library in a supply-chain attack by a Chinese firm. Altered JavaScript code redirected users to inappropriate sites. Industry responses included suspensions and replacements.