Other issues in this category (26)
Nowadays, Big Data and discussions regarding its analysis are trendy topics. Analysing huge arrays of diverse information yields all sorts of benefits. But before information can be analysed, it must be gathered. However, using leaked information (even if the leak was purely accidental, as happened in the case of Facebook) is never a good idea. Customers aren’t always willing to turn a blind eye to dubious information sources.
When it comes to collecting data on the World Wide Web, web crawlers and scrapers—special bots that index and extract information by imitating user behaviour—come to the rescue.
Modern websites usually consist of dynamic webpages whose appearance and behaviour depends on a client-side operating system, a browser, access permissions, possible restrictions imposed by local legislation, the plugins being used, etc. That's why just going to a site and downloading all its content is not possible—scripts that arrange information in the browser window and perform other tasks must be executed at the user's end.
That's why user behaviour needs to be imitated. Более того – пользователя, работающего на этом сайте. Why is this necessary? Well, how else can one download content that only becomes accessible after a password or a CAPTCHA code has been entered?
Search engine spiders operate exactly the same way. They only differ in how law abiding they are. For example, sites may use the exclusion standard robots.txt to specify which site sections aren't intended for the general public and shouldn't be indexed or appear in search results. But robots aren't obliged to comply with these rules of conduct. Unscrupulous scrapers can just ignore robots.txt and harvest any information, including private user data. And since many people use "12345678" as a password, this mission is not impossible.
The famous quote from Pirates of the Caribbean—“Take what you can, and give nothing back"—has become our motto.
Where can this be used? For example, a company may be eager to acquire the content of their competitor's online store and track changes in their goods catalogue and prices. What will the company gain? Well, if the competitor starts offering discounts, the company may set discounts of its own or launch a promo.
What else is required for data mining A parsing service or a number of parsing applications are needed to break down large pieces of data into smaller portions.
- How long does it take to parse the content of one site?
- And how much does one pill cost? It depends on how much content the site has and how promptly the server responds to queries. In our experience, it could take almost a week to thoroughly parse the data from one site. The contents from another site was parsed in 44 minutes and 10 seconds; 1,897 queries generated 1,550 entries.
And the most important thing:
And in conclusion, I'd like to say a few words about parsing in general and why you don't necessarily need to use Tor. Mining data is trendy and interesting. You can get yourself datasets that no one else has ever processed and discover something new and take a look at all the world’s memes at once. However, bear in mind that server restrictions, including incidents of clients getting banned, were introduced for a reason, specifically, to prevent DDoS attacks. Respect other people's labour. Even if a server uses no means of protection whatsoever, it doesn't mean that you should carpet-bomb it with queries, especially if it may render the server non-operational—crimes committed with no malicious intent are still punishable by law.
A site’s contents can be protected from web crawlers, albeit this task is not a simple one. For example, you can rename classes and variables in the site files with every update or identify spikes in the number of similar queries and block them.
- Protect your site from SQL injection attacks.
- Don't use a predictable pattern to generate folder names and file paths. For example, paths like /topic/11 and /topic/12 clearly suggest that more data can be extracted by submitting URLs containing similar strings.
- Use dynamic webpages, but be reasonable: search engine bots may fail to find the information you actually want to appear on the Web.
- Make sure the server doesn't accept too many search queries from one page, and restrict the duration and number of sessions per IP address or domain name.
- Use strong passwords and a CAPTCHA that is hard to break.
- Don't forget to check site logs for signs of intrusion.
- Restrict the number of IP addresses belonging to popular proxy servers that can be used to visit your site.
- If you've detect suspicious activity, don't notify the respective "visitor" about it—don't let them know how they've managed to expose themselves.
- Establish content usage rules, and employ competent lawyers.
Don't assume that you are completely safe just because you've done everything by the book. It is quite possible that THEY have already come up with something new. Your team should never stop analysing visitor behaviour. Unfortunately, this takes time and costs money, but it is vitally important for any project.