Home / Tutorials / Internet Marketing Tools / GSA Toolset / How To Filter Out Useless Footprints To Massively Improve Your Target Scraping Speed!

How To Filter Out Useless Footprints To Massively Improve Your Target Scraping Speed!

Over the past few weeks, I have seen an increasing number of people on the various internet marketing forums mentioning they are scraping their own targets with GSA Search Engine Ranker rather than using a premium list and when done correctly this can be a solid strategy for building out a verified list.

Over the past few years I have flip flopped between scraping my own list and using a premium list a number of times depending on what I required at the time but even when using a premium list it is still a good idea to supplement the contextual targets provided by doing your own scraping provided you have the resources available.

One mistake I see people making time and time again when scraping their own targets is to just take the default footprints from SER, load them into a scraper and press start. Unfortunately, many of the default footprints are a total waste of your time. In this post, I am going to go over why you should filter your footprints and exactly how I do it when I am scraping my own list.

What Exactly Is A Footprint

Well, every content management system out there has a number of repeating words built into their pages, URLs, or code that we are able to scrape the internet for to build lists of pages containing these words and phrases to try and maximise the chance of getting a hit on a specific content management system we wish to target.

Drupal is one of my favorite content management system to target with GSA Search Engine Ranker as the tools Drupal – Blog engine can provide a large number of links. I am going to use four footprints from the default Drupal footprint list that comes with the tool and explain my thoughts on each. The footprints are.

  • “Powered by Drupal”
  • inurl:”/node/1″ “You are here”
  • “This question is for testing whether you are a human visitor and to prevent automated spam submissions”
  • “Anmelden oder Registrieren um Kommentare zu schreiben” “Startseite Ÿ Weblogs”

The first is the “Powered by Drupal” footprint, as you can probably guess it is added by default by many drupal site themes to the sites footer to show the site is powered by drupal. The screenshot below shows an example of this footprint.

Powered By Drupal Footer

Now, let’s run a quick Google search on the footprint and go over the information we are provided with from its search results. Below is a screenshot of the results from Google at the time of writing, if you want to see the most up to date results then click here.

Powered By Drupal Footprint Google search

  1. The footprint we have used as the search query. We put the footprint into quotation marks to force Google to only return results that have the words in that exact same order. If we ran the search query without the quotation marks then our results would contain many useless targets containing the words in any order provided they were in a close proximity such as the example blog extract below.
  2. The total number of results Google currently hold in their index matching the search query. When I ran the search Google held 4,190,000 pages matching the query. When the volume is this large the number of pages bouncing in and out of the index will change massively every day but from that number alone, most would think it is a good footprint to scrape with.
  3. The first four results are the exact same domain, in this case, drupal.org.
  4. The first domain not related to the content management system but still a site talking about how to remove the footer to stop people scraping their site with the footprint.

In my opinion, this is far too broad of a footprint and even though it has over 4 million potential targets I would not use it as there is just too much potential noise to process. It is also very short meaning it could be added onto pages in the correct order and end up generating false possitives.

Next, we will look at the inurl:”/node/1″ “You are here” footprint, as you can see this footprint is targeting the way the Drupal content management system structures the URLs of the pages on the domain. Below is a screenshot of the results from Google at the time of writing, if you want to see the most up to date results then click here.

Second Drupal footprint search results.

As you can see it still returns 225,000 results for the query and all of the visible domains are different to each other. Personally, I would keep this footprint and test it on a live run to see how it performs then make a final desicion based on its URL yeild.

Although the third footprint is meant to keep automated tool users off the site it actually attracts more, the footprint “This question is for testing whether you are a human visitor and to prevent automated spam submissions” is added to a page when using the default Drupal captchas. Below is a screenshot of the results from Google at the time of writing, if you want to see the most up to date results then click here.

Third drupal footprint search results

As you can see this footprint returns 2,580,000 results, although its first few results are from the same domain I would still move forward with the footprint as you should expect the official website for the content management system we are targeting to have questions about it.

Additionally, it is seventeen words long meaning it is much harder for people to type it in general conversation in their posts and generate false possitives. On top of that, many users of the content management system may leave this footprint thinking they are protected due to the captcha it provides not knowing most tools have over 50% success rate on the drupal captcha type these days.

Why Should I Bother Filtering My Keywords

The final query is a good example of why you should filter your keywords, it is a none English footprint “Anmelden oder Registrieren um Kommentare zu schreiben” “Startseite Ÿ Weblogs”  Below is a screenshot of the results from Google at the time of writing, if you want to see the most up to date results then click here.

Drupal fourth footprint

As you can see, this footprint returns a grand total of six results. It is an excellent example to show that not all footprints included in SER are worth your time. However, many people seem to think that if they add a keyword to the search query then these results will increase.

Fourth drupal footprint with keyword

As you can see from the screenshot above, this is not the case. Adding a keyword to the footprint actually reduces the number of results returned as it has forces the results to match additional criteria.

Imagine if you had a list of 100,000 keywords merged with this footprint, that’s a grand total of 100,000 search queries to run with a grand total of 6 potential domains, all that time and resources would be totally wasted.

How Do I Filter My Footprints

My own personal method has two stages, the first is to go over all of the footprints for the engines I plan to use and get the number of results held in the Google index for that particular footprint to see if it has the volume to make it worth my time. Yes, this will take you a crazy amount of time if you do it manually but I will show you an easy way to do it later in the post.

The second stage is to complete a similar process as I did for the footprints I have covered in this post. Use your head and think of reasons why you may or may not want to keep a footprint for scraping such as being too broad to be worth your time such as the “powered by drupal” footprint.

Where Do I Find Footprints For Scraping

GSA Search Engine Ranker comes with a built-in set of footprints for its default engines that users are able to extract from the footprint studio to use when scraping. If you are using GSA Search Engine Ranker to scrape targets then you aren’t required to actually extract the footprints for use, simply enabling the engine in the scraping project will have the tool automatically use them as it gathers targets.

If you are wanting to use an external tool such as Scrapebox to scrape for targets then you will have to open the tools footprint factory by completing the navigation in the below screenshot in the advanced tab of the options menu.

GSA Search Engine Ranker Footprint Studio

Making that navigation brings up the tools footprint studio as shown in the screenshot below.

GSA Search Engine Ranker Footprint Studio Open

  1. Engine Selection
  2. Engine Footprints

Simply select the engine you wish to extract the default footprints for and then select them, right click and copy them from the footprint pane to paste them in a text file for safe keeping. For this next example, I am going to pull the footprints for some of my favorite engines, these are Drupal – Blog, BuddyPress, WordPress, MacOSWiki and  Joomla.

Click Here To Reveal The Engine Footprints

DRUPAL BLOG

“Create new account Log in Request new password”
“Posted by Anonymous (not verified)”
inurl:”/node/1″ “You are here”
inurl:”/node/2″ “You are here”
inurl:”/node/3″ “You are here”
“Login to post new content in the forum.” “Powered by Drupal”
“Startseite Ÿ Weblogs Ÿ Weblog von”
“Submitted by” “Login or register to post comments” “Search this site:”
“Submitted by Anonymous” “Login or register to post comments”
“Anmelden oder Registrieren um Kommentare zu schreiben” “Startseite Ÿ Weblogs”
“Home › Blogs” “Login or register to post comments”
“About the Author” “Recent posts” “Add new comment”
“The content of this field is kept private and will not be shown publicly”
“Provide a password for the new account in both fields”
“This question is for testing whether you are a human visitor and to prevent automated spam submissions”
“More information about text formats”
“Notify me when new comments are posted”
“Lines and paragraphs break automatically”
“To prevent automated spam submissions leave this field empty”
“Your virtual face or picture”
“This question is for testing whether or not you are a human visitor and to prevent automated spam submissions”
“Notify me of new posts by email”
“More information about formatting options”
“Provide a password for the new account in both fields Password must be at least”
“Navigation” “Add content” “User login” “Username” “Password” “Request new password” “Submitted by Anonymous”
“ajouter un commentaire” “powered by drupal”
“ajouter un commentaire” inurl:node
“ajouter un commentaire” inurl:content

BUDDYPRESS

“Proudly powered by WordPress and BuddyPress”
“Public Group” “Popular Search Terms” “Recent Search Terms”
Home Members RSS “created the group” “Please create an account to get started.”
inurl:”/activity/p/”
“Sign Up” “Arclite theme by digitalnature”
“became a registered member”
“Group Admins” “Public Group” “created the group”

WORDPRESS

“Additional Articles From”
“Do not submit articles filled with spelling errors and bad grammar”
“If you have hired a ghost writer, you agree that you have”
“Powered by WordPress ¾ Using Article Directory plugin”
“Publish your article in RSS format for other websites to syndicate”
“registered authors in our article directory”
“RSS Articles” “RSS comments” “Recent Articles”
“RSS Articles” “RSS comments” “Recent Articles” “Authorization” “Username:” “Password:” “Remember Me” “Register” “Lost your password?”
“There are * published articles and * registered authors in our article directory.”
“There are * published articles and * registered authors”
“This author has published * articles so far. More info about the author is coming soon.”
“Using Article Directory plugin”
“Welcome to article directory *. Here you can find interesting and useful information on most popular themes.”
“Powered by WordPress + Article Directory plugin”
inurl:”/wp-login.php?action=register”
“This entry was posted in Uncategorized by” “Bookmark the permalink.”

JOOMLA

“Additional Articles From”
“Do not submit articles filled with spelling errors and bad grammar”
“If you have hired a ghost writer, you agree that you have”
“Powered by WordPress ¾ Using Article Directory plugin”
“Publish your article in RSS format for other websites to syndicate”
“registered authors in our article directory”
“RSS Articles” “RSS comments” “Recent Articles”
“RSS Articles” “RSS comments” “Recent Articles” “Authorization” “Username:” “Password:” “Remember Me” “Register” “Lost your password?”
“There are * published articles and * registered authors in our article directory.”
“There are * published articles and * registered authors”
“This author has published * articles so far. More info about the author is coming soon.”
“Using Article Directory plugin”
“Welcome to article directory *. Here you can find interesting and useful information on most popular themes.”
“Powered by WordPress + Article Directory plugin”
inurl:”/wp-login.php?action=register”
“This entry was posted in Uncategorized by” “Bookmark the permalink.”

MACOSWIKI

“The wiki, blog, calendar, and mailing list”
“Log in to my page” “wikis”
inurl:groups “log in to my page”
“updates” “wikis” “blogs” “calendar” “mail”
“Mac OS X Server – Wikis”
“first” “prev” “1-20 of” “next” inurl:groups
“What’s Hot” “Recent Changes”
“What’s Hot” “Recent Changes” “Upcoming Events”
“What’s Hot” “Recent Changes” “Upcoming Events” “Tags”
“What’s Hot” “Recent Changes” “Upcoming Events” “Tags” “Edited”
intitle:”Mac OS X Server”
“Collaborate with online document creation, editing, and comments.”

Automating The Google Queries

If you are just starting out and don’t have much money to invest in tools then you can do this step manually to reduce costs, although it is easy to complete it is very monotonous and boring. Personally, I use Scrapebox with its free Google Competition Finder addon to automate the process for me. Essentially this addon automates the exact task we require, it queries google and pulls the total number of results for us.

To set it up, open Scrapebox, click the Addons tab at the top of the top, select show available add-ons and select the Google Competition Finder. Once it has downloaded be sure you have added proxies to your tool and add your footprints to the keywords tab as shown in the screenshot below.

Scrapebox footprints in the keywords field

Then open the Google Competition Finder add-on by clicking the addons tab and selecting it. In the bottom left of the Google Competition Finders window select Import Keywords and select import from Scrapebox.

I set my connections to one as it makes it easier to work out the delay you require. At the time of writing a proxy can be used to query Google with these footprints about once every 70 seconds, this is due to many of the footprints containing enhanced search modifiers such as intitle”” meaning Google will softban the IPs of people doing many simultaneous searches with them. This means we want to set our delay so each of our proxies will only query Google a maximum of once every 70 seconds then press the “Start/Recheck” button.

Scrapebox Google Competition Finder Footprint Check

As you can see the tool then queries Google and saves the footprints competition results for you, you will always get some Error 503s from Google not liking your proxies but once the run is complete simply press the “Start/Recheck” button and it will recheck the footprints with errors for you.

Click Here To Reveal The Complete Footprint Results

DRUPAL BLOG

8140 – “Create new account Log in Request new password”
38600 – “Posted by Anonymous (not verified)”
226000 – inurl:”/node/1″ “You are here”
66700 – inurl:”/node/2″ “You are here”
115000 – inurl:”/node/3″ “You are here”
38600 – “Login to post new content in the forum.” “Powered by Drupal”
16 – “Startseite Ÿ Weblogs Ÿ Weblog von”
220000 – “Submitted by” “Login or register to post comments” “Search this site:”
15400 – “Submitted by Anonymous” “Login or register to post comments”
16 – “Anmelden oder Registrieren um Kommentare zu schreiben” “Startseite Ÿ Weblogs”
187000 – “Home › Blogs” “Login or register to post comments”
29300 – “About the Author” “Recent posts” “Add new comment”
3450000 – “The content of this field is kept private and will not be shown publicly”
150000 – “Provide a password for the new account in both fields”
1680000 – “This question is for testing whether you are a human visitor and to prevent automated spam submissions”
4150000 – “More information about text formats”
314000 – “Notify me when new comments are posted”
5440000 – “Lines and paragraphs break automatically”
1080000 – “To prevent automated spam submissions leave this field empty”
29200 – “Your virtual face or picture”
2840000 – “This question is for testing whether or not you are a human visitor and to prevent automated spam submissions”
22400000- “Notify me of new posts by email”
544000 – “More information about formatting options”
67300 – “Provide a password for the new account in both fields Password must be at least”
11200 – “Navigation” “Add content” “User login” “Username” “Password” “Request new password” “Submitted by Anonymous”
1150 – “ajouter un commentaire” “powered by drupal”
544000 – “ajouter un commentaire” inurl:node
208000 – “ajouter un commentaire” inurl:content

BUDDYPRESS

245000 – “Proudly powered by WordPress and BuddyPress”
18000 – “Public Group” “Popular Search Terms” “Recent Search Terms”
19100 – Home Members RSS “created the group” “Please create an account to get started.”
871000 – inurl:”/activity/p/”
3340 – “Sign Up” “Arclite theme by digitalnature”
1060000 – “became a registered member”
167000 – “Group Admins” “Public Group” “created the group”

WORDPRESS

144000 – “Additional Articles From”
1860 – “Do not submit articles filled with spelling errors and bad grammar”
2980 – “If you have hired a ghost writer, you agree that you have”
84 – “Powered by WordPress ¾ Using Article Directory plugin”
794 – “Publish your article in RSS format for other websites to syndicate”
43500 – “registered authors in our article directory”
14800 – “RSS Articles” “RSS comments” “Recent Articles”
2400 – “RSS Articles” “RSS comments” “Recent Articles” “Authorization” “Username:” “Password:” “Remember Me” “Register” “Lost your password?”
1380000 – “There are * published articles and * registered authors in our article directory.”
3 – “There are * published articles and * registered authors”
11500000- “This author has published * articles so far. More info about the author is coming soon.”
12100 – “Using Article Directory plugin”
1930000 – “Welcome to article directory *. Here you can find interesting and useful information on most popular themes.”
1030 – “Powered by WordPress + Article Directory plugin”
32000 – inurl:”/wp-login.php?action=register”
1040000 – “This entry was posted in Uncategorized by” “Bookmark the permalink.”

JOOMLA

144000 – “Additional Articles From”
1850 – “Do not submit articles filled with spelling errors and bad grammar”
2990 – “If you have hired a ghost writer, you agree that you have”
83 – “Powered by WordPress ¾ Using Article Directory plugin”
808 – “Publish your article in RSS format for other websites to syndicate”
43600 – “registered authors in our article directory”
14800 – “RSS Articles” “RSS comments” “Recent Articles”
2250 – “RSS Articles” “RSS comments” “Recent Articles” “Authorization” “Username:” “Password:” “Remember Me” “Register” “Lost your password?”
1420000 – “There are * published articles and * registered authors in our article directory.”
3 – “There are * published articles and * registered authors”
11500000- “This author has published * articles so far. More info about the author is coming soon.”
12100 – “Using Article Directory plugin”
1930000 – “Welcome to article directory *. Here you can find interesting and useful information on most popular themes.”
1020 – “Powered by WordPress + Article Directory plugin”
32100 – inurl:”/wp-login.php?action=register”
1040000 – “This entry was posted in Uncategorized by” “Bookmark the permalink.”

MACOSWIKI

2500 – “The wiki, blog, calendar, and mailing list”
46400 – “Log in to my page” “wikis”
3090 – inurl:groups “log in to my page”
1800000 – “updates” “wikis” “blogs” “calendar” “mail”
7260 – “Mac OS X Server – Wikis”
4530 – “first” “prev” “1-20 of” “next” inurl:groups
179000 – “What’s Hot” “Recent Changes”
12300 – “What’s Hot” “Recent Changes” “Upcoming Events”
7350 – “What’s Hot” “Recent Changes” “Upcoming Events” “Tags”
3990 – “What’s Hot” “Recent Changes” “Upcoming Events” “Tags” “Edited”
25900 – intitle:”Mac OS X Server”
5100 – “Collaborate with online document creation, editing, and comments.”

As you can see by revealing the results for the 80 footprints above.

  • Six footprints had results between 0-100.
  • Two footprints had results between 101-1000.
  • Seventeen footprints had results between 1001-10,000.

That means that 25 out of the 80 default footprints have less than 10,000 total results, that’s almost one third! You may wish to scrape with these footprints one a week or once a month for any new targets in the Google index but if you are scraping with these each day just think of all the time and resources you are wasting especially if you are merging them with keywords!

I would then go over the footprints with large target volumes to see if I felt they were too broad or useless like the “powered by Drupal” footprint from earlier. The footprints with a large number of targets will have pages dancing in and out of the index ever day making it beneficial to scrape these more often.

 How Do I Remove Unwanted Footprints

If you are using a tool such as Scrapebox simply remove them from the text file you use to hold your footprints and the tool will not load them anymore.

If you are using GSA Search Engine Ranker to scrape then open the tools footprint studio as shown earlier, navigate to the engine whose footprint you wish to remove, select the footprint and select delete as shown in the screenshot below.

GSA Search Engine Ranker Purge Bad Footprints

When To Scrape With Footprints

Although I don’t want to go into it too much here as I have a full scraping tutorial planned I will cover it quickly. Merging footprints and keywords togeather do have their place in a specific strategy, for the most part, it is a waste of time if you are just looking for mass links as the contextual articles touching your money site will create a fresh page on the domain and its content will be controlled by you enabling you to make it niche relevant.

If for example, you wish to target relevant blogs with high metrics then you would merge all of the blog comment footprints with relevant keywords to ensure the targets you scrape have a potential to be relevant to the site you are wishing to promote.

Wrapping It Up

It doesn’t matter if you are scraping for bulk targets or high metric niche relevant targets, removing footprints that are not very common frees up your time and resources to scrape higher volume footprints that have a much better chance of providing results.

I know this post was long but I hope it has helped people understand what footprints are, what makes them useful and how you can filter them to improve your results.

Check Also

Transforming Profile and Forum Engines In To Contextuals In Less Than 30 Seconds!

In this post, I will be going over how you can easily transform the contextual …

Join The Mailing List!