Spam removal on marc.info

Proposals for new surrogate scripts, updates/bug fixes to existing ones, tips and tricks to work around the lazy web.
Post Reply
User avatar
Thrawn
Master Bug Buster
Posts: 3106
Joined: Mon Jan 16, 2012 3:46 am
Location: Australia
Contact:

Spam removal on marc.info

Post by Thrawn »

If anyone is in the habit of browsing around mail archives on marc.info, I've drafted a quick surrogate for wiping out spam entries:

Code: Select all

function is_spam(link) {
  return /\[spam\]|\(no subject\)|[0-9]usd|affor[dt]able|attire|accessories|barite|belle|cheapest|detergent|dress size|factory|infrared|new.*invitation|promotion|secrets|sunglasses|(ceiling|flood|panel|spot) ?light|lighting|\bled\b|\btoys?\b|^=/i.test(link.innerHTML);
}

function delete_around(link) {
  for (i = 0; i < 2; i++) {
    link.parentNode.removeChild(link.previousSibling); 
  }
  for (i = 0; i < 3; i++) {
    link.parentNode.removeChild(link.nextSibling); 
  }
  link.parentNode.removeChild(link);
}

var links = document.getElementsByTagName('a');
for (var i = 0; i < links.length; i++) {
  var link = links[i];
  if (is_spam(link)) {
    delete_around(link);
  }
}
I suggest using a 'sources' value of

Code: Select all

!@marc.info
You might notice that it runs multiple passes until it gets them all. I'm not sure exactly why it misses some the first time, but if anyone can figure it out, great.
Last edited by Thrawn on Wed Sep 16, 2015 3:54 am, edited 9 times in total.
Reason: add more spam patterns
======
Thrawn
------------
Religion is not the opium of the masses. Daily life is the opium of the masses.

True religion, which dares to acknowledge death and challenge the way we live, is an attempt to wake up.
Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:36.0) Gecko/20100101 Firefox/36.0
barbaz
Senior Member
Posts: 10847
Joined: Sat Aug 03, 2013 5:45 pm

Re: Spam removal on marc.info

Post by barbaz »

Thrawn wrote:You might notice that it runs multiple passes until it gets them all. I'm not sure exactly why it misses some the first time, but if anyone can figure it out, great.
Can you please provide a link where it doesn't get them in one pass? (in code tags please)
Otherwise we can only guess..

Does this version get them all?

Code: Select all

function is_spam(link) {
  return /\[spam\]|attire|sunglasses|detergent|belle|dress size|[0-9]usd|^=/i.test(link.innerHTML);
}

function delete_around(link) {
  for (i = 0; i < 2; i++) {
    link.parentNode.removeChild(link.previousSibling); 
  }
  for (i = 0; i < 3; i++) {
    link.parentNode.removeChild(link.nextSibling); 
  }
  link.parentNode.removeChild(link);
}

window.addEventListener('load', function(){
  var links = document.getElementsByTagName('a');
  for (let link of links) {
    if (is_spam(link)) {
      delete_around(link);
    }
  }
}, false);
*Always* check the changelogs BEFORE updating that important software!
-
User avatar
Thrawn
Master Bug Buster
Posts: 3106
Joined: Mon Jan 16, 2012 3:46 am
Location: Australia
Contact:

Re: Spam removal on marc.info

Post by Thrawn »

Nope, that version doesn't work either when there are multiple consecutive spam entries...

Code: Select all

http://marc.info/?l=haproxy&r=1&b=201503&w=2
There are still some missed spam entries there, but it's significantly cleaned up. The mailing list itself catches a lot of them and flags them, which is the primary benefit of the script.
======
Thrawn
------------
Religion is not the opium of the masses. Daily life is the opium of the masses.

True religion, which dares to acknowledge death and challenge the way we live, is an attempt to wake up.
Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:36.0) Gecko/20100101 Firefox/36.0
barbaz
Senior Member
Posts: 10847
Joined: Sat Aug 03, 2013 5:45 pm

Re: Spam removal on marc.info

Post by barbaz »

OK I think I've found the problem, try this version (also should zap an additional class of spam-links)

Code: Select all

function is_spam(link) {
  return /\[spam\]|\(no subject\)|attire|accessories|sunglasses|detergent|belle|dress size|[0-9]usd|affor[dt]able|cheapest|promotion|secrets|new.*invitation|^=/i.test(link.innerHTML);
}

function delete_around(link) {
  for (i = 0; i < 2; i++) {
    link.parentNode.removeChild(link.previousSibling);
  }
  for (i = 0; i < 3; i++) {
    link.parentNode.removeChild(link.nextSibling);
  }
  link.parentNode.removeChild(link);
}

window.addEventListener('load', function(){
  var links = document.getElementsByTagName('a');
  var toZap = [];
  for (let link of links) {
    if (is_spam(link)) {
      toZap.push(link);
    }
  }
  for (let a of toZap) {
    delete_around(a);
  }
}, false);
Basically, you've seen why it's bad practice to modify the thing you're iterating over while you're iterating over it.
Last edited by barbaz on Sat Apr 11, 2015 8:35 pm, edited 2 times in total.
Reason: Add more spam patterns [pulled from OP]
*Always* check the changelogs BEFORE updating that important software!
-
barbaz
Senior Member
Posts: 10847
Joined: Sat Aug 03, 2013 5:45 pm

Re: Spam removal on marc.info

Post by barbaz »

barbaz wrote:Basically, you've seen why it's bad practice to modify the thing you're iterating over while you're iterating over it.
Wow that was a vague statement.. Try again

Here's what I think was the problem. You are iterating over a live list of all the a's in the document, but you are removing some as you're iterating over them. This list is mutating as you're iterating over it, nothing is cached... so the "real" next item ends up with the same index as the current item (or something like that, I don't really know the details), so the JS engine skips over it during iteration.
For this reason, MDN warns against adding or deleting from the thing you're iterating over while iterating over it.

My latest modification to your surrogate makes sure that nothing is mutated as it's being iterated over - first cache the list of links to zap, then iterate over this cache to actually zap them.

The reason you didn't see any errors is probably because the length of the list is changing so if you're doing a for loop that constantly check against the length, it's a moving target that represents the current state of the list at the time of the single iteration.
(I'm not sure why the for (let .. of ..) loop didn't error there. Similar idea?)

Anyway, hope that helps.
*Always* check the changelogs BEFORE updating that important software!
-
User avatar
Thrawn
Master Bug Buster
Posts: 3106
Joined: Mon Jan 16, 2012 3:46 am
Location: Australia
Contact:

Re: Spam removal on marc.info

Post by Thrawn »

barbaz wrote: Wow that was a vague statement.. Try again
:D Don't worry, I understood it. The problem doesn't surprise me, either.
You are iterating over a live list
That part is interesting. I hadn't realised that interactions with the document return live lists. I'm primarily a Java programmer.

Anyway, latest script works, thanks :) (and you might have noticed I added some patterns).
======
Thrawn
------------
Religion is not the opium of the masses. Daily life is the opium of the masses.

True religion, which dares to acknowledge death and challenge the way we live, is an attempt to wake up.
Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:36.0) Gecko/20100101 Firefox/36.0
barbaz
Senior Member
Posts: 10847
Joined: Sat Aug 03, 2013 5:45 pm

Re: Spam removal on marc.info

Post by barbaz »

Thrawn wrote:
barbaz wrote: Wow that was a vague statement.. Try again
:D Don't worry, I understood it.
Not just for you but for anyone else who might read this thread ;)
Thrawn wrote:
You are iterating over a live list
That part is interesting. I hadn't realised that interactions with the document return live lists.
Most do but some don't.. I think document.querySelectorAll() doesn't, for instance.
Thrawn wrote:Anyway, latest script works, thanks :) (and you might have noticed I added some patterns).
You're welcome. Didn't catch that you had tweaked the spam regex until well after I had posted the mod :?
*Always* check the changelogs BEFORE updating that important software!
-
Post Reply