Page 1 of 1

Spam removal on marc.info

Posted: Thu Mar 19, 2015 2:33 am
by Thrawn
If anyone is in the habit of browsing around mail archives on marc.info, I've drafted a quick surrogate for wiping out spam entries:

Code: Select all

function is_spam(link) {
  return /\[spam\]|\(no subject\)|[0-9]usd|affor[dt]able|attire|accessories|barite|belle|cheapest|detergent|dress size|factory|infrared|new.*invitation|promotion|secrets|sunglasses|(ceiling|flood|panel|spot) ?light|lighting|\bled\b|\btoys?\b|^=/i.test(link.innerHTML);
}

function delete_around(link) {
  for (i = 0; i < 2; i++) {
    link.parentNode.removeChild(link.previousSibling); 
  }
  for (i = 0; i < 3; i++) {
    link.parentNode.removeChild(link.nextSibling); 
  }
  link.parentNode.removeChild(link);
}

var links = document.getElementsByTagName('a');
for (var i = 0; i < links.length; i++) {
  var link = links[i];
  if (is_spam(link)) {
    delete_around(link);
  }
}
I suggest using a 'sources' value of

Code: Select all

!@marc.info
You might notice that it runs multiple passes until it gets them all. I'm not sure exactly why it misses some the first time, but if anyone can figure it out, great.

Re: Spam removal on marc.info

Posted: Thu Mar 19, 2015 5:46 am
by barbaz
Thrawn wrote:You might notice that it runs multiple passes until it gets them all. I'm not sure exactly why it misses some the first time, but if anyone can figure it out, great.
Can you please provide a link where it doesn't get them in one pass? (in code tags please)
Otherwise we can only guess..

Does this version get them all?

Code: Select all

function is_spam(link) {
  return /\[spam\]|attire|sunglasses|detergent|belle|dress size|[0-9]usd|^=/i.test(link.innerHTML);
}

function delete_around(link) {
  for (i = 0; i < 2; i++) {
    link.parentNode.removeChild(link.previousSibling); 
  }
  for (i = 0; i < 3; i++) {
    link.parentNode.removeChild(link.nextSibling); 
  }
  link.parentNode.removeChild(link);
}

window.addEventListener('load', function(){
  var links = document.getElementsByTagName('a');
  for (let link of links) {
    if (is_spam(link)) {
      delete_around(link);
    }
  }
}, false);

Re: Spam removal on marc.info

Posted: Fri Mar 20, 2015 12:40 am
by Thrawn
Nope, that version doesn't work either when there are multiple consecutive spam entries...

Code: Select all

http://marc.info/?l=haproxy&r=1&b=201503&w=2
There are still some missed spam entries there, but it's significantly cleaned up. The mailing list itself catches a lot of them and flags them, which is the primary benefit of the script.

Re: Spam removal on marc.info

Posted: Fri Mar 20, 2015 12:59 am
by barbaz
OK I think I've found the problem, try this version (also should zap an additional class of spam-links)

Code: Select all

function is_spam(link) {
  return /\[spam\]|\(no subject\)|attire|accessories|sunglasses|detergent|belle|dress size|[0-9]usd|affor[dt]able|cheapest|promotion|secrets|new.*invitation|^=/i.test(link.innerHTML);
}

function delete_around(link) {
  for (i = 0; i < 2; i++) {
    link.parentNode.removeChild(link.previousSibling);
  }
  for (i = 0; i < 3; i++) {
    link.parentNode.removeChild(link.nextSibling);
  }
  link.parentNode.removeChild(link);
}

window.addEventListener('load', function(){
  var links = document.getElementsByTagName('a');
  var toZap = [];
  for (let link of links) {
    if (is_spam(link)) {
      toZap.push(link);
    }
  }
  for (let a of toZap) {
    delete_around(a);
  }
}, false);
Basically, you've seen why it's bad practice to modify the thing you're iterating over while you're iterating over it.

Re: Spam removal on marc.info

Posted: Fri Mar 20, 2015 3:31 am
by barbaz
barbaz wrote:Basically, you've seen why it's bad practice to modify the thing you're iterating over while you're iterating over it.
Wow that was a vague statement.. Try again

Here's what I think was the problem. You are iterating over a live list of all the a's in the document, but you are removing some as you're iterating over them. This list is mutating as you're iterating over it, nothing is cached... so the "real" next item ends up with the same index as the current item (or something like that, I don't really know the details), so the JS engine skips over it during iteration.
For this reason, MDN warns against adding or deleting from the thing you're iterating over while iterating over it.

My latest modification to your surrogate makes sure that nothing is mutated as it's being iterated over - first cache the list of links to zap, then iterate over this cache to actually zap them.

The reason you didn't see any errors is probably because the length of the list is changing so if you're doing a for loop that constantly check against the length, it's a moving target that represents the current state of the list at the time of the single iteration.
(I'm not sure why the for (let .. of ..) loop didn't error there. Similar idea?)

Anyway, hope that helps.

Re: Spam removal on marc.info

Posted: Fri Mar 20, 2015 4:01 am
by Thrawn
barbaz wrote: Wow that was a vague statement.. Try again
:D Don't worry, I understood it. The problem doesn't surprise me, either.
You are iterating over a live list
That part is interesting. I hadn't realised that interactions with the document return live lists. I'm primarily a Java programmer.

Anyway, latest script works, thanks :) (and you might have noticed I added some patterns).

Re: Spam removal on marc.info

Posted: Fri Mar 20, 2015 5:16 am
by barbaz
Thrawn wrote:
barbaz wrote: Wow that was a vague statement.. Try again
:D Don't worry, I understood it.
Not just for you but for anyone else who might read this thread ;)
Thrawn wrote:
You are iterating over a live list
That part is interesting. I hadn't realised that interactions with the document return live lists.
Most do but some don't.. I think document.querySelectorAll() doesn't, for instance.
Thrawn wrote:Anyway, latest script works, thanks :) (and you might have noticed I added some patterns).
You're welcome. Didn't catch that you had tweaked the spam regex until well after I had posted the mod :?