WebSVN - LCARS - Path Comparison - /trunk Rev 12 and /trunk/ Rev 13

Regard whitespace Rev 12 → Rev 13

 /trunk/tools/network/news/newsstat/newsstat.pl
 ,80 → 1,35
 #!/usr/bin/env perl
 use strict;
 use warnings;
+use diagnostics;
 use utf8;
-use encoding 'utf-8';
 use Encode;
-###########################
-# newsstat.pl version 0.4.2
+## Print out all text to STDOUT UTF-8 encoded
+binmode STDOUT, ':encoding(UTF-8)';
-############################################################################
-# Collect statistics about a newsgroup (specified by first argument) in
-# the local news spool. Check all articles in the last 30-day period.
-# Rank posters by number of posts and by volume of posts, report on top and
-# bottom 20 posters. Show their name, number of posts, size of posts,
-# percentage of quoted lines. Rank user-agents used, by poster rather than
-# by post. Rank top 20 threads. Rank top 10 cross-posted groups.
-#
-# (Numbers and paths can be configured below.  -- PE)
-############################################################################
+############################
+## newsstat.pl version 0.4.3
-############################################################################
-#                       RECENT CHANGES                                     #
-# 2011-10-03  PE  - Use more compatible shebang
-#                 - Fixed some Perl::Critic-ized code
-#                 - Fixed wrong indent for non-ASCII names
-#                 - Formatted source code
-# 2011-07-03  PE  - Use Encode to decode/encode MIME encodings
-#                 - Use warnings, utf8 (just in case)
-#                 - Documentation update
-# N/A         NN  - Take newsgroup name as argument
-# 2004-06-19  NN  - newsgroup name is $ARGV[0]
-#                 - Allow command line flags for subtracting
-#                   output if not pertinent for a group
-# 2002-11-09  NN  - Put Garry's writedata() function back in.
-#                 - added "rn" to my list of UA's
-#                 - Started using %distinct_agent for both User agent
-#                   sections
-#                 - named it newsstat.pl version 0.3
-# 2002-11-06  NN  - Fixed the earliest/latest file problem by using
-#                   mtime rather than ctime, and simplifying the logic
-# 2002-11-05  NN  - moved user configurations to the top
-#                 - fixed the cross-posting section
-#                 - introduced the $newsgroup_name variable which
-#                   later becomes $news$group
-#                 - changed $name to $agent_name in countagents()
-#
-# Contributors
-# -------------
-# NN  Nomen nominandum (name to be determined later)
-# PE  Thomas 'PointedEars' Lahn <startrek@PointedEars.de>
+###########################################################################
+## Collect statistics about a newsgroup (specified by first argument)
+## in the local news spool. Check all articles in the last 30-day period.
+## Rank posters by number of posts and by volume of posts, report on top
+## and bottom 20 posters. Show their name, number of posts, size of posts,
+## percentage of quoted lines. Rank user-agents used, by poster rather
+## than by post. Rank top 20 threads. Rank top 10 cross-posted groups.
+##
+## Numbers and paths can be configured below.  See ChangeLog and TODO
+## for more.  -- PE
+###########################################################################
-########### TODO #############
-# Commas in bottom section of report
-# Show date the figures were compiled
-# No. of HTML articles (Content-Type: text/html)
-# No. of quoted sigs (/>\s*-- /)
-# Per cent of top-posted articles
-# Top 10 cross-posters
-# Top 20 news posting hosts (from Path)
-# Count of certain subject words: newbie, kde, burner, sendmail, etc.
-# Count *all* User Agents that each poster uses
-# What do we do about Bill Unruh's ] quote style?
-# Change the way dates/times are checked
-# include % share in posters by no. of arts
-# include % share in posters by size
-# Total, orig & quoted lines by user agent with per cent
-# Take more arguments
-#######################################################
 ###################### USER CONFIGURATIONS ############################
-# The name of the group to do stats for
+## The name of the group to do stats for
 my $newsgroup_name = $ARGV[0];
 $newsgroup_name or &usage;
-# Check for removal flags
+## Check for removal flags
 my $ix;
 my $j;
 my %skipSec;
 ,25 → 52,25
   $skipSec{$_} = 1;
 }
-# Leafnode users will want /var/spool/news for this variable.
+## Leafnode users will want /var/spool/news for this variable.
 my $news = "/var/spool/news/";
-# How many days are we doing statistics for?
+## How many days are we doing statistics for?
 my $numdays = 30;
-# no. of agents we list
+## Number of agents we list
 my $topagents = 10;
-# no. of threads we want to know about
+## Number of threads we want to know about
 my $topthreads = 20;
-# no. of top or bottom posters to show
+## Number of top or bottom posters to show
 my $topposters = 20;
-# no. of cross-posted threads to show
+## Number of cross-posted threads to show
 my $topcrossposts = 10;
-# no. of time zones to show
+## Number of time zones to show
 my $toptz = 10;
 ###################### DATA STRUCTURES ######################
 ,29 → 100,31
 my $replies   = 0; # total no. of replies
 my $i;             # general purpose
 my %distinct_agent;
-my %agents =       # used to hold counts of User Agents used
-  (
-  "KNode"                     => 0,
-  "Pan"                       => 0,
-  "Mozilla"                   => 0,
-  "Sylpheed"                  => 0,
-  "Gnus"                      => 0,
+## Used to hold counts of User Agents used
+my %agents = (
+  "Compuserver"               => 0,
+  "Foorum"                    => 0,
   "Forte Agent"               => 0,
   "Forte Free Agent"          => 0,
+  "Gnus"                      => 0,
+  "KNode"                     => 0,
+  "MacSOUP"                   => 0,
+  "MT-NewsWatcher"            => 0,
   "MicroPlanet Gravity"       => 0,
   "Microsoft Outlook Express" => 0,
-  "Xnews"                     => 0,
+  "Microsoft Windows Mail"    => 0,
+  "Mozilla"                   => 0,
+  "News Rover"                => 0,
+  "NN"                        => 0,
+  "Pan"                       => 0,
+  "rn"                        => 0,
   "slrn"                      => 0,
+  "Sylpheed"                  => 0,
   "tin"                       => 0,
-  "rn"                        => 0,
-  "NN"                        => 0,
-  "MacSOUP"                   => 0,
-  "Foorum"                    => 0,
-  "MT-NewsWatcher"            => 0,
-  "News Rover"                => 0,
+  "VSoup"                     => 0,
   "WebTV"                     => 0,
-  "Compuserver"               => 0,
-  "VSoup"                     => 0
+  "Xnews"                     => 0
   );
 ######################## MAIN CODE ########################
 ,17 → 140,17
   next if ( -M $filename > $numdays );     # only want articles <= a certain age
   $earliest = ( stat $filename )[9] unless defined($earliest);
   $latest   = ( stat $filename )[9] unless defined($latest);
-  &getarticle($filename);                  # read in the article
-  &getdata;                                # grab the data from the article
+  &get_article($filename);                 # read in the article
+  &get_data;                               # grab the data from the article
   $totalposts++;                           # bump count of articles considered
 }
 closedir(DIR);                             # finished with the directory
-# post-processing
-&countagents;    # count agents, collapsing versions
-&fixpercent;     # check percentages orig/total for posters
+## Post-processing
+&count_agents;    # count agents, collapsing versions
+&fix_percent;     # check percentages orig/total for posters
-&writedata;
+&write_data;
 #################### DISPLAY RESULTS #####################
 print "=" x 76, "\n";
 ,15 → 165,15
 printf "Latest article:   %s\n",               scalar localtime($latest);
 printf "Original articles: %s, replies: %s\n", commify($origposts),
   commify($replies);
-printf "Total size of posts: %s bytes (%sK) (%.2fM)\n", commify($totsize),
+printf "Total size of posts: %s bytes (%s KiB) (%.2f MiB)\n", commify($totsize),
   commify( int( $totsize / 1024 ) ), $totsize / 1048576;    #
-printf "Average %s articles per day, %.2f MB per day, %s bytes per article\n",
+printf "Average %s articles per day, %.2f MiB per day, %s bytes per article\n",
   commify( int( $totalposts / $numdays ) ), $totsize / $numdays / 1048576,
   commify( int( $totsize / $totalposts ) );
 my $count = keys %data;
-printf "Total headers: %s KB  bodies: %s KB\n",
+printf "Total headers: %s KiB  bodies: %s KiB\n",
   commify( int( $totheader / 1024 ) ), commify( int( $totbody / 1024 ) );
-printf "Body text - quoted: %s KB,  original: %s KB = %02.2f%%, sigs: %s KB\n",
+printf "Body text - quoted: %s KiB,  original: %s KiB = %02.2f%%, sigs: %s KiB\n",
   commify( int( $totquoted / 1024 ) ), commify( int( $totorig / 1024 ) ),
   ( $totorig * 100 ) / ( $totorig + $totquoted ),
   commify( int( $totsig / 1024 ) );
 ,12 → 182,12
 $count = keys %threads;
 printf "Total number of threads: %s, average %s bytes per thread\n",
   commify($count), commify( int( $totsize / $count ) );     #/
-printf "Total number of User-Agents: %d\n", scalar keys %agents;
+printf "Total number of user agents: %d\n", scalar keys %agents;
 print "\n", "=" x 76, "\n";
-###############################
-# show posters by article count  Sec 1;
-###############################
+########################################
+## Show posters by article count  Sec 1;
+########################################
 unless ( $skipSec{1} )
 {
   if ( keys %data < $topposters )
 ,9 → 212,9
   print "\n", "=" x 76, "\n";
 }
-################################
-# show posters by size in Kbytes Sec 2;
-################################
+######################################
+## Show posters by size in KiB  Sec 2;
+######################################
 unless ( $skipSec{2} )
 {
   if ( keys %data < $topposters )
 ,7 → 225,7
   {
     $count = $topposters;
   }
-  printf "%s\n", &centred( "Top $count posters by article size in Kbytes", 76 );
+  printf "%s\n", &centred( "Top $count posters by article size in KiB", 76 );
   print "=" x 76, "\n";
   $i = 0;
   foreach my $poster ( sort { $data{$b}{size} <=> $data{$a}{size} } keys %data )
 ,9 → 238,9
   print "\n", "=" x 76, "\n";
 }
-####################################
-# show top posters for original text
-####################################
+#####################################
+## Show top posters for original text
+#####################################
 unless ( $skipSec{3} )
 {
   if ( keys %data < $topposters )
 ,9 → 270,9
   print "\n", "=" x 76, "\n";
 }
-#######################################
-# show bottom posters for original text
-#######################################
+########################################
+## Show bottom posters for original text
+########################################
 unless ( $skipSec{4} )
 {
   if ( keys %data < $topposters )
 ,9 → 302,9
   print "\n", "=" x 76, "\n";
 }
-####################################
-# show threads by number of articles
-####################################
+#####################################
+## Show threads by number of articles
+#####################################
 unless ( $skipSec{5} )
 {
   if ( keys %threads < $topthreads )
 ,9 → 330,10
   }
   print "\n", "=" x 76, "\n";
 }
-################################
-# show threads by size in Kbytes
-################################
+##############################
+## Show threads by size in KiB
+##############################
 unless ( $skipSec{6} )
 {
   if ( keys %threads < $topthreads )
 ,7 → 344,7
   {
     $count = $topthreads;
   }
-  printf "%s\n", &centred( "Top $count threads by size in KB", 76 );
+  printf "%s\n", &centred( "Top $count threads by size in KiB", 76 );
   print "=" x 76, "\n";
   $i = 0;
   foreach my $thread (
 ,9 → 360,9
   print "\n", "=" x 76, "\n";
 }
-#################################
-# show top 10 cross-posted groups
-#################################
+##################################
+## Show top 10 cross-posted groups
+##################################
 unless ( $skipSec{7} )
 {
   delete $crossposts{"$newsgroup_name"};    # don't include ours
 ,9 → 386,10
   }
   print "\n", "=" x 76, "\n";
 }
-#######################
-#show agents and counts
-#######################
+#########################
+## Show agents and counts
+#########################
 unless ( $skipSec{8} )
 {
   if ( keys %agents < $topagents )
 ,7 → 413,7
 }
 #######################
-#show distinct agents
+## Show distinct agents
 #######################
 unless ( $skipSec{9} )
 {
 ,9 → 441,9
   print "\n", "=" x 76, "\n";
 }
-##########################
-#show timezones and counts
-##########################
+############################
+## Show timezones and counts
+############################
 unless ( $skipSec{10} )
 {
   if ( keys %tz < $toptz )
 ,15 → 467,15
 ################################ SUBROUTINES ################################
-#######################################
-# get current article's header and body
-#######################################
-sub getarticle
+########################################
+## Get current article's header and body
+########################################
+sub get_article
 {
   %headers = ();    # dump old headers
   my $filename = shift;    # get the name of the file
-  # get stats about the file itself
+  ## get stats about the file itself
   $filesize = -s $filename;    # get total size of file
   $totsize += $filesize;       # bump total sizes of all files
 ,13 → 489,13
     $latest = $mtime;
   }
-  # now read the file
-  open( my $FILE, $filename ) or die "Can't open $filename: $!\n";
+  ## now read the file
+  open( my $FILE, '<', $filename ) or die "Can't open $filename: $!\n";
   while (<$FILE>)
   {
     $totheader += length($_);    # bump total header size
     last if (/^\s*$/);           # end of header?
-    if (/^([^:\s]*):\s+(.*)/)
+    if (/^([^:\s]*):\s*(.*)/)
     {
       my ( $key, $val ) = ( $1, $2 );
       $headers{$key} = decode( 'MIME-Header', $val );
 ,17 → 504,27
   }
   @body = <$FILE>;               # slurp up body
   close($FILE);
-}    # getarticle
+}    # get_article
-###################################
-# get data from the current article
-###################################
-sub getdata
+####################################
+## Get data from the current article
+####################################
+sub get_data
 {
 #### First, analyse header fields ####
-  # Set up this poster if not defined, get counts, sizes
+  ## Set up this poster if not defined, get counts, sizes
   my $poster = $headers{From};    # get the poster's name
+  # Convert old to new format
+  $poster =~ s/^\s*(.+?\@.+?)\s*\((.+?)\)\s*$/$2 <$1>/;
+  # Collapse whitespace
+  $poster =~ s/\s+/ /;
+  # Remove outer quotes
+  $poster =~ s/^["'](.+?)["']\s+(.*)/$1 $2/;
   if ( !defined( $data{$poster} ) )
   {                                   # seen this one before?
     $data{$poster}{agent}  = 'Unknown';    # comes after For: field
 ,8 → 534,8
   $data{$poster}{count}++;                 # bump count for this poster
   $data{$poster}{size} += $filesize;       # total size of file
-  # The User-Agent and/or X-Newsreader fields
-  # for User-Agent by poster
+  ## The User-Agent and/or X-Newsreader fields
+  ## for User-Agent by poster
   if ( defined $lcheader{"user-agent"} )
   {
     $data{$poster}{agent} = $lcheader{"user-agent"};
 ,7 → 545,7
     $data{$poster}{agent} = $lcheader{"x-newsreader"};
   }
-  # The User Agent for User-Agent by number of posts
+  ## The User Agent for User-Agent by number of posts
   my $UA = "unknown";
   foreach my $keys ( keys %lcheader )
   {
 ,11 → 596,14
     if ( $raw =~ /^microsoft/i ) { $raw =~ s/-/ /g; }
     ## Pick out the popular agents
-    if ( $raw =~ /(outlook express)/i
+    if (
+           $raw =~ /(outlook express)/i
+        || $raw =~ /(windows mail)/i
       || $raw =~ /(microplanet gravity)/i
       || $raw =~ /(news rover)/i
       || $raw =~ /(forte agent)/i
-      || $raw =~ /(forte free agent)/i )
+        || $raw =~ /(forte free agent)/i
+      )
     {
       $agent = $1;
     }
 ,13 → 658,13
     return $agent;
   }
-  # Get all cross-posted newsgroups
+  ## Get all cross-posted newsgroups
   for ( split /,/, $headers{"Newsgroups"} )
   {
     $crossposts{$_}++;    # bump count for each
   }
-  # Get threads
+  ## Get threads
   my $thread = $headers{"Subject"};
   $thread =~ s/^re: //i;    # Remove Re: or re: at start
   $thread =~ s/\s+/ /g;     # collapse whitespace
 ,7 → 671,7
   $threads{$thread}{count} += 1;            # bump count of this subject
   $threads{$thread}{size}  += $filesize;    # bump bytes for this thread
-  # Is this an original post or a reply?
+  ## Is this an original post or a reply?
   if ( defined $headers{"References"} )
   {
     $replies++;
 ,9 → 681,9
     $origposts++;
   }
-  # Get the time zone
+  ## Get the time zone
   $_ = $headers{"Date"};
-  my ($tz) = /\d\d:\d\d:\d\d\s+(.*)/;
+  my ($tz) = /\d\d:\d\d(?::\d\d)?\s+(.*)/;
   if ( ( $tz =~ /UTC/ ) or ( $tz =~ /GMT/ ) or ( $tz =~ /0000/ ) )
   {
     $tz = "UTC";
 ,7 → 700,7
     {
       $totsig += length($_);    # bump total sig size
-      # Bill Unruh uses ] quotes, and another poster uses ::
+      ## Bill Unruh uses ] quotes, and another poster uses ::
     }
     elsif ( /^\s*[>\]]/ or /^\s*::/ )
     {                           # are we in a quote line?
 ,19 → 714,19
     else
     {
-      # we must be processing an original line
+      ## We must be processing an original line
       $data{$poster}{orig} += length($_);      # bump count of original chrs
       $totorig += length($_);
     }
   }    # end for (@body)
-}    # getdata
+}    # get_data
-########################################
-# Count the User-Agents used, collapsing
-# different versions into one per agent.
-########################################
-sub countagents
+#########################################
+## Count the User-Agents used, collapsing
+## different versions into one per agent.
+#########################################
+sub count_agents
 {
 POSTER:
   foreach my $poster ( keys %data )
 ,12 → 741,12
     }
     $agents{ $data{$poster}{agent} }++;
   }
-}    # countagents
+}    # count_agents
-############################################
-# set orig/total percentages for all posters
-############################################
-sub fixpercent
+#############################################
+## Set orig/total percentages for all posters
+#############################################
+sub fix_percent
 {
   foreach my $poster ( keys %data )
   {
 ,15 → 765,15
   }
 }
-##############################
-# right pad a string with '.'s
-##############################
+###############################
+## Right pad a string with '.'s
+###############################
 sub rpad
 {
-  # get text to pad, length to pad, pad chr
+  ## Get text to pad, length to pad, pad chr
   my ( $text, $pad_len, $pad_chr ) = @_;
-  ### DEBUG
+  ## DEBUG
 #  printf "|%s| = %d\n", $text, length($text);
   if ( length($text) > $pad_len )
 ,9 → 784,9
   return $padded;
 }
-#################
-# centre a string
-#################
+##################
+## Centre a string
+##################
 sub centred
 {
   my ( $text, $width ) = @_;    # text to centre, size of field to centre in
 ,25 → 795,24
   return $centred;
 }
-##########################
-# put commas into a number
-##########################
+###########################
+## Put commas into a number
+###########################
 sub commify
 {
-  $_ = shift;
-while s/^(-?\d+)(\d{3})/$1,$2/;
+  local $_ = shift;
+while s/^([-+]?\d+)(\d{3})/$1,$2/;
   return $_;
 }
-#########################
-# clean
-#########################
+################################################################
+## Returns a string with leading and trailing whitespace removed
+################################################################
 sub clean
 {
   my $dirty = shift;
   my $clean = $dirty;
-  $clean =~ s/^\s*//;
-  $clean =~ s/\s*$//;
+  $clean =~ s/^\s*|\s*$//g;
   return $clean;
 }
 ,18 → 819,18
 sub usage
 {
   print "usage: newstat.pl newsgroupname\n";
   exit 1;
 }
-###################################
-# Write data structures to a file #
-###################################
-sub writedata
+##################################
+## Write data structures to a file
+##################################
+sub write_data
 {
-  open my $OUTF, ">/tmp/XDATA" or die "Can't create XDATA: $!\n";
-  print $OUTF "Data collected from alt.os.linux.mandrake\n\n";
+  open my $OUTF, ">:encoding(UTF-8)", "/tmp/XDATA"
+    or die "Can't create XDATA: $!\n";
+  print $OUTF "Data collected from $newsgroup_name\n\n";
   print $OUTF
     "Poster Data\nname : agent : count : size: orig : quoted : per cent\n";
   foreach my $name ( keys %data )
 ,7 → 857,7
   {
     print $OUTF "$name : $crossposts{$name}\n";
   }
-  print $OUTF print $OUTF
+  print $OUTF
 "============================================================================\n";
   print $OUTF "User agents\n";
   print $OUTF
 ,4 → 876,4
     print $OUTF "$name : $tz{$name}\n";
   }
   close $OUTF;
-}    # writedata
+}    # write_data

 /trunk/tools/network/news/newsstat/ChangeLog
 ,30 → 1,64
-############################################################################
-#                       RECENT CHANGES                                     #
-# 2011-10-03  PE  - Use more compatible shebang
-#                 - Fixed some Perl::Critic-ized code
-#                 - Fixed wrong indent for non-ASCII names
-#                 - Formatted source code
-# 2011-07-03  PE  - Use Encode to decode/encode MIME encodings
-#                 - Use warnings, utf8 (just in case)
-#                 - Documentation update
-# N/A         NN  - Take newsgroup name as argument
-# 2004-06-19  NN  - newsgroup name is $ARGV[0]
-#                 - Allow command line flags for subtracting
-#                   output if not pertinent for a group
-# 2002-11-09  NN  - Put Garry's writedata() function back in.
-#                 - added "rn" to my list of UA's
-#                 - Started using %distinct_agent for both User agent
-#                   sections
-#                 - named it newsstat.pl version 0.3
-# 2002-11-06  NN  - Fixed the earliest/latest file problem by using
-#                   mtime rather than ctime, and simplifying the logic
-# 2002-11-05  NN  - moved user configurations to the top
-#                 - fixed the cross-posting section
-#                 - introduced the $newsgroup_name variable which
-#                   later becomes $news$group
-#                 - changed $name to $agent_name in countagents()
-#
-# Contributors
-# -------------
-# NN  Nomen nominandum (name to be determined later)
-# PE  Thomas 'PointedEars' Lahn <startrek@PointedEars.de>
+Changelog
+==========
+-10-04  PE
+  - Added diagnostics (just in case)
+  - Use `binmode STDOUT' instead of `use encoding' (compat.)
+  - Documentation update, moved changelog and TODO to files
+  - `##' for leading comments to handle dev artifacts better
+  - Sorted supported newsreaders alphabetically
+  - Added support for Microsoft Windows Mail (OE successor)
+  - Use uniform sub identifiers (words delimited with `_')
+  - Use ISO/IEC units of data storage (KiB, MiB) uniformly
+  - Space after header field's `:' are optional now,
+    see RFC 5536, section 2.2 ("MAY")
+  - Convert old `From' format to new one, collapse whitespace,
+    remove outer ("protocol") quotes
+  - Seconds are optional in `Date' header field values now,
+    see grammar in RFC 5322, section 3.3 (ref. by RFC 5536, 2.2)
+  - commify() adapted to perlfaq5
+  - clean(): Simplified whitespace stripping
+  - write_data(): writes XDATA using UTF-8, removed bogus print()
+  - Fixed all Perl::Critic-ized code except nested get_agent()
+-10-03  PE
+  - Use more compatible shebang
+  - Fixed some Perl::Critic-ized code
+  - Fixed wrong indent for non-ASCII names
+  - Formatted source code
+-07-03  PE
+  - Use Encode to decode/encode MIME encodings
+  - Use warnings, utf8 (just in case)
+  - Documentation update
+N/A         NN
+  - Take newsgroup name as argument
+-06-19  NN
+  - newsgroup name is $ARGV[0]
+  - Allow command line flags for subtracting
+    output if not pertinent for a group
+-11-09  NN
+  - Put Garry's writedata() function back in.
+  - added "rn" to my list of UA's
+  - Started using %distinct_agent for both User agent
+    sections
+  - named it newsstat.pl version 0.3
+-11-06  NN
+  - Fixed the earliest/latest file problem by using
+    mtime rather than ctime, and simplifying the logic
+-11-05  NN
+  - moved user configurations to the top
+  - fixed the cross-posting section
+  - introduced the $newsgroup_name variable which
+    later becomes $news$group
+  - changed $name to $agent_name in countagents()
+Contributors
+-------------
+NN  Nomen nominandum (name to be determined later)
+PE  Thomas 'PointedEars' Lahn <startrek@PointedEars.de>

Subversion Repositories LCARS

Compare Revisions

Last modification

Regard whitespace Rev 12 → Rev 13