Kaffis Mark V wrote:
Now, what will it take to get these calculations (once the algorithms have been tweaked to whoever's satisfaction) to be re-calculated on demand and included in profile pages?
Hm....I'm not sure that it can be automated from this side. At least, not with my poor bash scripting and sed skills. There are definitely limitations to parsing the HTML output from phpbb vs. just running queries against the actual database.
What I did was something like this:
Code:
wget -O in25.txt 'http://gladerebooted.org/viewforum.php?f=2&start=25'
egrep 'topicdetails">[^>]+>[^<]+<\/a>' in25.txt | \
sed 's/^.*>\([^<]\+\)<\/a.*$/\1/' > out25.txt
wget -O in50.txt 'http://gladerebooted.org/viewforum.php?f=2&start=50'
egrep 'topicdetails">[^>]+>[^<]+<\/a>' in50.txt | \
sed 's/^.*>\([^<]\+\)<\/a.*$/\1/' > out50.txt
[...]
cat out* > all.txt
rm in* out*
sort all.txt | uniq -c | sort -rn all.txt
The grep and sed stuff are just regex search and replace. They're not as nasty as they look. The grep statement finds and outputs all lines that look like this:
Quote:
topicdetails"><a [blah, blah, blah]>Stathol</a>
Sed then processes that list and replaces the entire line with the part I've shown in bold.
After dumping the list of last posters for each page into it's own outXXX.txt file, all of them are globbed together with cat. The list is sorted alphabetically, then handed off to uniq, which counts the number of duplicate lines, adds that information to the text file, and then removes the duplicate lines. The final sort is optional - it just puts the whole thing into reverse numerical order (highest kills first).
There are several problems with this:
- The wget and egrep statements were generated "in line" by Excel because I suck and couldn't be bothered to learn rudimentary Bash control structures. This could easily be done with a 'for' loop, I'm sure, but the problem is knowing when to stop. As with the Excel approach, the only thing I know to do is manually check the last page in each forum to figure out what the max value should be for 'start='. Since this is a moving target, it present an automation problem. There might be a way to infer it from the thread count on the main board index, though.
- There's no automated way to tell if a forum is a PbP forum. The forum IDs (f=) are intermixed. You could always use a static list, but it would have to be updated every time a forum was created or destroyed.
- Offhand, I don't know of any way to automatically ignore the announcement threads using only grep and sed. Both pattern match on a line-by-line scope, and the line of text in the HTML that includes the last poster's name doesn't contain anything that would allow you to distinguish between the two. This was a manual adjustment that I had to make for each forum based on which announcements were visible. It can differ from forum to forum. Michael's announcement in General is not global like the other two, for instance; it only exists in the General forum.
- Correlating the post counts to my list of thread-killers had to be done manually as well. I actually did the data entry manually too, but it would be possible to automated that part of it. The trick would be correlating a list of all users and their postcounts to a list of some users and their thread kill count. This is really a job for a database engine rather than a bash script or excel spreadsheet.
The straightest path through the mud would be build some SQL queries that you could run directly against the phpbb databases. This would give you a much cleaner and easier way to do all of the above. For instance, all of the PbP forums are in their own section, which has
some kind of implication in the database. I don't know what, exactly, but it's something that you could build a query around.
Edit:
Of course, someone will surely come along and say,
"Oh, that's
easy in
Python! Just 'import parsePhpbbThreadKills'!"