Stuff and stuff: rosettacode

Showing posts with label rosettacode. Show all posts

Tuesday, January 26, 2010

Graphic background doubles nicely as a desktop background

This is one of the images I came up with as a component for images on fundraising items. However, I happen to think it makes an awesome background image. It's made for tiling, not stretching. Enjoy.

(Sorry, I'm keeping the 300dpi version for myself...)

Tuesday, December 22, 2009

Techtalk Tuesday: Nexus bot

This was an idea I've been chewing on for a while, now, and something I've been thinking about doing. Simply put, it's a chat room bridge, but it's not that simple.

Normally, I sit in rosettacode, haskell, proggit, perl, tcl and any number of other channels, ready to offer insight or assistance, or even just observe, if someone mentions Rosetta Code. What Would Be Nice would be if I could just sit in rosettacode, and let a bot handle it for me.

The general sequence might go as follows:

proggit - * soandso thinks Rosetta Code needs better algorithms
rosettacode - * soandso thingks Rosetta Code needs better algorithms
proggit - soandso: What are you looking for?
rosettacode - <#jubilee> soandso: What are you looking for?
rosettacode nexusbot: soandso: Did you see this category? (some url)
proggit - soandso: Did you see this category? (some url)

Nexusbot has to perform several complicated behaviors there, so let's look at them.

First:
* soandso thinks Rosetta Code needs better algorithms

"Rosetta Code" matches one of nexusbot's highlight rules for forwarding to rosettacode, so nexusbot relays the message to rosettacode, thinks of it as a "connection", and associates soandso as a primary for that connection, with a most recent related activity timestamp attached to his association with the connection.

Next:
soandso: What are you looking for?

soandso is associated with a current connection (and that association hasn't timed out), and jubilee just said something to him. nexusbot associates jubilee with soandso, and, through soandso, to the relay to rosettacode. jubilee is attached to the relay with his own related activity timestamp, copied from soandso's.

rosettacode - nexusbot: soandso: Did you see this category? (some url)

shortcircuit addresses nexusbot, and indicates he's addressing soandso through nexusbot. Nexusbot sees that soandso is associated with a connection between rosettacode and proggit, associates shortcircuit with that connection (along with a recent activity timestamp), and passes shortcircuit's message along to proggit

Each time someone triggers a highlight, they're considered a primary for the connection that highlight creates (or would create, if it exists already), and their "recent related activity" timestamp is updated. Each time someone talks to a primary for a connection, they're also associated with the connection, and their "recent related activity" timestamp is set to that of the primary's.

Whenever a primary or secondary talks, their communications are relayed across the connection, but their RRAs are not updated.

When a primary's RRA grows old past a certain point, they're disassociated from the connection. When all of a connection's primaries are gone, the connection is ended.

There are a couple scenarios this logic doesn't quite resolv. What if jubilee is a channel champion, someone who talks to everyone, and who everyone talks to? It's probable that his side of a conversation with someone else would leak across the channel. What if someone talks to a secondary on a related subject, but he doesn't trigger a hilight keyword? Well, that line would be lost.

No solution is perfect.

Now to deal with the Big Brother concerns. Ideally, nexusbot would only be in a channel if he were legitimately asked to be there. That means only a /invite, and preferably checking to see if the user who sent the invite is, in fact, in the destination channel. Ideally, nexusbot would only be in a channel until he were asked to leave. That means no autojoin after a /kick.

There's also the consideration that it should let someone who's in authority in the channel know they're there and what they are, and offer a command set to control the bot's behavior in the channel.

Random braindump of possible commands:

HILIGHT LIST/ADD/REMOVE [#channel] -- lists, adds or removes a hilight rule, optionally associated with a channel. Lists include who requested the hilight rule, and when.

RATELIMIT GET/SET -- get or set the maximum number of lines per minute.

LINEBUFFER GET/SET -- get or set the size of the buffer for queuing lines if the ratelimit is hit.

REPLYMODE USER/HIGHLIGHT/CHANNEL/AUTO +/- b/m/v -- treat connections derived from highlights or associated with particular remote channels as channels themselves, and allow some channel modes like +/- m to be applied to them. Likewise, allow user modes like +/- b and v to be associated with remote users. AUTO means having the bot automatically sync its remote user modes (as they apply in that channel) with the channel's mute, voice and bans.

Ideally, only channel members with +O or +o would have access to the setter commands.

Monday, September 7, 2009

Not done yet, but...

I sprung for a Linode 540 account. I copied RC over, and have been tuning everything all evening. The new VPS server is serving up 5-7Mb/s to my stress testing. Of dynamically-generated pages.

Points of interest:

540MB of RAM.
memcached -k -m128M
php5-xcache
Tuned MySQL parameters
The site's code is on a tmpfs mount
Not using Squid
My stress testing was done from a connection capped at 6Mb/s
My stress testing was eight concurrent runs of wget -m --spider http://hostname

Kinda pleased.

Still have to set up all the subdomains, but the site should run a lot faster once I'm ready to switch DNS over. I'm also spending most of my CPU time in system, and that's coming from the apache forked processes (post-fork, mind you). I haven't figured out why.

You might have noticed the mention of tmpfs...I had a lot of spare RAM. I have a hard upper limit on the number of running forked PHP processes, an upper limit on memcached, and MySQL doesn't seem to be able to be coaxed into keeping its M_DRS memory resident. So I had about 300MB of RAM used for file cache.

Since I don't need to change the PHP files often, losing the contents of the tmpfs mount in an outage isn't a major concern, and I can cron an rsync to disk.

Tuesday, August 11, 2009

Complete site backups in under five minutes

I've changed how site backups for Rosetta Code work.

Previously, a site backup was a rather manual affair of mysqldump, tar and scp. I've got a fair number of large tarballs that only contain the contents of the httpd root as well as database SQL dumps. Consumed space and bandwidth grows fairly quickly; This is part of why incremental backup strategies get devised.

Now, I have a set of nested makefiles on an offsite system with a few different targets. The root level makefile has 'backup', 'recurse' and 'git' targets. The 'backup' target depends on the other two. The recurse target drops into a few different subdirectories, one each for databases, webroot and logs. Each subdirectory has its own make file with a 'backup' target.

The databases subdirectory connects to the server, has the server do a mysqldump of the databases to a server-local file, and then uses rsync to copy the SQL dump file to the local system.

The webroot subdirectory uses rsync to copy the webroot to the local system.

The logs subdirectory uses rsync to copy system log files to the local system.

After running the recurse target, the root makefile runs the git target, which updates a local git repository with the modifications since the last time a backup was done. This is relatively cheap; Since the data has already been copied to the local system, the server isn't loaded down with the subsequent processing.

Once the whole thing has been primed (the first backup takes quite a while, as all of the data has to be copied), a full backup run takes less than five minutes to save off the changes from an hours' subsequent site traffic.

The biggest problem with the system is that serverside CPU usage is fairly heavy with the mysqldump and the rsync work. Serverside work currently takes four of the five minutes, while the git processing takes the rest. Hopefully, I'll be able to offset this by moving some batch processing typically done on the server to offsite, on better, faster hardware.

In the near future, I plan to add chunks of /etc to the backup process.

Sunday, August 2, 2009

A comparison of compressors on SQL

I was cleaning and reorganizing data on my computers, and taking the opportunity to compress anything large. that I wouldn't need to see inside as part of, e.g. indexing.

I looked at one of my snapshots of Rosetta Code's database; Uncompressed, it occupied over 600MB. Compressed with bzip2, it occupied about one tenth of that. I decompressed and recompressed it with rzip, and was sufficiently surprised at the results that I tried to do a fairly thorough comparison of bzip2, rzip and gzip. Based on my use case, I collected data on compression ratio and speed. I did not collect data on RAM usage. (Though I do know that rzip at max compression exceeds the amount of RAM available on my basic Slice.)

It took (bzip2,rzip,gzip) (5m29s, 3m1s, 1m3s) to achieve compression ratios of (11.2, 605, 7.48).

Here's the raw data:

shortcircuit@dodo~/comprcompa
04:54:56 $ ./comprcompa.sh rcode_20090704_2029.sql
The primary purpose here is to compare compression ratios for database SQL dump backups.
Running environment is a Gentoo system running an AMD Phenom 9650.
As most things in Gentoo are typically compiled from source, these are the CFLAGS used:
CFLAGS="-march=amdfam10 -O2 -pipe"
our source database dump
-rw------- 1 shortcircuit shortcircuit 633986350 2009-08-02 02:58 rcode_20090704_2029.sql
Starting memory conditions; If there's a great deal of room for cache, we won't hit disk as frequently
             total       used       free     shared    buffers     cached
Mem:       7936588    7730932     205656          0     207452    6504216
-/+ buffers/cache:    1019264    6917324
Swap:            0          0          0
streaming original file into /dev/null via dd, to pull it into cache
1238254+1 records in
1238254+1 records out
633986350 bytes (634 MB) copied, 1.07369 s, 590 MB/s
starting uptime:  04:55:01 up 13 days, 17:13,  3 users,  load average: 0.67, 0.85, 0.71
Starting bzip2 -fk9

real	5m28.838s
user	5m28.279s
sys	0m0.537s
Post-bzip2 uptime:  05:00:29 up 13 days, 17:18,  3 users,  load average: 1.05, 1.03, 0.83
Pull original back into cache, for fair comparison
1238254+1 records in
1238254+1 records out
633986350 bytes (634 MB) copied, 1.097 s, 578 MB/s
Starting rzip -k9

real	3m0.765s
user	3m0.445s
sys	0m0.307s
Post-rzip uptime:  05:03:31 up 13 days, 17:21,  3 users,  load average: 1.21, 1.08, 0.87
Pull original back into cache, for fair comparison
1238254+1 records in
1238254+1 records out
633986350 bytes (634 MB) copied, 1.0498 s, 604 MB/s
Starting gzip -9

real	1m33.356s
user	1m32.011s
sys	0m0.890s
Post-gzip uptime:  05:05:06 up 13 days, 17:23,  3 users,  load average: 1.11, 1.07, 0.89
Final file sizes:
-rw------- 1 shortcircuit shortcircuit 56789389 2009-08-02 02:58 rcode_20090704_2029.sql.bz2
-rw------- 1 shortcircuit shortcircuit 84747906 2009-08-02 02:58 rcode_20090704_2029.sql.gz
-rw------- 1 shortcircuit shortcircuit 10472407 2009-08-02 05:03 rcode_20090704_2029.sql.rz
shortcircuit@dodo~/comprcompa
05:05:06 $

Saturday, August 1, 2009

Harvesting of SourceForge projects and spamming SF users

I got an email from "Apparition " telling me my ActivityRank value was 0, and that I should add photo and blog entires to increase it. The email included a link to "http://group.ps/apparition"

"Apparition" is the name of a project I created years ago when I was in college and was convinced that I could do a better job writing computer lab imaging software than the "Ghost" software that was being used in the lab I worked in at the time. (Heck, I suspect that's even more true, now. Imagine imaging a lab, but having the imaging software using bittorrent on the local switch to distribute the drive images. Certainly would have worked better than imaging one machine in the lab to get past the building-building bottleneck, then having that machine serve up to the other 63 PCs in the lab...)

I never went anywhere with it. Haven't even really thought about it in the last few years. The website that looks like it sent me the email appears to have found my old project on SourceForge and sent an email to my SourceForge account, which was forwarded to my personal email. My first impression was a targeted malware campaign. I've grabbed grou.ps and grou.ps/apparition with wget and examined them with less and links, and neither *appears* to contain malicious code to my untrained eye, mostly jQuery code to control the interface to the social networking site. That's not to say it's safe; I wouldn't open it in a full browser outside of a clean VM, for the sake of being paranoid.

I'm fairly confident it's a programmatic attack, as Apparition is probably the least interesting of the SF projects I started and never went anywhere with. Even if it's not an attempt at spreading malware or collecting personal info of technical users and people with access to source code repos, it bothers me a bit that someone appears to be using programmatic means to harvest SF accounts, create places for them on a social networking site, and it bothers me that that email somehow got through SourceForge's filters.

Here are the headers:

Delivered-To: mikemol@gmail.com
Received: by 10.150.123.8 with SMTP id v8cs281312ybc;
        Fri, 31 Jul 2009 23:44:12 -0700 (PDT)
Received: by 10.100.216.7 with SMTP id o7mr4434688ang.120.1249109052582;
        Fri, 31 Jul 2009 23:44:12 -0700 (PDT)
Return-Path: 
Received: from mx.sourceforge.net (mx.sourceforge.net [216.34.181.68])
        by mx.google.com with ESMTP id 13si10290449yxe.76.2009.07.31.23.44.10;
        Fri, 31 Jul 2009 23:44:11 -0700 (PDT)
Received-SPF: fail (google.com: domain of bounce@grou.ps does not designate 216.34.181.68 as permitted sender) client-ip=216.34.181.68;
Authentication-Results: mx.google.com; spf=hardfail (google.com: domain of bounce@grou.ps does not designate 216.34.181.68 as permitted sender) smtp.mail=bounce@grou.ps; dkim=pass (test mode) header.i=@grou.ps
Received-SPF: pass (3b2kzd1.ch3.sourceforge.com: domain of grou.ps designates 67.228.206.32 as permitted sender) client-ip=67.228.206.32; envelope-from=bounce@grou.ps; helo=mail01.grou.ps;
Received: from mail01.grou.ps ([67.228.206.32])
	by 3b2kzd1.ch3.sourceforge.com with esmtp 
	(Exim 4.69)
	id 1MX8K6-0001oT-B0
	for shortcircuit@users.sourceforge.net; Sat, 01 Aug 2009 06:44:10 +0000
Received: from mail01.grou.ps (localhost [127.0.0.1])
	by mail01.grou.ps (Postfix) with ESMTP id 354613106EB
	for ; Thu, 30 Jul 2009 19:41:38 -0500 (CDT)
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=grou.ps; h=date:to:from
	:subject:message-id:list-unsubscribe:mime-version:content-type;
	 s=s1; bh=1s8xCFXOMjsEryNQjM/Jgn/L6VI=; b=D1qzcHRVWOs5w7fXza79KX
	QN5oOAE19VQ2tLJZsXbuSYJ22ZUBqdp5RoA4cXBbxta4f+9VOc8QSaPmytOFcURt
	0gQ2k9LeWahR63fxVLPDqkLpBmtRl59VKZN7TF4f9IfJ19/RdfhYqvnV/GbCcoE1
	XNNHwnMdiKiDfxSmZTxAI=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=grou.ps; h=date:to:from:subject
	:message-id:list-unsubscribe:mime-version:content-type; q=dns;
	 s=s1; b=HLOKH9ucB8bSXrWREFZ47U5qEHfgyCo2LN/5MvP+rrc6A1ZrDmF8zdL
	cpZzq1P4n43XFjssW18HRk/076lQHYxvi8XlcMuOk9hleImk22W366VZo+mnID+V
	5JYJlMNR1nMrB0x76i9RJ9fiCcSGTivRoDi6vrOOVmyj/FIIhqM0=
Received: from localhost.localdomain (unknown [67.228.115.98])
	by mail01.grou.ps (Postfix) with ESMTP id 2A2103106EA
	for ; Thu, 30 Jul 2009 19:41:38 -0500 (CDT)
Date: Thu, 30 Jul 2009 19:41:38 -0500
To: shortcircuit 
From: Apparition 
Subject: Apparition: Weekly Newsletter
Message-ID: <8507431ef5f59bdc6fecbb3f67dfa0e1@localhost.localdomain>
X-Priority: 3
X-Mailer: GROU.PS Mailer
List-Unsubscribe: http://grou.ps/noemail.php?x1=%25qCGbT5-%3B%5C9Wr%2BK8Asq4%27%3FWmJIX6%24%272%23xR&x2=%251H3qV%7B%23O%5Bd%5Bb%27%7E1%27%27%3A1%7CV%5C6vC%2FA%7ByN%21lU
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="b1_8507431ef5f59bdc6fecbb3f67dfa0e1"
X-Spam-Score: -0.5 (/)
X-Spam-Report: Spam Filtering performed by mx.sourceforge.net.
	See http://spamassassin.org/tag/ for more details.
	-1.5 SPF_CHECK_PASS         SPF reports sender host as permitted sender for
	sender-domain
	-0.0 SPF_PASS               SPF: sender matches SPF record
	-0.0 DKIM_VERIFIED          Domain Keys Identified Mail: signature passes
	verification
	0.0 DKIM_SIGNED            Domain Keys Identified Mail: message has a signature
	1.0 HTML_MESSAGE           BODY: HTML included in message
	0.0 AWL                    AWL: From: address is in the auto white-list
X-Headers-End: 1MX8K6-0001oT-B0

Sunday, July 26, 2009

On syntax highlighting and artificial intelligence

So on Rosetta Code, we use GeSHi for syntax highlighting. The relationship between Rosetta Code, GeSHi, a programming language and the code written in that language is fairly simple. (The exact order of events inside GeSHi might be slightly different; I haven't delved deeply into its core)

Rosetta Code (by way of a MediaWiki parser extension) gives GeSHi a few pointers about how it wants the code formatted, the language the code sample will be in, and, finally, the code sample itself.

GeSHi takes the code example, and loads the language file named after the language in question. Each language file defines a PHP associative array that contains(among a couple other things) simple rules for how GeSHi can apply formatting to the code in a way that will clarify it to the viewer. These rules include lists of known keywords of various classifications, symbols used for normal commenting conventions and optional regex matching rules for each, among other things.

It's a perfectly reasonable, fairly static approach that allows syntax highlighting to cover a broad variety of languages without knowing how to parse that language's actual syntax, and so avoiding having a syntax error break the whole process.

Unfortunately, it requires Rosetta Code to be able to tell GeSHi what language a code sample is written in. It also leads to odd scenarios where a supported language and an unsupported language are so closely related that examples written for the unsupported language can be comfortably highlighted using the the rules for the supported language.

And I have yet to learn of a good way to do syntax highlighting for Forth. (The Forth developers appear to pretty much keep to their own community, and don't seem to do much in the way of outreach, which makes finding a solution relatively difficult, but I digress...)

So what does this have to do with artificial intelligence? Well, in identifying a language without being told what it is, of course!

A few solutions have been discussed. One approach that has been attempted had something to do with Markov Chains. The code is in the GeSHi repos, and I haven't looked at it.

One solution I suggested was to run the code example through all the supported languages (Yes, I know, that's expensive. Not something to be done in real time.), and select the ruleset based on how many rules(X) were matched for a language and how much of the code sample was identified(Y). Using a simple heuristic of (a*X)/(b*Y), you can account for a number of matched rules while hopefully accounting for an overly-greedy regex rule.

How can we take this a step farther? How about formatting languages we don't know about?

Well, many, many languages have rules in common. Common keywords, common code block identifiers, common symbols for comments, common symbols for quotation, etc. This tends to result from their being derived or inspired in some way by another language. For the sake of avoiding pedantry, I'll just say that C, C++, Perl, Python, PHP, Pascal and Java all have a few common ancestors.

One way would be to note the best N language matches, take an intersection of their common rules, and apply that intersection as its own ruleset. This would certainly work for many of the variants of BASIC out there, as well as specialized variants of common languages like C and low-level ISAs.

Stuff and stuff