Below is a, "Tag Cloud" - it's a list of the most used words on this journal/sketchbook thing. The larger the word, the more times it's used. Clicking on the word will bring you to a Google Search of the word as its used on this site. Interesting? Almost. Below the cloud is the code I used to create it, and more blather about it.
I forgoed going to the local Wednesday night 80's dance club makeout fest to give you the below code. I think I needed a break from socializing and working the, "oh, I'm this amazing artist" rap and more time, alone in my studio, in front a computer.
Note: This is very much a proof-of-concept.
#!/usr/bin/perl -w
use strict;
# Where I put the Text::Ngrams Module:
use lib qw(/home/alex/perllib);
# lowest font size, in px
my $base_size = 10;
# where are our text entries?
my $dir_to_search = '/home/alex/www/alex/ephemeris';
# what's the site URL?
my $site_url = 'http://skazat.com';
# what's the most words you want back? (before we filter for common words)
my $word_limit = 1250;
# No more to really change;
#---------------------------------------------------------------------#
use File::Find;
use Text::Ngrams;
my @files;
my %count = ();
my %results;
# A list of words, we won't take into account, since either:
# (1 They're part of the HTML of the page,
# (2 They're locally weird
# (3 they're very common.
my @banned_words = qw(
p strong img href em value param width border height com net org align style
images target blank skazat src solid black NAME PARAM http alt And s px www I
jpg gif table tr td id indexOf FFFFFF index serif row url br san meta size
header blockquote bgcolor html text li ol font quot ul nbsp hspace vspace cd
cccccc bold arial mailto span hr valign
currents verdana cgi am gt didn don wasn ssi th nav div
the of to and a in is it you that he was for on are with
as I his they be at one have this from or had by hot word
but what some we can out other were all there when up use
your how said an each she which do their time if will way
about many then them write would like so these her long make
thing see him two has look more day could go come did number
sound no most people my over know water than call first who
may down side been now find any new work part take get place
made live where after back little only round man year came
show every good me give our under open seem together next
white children begin got walk example ease paper group always
music those both mark often letter until mile river car feet
care second book carry took science eat room friend began idea
fish mountain stop once base hear horse cut sure watch color
face wood main enough plain girl usual young ready above ever
red list though feel talk bird soon body dog family direct
pose leave song measure door product black short numeral class
wind question happen complete ship area half rock order fire
south problem piece told knew pass since top whole king space
heard best hour better true . during hundred five remember step
early hold west ground interest reach fast verb sing listen
six table travel less morning gentle woman captain practice
separate difficult doctor please protect noon whose locate ring
character insect caught period indicate radio spoke atom human
history effect electric expect crop modern element hit student
corner party supply bone rail imagine provide agree thus
capital won't chair danger fruit rich thick soldier process
operate guess necessary sharp wing create neighbor wash bat
rather crowd corn compare poem string bell depend meat rub
tube famous dollar stream fear sight thin triangle planet hurry
chief colony clock mine tie enter major fresh search send
yellow gun allow print dead spot desert suit current lift rose
continue block chart hat sell success company subtract event
particular deal swim term opposite wife shoe shoulder spread
arrange camp invent cotton born determine quart nine name very
through just form sentence great think say help low line differ turn
cause much mean before move right boy old too same tell does set three
want air well also play small end put home read hand port large spell
add even land here must big high such follow act why ask men change went
light kind off need house picture try us again animal point mother world
near build self earth father head stand own page should country found
answer school grow study still learn plant cover food sun four between
state keep eye never last let thought city tree cross farm hard start
might story saw far sea draw left late run don't while press close night
real life few north ten simple several vowel toward war lay against
pattern slow center love person money serve appear road map rain rule
govern pull cold notice voice unit power town fine certain fly fall lead
cry dark machine note wait plan figure star box noun field rest correct
able pound done beauty drive stood contain front teach week final gave
green oh quick develop ocean warm free minute strong special mind behind
clear tail produce fact street inch multiply nothing course stay wheel
full force blue object decide surface deep moon island foot system busy
test record boat common gold possible plane stead dry wonder laugh
thousand ago ran check game shape equate hot miss brought heat snow tire
bring yes distant fill east paint language among truck noise level
chance gather shop stretch throw shine property column molecule select
wrong gray repeat require broad prepare salt nose plural anger claim
continent oxygen sugar death pretty skill women season solution magnet
silver thank branch match suffix especially fig afraid huge sister steel
discuss forward similar guide experience score apple bought led pitch
coat mass card band rope slip win dream evening condition feed tool
]total basic smell valley nor double seat arrive master track parent
shore division sheet substance favor connect post spend chord fat glad
original share station dad bread charge proper bar offer segment slave
duck instant market degree populate chick dear enemy reply drink occur
support speech nature range steam motion path liquid log meant quotient
teeth shell neck
);
# make a simple lookup table...
my %banned_words = ();
$banned_words{lc($_)} = 1 foreach @banned_words;
my $ng = Text::Ngrams->new(type => 'word', windowsize => 1, limit => $word_limit);
# walk the directory structure, make note of the plain text (HTML) files...
find(\&wanted, $dir_to_search);
sub wanted {
if(-r $File::Find::name && -f _ && -s _){
push(@files, $File::Find::name);
}
}
my $count = $ng->process_files(@files);
my @results = split("\n", $ng->to_string());
# Take note of what words were found, and how many
foreach(@results){
next if $_ =~ /\<NUMBER\>/; # We don't care about numbers.
next unless $_ =~ m/\t/; # a tab usually delimits
# the word from the number
my ($word, $count) = split("\t");
next if length($word) <= 1; # has to be a word of
# more than one letter...
next if its_banned($word); # can't be a banned word...
# make sure everything is lowercased...
if(exists($results{lc($word)})){
$results{$word} += $count;
}else{
$results{lc($word)} = $count;
}
}
# Find the smallest amount, and set as, "0",
# adjust all other numbers to relate to that
# baseline.
%results = massage_for_size(\%results);
# Make out our links.
my $str;
foreach my $w(sort keys %results){
$str .= wb_link($w, $results{$w});
}
# print it out!
print $str;
sub massage_for_size {
my $results = shift;
my @numbers = values %$results;
@numbers = sort { $a <=> $b} @numbers;
my $smallest = $numbers[0];
foreach(keys %$results){
$results{$_} -= $smallest;
}
return %$results;
}
sub wb_link {
my ($word, $count) = @_;
return '<span style="font-size:'. ( (int($count/3)) + $base_size ) .'px"><a href="http://www.google.com/search?q=site:'.$site_url.'%20' . $word . '">'. $word . '</a> </span> ' . "\n";
}
sub its_banned {
my $word = shift;
return 1 if exists $banned_words{lc($word)};
}
__END__
=cut
Copyright (c) 2005 alex Simoni
This program is provided "as is" without expressed or implied warranty. This is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
=cut
A few things to consider:
The above program uses a file system-based database - meaning: you have directories, and inside those directories, you have files, and more directories, that have files - we're not using SQL here.
I'm using some home-brewed stuff for my... whatever you want to call it, but you can pick up something called Blosxom and use this script with it. I may write a plugin for that program using the above as a starter - we'll see.
I decided to hack about making one of these cloud things to see if it would give me an insight on the six or so years of writing. The verdict? Not really. It's neat, but not useful enough.
Some problems:
The amount a specific word is used does not always mean that word is, well, really important. Sure, you can tell from the above cloud that, "art" is used a lot - so I guess that's important - although the cloud does not even begin to tell you how important related, but less used words are to the context of the writing. For example, ocularium does not show up at all. Common words will show up more, no matter what.
Speaking of context, we've completely obliterated it and made some sort of statistical graph. Go us. Nothing I like more than statistics.
Some Creme
What I guess is interesting, is that the information, (raw data - blah blah blah), is attempting to sort, "itself", meaning: We don't have to make a hierarchy of subjects to use in the journal thing, it's now doing it itself... sort of. I'm not sure if, "Something" is a subject worth pursuing - it's still sort of dense and stupid. I really wish I could get back a list of the most... least used words. Does that make any sense?
I hate the word, "blogging". I hate that most of the creative output put forth on personal sites has come to using simple web services like, blogging, photosharing and the like and less is used simply to experiment. That's not totally what I want to say, but there's gotta be an artist hired by one of these large blogging software companies to actually make something *cool* out of all this data processing. Because. There is no soul to sorted words with a graphical queue of amount used. What is something interesting I can use with this?