April 20, 2005: Cloud Nine

< Absent to all, abstaining nothing.

| ??? |

Be happy. It is a way of being wise. - Colette >

Below is a, "Tag Cloud" - it's a list of the most used words on this journal/sketchbook thing. The larger the word, the more times it's used. Clicking on the word will bring you to a Google Search of the word as its used on this site. Interesting? Almost. Below the cloud is the code I used to create it, and more blather about it.

abortion absolutely actual actually added afterwards airplane alarm almost alone along already although amount andenken andy ani another anyone anything anyways apartment application appname april arapahoe around art artist artists artwork ashleysbday asked ass atomandhispackage attempt attention attractive author away bad bag based basement basically beau beautiful because become bed bedroom beer being believe belt bike bill birthday bit bite blink blocks board books boston bottom boulder break breaks broke brother building bunny bus business buy calculator called camera cannot canvas cards cars cart case cat cellpadding cellspacing changed cheap checkforshockwave child childhood christmas circle classes clothes code coffee coin collection college colorado colored comes comic coming community companies completely complex composition computer conclusion cookie cool couldn couple cramps crayon crayons creases created creating credit cube cunt damn dance danced dancing Dhalia database days decided deleted denver description design deviant different digital dimensional discipline display dj document doesn dogs doing dorm download downtown drawing drawings drawn dress dressed drew drinking driving drove drunk easily easy eating either else english entire entry eof eq events everyone everything exact exactly except expensive eyes fairly fake false fashion favorite feeling felt file film finally finding finished flash flattened floor flyer flyers folding following foreach friday friends frozen fucking fun funny future gallery gas general geocities gets getting giotto given giving glasses goal god goes goin going gonna gotten grade guitar guy hadn hair halfpipe handssmall hang hanging happened happy harry hate haven having heart highschool himself holding holly hope hours hug hung hurt icq ideas image important impossible incredibly information input inside instead interesting internal into ironclad isn its itself jack jar jarpath jars jason javascript jeremy jerimiah jessica job journal js jug juggling alex alexsimoni kelly kept kerouac key keys keywords kids kiss knee knowing known laptop larger later learned least letters license liked likely lines link linkto living ll local longer looked looking lost lot lots lovely lt lux macromedia magazine mail makes making mania math matter maybe meaning means medium meet melissa memories mental met middle midnight miles mimetypes minutes mirror moby model mojo monkey month months mostly motoko mountains moved movie movies mp mr mu mural museum myself natalie navigator neat needed needs netscape nice non none not num ny odd oil ok ones onto opening option organizer origami outside owner pacman pads pages paid painted painting paintings pants parents park parking passed past pastel patterns pay pd pencil pencils Penelope perfect performance perhaps personal pete phone photo photodb photos physical physically physics pic pick pictures pieces pin pins pipeline pixies pizza places plastic platform played playing plugins pm polaroid police popular portrait position pre price probably problems prod program project pub pure push putting quality quarter questions quite rains ralph random re reading reality realize realized really reason reasons received recently release respect return ride riot roommate roughcut rudy rules ryan sale sarah sat saturday saying scared scene screen sculpture seeing seemed seems seen sense session setting sex shapes shift shirt shit shockmode shockwave shoes shot shots shown shows shtml sick sign silly simoni simply singer single sk skate skateboard skateboarders skateboarding skatepark skater skaters skating sketch sketchbook sketches sleep slowly smart smile snowboarding soap social socko softupdate solve someone something sometimes somewhat songs sort sounds sp spdraw specific spent split spray stage started statement stayed stick stolen stopped store str stuck studio stuff stupid sub summersault sunday sup supposed swear swf tags taken takes taking talked talking teacher technique techno telling template tent Jack thanks thats themselves things thinking throwing ticket tickets times tip tired title today totally trick tried trigger trip trying tuesday turned twenty twice type understand unflattering unknown update upon upside used using usually valign ve via view visibility waking walked walking wall walls wanted wants warhol ways wear wearing web weeks weight weird whatever whom wig window wish without woke won wondering woodstock words worked working works wouldn writing written wrote xxxx yeah years yet yourself zombie

I forgoed going to the local Wednesday night 80's dance club makeout fest to give you the below code. I think I needed a break from socializing and working the, "oh, I'm this amazing artist" rap and more time, alone in my studio, in front a computer.

Note: This is very much a proof-of-concept.

#!/usr/bin/perl -w
use strict; 

# Where I put the Text::Ngrams Module:
use lib qw(/home/alex/perllib); 

# lowest font size, in px
my $base_size = 10; 

# where are our text entries?
my $dir_to_search = '/home/alex/www/alex/ephemeris'; 

# what's the site URL?
my $site_url = 'http://skazat.com'; 

# what's the most words you want back? (before we filter for common words)
my $word_limit = 1250; 


# No more to really change; 
#---------------------------------------------------------------------#

use File::Find; 
use Text::Ngrams; 

my @files; 
my %count = (); 
my %results; 





# A list of words, we won't take into account, since either: 
# (1 They're part of the HTML of the page, 
# (2 They're locally weird
# (3 they're very common. 

my @banned_words = qw(

p strong img href em value param width border height com net org align style
images target blank skazat src solid black NAME PARAM http alt And s px www I
jpg gif table tr td id indexOf FFFFFF index serif row url br san meta size 
header blockquote bgcolor html text li  ol font quot ul nbsp hspace vspace cd
cccccc bold arial mailto span hr valign

currents verdana cgi am gt didn don wasn ssi th nav div

the of to and a in is it you that he was for on are with 
as I his they be at one have this from or had by hot word 
but what some we can out other were all there when up use 
your how said an each she which do their time if will way 
about many then them write would like so these her long make 
thing see him two has look more day could go come did number
 sound no most people my over know water than call first who 
may down side been now find any new work part take get place
 made live where after back little only round man year came 
show every good me give our under open seem together next 
white children begin got walk example ease paper group always 
music those both mark often letter until mile river car feet 
care second book carry took science eat room friend began idea
 fish mountain stop once base hear horse cut sure watch color 
face wood main enough plain girl usual young ready above ever 
red list though feel talk bird soon body dog family direct 
pose leave song measure door product black short numeral class 
wind question happen complete ship area half rock order fire 
south problem piece told knew pass since top whole king space 
heard best hour better true . during hundred five remember step
 early hold west ground interest reach fast verb sing listen 
six table travel less morning gentle woman captain practice 
separate difficult doctor please protect noon whose locate ring 
character insect caught period indicate radio spoke atom human 
history effect electric expect crop modern element hit student 
corner party supply bone rail imagine provide agree thus 
capital won't chair danger fruit rich thick soldier process 
operate guess necessary sharp wing create neighbor wash bat 
rather crowd corn compare poem string bell depend meat rub 
tube famous dollar stream fear sight thin triangle planet hurry
 chief colony clock mine tie enter major fresh search send 
yellow gun allow print dead spot desert suit current lift rose
 continue block chart hat sell success company subtract event 
particular deal swim term opposite wife shoe shoulder spread 
arrange camp invent cotton born determine quart nine name very
through just form sentence great think say help low line differ turn
cause much mean before move right boy old too same tell does set three
want air well also play small end put home read hand port large spell
add even land here must big high such follow act why ask men change went
light kind off need house picture try us again animal point mother world
near build self earth father head stand own page should country found
answer school grow study still learn plant cover food sun four between
state keep eye never last let thought city tree cross farm hard start
might story saw far sea draw left late run don't while press close night
real life few north ten simple several vowel toward war lay against
pattern slow center love person money serve appear road map rain rule
govern pull cold notice voice unit power town fine certain fly fall lead
cry dark machine note wait plan figure star box noun field rest correct
able pound done beauty drive stood contain front teach week final gave
green oh quick develop ocean warm free minute strong special mind behind
clear tail produce fact street inch multiply nothing course stay wheel
full force blue object decide surface deep moon island foot system busy
test record boat common gold possible plane stead dry wonder laugh
thousand ago ran check game shape equate hot miss brought heat snow tire
bring yes distant fill east paint language among truck noise level
chance gather shop stretch throw shine property column molecule select
wrong gray repeat require broad prepare salt nose plural anger claim
continent oxygen sugar death pretty skill women season solution magnet
silver thank branch match suffix especially fig afraid huge sister steel
discuss forward similar guide experience score apple bought led pitch
coat mass card band rope slip win dream evening condition feed tool
]total basic smell valley nor double seat arrive master track parent
shore division sheet substance favor connect post spend chord fat glad
original share station dad bread charge proper bar offer segment slave
duck instant market degree populate chick dear enemy reply drink occur
support speech nature range steam motion path liquid log meant quotient
teeth shell neck
); 

# make a simple lookup table...
my %banned_words = (); 
$banned_words{lc($_)} = 1 foreach @banned_words; 


my $ng = Text::Ngrams->new(type => 'word', windowsize => 1, limit => $word_limit);

# walk the directory structure, make note of the plain text (HTML) files...
find(\&wanted, $dir_to_search);
sub wanted {
	if(-r $File::Find::name && -f _ && -s _){ 
 		push(@files, $File::Find::name); 
	}
}
    
my $count = $ng->process_files(@files);

my @results = split("\n", $ng->to_string()); 

# Take note of what words were found, and how many 

foreach(@results){ 

	next if $_ =~ /\<NUMBER\>/;	    # We don't care about numbers.
	next unless $_ =~ m/\t/; 	    # a tab usually delimits 
								    # the word from the number
								
	my ($word, $count) = split("\t"); 
	
	next if length($word) <= 1; 	# has to be a word of 
									# more than one letter...
	next if its_banned($word); 	    # can't be a banned word...

	# make sure everything is lowercased...
	if(exists($results{lc($word)})){ 
		$results{$word} += $count;  
	}else{ 
		$results{lc($word)} = $count;  
	}
}


# Find the smallest amount, and set as, "0", 
# adjust all other numbers to relate to that
# baseline.

%results = massage_for_size(\%results); 

# Make out our links. 
my $str; 
foreach my $w(sort keys %results){ 
	$str .= wb_link($w, $results{$w});
}

# print it out!
print $str; 




sub massage_for_size { 

	my $results = shift; 
	
	my @numbers = values %$results; 
	
	@numbers = sort { $a <=> $b} @numbers; 
	my $smallest = $numbers[0];	
	foreach(keys %$results){ 
		$results{$_} -= $smallest; 
	}

	return %$results; 
}




sub wb_link { 

	my ($word, $count) = @_; 
	return '<span style="font-size:'. ( (int($count/3))  + $base_size ) .'px"><a href="http://www.google.com/search?q=site:'.$site_url.'%20' . $word . '">'. $word . '</a> </span> ' . "\n"; 

}




sub its_banned { 

	my $word = shift; 
	return 1 if exists $banned_words{lc($word)}; 

}


__END__

=cut

Copyright (c) 2005 alex Simoni 

This program is provided "as is" without expressed or implied warranty. This is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

=cut


A few things to consider:

The above program uses a file system-based database - meaning: you have directories, and inside those directories, you have files, and more directories, that have files - we're not using SQL here.

I'm using some home-brewed stuff for my... whatever you want to call it, but you can pick up something called Blosxom and use this script with it. I may write a plugin for that program using the above as a starter - we'll see.

I decided to hack about making one of these cloud things to see if it would give me an insight on the six or so years of writing. The verdict? Not really. It's neat, but not useful enough.

Some problems:

The amount a specific word is used does not always mean that word is, well, really important. Sure, you can tell from the above cloud that, "art" is used a lot - so I guess that's important - although the cloud does not even begin to tell you how important related, but less used words are to the context of the writing. For example, ocularium does not show up at all. Common words will show up more, no matter what.

Speaking of context, we've completely obliterated it and made some sort of statistical graph. Go us. Nothing I like more than statistics.

Some Creme

What I guess is interesting, is that the information, (raw data - blah blah blah), is attempting to sort, "itself", meaning: We don't have to make a hierarchy of subjects to use in the journal thing, it's now doing it itself... sort of. I'm not sure if, "Something" is a subject worth pursuing - it's still sort of dense and stupid. I really wish I could get back a list of the most... least used words. Does that make any sense?

I hate the word, "blogging". I hate that most of the creative output put forth on personal sites has come to using simple web services like, blogging, photosharing and the like and less is used simply to experiment. That's not totally what I want to say, but there's gotta be an artist hired by one of these large blogging software companies to actually make something *cool* out of all this data processing. Because. There is no soul to sorted words with a graphical queue of amount used. What is something interesting I can use with this?

Comments

< Absent to all, abstaining nothing.

| ??? |

Be happy. It is a way of being wise. - Colette >