Home • Essays • Lost Articles • Loose Ends • Collections • Computing • Projects • Widdershins • Quotations • Links • Us

 

Namemaker

The other day I reconstructed an old program I wrote once, many years ago. It had gotten lost somewhere in the smoke and clatter of various hard drive crashes and changes of computers and operating systems -- but I still somehow retained its essence in that steel-trap brain of mine (hah! a funny smiley face set of text characters should certainly follow that statement).

This program generates "fake text", based on the structure of "seed" text fed to it. The output seems "almost" reasonable. Given enough initial text data of a particular type, it outputs words that sort of sound like the seed text -- but not exactly. That is to say, the structure of the outputted "fake" words are not necessarily proper English -- but they do reflect somewhat the makeup of the seeded English words. It does best on simple lists of words. One good example is proper names of people or articles. If I were a Marketing or Advertising guy, I would use it to spit out names of new drugs, automobiles, genetically created animals, novel products, etc. It would also be useful for an author searching for a unique name for a character or an imaginary entity.

The seed text is first analyzed for its letter sequence structure. Arrays are set up that tell, for each given character in the text, the frequency of occurrence of the 3 characters before it. For example, if the seed text contains the word "human", and the letter being analyzed is the "a", the letter "m" will be counted as occurring immediately prior to an "a" once, the letter "u" will be counted at occurring as the 2nd letter preceding an "a" once, and the "h" as a 3rd letter preceding an "a" once. The counts of preceding letters in the text are incremented according to how often they are seen preceding any particular letter.

To be clearer, a 3-dimensional array is set up covering each letter in the alphabet. The declaration is given as:  LTR(128,3,128), where the first dimension's element is the ASCII-numbered bin for the primary letter in question, the second relates to whether the array element is covering its 1st-preceding, 2nd-preceding or 3rd-preceding letter history, and the last is the ASCII-numbered bin of that preceding letter. (You may ask why I included 128 bins when there are only 26 letters plus a space needed in the output. I can only answer that the ASCII values for lower case text runs in the range from 97 to 122, and the ASCII value for a space character is 32.  So most of the array is actually unused.  But since RAM is not as precious nowadays as it used to be, doing it this way simplifies identifying the applicable array elements when doing the pre-analysis counting and tabulating of the seed text.  In this case, I merely assign the array element "count container" number to equal the ASCII number of the character it represents.  It would be much more complicated to do it in a space-efficient manner.  In the old days, I would have been forced to do so.)

The fake words are constructed one character at a time. Each tentative character is tested to see if it has ever followed the last 3 new characters, back in the seed text. The test is skewed a bit to make the final chosen character one that has been more "commonly" (rather than more "rarely") followed by the 3 preceding characters. If the tentative selection is a complete bust, the program flow jumps out of the subroutine and tries for a better tentative letter. There are routines in place to try to limit the total length of the word, but in a manner that is "natural".

For example, say we want to generate new female proper names. We can generate a "seed" string listing many common female names, plug it into the program, and look through the output for interesting new names. Here is an example of some output fashioned in this manner:

The program (VB) code is given below. Establish a form called "Namemaker" and put a good-sized text box in it called "text1". (Change its property to "yes" for "multiline".) Create the various procedure subroutines as noted below and see if you can copy this text into them. You can try changing the seed string to other lists of articles, say, the elements, car names, animal names, etc. Amaze your friends with your newfound naming creativity!

'variable declarations

Dim ltr(128, 3, 128)
Dim a$, b$, c$
Dim r, s, t, u, v
Dim i, j, k
Dim mt, ct, cx
Dim sumbef1

 

 

Public Sub Form_Load()

'here is the VB form page, governing the program flow

Randomize Timer

Namemaker.WindowState = 2
Namemaker.Show

Call targetstring
Call fillarray
Call mainloop

End Sub

 

 

Public Sub targetstring()

'c$ is the feed datastring -- put double spaces between each name!
c$ = " alice annie arlene annette anna barbara bambi beatrice bonnie carol "
c$ = c$ & "constance cheryl cherise cynthia candy celia clarice diedre darlene "
c$ = c$ & "earlene erma esther felicia frances florence faye gloria georgia grace "
c$ = c$ & "gladys helen heide henrietta iona julia janette jasmine karen kerry "
c$ = c$ & "linda lynn leona mandy marlene marty marilyn norma nanette opal "
c$ = c$ & "peony pearl phoebe queenie ruth susan suzette tanny ursula violet "
c$ = c$ & "wanda yolanda zane zinnia charlene sherry cherry lillian libby mary "
c$ = c$ & "yvonne yvette patricia patsy jade ruby sapphire rose daisy lucy "
c$ = c$ & "chris donna dorothy lisa elizabeth maria obediah olivia loretta "
c$ = c$ & "chloe uma evah winifred thelma inez victoria vicky rheba jewel "
c$ = c$ & "diana dianne dionne heather anita "

End Sub

 

 

Public Sub fillarray()

'analyze the feed string for the frequency of characters preceding

'each one, as they are interrogated in sequence one-by-one through the string.

'establish the relevant arrays for this.

'for reference:
'ASCII 65-90=caps A-Z
'ASCII 32=space
'ASCII 97-122=small case a-z

For i = 4 To Len(c$)

'target letter in feed datastring
a$ = Mid$(c$, i, 1)
r = Asc(a$)
If r <> 32 Then
'if a capital make a lower case
If r >= 65 And r <= 65 + 25 Then r = r + 32
'if not a lower case letter or space
If r < 97 Or r > 97 + 25 Then r = 0
End If

'the character before
a$ = Mid$(c$, i - 1, 1)
s = Asc(a$)
If s <> 32 Then
'if a capital make a lower case
If s >= 65 And s <= 65 + 25 Then s = s + 32
'if not a lower case letter or space
If s < 97 Or s > 97 + 25 Then s = 0
End If
If s > 0 Then ltr(r, 1, s) = ltr(r, 1, s) + 1
'the character before that
a$ = Mid$(c$, i - 2, 1)
s = Asc(a$)
If s <> 32 Then
'if a capital make a lower case
If s >= 65 And s <= 65 + 25 Then s = s + 32
'if not a lower case letter or space
If s < 97 Or s > 97 + 25 Then s = 0
End If
If s > 0 Then ltr(r, 2, s) = ltr(r, 2, s) + 1

'and the character before that one
a$ = Mid$(c$, i - 3, 1)
s = Asc(a$)
If s <> 32 Then
'if a capital make a lower case
If s >= 65 And s <= 65 + 25 Then s = s + 32
'if not a lower case letter or space
If s < 97 Or s > 97 + 25 Then s = 0
End If
If s > 0 Then ltr(r, 3, s) = ltr(r, 3, s) + 1

Next

End Sub

 

 

Public Sub mainloop()

'construct and display the output string

'default temp ASCII number variables to blanks
r = 32: s = 32: u = 32

'go thru while/wend at least once
'and make sure you get non-blanks to start with
While r = 32 Or s = 32 Or u = 32
'get a starting point in the target string
v = 3 + Int(Rnd * (Len(c$) - 2))
'set a 3-character substring
a$ = Mid$(c$, v, 3)
'start off the output string b$ with it
b$ = a$
'collect the ASCII numbers of the 3 characters
r = Asc(Mid$(a$, 3, 1))
s = Asc(Mid$(a$, 2, 1))
u = Asc(Mid$(a$, 1, 1))
Wend

'set the total length of the output string
While Len(b$) < 500

'intialize "successful run" counter
cx = 0

'look for a letter with some good history behind it;
'that is, one that has been seen following the 3 before it
While mt < 3
mt = 0
'pick a letter to try
t = Int(Rnd * 129)
'don't want a blank following too near a blank
While t = 32 And (r = 32 Or s = 32)
t = Int(Rnd * 129)
Wend

'pick a good name starting letter if the last letter was a blank
If r = 32 Then
While ltr(t, 1, 32) = 0 And ltr(t, 2, 32) = 0
t = Int(Rnd * 129)
Wend
End If

'lengthen if too short
If t = 32 And Len(b$) > 6 Then
For j = Len(b$) - 5 To Len(b$)
If Asc(Mid$(b$, j, 1)) = 32 Then t = Int(Rnd * 129)
Next
End If
'if the last letter has been seen to precede this one
If ltr(t, 1, r) > 0 Then
'sum up the history array contents for the 1st-preceding letter
For j = 1 To 128: sumbef1 = sumbef1 + ltr(t, 1, j): Next
If sumbef1 = 0 Then sumbef1 = 1
'tentative pick, if it passes a frequency test
If ltr(t, 1, r) / sumbef1 > (Rnd * 5) / sumbef1 Then mt = mt + 1
End If
'if the letter before that has been seen to precede this one
If ltr(t, 2, s) > 0 Then
'sum up the history array contents for the 2nd-receding letter
For j = 1 To 128: sumbef1 = sumbef1 + ltr(t, 2, j): Next
If sumbef1 = 0 Then sumbef1 = 1
'tentative pick, if it passes a frequency test
If ltr(t, 2, s) / sumbef1 > (Rnd * 5) / sumbef1 Then mt = mt + 1
End If
'and if the letter before that one has been seen to precede this one
If ltr(t, 3, u) > 0 Then
'sum up the history array contents for the 3rd-preceding letter
For j = 1 To 128: sumbef1 = sumbef1 + ltr(t, 3, j): Next
If sumbef1 = 0 Then sumbef1 = 1
'tentative pick, if it passes a frequency test
If ltr(t, 3, u) / sumbef1 > (Rnd * 5) / sumbef1 Then mt = mt + 1
End If
'give a good chance of shortening the name if it is too long, in a natural way

If ct > 7 And ltr(32, 1, t) > 0 And ltr(32, 2, t) > 0 Then
If Rnd > 0.9 Then
t = 32
mt = 3
End If
End If
'increment wend counter -- if >5000, then call this round "unsuccessful"
'and jump out of while/wend
'("unsuccesful" probably means the history for the selected letter did not include
'the previous letters that were already in the output string)
cx = cx + 1
If cx > 5000 Then mt = 3
'mt=3 if the randomly chosen letter passed all the tests! If so, get out
Wend
'reset match flag for next time thru
mt = 0
'increment name length counter
ct = ct + 1
'reset counter if a blank was returned (end of name)
If t = 32 Then ct = 0
'if the letter pick was truly successful
If cx < 5001 Then
'reset 3rd, 2nd, 1st-preceding letters in string
u = s
s = r
r = t

'add the successful character to output string
b$ = b$ & Chr$(t)
End If
Wend

'print final output string
Namemaker.text1.Text = b$

End Sub



Postscript -- You might play around with changing the multiplier for "RND" in the test lines of mainloop. Higher numbers make the testing harder (and that causes the output string to take longer to appear), but it will produce more "realistic" names, depending on the size of the seed string. You have to try to balance this multiplier according to what the total "array counts" of preceding letters are obtained as the pre-analysis routine goes through the seed string. I am running a 2 gigaHz machine, so I can afford to be a bit choosy before my patience runs out. If the output takes too long to produce, reduce the multiplier a little bit. You know you have the number about right when you see a name produced verbatim from the seed listing once in awhile. (I think I had a better way of tapping into and utilizing a "frequency histogram" for the array counts, but I don't recall exactly how I originally did that, when I rewrote this program.)  Also, you can reduce the size of the output string (given as 500 above) to cut down the time it takes to output. If the name lengths don't seem quite to your liking, try changing the "lengthening" and "shortening" routine constants.

You also may want to put a "done" button on the main form. I didn't bother to show that control code (which is simply "end"). On my program, I also have a "repeat" button to redo the output string. You will need to reinitialize b$="" under the repeat control button's code, and set Namemaker.text1.text="". After that, just add "call mainloop" again at the end of the "repeat" control button's code.


  Back to Recreational Computing... 

 

First-time visitors -- including you!

Free Web Counter

Free Hit Counter The Foggiest Notion The Foggiest Notion The Foggiest Notion The Foggiest Notion The Foggiest Notion

 

Luck Favors the Prepared Mind...

Essays • Lost Articles • Loose Ends • Collections • Computing • Projects • Widdershins • Quotations • Links • Us

Site contents Copyright 2004-2008 by Gary Cuba       Email: webmeister at thefoggiestnotion dot com