Some people eat, sleep and chew gum, I do genealogy and write...

Sunday, December 3, 2017

Can a computer do genealogy?


Computers are complex calculating machines. They process unimaginable amounts of data using 1s and 0s. You might have heard the term "artificial intelligence" bantered around from time to time. Many genealogists, not all by any means, now benefit from using tremendously powerful computers and very sophisticated computer programs to assist with their genealogical research. The ultimate question is simply this: will everything we now do as genealogists be replaced by more highly sophisticated computer programs?

In thinking about this question, I began mentally constructing a hierarchy of the complexity associated with what I, and others, do as a genealogist. Some of those tasks involve simple search and retrieval functions, such as looking for a name in an index of names. At the level of a computer program, this search involves the process of looking at matching strings of characters. A simple example is the "search and replace" function of a word processing program. At this level, there is an assumption that everything that is being searched is in "text" format. In a genealogical context, when we are "Indexing" records, we are converting the printed or handwritten information on the record into a "text" file the computer can use to search for equivalent patterns. Computer programs are very good at identifying certain types of patterns.

As we move up the scale of genealogical complexity, we quickly leave the search and replace or search and match level and move onto ideas and concepts that are much more difficult to program into a computer. One step up from search and replace we arrive at the challenges of optical character recognition. Genealogists are benefitting from this particular level of computer programming by having the ability to search through huge amounts of text from books and printed records. The challenges of optical character recognition mainly focus on background noise and poorly formed characters. For example, there may be little contrast between the text characters being recognized and the substrate they are printed on. In another case, the characters may be misinformed, i.e. broken, and hard to recognize. If you have used an optical character recognition program to realize that the output is not perfect. But even if you were copying the text manually, your copy would also probably be imperfect.

By using both optical character recognition and manual indexing, online genealogical database programs are able to provide "record hints" at an amazing level of accuracy. Couldn't that level of accuracy simply be extended to the point of having the program construct your family tree? Right now, the answer is no. An obvious fact of genealogical life is that many records we deal with are handwritten. Handwriting recognition is a major increase in complexity over text recognition. I have written about the status of handwriting recognition programs recently. There is a distinct possibility that handwriting recognition programs will achieve the same level of accuracy as human-based handwriting recognition.

In all this, there is the metaphorical "elephant in the room" that severely limits of one program to "do genealogy." That is the simple fact that genealogical research is spread both physically and content-wise all over the world. No one database or computer system or program has access to even a very small percentage of the total number of records available. In making this observation, I am not referring to the fact that many records remain in paper format but to the fact that records are scattered physically across the world. Even with the huge advances made by the Internet, information is still largely compartmentalized.

But what if we took a completely different approach to doing genealogy? Couldn't we do DNA testing of every person on the face of the earth and thereby construct an existing relationship tree of everyone? The answer is probably yes. But the answer begs the question. Even if we knew how everyone living was related that would give us no information about the identity of our ancestors. Showing degrees of relatedness only provides an incentive for doing historical research.

What if every historical record in existence was identified, digitized, indexed and made completely and freely available online? Could a computer program then construct a family tree for everyone on the earth? Theoretically yes but practically no. However, in limited contexts finding information for a particular individual is already available. All four of the large online genealogical database programs that supply record hints already do this to some extent. But unfortunately, there are many many people in the world who fall outside of the system. To a large extent, these relationships are established by the fact that millions of people have entered their historical genealogical information into online family trees. Regardless of the accuracy of the information, computers are able to evaluate and match relationships especially when supported by DNA testing.

But what may not be obvious is that regardless of the sophistication of the genealogical database programs, they rely entirely upon individual research including the evaluation of relationships and records by individuals and not by their computer programs.

In writing this commentary, I am not trying to denigrate the advances made by computerization of portions of the genealogical research process. For example, have written this entire blog post using a voice recognition program. You may be able to see the limitations of such a program in typos in my text.

Theoretically, given enough computer power, assuming that all available historical records were digitized and made available for text recognition and/or handwriting recognition, and further assuming that privacy and political concerns could be satisfied, I can see where substantial amounts of the routine genealogical research could be automated and extended beyond its present capabilities.

Back in 1950, Alan Turing proposed a test of a machines ability to exhibit intelligent behavior equivalent to, or indistinguishable, from that of a human. Essentially, the Turing test was whether or not a computer could carry on a conversation with a human without the human determining that the computer program was involved. The test is used essentially to determine the level of artificial intelligence obtained by a computer program. From my standpoint, the Turing test could be satisfied long before the program could be developed to do accurate genealogical research.

Because of my background in linguistics, I am also unconvinced that we presently have the ability to replicate human speech with the computer. Notwithstanding programs like Siri, human speech is one of the ultimately complex phenomena. Genealogical research approaches the complexity of human speech. Although I see some significant advances in computer search programs associated with genealogical research, I do not see the computers taking over my job anytime in the near future.

1 comment: