Monday, December 10, 2007

Surveying the Scene

Versions
It's important to know as soon as possible what version of Perl the program was developed for. This isn't necessarily the same as the version of Perl it may currently be running against, but find that out anyway so you have an upper bound. Again, get this information from the gold source: the original running environment. Type the complete path to perl that appears in the main program's shebang line (see below) followed by the -v argument to find out the version; for example:


% /opt/bin/perl -v

This is perl, v5.6.1 built for i386-linux

Copyright 1987-2003, Larry Wall


If this output indicates that they're running on a newer perl than the one you have (run the same command on your perl), do whatever you can to upgrade. Although upgrading may be unnecessary, if you have any difficulties getting the code to work, your energy for debugging will be sapped by the nagging fear that the problem is due to a version incompatibility.

One reason upgrading may be unnecessary is that the operating group upgraded their perl after the original program was written, and the program did not trigger any of the forward incompatibilities. A program written for Perl 4 could easily work identically under Perl 5.8.3 and probably would; the Perl developers went to fanatical lengths to preserve backward compatibility across upgrades.

Look at the dates of last modifications of the source files. You may need to visit the original operational system to be able to determine them. Although the dates may be more recent than any significant code changes (due to commenting, or insignificant changes to constants), the earlier the dates are, the more they can bound for you the most recent version of Perl the program was developed for. See the version history in Chapter 7 to find out how to determine that version.
Part or Whole?
Are you in fact taking over a complete program or a module used by other programs (or both)? Let's see how we can find out.

2.2.1 Shebang-a-Lang-a-Ding-Dong
You can recognize a Perl program file by the fact that the first two characters are:


#!


and somewhere on the rest of that line the word "perl" appears.[1] Developers call this the shebang line. It is possible to create a Perl program without this line by requiring it to be run explicitly with a command something like

[1] A Perl program on Windows could get away without this line, because the .pl suffix is sufficient to identify it as a Perl program, but it is good practice to leave the line in on Windows anyway. (The path to Perl isn't important in that case.)


% perl program_file_name


although it would be strange to receive a main program in this state. Don't depend on the filename ending with an extension like .pl or .plx; this is not necessary on many systems. A .pl extension is commonplace on Windows systems, where the file extension is required to tell the operating system the type of the file; aside from that .pl extensions were often used as a convention for "Perl Library": files containing specialized subroutines. These mostly precede the introduction in Perl 5 of objects, which provide a better paradigm for code reuse.

One time when the extension of a file is guaranteed, however, is for a Perl module; if the filename ends in .pm, then it's a module, intended to be used by another file that's the actual program.

Caveat: Sometimes a .pm file may begin with a shebang; this almost certainly means that someone created a module that contains its own tests so that executing the module as a program also works. If you see a .pm like this, try running it to see what happens. A file that's not a .pm can't be a module, but could still be a dependency rather than the main program. If it begins with a shebang, it could be a library of subroutine or constant definitions that's been endowed with self-testing capabilities. It may not be possible to tell the difference between this type of file and the main program without careful inspection or actual execution if the developer did not comment the file clearly.

It is quite possible that you will have to change the shebang line to refer to a different perl. The previous owners of the program may have located their perl somewhere other than where the one you plan to use is. If the code consists of a lot of files containing that path, here's how you can change them all at once, assuming that /their/path/to/perl is on the original shebang line and /your/path/to/perl is the location of your perl:


% perl -pi.bak -e \

's#/their/path/to/perl#/your/path/to/perl#g' *


This command puts the original version of each file—before any changes were made—in a file of the same name but with .bak appended to it. If you've been using a revision control system to store the files in, you don't need to make copies like that. (I told you that would turn out to be a good decision.) Leaving out the .bak:


% perl -pi -e 's#/their/path/to/perl#/your/path/to/perl#g' *


results in in-place editing; that is, the original files are overwritten with the new contents.

This command assumes that all the files to be changed are in the current directory. If they are contained in multiple subdirectories, you can combine this with the find command like this:


% find . -type f -print | xargs perl -pi -e \

's#/their/path/to/perl#/your/path/to/perl#g'


Of course, you can use this command for globally changing other strings besides the path to perl, and you might have frequent occasion to do so. Put the command in an alias, like so:


% alias gchange "find . -type f -print | xargs \

perl -pi.bak -e '\!1'"


the syntax of which may vary depending on which shell you are using. Then you can invoke it thusly:


% gchange s,/their/path/to/perl,/your/path/to/perl,g


Note that I changed the substitution delimiter from a # to a ,: A shell might take the # as a comment-introducing character. Because you might be using this alias to change pieces of code containing characters like $ and ! that can have special meaning to your shell, learn about how your shell does quoting and character escaping so you'll be prepared to handle those situations.

Note also that I put the .bak back. Because otherwise one day, you'll forget to check the files into your version control system first, because the alias isn't called something like gchange_with_no_backups.

If you want to develop this concept further, consider turning the alias into a script that checks each file in before altering it.

2.2.2 .ph Files
You may encounter a .ph file. This is a Perl version of a C header (.h) file, generated by the h2ph program that comes with perl. The odds are that you can eliminate the need for this file in rewriting the program. These .ph files have not been commonly used since Perl 4 because the dynamic module loading capability introduced in Perl 5 made it possible, and desirable, for modules to incorporate any header knowledge they required. A private .ph file is probably either a copy of what h2ph would have produced from a system header (but the author lacked the permission to install it in perl's library), or a modified version of the same. Read the .ph file to see what capability it is providing and then research modules that perform the same function.
Find the Dependencies
Look for documentation that describes everything that needs to exist for this program to work. Complex systems could have dependencies on code written in other languages, on data files produced by other systems, or on network connections with external services. If you can find interface agreements or other documents that describe these dependencies they will make the job of code analysis much easier. Otherwise you will be reduced to a trial-and-error process of copying over the main program and repeatedly running it and identifying missing dependencies until it appears to work.

A common type of dependency is a custom Perl module. Quite possibly the program uses some modules that should have been delivered to you but weren't. Get a list of modules that the program uses in operation and compare it with what you were given and what is in the Perl core. Again, this is easier to do with the currently operating version of the program. First try the simple approach of searching for lines beginning with "use " or "require ". On UNIX, you can use egrep:


% egrep '^(use|require) ' files...


Remember to search also all the modules that are part of the code you were given. Let's say that I did that and the output was:


use strict;

use warnings;

use lib qw(/opt/lib/perl);

use WWW::Mechanize;


Can I be certain I've found all the modules the program loads? No. For one thing, there's no law that use and require have to be at the beginning of a line; in fact I commonly have require statements embedded in do blocks in conditionals, for instance.

The other reason this search can't be foolproof is that Perl programs are capable of loading modules dynamically based on conditions that are unknown until run time. Although there is no completely foolproof way of finding out all the modules the program might use, a pretty close way is to add this code to the program:


END {

print "Directories searched:\n\t",

join ("\n\t" => @INC),

"\nModules loaded:\n\t",

join ("\n\t" => sort values %INC),

"\n";

}


Then run the program. You'll get output looking something like this:


Directories searched:

/opt/lib/perl

/usr/lib/perl5/5.6.1/i386-linux

/usr/lib/perl5/5.6.1

/usr/lib/perl5/site_perl/5.6.1/i386-linux

/usr/lib/perl5/site_perl/5.6.1

/usr/lib/perl5/site_perl/5.6.0

/usr/lib/perl5/site_perl

/usr/lib/perl5/vendor_perl/5.6.1/i386-linux

/usr/lib/perl5/vendor_perl/5.6.1

/usr/lib/perl5/vendor_perl

.

Modules loaded:

/usr/lib/perl5/5.6.1/AutoLoader.pm

/usr/lib/perl5/5.6.1/Carp.pm

/usr/lib/perl5/5.6.1/Exporter.pm

/usr/lib/perl5/5.6.1/Exporter/Heavy.pm

/usr/lib/perl5/5.6.1/Time/Local.pm

/usr/lib/perl5/5.6.1/i386-linux/Config.pm

/usr/lib/perl5/5.6.1/i386-linux/DynaLoader.pm

/usr/lib/perl5/5.6.1/lib.pm

/usr/lib/perl5/5.6.1/overload.pm

/usr/lib/perl5/5.6.1/strict.pm

/usr/lib/perl5/5.6.1/vars.pm

/usr/lib/perl5/5.6.1/warnings.pm

/usr/lib/perl5/5.6.1/warnings/register.pm

/usr/lib/perl5/site_perl/5.6.1/HTML/Form.pm

/usr/lib/perl5/site_perl/5.6.1/HTTP/Date.pm

/usr/lib/perl5/site_perl/5.6.1/HTTP/Headers.pm

/usr/lib/perl5/site_perl/5.6.1/HTTP/Message.pm

/usr/lib/perl5/site_perl/5.6.1/HTTP/Request.pm

/usr/lib/perl5/site_perl/5.6.1/HTTP/Response.pm

/usr/lib/perl5/site_perl/5.6.1/HTTP/Status.pm

/usr/lib/perl5/site_perl/5.6.1/LWP.pm

/usr/lib/perl5/site_perl/5.6.1/LWP/Debug.pm

/usr/lib/perl5/site_perl/5.6.1/LWP/MemberMixin.pm

/usr/lib/perl5/site_perl/5.6.1/LWP/Protocol.pm

/usr/lib/perl5/site_perl/5.6.1/LWP/UserAgent.pm

/usr/lib/perl5/site_perl/5.6.1/URI.pm

/usr/lib/perl5/site_perl/5.6.1/URI/Escape.pm

/usr/lib/perl5/site_perl/5.6.1/URI/URL.pm

/usr/lib/perl5/site_perl/5.6.1/URI/WithBase.pm

/opt/lib/perl/WWW/Mechanize.pm

/usr/lib/perl5/site_perl/5.6.1/i386-linux/Clone.pm

/usr/lib/perl5/site_perl/5.6.1/i386-linux/HTML/Entities.pm

/usr/lib/perl5/site_perl/5.6.1/i386-linux/HTML/Parser.pm

/usr/lib/perl5/site_perl/5.6.1/i386-linux/HTML/PullParser.pm

/usr/lib/perl5/site_perl/5.6.1/i386-linux/HTML/TokeParser.pm


That doesn't mean that the user code loaded 34 modules; in fact, it loaded 3, one of which (WWW::Mechanize) loaded the rest, mostly via other modules that in turn loaded other modules that—well, you get the picture. Now you want to verify that the program isn't somehow loading modules that your egrep command didn't find; so create a program containing just the results of the egrep command and add the END block, like so:


use strict;

use warnings;

use lib qw(/opt/lib/perl);

use WWW::Mechanize;



END {

print "Directories searched:\n\t",

join ("\n\t" => @INC),

"\nModules loaded:\n\t",

join ("\n\t" => sort values %INC),

"\n";

}


Run it. If the output is identical to what you got when you added the END block to the entire program, then egrep almost certainly found all the dependencies. If it isn't, you'll have to dig deeper.

Even if the outputs match, it's conceivable, although unlikely, that you haven't found all the dependencies. Why? Just because one set of modules was loaded by the program the time you ran it with your reporting code doesn't mean it couldn't load another set some other time. You can't be certain the code isn't doing that until you've inspected every eval and require statement in it. For instance, DBI (the DataBase Independent module) decides which DBD driver module it needs depending on part of a string passed to its connect() method. Fortunately, code that complicated is rare.

Now check that the system you need to port the program to contains all the required modules. Take the list output by egrep and prefix each module with -M in a one-liner like so:


% perl -Mstrict -Mwarnings -MWWW::Mechanize -e 0


This runs a trivial program (0) after loading the required modules. If the modules loaded okay, you won't see any errors. If one or more modules don't exist on this system, you'll see a message starting, "Can't locate module.pm in @INC . . . "

That's quite likely what will happen with the preceding one-liner, and the reason is the use lib statement in the source. Like warnings and strict, lib is a pragma, meaning that it's a module that affects the behavior of the Perl compiler. In this case it was used to add the directory /opt/lib/perl to @INC, the list of directories perl searches for modules in. Seeing that in a program you need to port indicates that it uses modules that are not part of the Perl core. It could mean, as it did here, that it is pointing perl toward a non-core Perl module (WWW::Mechanize) that is nevertheless maintained by someone else and downloaded from CPAN. Or it could indicate the location of private modules that were written by the developers of the program you are porting. Find out which case applies: Look on CPAN for any missing modules. The easiest way to do this is to go to http://search.cpan.org/ and enter the name of each missing module, telling the search engine to search in "modules".[2]

[2] Unless you're faced with a huge list to check, in which case you can script searches using the CPAN.pm module's expand method.

So if we want to write a one-liner that searches the same module directories as the original code, we would have to use Perl's -I flag:


% perl -Mstrict -Mwarnings -I/opt/lib/perl -MWWW::Mechanize \

-e 0


However, in the new environment you're porting the program to, there may not be a /opt/lib/perl; there may be another location you should install third-party modules to. If possible, install CPAN modules where CPAN.pm wants to put them; that is, in the @INC site-specific directory. (Local business policies might prevent this, in which case you put them where local policy specifies and insert use lib statements pointing to that location in your programs.)

If you find a missing module on CPAN, see if you can download the same version that is used by the currently operational program—not necessarily the latest version. Remember, you want first of all to re-create the original environment as closely as possible to minimize the number of places you'll have to look for bugs if it doesn't work. Again, if you're dealing with a relatively small, unprepossessing program, this level of caution may not be worth the trouble and you will usually spend less time overall if you just run it against the latest version of everything it needs.

To find out what version of a module (Foo::Bar, say) the original program uses, run this command on the operational system:


% perl -MFoo::Bar -le 'print $Foo::Bar::VERSION'

0.33


Old or poorly written modules may not define a $VERSION package variable, leaving you to decide just how much effort you want to put into finding exactly the same historical version, because you'll have to compare the actual source code texts (unless you have the source your module was installed from and the version number is embedded in the directory name). Don't try getting multiple versions of the same module to coexist in the same perl installation unless you're desperate; this takes considerable expertise.

You can find tools for reporting dependencies in programs and modules in Tom Christiansen's pmtools distribution (http://language.perl.com/misc/pmtools-1.00.tar.gz).

2.3.1 Gobbledygook
What if you look at a program and it really makes no sense at all? No indentation, meaningless variable names, line breaks in bizarre places, little or no white space? You're looking at a deliberately obfuscated program, likely one that was created by running a more intelligible program through an obfuscator.[3]

[3] Granted, some programs written by humans can appear obfuscated even when there was no intention that they appear that way. See Section 1.5.

Clearly, you'd prefer to have the more intelligible version. That's the one the developer used; what you've got is something they delivered in an attempt to provide functionality while making it difficult for the customer to make modifications or understand the code. You're now the developer, so you're entitled to the original source code; find it. If it's been lost, don't despair; much of the work of reconstructing a usable version of the program can be done by a beautifier, discussed in Section 4.5. A tool specifically designed for helping you in this situation is Joshua ben Jore's module B::Deobfuscate (http://search.cpan.org/dist/B-Deobfuscate/).

No comments: