Monthly Archives: February 2019

MALLET: A Text Analysis Tool

MALLET is an intriguing piece of software developed by Andrew MacCallum, with the aid of several different graduate students and staff. MALLET is an open source project. Meaning that this tool is allowed to be used freely in research and commercial use so long as the users of the tool credit MALLET appropriately. MALLET is also a machine learning program. Machine learning is a type of advanced Artificial Intelligence or AI for short. In blatant terms, the AI will basically learn as the program runs. As the AI learns from its mistakes it will readjust its searching algorithm, in order to give better and smarter outputs. MALLET allows users to quickly extract information from documents and turn it into text-based files.  Layered upon the text conversion MALLET also supports the classification of data. Through the process of sequence tagging, MALLET allows users to tag certain data to help refine searching through large amounts of data.

One major weakness that is apparent in the MALLET system is the process of setting up and using the actual tool. The way the download process is set up, it is apparent that the user interacting with the tutorial needs to have a Computer Science background.  An example of this is if a user clicked on the Download hyperlink, the website would redirect the user to a new page with information on how to download MALLET. Well if a user wanted to download the windows version they would download a zip file. Now that the user has a zip file, it needs to be extracted and then the user needs to change the environment variable. At this stage, this is where the download tutorial becomes grey with confusion.  Where does a user find the environment variable and what is an environment variable? Well, an environment variable is a command that can be run on a system. The command needs to be set up by adding the location of that command to the computers file path on the user’s computer through command line instructions. The tutorial on the website does not explain these fundamental questions when using MALLET for the first time. If a user wants to use MALLET for research and they do not have a Computer Science background it would be difficult for the user to get the maximum benefits from the service. This is a present theme found on the MALLET website, as it is evident that the webpage is directed towards a Computer Science orientated audience.

Another weakness of MALLET is the enforcing use of the command line. Command line is very useful, it allows for Computer Scientists to quickly perform tasks without having to go through graphical windows, which at times can be tedious. However, there is a reason why the modern operating system uses modern graphical technology. It is simply because it is easy to interact with a graphical menu then a black window full of text that can only be interacted with through special commands.  This feeds off the concern that MALLET is not very user-friendly. It does not matter how advanced an algorithm is if it cannot be used by a human effectively.  Graphics have been a major push in computing, it enables everyday people to interact with systems without having to worry about the complex operations that are being done under the hood of the computer.

Strengths of the MALLET digital tool is the use of the machine learning algorithms to quickly process unlabeled text, into useable data. For a digital historian, this is huge, as it allows historians to input a large number of files and search for a certain topic. The term on the MALLET website that is used for searching through large quantities of unlabeled text is called topic modeling. To the MALLET system, a topic is a cluster of words that can be combined to search in a large sample size of data. The topic then will allow for searching for similar meanings of words in other documents which allow for connecting data together to display similarities. This is important for historians as it allows them to search for similarities in large amounts of text very quickly. The MALLET system is designed with the ease of finding data quickly and effectively offering specialized command calls to optimize search results. The code that is used for implementing optimization is interesting. To fully implement an effective optimization algorithm the code must accept values from all parameters in the search. How MALLET proceeded to encode optimization is by keeping a track of all the search parameters. The term for topic optimization is called hyperparameter optimization. Hyperparameter optimization allows for the searcher to magnify the results to ensure that certain topics considered to be more outstanding than others.

 

public class OptimizerExample implements Optimizable.ByGradientValue {

    // Optimizables encapsulate all state variables, 
    //  so a single Optimizer object can be used to optimize 
    //  several functions.

    double[] parameters;

    public OptimizerExample(double x, double y) {
        parameters = new double[2];
        parameters[0] = x;
        parameters[1] = y;
    }

    public double getValue() {

        double x = parameters[0];
        double y = parameters[1];

        return -3*x*x - 4*y*y + 2*x - 4*y + 18;

    }

    public void getValueGradient(double[] gradient) {

        gradient[0] = -6 * parameters[0] + 2;
        gradient[1] = -8 * parameters[1] - 4;

    }

    // The following get/set methods satisfy the Optimizable interface

    public int getNumParameters() { return 2; }
    public double getParameter(int i) { return parameters[i]; }
    public void getParameters(double[] buffer) {
        buffer[0] = parameters[0];
        buffer[1] = parameters[1];
    }

    public void setParameter(int i, double r) {
        parameters[i] = r;
    }
    public void setParameters(double[] newParameters) {
        parameters[0] = newParameters[0];
        parameters[1] = newParameters[1];
    }
}
Code from MALLET displaying the use of an Optimization class to implement a quadratic function.

In the end, MALLET is a very powerful tool when it comes to conducting searches on vast amounts of data. It can prove very useful for searching for commonalities amongst data very quickly. However, the in-between interaction of using the program could be and should be improved to ensure as much ease as possible when it comes to working with MALLET.  In the end, in the computing world, it is always a struggle between complexity and being able to easily use the software. Major companies spend millions sometimes even billions trying to perfect the balance between user-friendly software and optimized software.

              Bibliography

  1. McCallum, Andrew Kachites. “MALLET: A Machine Learning for Language Toolkit.” http://mallet.cs.umass.edu. 2002.
  2. Mimno, David, Charles Sutton, Gaurav Chandalia, and Al Hough. “MAchine Learning for LanguagE Toolkit.”MALLET. Accessed Feburary 6th 2019. http://mallet.cs.umass.edu/about.php.