Machine Learning for Sequential Data A Review-

Machine Learning for Sequential Data: A ReviewThomas G. DietterichOregon State University, Corvallis, Oregon, USA,tgdcs.orst.edu,WWW home page: http:/www.cs.orst.edu/tgdAbstract. Statistical learning problems in many fields involve sequen-tial data. This paper formalizes the principal learning tasks and describesthe methods that have been developed within the machine learning re-search community for addressing these problems. These methods includesliding window methods, recurrent sliding windows, hidden Markov mod-els, conditional random fields, and graph transformer networks. The pa-per also discusses some open research issues.1 IntroductionThe classical supervised learning problem is to construct a classifier that cancorrectly predict the classes of new objects given training examples of old objects19. A typical application is in optical character recognition where the objectsare images of hand-written characters and the classes are the 26 alphabeticletters. The classifier takes an image as input and produces a letter as output.This task is typically formalized as follows.Letxdenote an image of a hand-written character andy A,.,Z denotethe corresponding letter class. A training example is a pair (x,y) consisting ofan image and its associated class label. We assume that the training examplesare drawn independently and identically from the joint distribution P(x,y), andwe will refer to a set of N such examples as the training data.A classifier is a function h that maps from images to classes. The goal of thelearning process is to find an h that correctly predicts the class y = h(x) of newimages x. This is accomplished by searching some space H of possible classifiersfor a classifier that gives good results on the training data without overfitting.Over the past 10 years, supervised learning has become a standard tool inmany fields, and practitioners have learned how to take new application prob-lems and view them as supervised learning problems. For example, in cellulartelephone fraud detection, each x describes a telephone call, and y is 0 if the callis legitimate and 1 if the call originated from a stolen (or cloned) cell phone 8.Another example involves computer intrusion detection where each x describes arequest for a computer network connection and y indicates whether that requestis part of an intrusion attempt. A third example is part-of-speech tagging inwhich each x describes a word and each y gives the part-of-speech of that word(noun, verb, adjective, etc.).2One thing that is apparent in these (and other) applications is that theydo not quite fit the supervised learning framework. Rather than being drawnindependently and identically (iid) from some joint distribution P(x,y), thetraining data actually consist of sequences of (x,y) pairs. These sequences exhibitsignificant sequential correlation. That is, nearby x and y values are likely to berelated to each other.For example, before a cell phone is stolen, all of the y values will be 0. Af-terwards, all of the y values will be 1. Similarly, computer intrusions exhibitsignificant clusteringparticularly denial of service attacks. Other kinds of at-tacks are deliberately spaced over time to avoid detection, which is a form oftemporal anti-correlation. In part-of-speech tagging, sequences of parts of speechare constrained by the grammar of the language. Hence, in English, a sequencesuch as (verb verb adjective verb verb) would be very unlikely. Sequential pat-terns are present even in the original task of character recognition: Charactersequences usually form words rather than random sequences of letters.Sequential patterns are important because they can be exploited to improvethe prediction accuracy of our classifiers. In English, for example, if the classifierdetermines that one letter is Q, then the next letter is almost certain to beU. In telephone fraud detection, it is only possible to detect fraud by lookingat the distribution of typical (legitimate) phone calls and then to see that thisdistribution changes when the telephone is stolen. Any single phone call, viewedin isolation, appears to be perfectly legitimate.The sequential supervised learning problem can be formulated as follows.Let (xi,yi)Ni=1 be a set of N training examples. Each example is a pair ofsequences (xi,yi), where xi = xi,1,xi,2,.,xi,Ti and yi = yi,1,yi,2,.,yi,Ti.For example, in part-of-speechtagging, one (xi,yi) pair might consist ofxi = doyou want fries with that and yi = verb pronoun verb noun prep pronoun. Thegoal is to construct a classifier h that can correctly predict a new label sequencey = h(x) given an input sequence x.This task should be contrasted with two other, closely-related tasks. Thefirst of these is the time-series prediction problem. Here the task is to predictthe t+1st element of a sequence y1,.,yt. This can be extended in two ways.First, we can consider the case where each yt is a vector yt. The time-seriestask becomes to