Wednesday, May 31, 2006

Summer of SQL and Data Mining

Summer may be the best semester at George Washington University because all the undergrads are gone. The Marvin center is almost completely vacant, and the food court is closed. I am taking two electives this summer, which will give me more electives than I need. Why am I doing this? I’m actually learning something useful. People keep asking me why I don’t go for the CIO Certification. Instead of doing databases, I’d be taking MGT 272 Information Resource Management and MGT 274 Survey of Advanced IT Technologies. I have yet to read a single government job description that even mentions CIO Certification. (The Government Services Administration “invented” the CIO Certification, but the Office of Personnel Management sets job standards.)

I’ve taken academic classes in Information Systems at GWU for a while, and I have also taken classes at Learning Tree in SQL, Exchange, Solaris, Security and the programming language, C. The more advanced my classes get at GWU, the more they resemble a Learning Tree class, with one exception: at GWU, they teach theory and practice. At Learning Tree, it’s just practice. You can learn how Microsoft SQL Server works without learning a thing about normalization. Learning Tree is training database administrators, not database designers. My professor for both classes, John Artz, argues that vendor certifications will include more theory in the future; otherwise the vendor certs will become less relevant over time.

So what are my classes? Data Warehouse Design; and Database and Expert Systems. Database and Expert Systems includes no expert systems (I covered them in Decision Systems), and is mostly T-SQL for Microsoft SQL Server 2005. Data Warehouse Design is mostly theory with some implementation on SQL 2005 Analysis Services. Relational database theory hasn’t changed in about 30 years now, so you’d think I would have learned more relational database management systems earlier, but hey, it’s easy to become distracted with security and email and the web servers and Linux. Not that I haven’t used plenty of relational databases as back-ends to applications – I just didn’t think about the relational algebra that drives my queries.

While I can’t post class notes here, I can tell you which books the professor has chosen. When I’m searching for technology books, it’s hard to tell which ones are good.
Database and Expert Systems:
Dusan Petkovic’s SQL Server 2005: A Beginner’s Guide
Ken England’s Microsoft SQL Server 2000 Performance Optimization and Tuning Handbook. (There isn’t one for SQL 2005 yet.)
Data Warehouse Design:
Ted Lachev’s Applied Microsoft Analysis Services and Microsoft Business Intelligence Platform 2005.
(And the John Artz manuscript.)
Both classes:
Connolly and Begg’s Database Systms. This is the book I wish my database class from last semester used. It covers a lot more than that textbook: Hoffer, Prescott and McFadden’s Modern Database Management.

Wednesday, May 24, 2006

Exchange 12 Offline Address List Issue

I'm continuing to use the Exchange 12 preview beta, with mostly good results. Everybody that sees the new Outlook Web Access loves it and wants it. Thus users will have a reason to upgrade and it won't be just us messaging engineers that care. Since Exchange 12 will require 64-bit hardware, having a greate user interface may help justify the cost to many organizations.

I had one no-connect issue, which I tracked down to a DNS issue between my Windows Server 2003 Domain Controller (ADI-DNS) and my Red Hat Federa Core 4 box running Bind 9.3.1. While I don't do secure dynamic updates to BIND (yet), I do secure transfers from AD to BIND. (When I make sure BIND knows AD is authoritative for my AD zone.) It's funny seeing all the Active Directory DNS records in BIND: _gc, _kerberos, _ldap, etc. (Which brings up a typical AD interview question: what version of BIND do you need to run AD? 8.6.3 is the minimum. BIND 9 is better because it won't make you set the don't check names option because it supports a wider character set for domain names.)

I recently set up an account for the lab here in AD for mail to forward to a couple of folks in Outlook 2003 on Windows 2000 SP4. This account is not an administrator on the local box, and it sends and receives mail fine, but on Send/Receive I get a
Task ‘Microsoft Exchange Server’ reported error (0x8004010F): ‘The operation failed. An object could not be found.’


Microsoft has a solution for Exchange 2003 here: http://support.microsoft.com/?kbid=905813 ; but this isn't possible using the current Exchange 12 implementation. The error doesn't occur when I log on as a user with local admin privileges (yet). The server event logs don't show any errors. (I can grep my event logs because they go to my linux box via syslogd.)

I'm left with using the Exchange Management Shell's new-OfflineAddressBook command; first you must create an address list: "new-Addresslist." Fortunately, the shell features tab-command completion. All I have left to do is add addresses to my offline address list, but at least I have an offline address book, so my error should disappear, right? Wrong. I still get the error above. I'd fire up my Etherreal to do some more diagnosis, but I don't get the error on accounts wtih local admin privileges. Aargh.

Sometimes I miss the GUI.

Saturday, May 20, 2006

Google Analytics: Malkovich of the Internet?

If you’ve ever seen Being John Malkovich, you should remember what happens when Malkovich himself enters the Malkovich portal: everyone is John Malkovich and all they say is “Malkovich Malkovich.” On the Internet these days, I feel like it’s “Google Google.”

After months of waiting, I got my free Google Analytics invitation. I had set up Urchin reporting on a number of sites years ago, and I was disappointed that I couldn’t buy Urchin anymore because Google had swallowed them whole. Google Analytics is even better than Urchin was, and it includes Urchin’s campaign and e-commerce tracking modules. I also liked Urchin a lot more than Webtrends, which seems to have grown into bloatware since my first (positive) encounter with Webtrends in 1999. The trouble with Webtrends and Omniture is that they are focused on big business, because that’s where the big money is. Google Analytics is simple enough use on this blog, but can scale as large as is needed. GA gives me the same tools that the big guys have. Thus, Google will make its money scaling out rather than scaling up.

Unlike Urchin, Google Analytics doesn’t run on my server. The JavaScript script goes back to Google, and I’m sure they can also see how many people are visiting my sites. On the up side, I didn’t have to remember to reconfigure Apache logging.
This is what I used to have to enter as root into httpd.conf and hope that I didn't make any typos:

LogFormat "%h %v %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" \"%{Cookie}i\"" special 


Also, the Googlebot doesn’t show up in Analytics reports. Browsers that reject third-party cookies won’t be track-able. And Google will know a lot more about all of our browsing habits once more sites start implementing Google Analytics.

As far as the markets go, Google Analytics is a shot at other web analytics packages, and it may surpass them because Google can integrate your Google Adwords into your Google Analytics. (Malkovich Malkovich.) This is something that others can’t do, because Omniture and Overture are separate. Of course, Google Analytics, Adwords, and everything else Google integrates with your Gmail account. Before long, Google may be able to offer IP address resolution into Google Earth. (Malkovich Malkovich Malkovich)

I also recently created Google sitemaps for a couple of sites. Google uses sitemaps to find new pages and changed pages on sites faster. The sitemap generator is a python script that looks at your log files, your web file system and creates an XML file that the Googlebot can download. After creating the sitemaps, the two have now gotten a lot of hits from the Googlebot: 1671 for this site, and 5,000 or so for the other site, which gets a lot more traffic than mine. By getting me to do some of the work of indexing, Google indexing operates more efficiently.

Before long, Google will know more about traffic on the web than the NSA knows about traffic on the Public Switched Telephone Network.

Wednesday, May 17, 2006

More on How the NSA Tracking System Might Work

The Washington Post's David Ignatius postulated about how the National Security Agency's system might work. In doing so, he provided an excellent example of data mining. What the NSA is trying to do is simple and complex at the same time. The data structure is simple, but the sheer volume makes it complex.

The problem may seem hopelessly complex, but if you use common sense, you can see how the NSA has tried to solve it. Suppose you lost your own cellphone and bought a new one, and people really needed to find out that new number. If they could search all calling records, they would soon find a number with the same pattern of traffic as your old one -- calls to your spouse, your kids, your office, your golf buddies. They wouldn't have to listen to the calls themselves to know it was your phone. Simple pattern analysis would be adequate -- so long as they had access to all the records.


The trouble is, simple pattern analysis isn't that simple when you start trying to code it. You would have a giant data cube, and you would have millions of slices to compare with each other. On the other hand, if you have one target number and have a query that pulls all its callees, you could craft another query that searches for those same numbers. You could then score new numbers based on old queries: each query would have a rating of between 0 and 1 with 1 being just like the original number.

If you have voice matching that can confirm a 1, you could design an artifcial neural network that learns as it targets new numbers. Voice matching would require eavesdropping -- but if you got a score of 1, it would be worth the trouble. This way, your neural network could learn what the score is between 0 and 1 that should trigger voice matching.

Monday, May 15, 2006

Reflexive vs. Recursive Queries and Self-Joins

People (mostly database people) keep asking what I meant by reflexive query in my previous posts. Some thought I was confusing a self-join with a recursive query that would allow my DNS server to answer a DNS query for a domain for which it is not authoritative. What I mean is a query that returns a caller and a callee; and then another query that returns the callee's callees. In a self-join, I can match employers to their managers, since both are in the same table. While the NSA-phone tracking system might use some self-joins, what makes the network part work is getting queries from queries, and jumping from callers to callees. Of course I'd like all the results timestamped, too. Recursive queries are a part of this, and SQL 2005 can do it. If I had a few gigabytes worth of phone data, I'd love to let SQL 2005 loose and see what connections I could see. When I say reflexive, though, I mean that I'm going to use my queries to start other queries, and not just as sub-queries. The recursive part could lead to infinite loops. I wonder if the NSA hit any infinite loops when testing their system. Fortunately, you can specify limits on recursion in SQL 2005.

The scary part of this is what would happen if I had the resources to make it run really fast and tuned it to be as efficient as possible. I could select a target and find all of its connections in a few seconds or a few minutes. The difference between seconds and minutes would make a huge difference. In a system where it takes a couple of seconds to generate results, nobody would notice if I ran a few "unofficial" queries on my friends. If it took a few minutes and precious computer time, then people would notice. Utilitarian ethics.

This begs the question: how much computer time (and tax dollars) does our government use tracking down everyone who calls reporters? Then again, leaking classified information is unethical, but people do it anway. Contextual ethics. Of course, there's also the issue of selective enforcement, but that's a legal issue, not an ethical one.

Thus we're left with utilitarian ethics vs. contextual ethics. Who knew that efficient queries and more processing power could give one type of ethics an advantage over the other?

Saturday, May 13, 2006

NSA has Greatest Sociological Dataset of the 21st Century

Between the phone records, emails, and instant messaging, the NSA can now map the social and business fabric of America and see how it is changing over time. By collecting our phone records, the NSA has created a dataset that can map our social connections. As I've mentioned in my previous post, the NSA can now map the calls we make and the calls our callees make, creating a giant tree of connections. A wrong number might get you placed on a no-fly list, but I'm not going into morality. (The current administration is the de facto law, so legality becomes a moot argument.) I'm just going to explore how they might use this data and what they could find from it.

The phone call data structure (date/time, origin number, destination number, duration, ID) could easily be used to scan emails and instant messaging as well: date/time, origin address, destination address, length, ID). Using the same reflexive queries, the NSA can track our social and business connections through email. They could also tell how those annoying chain letters get spread (Good TImes virus warning: forward this to all your friends). However, the size of each record would be larger, probably more than double the ~48 bytes required for a phone record. The size would slow down the query and require more storage. Also, geocoding the emails would be more difficult, although it could be done based on where the person writing them maintains a physical address, or on the originating Internet Protocol address.

With all this data, the NSA is in a position to know the difference in social networks that use telephones and those that use email. If they've had this program in place for long enough, they could even recognize trends social communication. With the right set of queries, you could spot the difference between neighborhoods where people know each other and neighborhoods where nobody knows each other and plot them in different colors on a map. What percentage of Americans calls their mom on Mother's Day? In which neighborhoods do people call their mothers most often?

It would also be somewhat easy to plot our connections in a GIS tool such as Google Earth. You would just export your target connection tree into an XML kmz/kml file and import it into Google Earth or ESRI's ArcGIS. The graphs derived from the data would by anonymous. (Of course, research ethics might prevent us from using the data.)

Sadly, none of the derived information will ever see the light of day. Academically, it would produce fascinating results that could teach us about how we communicate with each other and how our communications are changing. Then again, no local, state, or federal government information sytem about citizens has ever been immune from misuse. IRS workers read their neighbors' tax returns. Cops run license "plates for dates," and use NCIC (National Criminal Information Center) criminal background checks to help those running for election denounce their opponents. If a dataset about us exists, it will be abused by those with access, but appropriate security controls should prevent that.

Thursday, May 11, 2006

How the NSA Might use our Phone Records

Today, USA Today reported that the National Security Agency has been collecting domestic phone records of many of us U.S. citizens. Unlike everyone else blogging on this today, I'm taking no position on the ethicality of this activity. Instead, I'm going to tell you what I would do with those phone records from the perspective of a database geek. There's plenty of other analysis going on elsewhere, and I'm no constitutional lawyer.

I've been using Vonage for a while now, and I have access to my own phone records on the computer. It's easy enough to cut and paste my Vonage call records into Excel and from there into Access. From Access, I can easily export/import them into the Relational Database Mangement System of my choice, which for now is MS SQL 2005. However, there are many more out there.

Each records looks something like this: Date, Time (you can combine these into a LongDate), From phone number, To phone number, Duration, and a unique transaction ID. I get all this for incoming and outgoing calls. It's great for anyone that does billing for phone time. I'm assuming that these are the same kind of records that the NSA gets. Once the NSA gets these records, they do a data transform to make all the fields fit into their system in a uniform manner. Since the data is already fairly simple, they don't have to do much, and even a moderately skilled programmer like me could write something to transfer phone records almost as fast as they could get them.

If I had phone records from other people, I could combine them with my phone records into one massive table (relation, in database-speak). I could then do a reflexive query on them to pull a list of all the people I had contact with, through incoming or outgoing calls. I could then do another query to pull all contacts of all the people who had called me; this would show my my friends' friends. If I had access to more data about the phone numbers, say through geocoding (a fancy way of saying latitude and longitude attached to each phone number), I could create a map and track a phone tree. If I call someone in New York, and they call someone in Paris, and the person in Paris calls someone in Amman, I could draw lines making the connections on a map.

For this level of tracking to work, the NSA has to have absolutely all the phone records they can possibly get their hands on. If they have a target talking to someone and that someone talks to someone else and the NSA's records drop at the first friend of the target, they're lost. It would be a dead end. If they get all the records, the creation of a massive data warehouse that shows connections between people is pretty much academic. The budget for doing all this has dropped dramatically over recent years: you might be able to do it with a couple of Netezza data warehouse appliances. Rumor has it the NSA was Netezza's first customer. All the hardware to do it might cost under a million dollars. The tricky part, as with all data mining projects, is getting good data, and the NSA has that problem solved.

The hardest part left for them is scalability: they're trying to drink from a firehose, but the records aren't that big, which makes it feasable. You might be able to store all the number-only data in a record as short as 40 bytes: LongDate, Number, Number, Number, Number. (I'm not going to get into data types in depth here, but let's assume we can store phone numbers as numbers and not text to save space.) Thus one million phone records would occupy 40 megabytes. If the US makes a hundred million phone calls a day, that's about 4 GB a day of data. Large, but manageable if you have a large budget. Even if you double the key identifier size to 16 bytes (to cover hundreds of millions of calls) you're still only up to 4.8 GB per 100 million calls.

Only after you've identified a target would you want to create a join query that connects names and addresses with phone numbers; this would be far more efficient than attaching names to the phone record tables, and would give the NSA a chance to say they're recording numbers only. If the NSA uses a consumer data company like, say, Acxiom, to get information on phone numbers post-targeting, then they're not even subject to the Freedom of Information Act or US Privacy Law.

The end result is that the NSA has the capability to map our social and business networks; given enough time and hardware, they could even plot them on satellite photos, creating a cool mish-mash of lines across neighborhoods. They could create files on us all like Friendster lists our friends and their connections. Whether the NSA's system actually works efficiently, we'll never know.

Wednesday, May 10, 2006

The Job Interview

I used to think it happened only to me, and then I read Dilbert last Saturday. My interviewers are asking more and more real-world technical questions about real problems they have. More often than not, I can solve them in a couple of sentences. And later I wonder if they really had a job to offer or if they were just looking for cheap solutions.

Even months later, I can see users from my interviewers' companies reading my blog entries about how IIS, SQL, AD, and SharePoint work together. Thank you, Google Analytics. And no, I would never name your company here, because it just wouldn't be professional.

Friday, May 5, 2006

Finals: Neural Nets in a Nutshell?

I have three finals: one Monday evening and two Wednesday evening. The bad part about having two finals on one evening is that it's two exams for which I have to prepare. The good part is crossover between the two classes: both cover neural networks, although it's definitely on my database exam, my decision support systems class covered it better. Although the Marakas text had problems, including a typo and a cut-off paragraph, it explained neural nets better than the database textbook or the db professor did, and I've gone over his powerpoint several times looking for a good definition. Marakas:
neural networks attempt to mirror the way the human brain works in recognizing patterns by developing mathematical structures with the ability to learn.


One type of human intelligence is the ability to recognize patterns, and then learn to recognize patterns better. Thus, neural networks are one form of artificial intelligence.

If you search the web, though, you'll find as many definitions of neural networks as you do result pages.

Monday, May 1, 2006

More Ruby on Rails Security

I do Windows, Unix (Solaris), and Linux (mostly Red Hat). Everyone who's into "open-source" keeps telling me how much more secure it is. I'm a CISSP and I've been installing open-source OSes since I had to know the chipset, IRQ and DMA of the NICs in my box to get networking to work. (The DEC Tulip was my favorite.) When I started working with Solaris 7 and Red Hat 4.x, telnet was enabled by default. I still wonder if telnet was enabled on a Trusted Solaris 7 default install. People who tell me any form of Unix is inherently more secure than any Windows don't seem to be familiar with the Morris worm, the Leshka Sendmail exploit, or BIND vulnerabilities. In fact, just mentioning BIND and sendmail in the same sentence is likely to send your security coordinator into the bunker for the rest of the day. Mind you, I've also seen IIS flaws. Can't we all just get along and implement security best practices on whatever platforms we're using?

Ruby on Rails shows a lot of promise as to helping people get up and running on applications quickly. The tutorials are pretty helpful , but there are a a couple of caveats:
In the configuration wizard, you can also just accept all of the defaults, except that in the security panel you must uncheck the "Modify Security Settings" checkbox (Figure 4). This is because starting with version 4.1.7, MySQL uses a new authentication algorithm that is not compatible with older client software, including the current version of Rails. By unchecking this box, you can access MySQL without a password.


This is not the path to secure computing. MySQL should NOT ship with a blank root password. Tutorials should not encourage the use of blank root passwords.

And they have you set up your server as to leave database.yml publicly available. I see Drupal attacks (xmlrpc.php) every day; it's only a matter of time before I start to see RoR attacks.

It's the developers' job to make it work. It's your job to make it work securely. Today's hackers don't even know C and have never heard of Kernighan and Ritchie; all they need is a script and an Internet connection to take advantage of your vulnerabilities.