Today, USA Today reported that the National Security Agency has been collecting domestic phone records of many of us U.S. citizens. Unlike everyone else blogging on this today, I'm taking no position on the ethicality of this activity. Instead, I'm going to tell you what I would do with those phone records from the perspective of a database geek. There's plenty of other analysis going on elsewhere, and I'm no constitutional lawyer.
I've been using Vonage for a while now, and I have access to my own phone records on the computer. It's easy enough to cut and paste my Vonage call records into Excel and from there into Access. From Access, I can easily export/import them into the Relational Database Mangement System of my choice, which for now is MS SQL 2005. However, there are many more out there.
Each records looks something like this: Date, Time (you can combine these into a LongDate), From phone number, To phone number, Duration, and a unique transaction ID. I get all this for incoming and outgoing calls. It's great for anyone that does billing for phone time. I'm assuming that these are the same kind of records that the NSA gets. Once the NSA gets these records, they do a data transform to make all the fields fit into their system in a uniform manner. Since the data is already fairly simple, they don't have to do much, and even a moderately skilled programmer like me could write something to transfer phone records almost as fast as they could get them.
If I had phone records from other people, I could combine them with my phone records into one massive table (relation, in database-speak). I could then do a reflexive query on them to pull a list of all the people I had contact with, through incoming or outgoing calls. I could then do another query to pull all contacts of all the people who had called me; this would show my my friends' friends. If I had access to more data about the phone numbers, say through geocoding (a fancy way of saying latitude and longitude attached to each phone number), I could create a map and track a phone tree. If I call someone in New York, and they call someone in Paris, and the person in Paris calls someone in Amman, I could draw lines making the connections on a map.
For this level of tracking to work, the NSA has to have absolutely all the phone records they can possibly get their hands on. If they have a target talking to someone and that someone talks to someone else and the NSA's records drop at the first friend of the target, they're lost. It would be a dead end. If they get all the records, the creation of a massive data warehouse that shows connections between people is pretty much academic. The budget for doing all this has dropped dramatically over recent years: you might be able to do it with a couple of Netezza data warehouse appliances. Rumor has it the NSA was Netezza's first customer. All the hardware to do it might cost under a million dollars. The tricky part, as with all data mining projects, is getting good data, and the NSA has that problem solved.
The hardest part left for them is scalability: they're trying to drink from a firehose, but the records aren't that big, which makes it feasable. You might be able to store all the number-only data in a record as short as 40 bytes: LongDate, Number, Number, Number, Number. (I'm not going to get into data types in depth here, but let's assume we can store phone numbers as numbers and not text to save space.) Thus one million phone records would occupy 40 megabytes. If the US makes a hundred million phone calls a day, that's about 4 GB a day of data. Large, but manageable if you have a large budget. Even if you double the key identifier size to 16 bytes (to cover hundreds of millions of calls) you're still only up to 4.8 GB per 100 million calls.
Only after you've identified a target would you want to create a join query that connects names and addresses with phone numbers; this would be far more efficient than attaching names to the phone record tables, and would give the NSA a chance to say they're recording numbers only. If the NSA uses a consumer data company like, say, Acxiom, to get information on phone numbers post-targeting, then they're not even subject to the Freedom of Information Act or US Privacy Law.
The end result is that the NSA has the capability to map our social and business networks; given enough time and hardware, they could even plot them on satellite photos, creating a cool mish-mash of lines across neighborhoods. They could create files on us all like Friendster lists our friends and their connections. Whether the NSA's system actually works efficiently, we'll never know.
No comments:
Post a Comment