If I had more time, I would have written shorter letter — Blaise Pascal
[This is the article published on DZone: https://dzone.com/articles/sql-on-twitter-twitter-analysis-made-easy]
There have been lengthy articles on analyzing Twitter data. From Cloudera: here, here, and here. More from Hortonworks here and here. This one from Couchbase is going to be short, save the examples and results.
Step 1: Install Couchbase 4.5. Use the Couchbase console create a bucket called Twitter and CREATE PRIMARY INDEX on Twitter using the query workbench or cbq shell.
Step 2: Request your Twitter archive. Once you receive it, unzip it. (You can use larger twitter archives as well): cd <to the unzipped location>/data/js/tweets
Step 3:
Step 4: Update your IP, username, and password before you run this:
Step 5: There is no step 5!
Log into Couchbase's query workbench or cbq shell and start playing! Simply use SQL-based N1QL to query and play with the data. This online interactive tutorial will get you started with N1QL.
Here are the example queries on my twitter archive.
1. Give me the count of my tweets.
Results:
2. Get me a sample Twitter document.
Results: Twitter document is rich. It has nested objects, arrays, and arrays of objects.
3. What days did I tweet most?
4. Give me the top 5 hashtags and counts in my tweets:
(Yes, I worked for Informix and IBM!)
5. How many tweets have I done on Couchbase, N1QL, NoSQL, or SQL?
Because hashtags are stored in an array, you need to UNNEST it so you can group by the hashtab.
6. Let’s see who I’ve mentioned in my tweets and how many times?
I've only given partial results below. @N1QL and @Couchbase were top mentions. Note Twitter itself doesn't store the @ character in its data.
7. Let’s get all the tweets I’ve mentioned @sangudi, creator of N1QL.
While this works fine, it scans the whole bucket using primary scan.
Let’s create an index on this array element to make it go faster.
Now, see the plan for the same query. This uses the index and pushes down the predicate to the index, making the query faster.
Couchbase 4.5 makes it very easy to ingest JSON so you can get insight into your data. For more advanced questions and advanced usage, use array.
Try it out with your own Twitter data or a public JSON archive. Create indices on fields and arrays. Ask more questions, find more insights!
Comments
Post a Comment