![]() |
| SQL on Twitter |
There have been lengthy articles on analyzing twitter data by Cloudera here, hereandhere. More from Hortonworks here and here. This article is going to be short. Thanks to features in Couchbase 4.5!
There have been lengthy articles on analyzing twitter data by Cloudera here, hereandhere. More from Hortonworks here and here. This article is going to be short. Thanks to features in Couchbase 4.5!
Step 1: Install Couchbase 4.5 Use Couchbase console create a bucket called twitterandCREATE PRIMARY INDEX on twitter using query workbench or any other tool.
Step 2: Request for your twitter archive. Once you receive it, unzip it. (You can use larger twitter archives as well).
cd <to the unzipped location>/data/tweets
Step 3: $ for i in `ls`;
do
grep -i ^Grailbird $i > $i.out ;
done
Step 4: update your ip, username & password before you run this.
$ for i in `ls *.out`;
do
/opt/couchbase/bin/cbbackupmgr json --host http://172.23.123.107:8091 --username Administrator --password password --bucket default --dataset file:///root/$i --format list --generate-key %id_str%;
done
Step 5: There is no step 5. Login to Couchbase query workbench and start playing! Simply use SQL based N1QL to query and play with the data. Online interactive tutorial makes this even easier.
Here are some observations for my own twitter deck:
In the query tab, simply click on the bucketname twitter, it'll automatically analyze the data to give you the structure of the documents with sample data. Move your cursor over the fields to see the sample data.

2. Get me a sample twitter document.
select *
Step 5: There is no step 5. Login to Couchbase query workbench and start playing! Simply use SQL based N1QL to query and play with the data. Online interactive tutorial makes this even easier.
Here are some observations for my own twitter deck:
In the query tab, simply click on the bucketname twitter, it'll automatically analyze the data to give you the structure of the documents with sample data. Move your cursor over the fields to see the sample data.

2. Get me a sample twitter document.
select *
from twitter
limit 1;

3. Give me the count of my tweets.
select count(*) as my_tweet_count

3. Give me the count of my tweets.
select count(*) as my_tweet_count
from twitter;

4. Give me top 5 hashtags and counts in my tweet

Yes. I worked for Informix and IBM :-)
5. What days did I tweet most?

4. Give me top 5 hashtags and counts in my tweet
select ht.text, count(1)
from twitter unnest entities.hashtags ht
group by ht
order by count(1) desc limit 5;

Yes. I worked for Informix and IBM :-)
5. What days did I tweet most?
select substr(created_at, 0, 10), count(1)
from twitter
group by substr(created_at, 0, 10)
order by count(1) desc
limit 5;

Couchbase 4.5 makes it very easy to ingest JSON so you can get insights.
Try out with your own twitter data or a public JSON archive. Create indices on fields, and arrays. Ask more questions, find more insights!
from twitter
group by substr(created_at, 0, 10)
order by count(1) desc
limit 5;

Couchbase 4.5 makes it very easy to ingest JSON so you can get insights.
Try out with your own twitter data or a public JSON archive. Create indices on fields, and arrays. Ask more questions, find more insights!

Comments
Post a Comment