官术网_书友最值得收藏!

Getting data from the Twitter API

We use the popular twitter4j package to invoke the Twitter Search API, and search for tweets and save them to disk. The Twitter API requires authentication as of Version 1.1, and we will need to get authentication tokens and save them in the twitter4j.properties file before we get started.

Getting ready

If you don't have a Twitter account, go to twitter.com/signup and create an account. You will also need to go to dev.twitter.com and sign in to enable yourself for the developer account. Once you have a Twitter login, we'll be on our way to creating the Twitter OAuth credentials. Be prepared for this process to be different from what we are presenting. In any case, we will supply example results in the data directory. Let's now create the Twitter OAuth credentials:

  1. Log in to dev.twitter.com.
  2. Find the little pull-down menu next to your icon on the top bar.
  3. Choose My Applications.
  4. Click on Create a new application.
  5. Fill in the form and click on Create a Twitter application.
  6. The next page contains the OAuth settings.
  7. Click on the Create my access token link.
  8. You will need to copy Consumer key and Consumer secret.
  9. You will also need to copy Access token and Access token secret.
  10. These values should go into the twitter4j.properties file in the appropriate locations. The properties are as follows:
    debug=false
    oauth.consumerKey=ehUOExampleEwQLQpPQ
    oauth.consumerSecret=aTHUGTBgExampleaW3yLvwdJYlhWY74
    oauth.accessToken=1934528880-fiMQBJCBExamplegK6otBG3XXazLv
    oauth.accessTokenSecret=y0XExampleGEHdhCQGcn46F8Vx2E

How to do it...

Now, we're ready to access Twitter and get some search data using the following steps:

  1. Go to the directory of this chapter and run the following command:
    java -cp lingpipe-cookbook.1.0.jar:lib/twitter4j-core-4.0.1.jar:lib/opencsv-2.4.jar:lib/lingpipe-4.1.0.jar com.lingpipe.cookbook.chapter1.TwitterSearch
    
  2. The code displays the output file (in this case, a default value). Supplying a path as an argument will write to this file. Then, type in your query at the prompt:
    Writing output to data/twitterSearch.csv
    Enter Twitter Query:disney
    
  3. The code then queries Twitter and reports every 100 tweets found (output truncated):
    Tweets Accumulated: 100
    Tweets Accumulated: 200
    
    Tweets Accumulated: 1500
    writing to disk 1500 tweets at data/twitterSearch.csv 
    

This program uses the search query, searches Twitter for the term, and writes the output (limited to 1500 tweets) to the .csv file name that you specified on the command line or uses a default.

How it works...

The code uses the twitter4j library to instantiate TwitterFactory and searches Twitter using the user-entered query. The start of main() at src/com/lingpipe/cookbook/chapter1/TwitterSearch.java is:

String outFilePath = args.length > 0 ? args[0] : "data/twitterSearch.csv";
File outFile = new File(outFilePath);
System.out.println("Writing output to " + outFile);
BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
System.out.print("Enter Twitter Query:");
String queryString = reader.readLine();

The preceding code gets the outfile, supplying a default if none is provided, and takes the query from the command line.

The following code sets up the query according to the vision of the twitter4j developers. For more information on this process, read their Javadoc. However, it should be fairly straightforward. In order to make our result set more unique, you'll notice that when we create the query string, we will filter out retweets using the -filter:retweets option. This is only somewhat effective; see the Eliminate near duplicates with the Jaccard distance recipe later in this chapter for a more complete solution:

Twitter twitter = new TwitterFactory().getInstance();
Query query = new Query(queryString + " -filter:retweets"); query.setLang("en");//English
query.setCount(TWEETS_PER_PAGE);
query.setResultType(Query.RECENT);

We will get the following result:

List<String[]> csvRows = new ArrayList<String[]>();
while(csvRows.size() < MAX_TWEETS) {
  QueryResult result = twitter.search(query);
  List<Status> resultTweets = result.getTweets();
  for (Status tweetStatus : resultTweets) {
    String row[] = new String[Util.ROW_LENGTH];
    row[Util.TEXT_OFFSET] = tweetStatus.getText();
    csvRows.add(row);
  }
  System.out.println("Tweets Accumulated: " + csvRows.size());
  if ((query = result.nextQuery()) == null) {
    break;
  }
}

The preceding snippet is a pretty standard code slinging, albeit without the usual hardening for external facing code—try/catch, timeouts, and retries. One potentially confusing bit is the use of query to handle paging through the search results—it returns null when no more pages are available. The current Twitter API allows a maximum of 100 results per page, so in order to get 1500 results, we need to rerun the search until there are no more results, or until we get 1500 tweets. The next step involves a bit of reporting and writing:

System.out.println("writing to disk " + csvRows.size() + " tweets at " + outFilePath);
Util.writeCsvAddHeader(csvRows, outFile);

The list of tweets is then written to a .csv file using the Util.writeCsvAddHeader method:

public static void writeCsvAddHeader(List<String[]> data, File file) throws IOException {
  CSVWriter csvWriter = new CSVWriter(new OutputStreamWriter(new FileOutputStream(file),Strings.UTF8));
  csvWriter.writeNext(ANNOTATION_HEADER_ROW);
  csvWriter.writeAll(data);
  csvWriter.close();
}

We will be using this .csv file to run the language ID test in the next section.

See also

For more details on using the Twitter API and twitter4j, please go to their documentation pages:

主站蜘蛛池模板: 托克逊县| 沈阳市| 霍州市| 甘孜| 承德县| 深圳市| 襄垣县| 寿阳县| 万山特区| 绍兴县| 若羌县| 论坛| 封开县| 古交市| 定陶县| 临沧市| 娄烦县| 苍梧县| 南乐县| 繁峙县| 常熟市| 中卫市| 乌兰浩特市| 清原| 昌平区| 永川市| 商洛市| 达日县| 襄垣县| 且末县| 手游| 上杭县| 胶南市| 揭东县| 辽阳县| 饶河县| 同心县| 兴隆县| 红原县| 集贤县| 萍乡市|