官术网_书友最值得收藏!

Getting data from the Twitter API

We use the popular twitter4j package to invoke the Twitter Search API, and search for tweets and save them to disk. The Twitter API requires authentication as of Version 1.1, and we will need to get authentication tokens and save them in the twitter4j.properties file before we get started.

Getting ready

If you don't have a Twitter account, go to twitter.com/signup and create an account. You will also need to go to dev.twitter.com and sign in to enable yourself for the developer account. Once you have a Twitter login, we'll be on our way to creating the Twitter OAuth credentials. Be prepared for this process to be different from what we are presenting. In any case, we will supply example results in the data directory. Let's now create the Twitter OAuth credentials:

  1. Log in to dev.twitter.com.
  2. Find the little pull-down menu next to your icon on the top bar.
  3. Choose My Applications.
  4. Click on Create a new application.
  5. Fill in the form and click on Create a Twitter application.
  6. The next page contains the OAuth settings.
  7. Click on the Create my access token link.
  8. You will need to copy Consumer key and Consumer secret.
  9. You will also need to copy Access token and Access token secret.
  10. These values should go into the twitter4j.properties file in the appropriate locations. The properties are as follows:
    debug=false
    oauth.consumerKey=ehUOExampleEwQLQpPQ
    oauth.consumerSecret=aTHUGTBgExampleaW3yLvwdJYlhWY74
    oauth.accessToken=1934528880-fiMQBJCBExamplegK6otBG3XXazLv
    oauth.accessTokenSecret=y0XExampleGEHdhCQGcn46F8Vx2E

How to do it...

Now, we're ready to access Twitter and get some search data using the following steps:

  1. Go to the directory of this chapter and run the following command:
    java -cp lingpipe-cookbook.1.0.jar:lib/twitter4j-core-4.0.1.jar:lib/opencsv-2.4.jar:lib/lingpipe-4.1.0.jar com.lingpipe.cookbook.chapter1.TwitterSearch
    
  2. The code displays the output file (in this case, a default value). Supplying a path as an argument will write to this file. Then, type in your query at the prompt:
    Writing output to data/twitterSearch.csv
    Enter Twitter Query:disney
    
  3. The code then queries Twitter and reports every 100 tweets found (output truncated):
    Tweets Accumulated: 100
    Tweets Accumulated: 200
    
    Tweets Accumulated: 1500
    writing to disk 1500 tweets at data/twitterSearch.csv 
    

This program uses the search query, searches Twitter for the term, and writes the output (limited to 1500 tweets) to the .csv file name that you specified on the command line or uses a default.

How it works...

The code uses the twitter4j library to instantiate TwitterFactory and searches Twitter using the user-entered query. The start of main() at src/com/lingpipe/cookbook/chapter1/TwitterSearch.java is:

String outFilePath = args.length > 0 ? args[0] : "data/twitterSearch.csv";
File outFile = new File(outFilePath);
System.out.println("Writing output to " + outFile);
BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
System.out.print("Enter Twitter Query:");
String queryString = reader.readLine();

The preceding code gets the outfile, supplying a default if none is provided, and takes the query from the command line.

The following code sets up the query according to the vision of the twitter4j developers. For more information on this process, read their Javadoc. However, it should be fairly straightforward. In order to make our result set more unique, you'll notice that when we create the query string, we will filter out retweets using the -filter:retweets option. This is only somewhat effective; see the Eliminate near duplicates with the Jaccard distance recipe later in this chapter for a more complete solution:

Twitter twitter = new TwitterFactory().getInstance();
Query query = new Query(queryString + " -filter:retweets"); query.setLang("en");//English
query.setCount(TWEETS_PER_PAGE);
query.setResultType(Query.RECENT);

We will get the following result:

List<String[]> csvRows = new ArrayList<String[]>();
while(csvRows.size() < MAX_TWEETS) {
  QueryResult result = twitter.search(query);
  List<Status> resultTweets = result.getTweets();
  for (Status tweetStatus : resultTweets) {
    String row[] = new String[Util.ROW_LENGTH];
    row[Util.TEXT_OFFSET] = tweetStatus.getText();
    csvRows.add(row);
  }
  System.out.println("Tweets Accumulated: " + csvRows.size());
  if ((query = result.nextQuery()) == null) {
    break;
  }
}

The preceding snippet is a pretty standard code slinging, albeit without the usual hardening for external facing code—try/catch, timeouts, and retries. One potentially confusing bit is the use of query to handle paging through the search results—it returns null when no more pages are available. The current Twitter API allows a maximum of 100 results per page, so in order to get 1500 results, we need to rerun the search until there are no more results, or until we get 1500 tweets. The next step involves a bit of reporting and writing:

System.out.println("writing to disk " + csvRows.size() + " tweets at " + outFilePath);
Util.writeCsvAddHeader(csvRows, outFile);

The list of tweets is then written to a .csv file using the Util.writeCsvAddHeader method:

public static void writeCsvAddHeader(List<String[]> data, File file) throws IOException {
  CSVWriter csvWriter = new CSVWriter(new OutputStreamWriter(new FileOutputStream(file),Strings.UTF8));
  csvWriter.writeNext(ANNOTATION_HEADER_ROW);
  csvWriter.writeAll(data);
  csvWriter.close();
}

We will be using this .csv file to run the language ID test in the next section.

See also

For more details on using the Twitter API and twitter4j, please go to their documentation pages:

主站蜘蛛池模板: 巴彦淖尔市| 凤台县| 南安市| 堆龙德庆县| 璧山县| 凤阳县| 新闻| 开远市| 论坛| 阳新县| 孝义市| 石门县| 泊头市| 濉溪县| 安岳县| 化隆| 漾濞| 扶余县| 邵武市| 常宁市| 娄底市| 娄底市| 易门县| 宁安市| 东阿县| 扶沟县| 衡山县| 明光市| 襄汾县| 张掖市| 札达县| 思南县| 宁陵县| 徐汇区| 延长县| 桓台县| 罗城| 迁安市| 海门市| 巴南区| 武义县|